<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Guillaume Valmont&apos;s Weblog</title><description>Guillaume Valmont&apos;s notes on AI-assisted web development experiments and learnings.</description><link>https://blog.valmont.dev/</link><item><title>GPT-5.5 authorship and order effects</title><link>https://blog.valmont.dev/posts/gpt-5-5-authorship-and-order-effects/</link><guid isPermaLink="true">https://blog.valmont.dev/posts/gpt-5-5-authorship-and-order-effects/</guid><description>While comparing medium, high, and extra-high reasoning modes on the same task, I uncovered consistent authorship and presentation-order effects in GPT-5.5’s evaluations.</description><pubDate>Sat, 25 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Key takeaways&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;GPT-5.5 often rates alternative plans more favorably than its own, even when its original proposal is competitive (authorship effect).&lt;/li&gt;
&lt;li&gt;When ranking plans, GPT-5.5 frequently follows the presentation order (order effect).&lt;/li&gt;
&lt;li&gt;Reasoning mode alone is not a reliable indicator of plan quality or evaluation accuracy.&lt;/li&gt;
&lt;li&gt;Human review remains necessary to obtain reliable results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;This article shows that GPT-5.5’s evaluation of plans is systematically affected by two irrelevant factors: whether it authored the plan (authorship) and the order in which plans are presented. These effects are strong enough to undermine ranking-based evaluation workflows.&lt;/p&gt;
&lt;p&gt;Across all runs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Models ranked their own plan last in 5/6 cases (authorship effect)&lt;/li&gt;
&lt;li&gt;Rankings matched presentation order in 25/46 cases (order effect)&lt;/li&gt;
&lt;li&gt;Agreement between models was low&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Context&lt;/h2&gt;
&lt;p&gt;GPT-5.5 has just been released.&lt;/p&gt;
&lt;p&gt;According to its &lt;a href=&quot;https://openai.com/index/introducing-gpt-5-5/&quot;&gt;introduction by OpenAI&lt;/a&gt; “&lt;em&gt;Instead of carefully managing every step, you can give GPT‑5.5 a messy, multi-part task and trust it to plan, use tools, check its work, navigate through ambiguity, and keep going.&lt;/em&gt;”&lt;/p&gt;
&lt;p&gt;I wanted to test this claim using a moderately ambiguous prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;on datadog, when i look at web app usage, i see [Image #1]
how to separate each auto and eval page instead of merging them in /?/?/auto ?
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here’s the image : &lt;a href=&quot;/images/datadog-view-count.png&quot;&gt;image&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I initially assumed the issue was straightforward: a simple toggle to disable Datadog’s default view grouping. To test this, I ran the same prompt using &lt;code&gt;medium&lt;/code&gt;, &lt;code&gt;high&lt;/code&gt;, and &lt;code&gt;xhigh&lt;/code&gt; reasoning modes.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;high&lt;/code&gt; and &lt;code&gt;xhigh&lt;/code&gt; modes produced detailed and seemingly convincing plans, suggesting I had underestimated the problem. However, on closer inspection, &lt;code&gt;high&lt;/code&gt; proposed adding Next.js instrumentation, and &lt;code&gt;xhigh&lt;/code&gt; went as far as monkey-patching browser history (which were unneeded complexity).&lt;/p&gt;
&lt;p&gt;In contrast, &lt;code&gt;medium&lt;/code&gt; suggested disabling default grouping and setting the view name in the existing &lt;code&gt;beforeSend&lt;/code&gt; hook (an approach that was both simpler and aligned with my initial intuition).&lt;/p&gt;
&lt;p&gt;In order to verify if &lt;code&gt;medium&lt;/code&gt; was actually better than &lt;code&gt;high&lt;/code&gt; and &lt;code&gt;xhigh&lt;/code&gt; I repeated the experiment using a more controlled protocol. This assumption proved wrong but led to even more interesting observations.&lt;/p&gt;
&lt;h2&gt;Plan evaluation with authorship&lt;/h2&gt;
&lt;h3&gt;Protocol&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;I ran the same prompt in three separate sessions using plan mode with &lt;code&gt;medium&lt;/code&gt;, &lt;code&gt;high&lt;/code&gt;, and &lt;code&gt;xhigh&lt;/code&gt; reasoning.&lt;/li&gt;
&lt;li&gt;I asked each model to evaluate the strengths and weaknesses of the other plans relative to its own in the order: &lt;span class=&quot;nowrap&quot;&gt;&lt;code&gt;medium&lt;/code&gt; -&amp;gt; &lt;code&gt;high&lt;/code&gt; -&amp;gt; &lt;code&gt;xhigh&lt;/code&gt; -&amp;gt; &lt;code&gt;medium&lt;/code&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;I then asked each model to rank the three plans&lt;/li&gt;
&lt;li&gt;I repeated the same steps with the presentation order reversed: &lt;span class=&quot;nowrap&quot;&gt;&lt;code&gt;medium&lt;/code&gt; &amp;lt;- &lt;code&gt;high&lt;/code&gt; &amp;lt;- &lt;code&gt;xhigh&lt;/code&gt; &amp;lt;- &lt;code&gt;medium&lt;/code&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;details&gt;
  &lt;summary&gt;Show direct order rankings&lt;/summary&gt;
&lt;h4&gt;Direct order&lt;/h4&gt;
&lt;h5&gt;Direct order &lt;code&gt;medium&lt;/code&gt; ranking&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;  1. Third plan: “Separate Datadog RUM Views Per Exercise Page”
     Best as-is. It is Datadog-correct, explicit for all current pathnames,
     includes view context, strips query strings, and covers disabled
     initialization.
  2. Second plan: “Datadog RUM View Names”
     Also Datadog-correct and has stronger navigation test detail, but as-is it
     is less explicit about non-exercise pages than the third plan.
  3. My original plan
     Safest and smallest, but least correct for this Datadog issue because it
     relies on beforeSend renaming instead of manual startView lifecycle
     tracking.
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Direct order &lt;code&gt;high&lt;/code&gt; ranking&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;  1. Third plan: automatic tracking + beforeSend rename
     Best tradeoff. Minimal code, uses existing Datadog hook, preserves
     automatic SPA tracking, easy to test, and avoids React hooks.
  2. Second plan: manual tracking, broader and more complete
     Strong if you need full manual control over view lifecycle or per-view
     context on child events. But it is more machinery than needed for
     “show /5e/ch11/auto instead of /?/?/auto”.
  3. My original plan
     Mostly correct, but heavier than necessary and missing a couple practical
     notes the second plan had, especially historical data staying merged and
     the explicit non-initialized test.
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Direct order &lt;code&gt;xhigh&lt;/code&gt; ranking&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;  1. beforeSend rewrite plan: best pragmatic fit.
  2. Third/manual-tracking refined plan: best if you specifically need Datadog’s
     active SDK view state and view context to be set via startView, but more
     invasive.
  3. My original manual-tracking plan: same general idea as plan 3, but less
     complete around dedupe, tests, and no-effect constraints.
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Direct order Interpretation&lt;/h5&gt;
&lt;p&gt;These results are consistent with my earlier, less controlled tests. In my first tests, all three ranked the &lt;code&gt;beforeSend&lt;/code&gt; solution from &lt;code&gt;medium&lt;/code&gt; thinking to be the best plan. Here, &lt;code&gt;high&lt;/code&gt; and &lt;code&gt;xhigh&lt;/code&gt; think that &lt;code&gt;medium&lt;/code&gt;’s plan is the best one while &lt;code&gt;medium&lt;/code&gt; thinks that the best one is &lt;code&gt;xhigh&lt;/code&gt;’s plan.&lt;/p&gt;
&lt;p&gt;Notably, &lt;strong&gt;each model consistently ranks its own original plan last&lt;/strong&gt;. Also, both &lt;code&gt;medium&lt;/code&gt; and &lt;code&gt;high&lt;/code&gt; show the same ranking : third plan -&amp;gt; second plan -&amp;gt; own plan.&lt;/p&gt;
&lt;/details&gt;
&lt;details&gt;
  &lt;summary&gt;Show reverse order rankings&lt;/summary&gt;
&lt;h4&gt;Reverse order&lt;/h4&gt;
&lt;h5&gt;Reverse order &lt;code&gt;medium&lt;/code&gt; ranking&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;  1. beforeSend resolver plan
     Best as-is. Smallest change, preserves Datadog automatic tracking, avoids
     React lifecycle issues, strips query strings, and directly follows
     Datadog’s editable view.name guidance.
  2. History-patching manual tracking plan
     More decision-complete than my original and handles query strings/
     idempotency explicitly, but patching history.pushState / replaceState is
     invasive and riskier in a Next.js App Router app.
  3. My original usePathname + startView plan
     Directionally valid, but weakest as-is. It relies on an underspecified
     callback-ref workaround because effects are banned, lacks explicit
     idempotency, and does not call out query/cardinality or Datadog UI grouping
     details clearly enough.
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Reverse order &lt;code&gt;high&lt;/code&gt; ranking&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;  1. beforeSend plan
     Best as-is. It is the least invasive, uses the existing Datadog beforeSend
     path, avoids banned hooks and route side effects, preserves automatic RUM
     tracking, and explicitly handles query strings/trailing slashes. Main risk:
     it may not fix Datadog views that group by @view.url_path_group.
  2. usePathname + callback ref plan
     Solid, but less good as-is. It uses the official manual-view mechanism and
     Next’s route source of truth, but the callback-ref route tracking is a
     little unusual, and putting it in src/generic conflicts with the earlier
     helper-location decision.
  3. My original plan
     Most complete, but worst as-is. It is too heavy: manual tracking plus
     history patching creates more risk and maintenance cost than needed. The
     structured context and business-helper placement are good, but the
     implementation approach is overbuilt for this problem.
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Reverse order &lt;code&gt;xhigh&lt;/code&gt; ranking&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;  1. Second alternate plan: trackViewsManually + usePathname tracker
     Best overall. It uses Datadog’s official mechanism and a Next.js-native
     route source. It is more likely to fix Web App Usage cleanly than my
     beforeSend plan, without the browser-history patching risk of the first
     alternate.
  2. My original plan: mutate event.view.name in beforeSend
     Safest and smallest code change, but less robust as a Datadog fix. It may
     work for @view.name, but it is not the primary Datadog-recommended
     lifecycle path and may not affect views grouped by @view.url_path_group.
  3. First alternate plan: trackViewsManually + patched history/popstate
     It uses the official Datadog model, but the history patching makes it the
     riskiest as-is. It has more edge cases and more chance of conflicting with
     Next.js routing than the usePathname tracker.
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Reverse order Interpretation&lt;/h5&gt;
&lt;p&gt;There was some differences between the runs: this time, &lt;code&gt;xhigh&lt;/code&gt; produced the best &lt;code&gt;beforeSend&lt;/code&gt; plan instead of &lt;code&gt;medium&lt;/code&gt;. Nonetheless, regardless of reasoning mode, GPT-5.5 still favors presented alternatives over its own original plan. Notably, &lt;code&gt;xhigh&lt;/code&gt; ranks its own plan second rather than last, making it the only mode that does not systematically place its own proposal last.&lt;/p&gt;
&lt;/details&gt;
&lt;h3&gt;Authorship effect&lt;/h3&gt;
&lt;p&gt;The initial goal was to make them collectively decide which reasoning mode was the best. In the end the best one was not a particular reasoning model: it was the one which ended with the better choice (use &lt;code&gt;beforeSend&lt;/code&gt;) which was &lt;code&gt;medium&lt;/code&gt; in the first two runs and &lt;code&gt;xhigh&lt;/code&gt; in the last one. This suggests that reasoning mode alone is not a reliable indicator of plan quality.&lt;/p&gt;
&lt;p&gt;Unexpectedly, these results suggest an even more interesting &lt;strong&gt;authorship effect: when evaluating its own earlier plan against alternatives, the model consistently downgraded its original proposal&lt;/strong&gt; and ranked itself last 5/6 times.&lt;/p&gt;
&lt;p&gt;This raises a follow-up question: what happens if plans are evaluated without any notion of authorship?&lt;/p&gt;
&lt;h2&gt;Plan evaluation without authorship&lt;/h2&gt;
&lt;h3&gt;Protocol&lt;/h3&gt;
&lt;p&gt;I repeated the experiment five times per reasoning mode using the following prompt with &lt;code&gt;medium&lt;/code&gt;|&lt;code&gt;high&lt;/code&gt;|&lt;code&gt;xhigh&lt;/code&gt; in different orders:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Evaluate each of the following plans. List their strengths, their weaknesses and rank them.
# Plan 1
[Pasted `medium`|`high`|`xhigh` plan]
# Plan 2
[Pasted `medium`|`high`|`xhigh` plan]
# Plan 3
[Pasted `medium`|`high`|`xhigh` plan]
&lt;/code&gt;&lt;/pre&gt;
&lt;details&gt;
  &lt;summary&gt;Show experiment results&lt;/summary&gt;
&lt;h4&gt;Order 1&lt;/h4&gt;
&lt;p&gt;I repeated the experiment five times per reasoning mode using the following prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Evaluate each of the following plans. List their strengths, their weaknesses and rank them.
# Plan 1
[Pasted `medium` plan]
# Plan 2
[Pasted `high` plan]
# Plan 3
[Pasted `xhigh` plan]
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;medium&lt;/code&gt; ranked:
&lt;ul&gt;
&lt;li&gt;1 &amp;gt; 2 &amp;gt; 3 so &lt;code&gt;medium&lt;/code&gt; &amp;gt; &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; (3x)&lt;/li&gt;
&lt;li&gt;2 &amp;gt; 3 &amp;gt; 1 so &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; (2x)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;high&lt;/code&gt; ranked:
&lt;ul&gt;
&lt;li&gt;1 &amp;gt; 2 &amp;gt; 3 so &lt;code&gt;medium&lt;/code&gt; &amp;gt; &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; (3x)&lt;/li&gt;
&lt;li&gt;2 &amp;gt; 3 &amp;gt; 1 so &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; (1x)&lt;/li&gt;
&lt;li&gt;2 &amp;gt; 1 &amp;gt; 3 so &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; (1x)&lt;/li&gt;
&lt;li&gt;(note, they took between 53s and 2 min, the 2 &amp;gt; 3 &amp;gt; 1 resulted from the shortest thinking and 2 &amp;gt; 1 &amp;gt; 3 from the longest thinking)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;xhigh&lt;/code&gt; ranked:
&lt;ul&gt;
&lt;li&gt;1 &amp;gt; 2 &amp;gt; 3 so &lt;code&gt;medium&lt;/code&gt; &amp;gt; &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; (3x)&lt;/li&gt;
&lt;li&gt;2 &amp;gt; 1 &amp;gt; 3 so &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; (2x)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Order 2&lt;/h4&gt;
&lt;p&gt;I did the same by changing the order:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Evaluate each of the following plans. List their strengths, their weaknesses and rank them.
# Plan 1
[Pasted `xhigh` plan]
# Plan 2
[Pasted `medium` plan]
# Plan 3
[Pasted `high` plan]
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;medium&lt;/code&gt; ranked
&lt;ul&gt;
&lt;li&gt;1 &amp;gt; 2 &amp;gt; 3 so &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; &amp;gt; &lt;code&gt;high&lt;/code&gt; (3x)&lt;/li&gt;
&lt;li&gt;3 &amp;gt; 1 &amp;gt; 2 so &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; (1x)&lt;/li&gt;
&lt;li&gt;2 &amp;gt; 1 &amp;gt; 3 so &lt;code&gt;medium&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;high&lt;/code&gt; (1x)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;high&lt;/code&gt; ranked:
&lt;ul&gt;
&lt;li&gt;3 &amp;gt; 1 &amp;gt; 2 so &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; (4x)&lt;/li&gt;
&lt;li&gt;2 &amp;gt; 1 &amp;gt; 3 so &lt;code&gt;medium&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;high&lt;/code&gt; (1x)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;xhigh&lt;/code&gt; ranked:
&lt;ul&gt;
&lt;li&gt;1 &amp;gt; 2 &amp;gt; 3 so &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; &amp;gt; &lt;code&gt;high&lt;/code&gt; (4x)&lt;/li&gt;
&lt;li&gt;3 &amp;gt; 1 &amp;gt; 2 so &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; (1x)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Order 3&lt;/h4&gt;
&lt;p&gt;I did the same by changing the order:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Evaluate each of the following plans. List their strengths, their weaknesses and rank them.
# Plan 1
[Pasted `high` plan]
# Plan 2
[Pasted `xhigh` plan]
# Plan 3
[Pasted `medium` plan]
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;medium&lt;/code&gt; ranked
&lt;ul&gt;
&lt;li&gt;2 &amp;gt; 3 &amp;gt; 1 so &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; &amp;gt; &lt;code&gt;high&lt;/code&gt; (4x)&lt;/li&gt;
&lt;li&gt;1 &amp;gt; 2 &amp;gt; 3 so &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; (1x)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;high&lt;/code&gt; ranked:
&lt;ul&gt;
&lt;li&gt;1 &amp;gt; 2 &amp;gt; 3 so &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; (3x)&lt;/li&gt;
&lt;li&gt;2 &amp;gt; 1 &amp;gt; 3 so &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; (2x)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;xhigh&lt;/code&gt; ranked:
&lt;ul&gt;
&lt;li&gt;1 &amp;gt; 2 &amp;gt; 3 so &lt;code&gt;high&lt;/code&gt; &amp;gt; &lt;code&gt;xhigh&lt;/code&gt; &amp;gt; &lt;code&gt;medium&lt;/code&gt; (5x)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/details&gt;
&lt;h3&gt;Order effect&lt;/h3&gt;
&lt;p&gt;Across 45 rankings (3 orders × 3 modes × 5 runs), the 1 &amp;gt; 2 &amp;gt; 3 ranking appeared 25 times, so about 56% of occurrence. This is significantly higher than what would be expected under order-independent rankings and suggests an &lt;strong&gt;order effect: when ranking plans, the model frequently ranked them in the same order as they were presented&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Limitations&lt;/h2&gt;
&lt;p&gt;This experiment is limited in scope: it examines a single problem, a small number of runs, and one evaluation format, reasoning modes are compared one to each other but there is no comparison between runs with the same reasoning mode. Results may vary across tasks, prompts, or domains. Further testing would be needed to generalize these findings.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;This experiment does not establish a universal bias in GPT-5.5, but it highlights consistent patterns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When comparing its own plan with alternatives, GPT-5.5 rated alternative plans more favorably than its own earlier proposal (authorship effect).&lt;/li&gt;
&lt;li&gt;When ranking anonymous plans, GPT-5.5 frequently ranked them in the same order as they were presented (order effect).&lt;/li&gt;
&lt;li&gt;These effects interact: once authorship is removed, presentation order appears to become the dominant signal.&lt;/li&gt;
&lt;li&gt;Reasoning mode alone was not a reliable indicator of plan quality.&lt;/li&gt;
&lt;li&gt;GPT-5.5’s plan rankings were inconsistent in this setup and should be treated with caution.&lt;/li&gt;
&lt;li&gt;Regardless of reasoning mode, model-generated plans benefit from external judgment, especially when the task involves implementation tradeoffs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommendations:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Be extra cautious when proposing modifications to a plan (the model may accept them without questionning them enough).&lt;/li&gt;
&lt;li&gt;Don’t trust a model to rank outputs (either its own or another model’s outputs).&lt;/li&gt;
&lt;li&gt;Instead of list ranking, use pairwise comparison and judge externally.&lt;/li&gt;
&lt;li&gt;Don’t treat “reasoning level” as a quality guarantee.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded></item></channel></rss>