Neon sunglasses logo

troels.im

Level Up Your Coding Game with AI & Rust!

Made with ❤️ by @troels_im

Raining colors

Why the Structure of AI's Output Matters

·7 min read

Here's something most developers miss when working with LLMs: the order in which you ask for information dramatically affects the quality of responses. Not by a little—we're talking 20-30% accuracy improvements in some cases, just by rearranging JSON fields.

This isn't about prompt engineering tricks or clever workarounds. It's about understanding how these models actually work at a fundamental level.

How LLM Chat Systems Actually Work

Large language models like GPT-4, Claude, or Llama operate on a deceptively simple principle: they predict the next token, one at a time, from left to right. That's it. The entire multi-billion parameter model boils down to answering "what comes next?" over and over again.

This process is called autoregressive generation. The model:

  1. Takes your prompt and any conversation history
  2. Converts it into tokens (roughly word chunks)
  3. Predicts the probability distribution for the next token
  4. Samples a token from that distribution
  5. Appends it to the sequence
  6. Repeats until it generates a stop token

Think of it like writing a sentence where you can only see what you've written so far—you can't go back and change earlier words based on what comes later. Each token is generated based purely on what came before it.

This matters more than you might think.

A Basic Example: Making Decisions

Let's say you're building a system where an LLM needs to approve or reject user requests. You want structured output so you can parse it reliably. You have two obvious ways to structure this:

Option A: Decision First

{
  "decision": "approve",
  "reason": "User has valid credentials and request is within quota"
}

Option B: Reason First

{
  "reason": "User has valid credentials and request is within quota",
  "decision": "approve"
}

Same information, right? Just different ordering. But here's what happens in practice:

With Option A (decision first), the model generates:

  1. Opens JSON: {
  2. Key: "decision"
  3. Colon: :
  4. Value: `"approve"` ← Makes this choice NOW
  5. Then fills in reasoning to justify what it already decided

With Option B (reason first), the model generates:

  1. Opens JSON: {
  2. Key: "reason"
  3. Colon: :
  4. Value: Thinks through the actual criteria ← Reasoning happens HERE
  5. Then makes decision based on reasoning it just generated

The difference? In Option A, the model commits to an answer before doing the reasoning. In Option B, it does the reasoning first, then decides.

Why One is Better Than the Other: The Deep Dive

The reason this matters comes down to three fundamental properties of autoregressive models:

1. Sequential Dependencies and "Exposure Bias"

When an LLM generates text, each token influences all subsequent tokens. This creates a dependency chain. If the model makes a mistake early, that error can cascade through the rest of the generation—researchers call this "exposure bias."

In the decision-first approach, if the model picks "approve" prematurely (maybe because approvals are more common in training data), it will then generate reasoning that supports that decision—regardless of whether it's correct. The model becomes committed to justifying its initial choice.

In the reason-first approach, the model works through the relevant factors first. When it reaches the decision field, it has already generated the logical analysis. The decision becomes a natural conclusion rather than a premature commitment.

2. The Left-to-Right Constraint

Remember: LLMs can only condition on previous tokens, never future ones. This is a fundamental architectural constraint of transformer-based models. They use causal attention masks that literally prevent the model from "peeking ahead."

When you structure output with decision first, you're asking the model to make a decision without the benefit of its own reasoning. It's like asking someone to pick A, B, or C before they've finished reading the question.

When you put reasoning first, the model builds up context and analysis tokens that are then available in its attention context when generating the decision token. The decision can attend to the reasoning that came before it.

3. Probability Distribution Shaping

Each token the model generates shifts the probability distribution for subsequent tokens. Early tokens have outsized influence on this distribution shift.

If "decision: approve" comes first, the probability distribution for the reasoning text gets shaped toward justifications for approval. The model is now in "explain why this is approved" mode rather than "analyze whether this should be approved" mode.

If reasoning comes first, the probability distribution evolves based on the actual analysis. The decision field's probabilities get shaped by genuine evaluation rather than post-hoc justification.

Real-World Numbers

Recent experiments show this isn't theoretical:

  • Researchers testing structured outputs on reasoning tasks found 20-30% accuracy improvements simply by reordering JSON fields to put reasoning before conclusions
  • In the MATH500 dataset, models with tag-based structured reasoning (putting verification and reasoning before answers) maintained significantly better accuracy
  • Even non-reasoning models like GPT-4 show measurable improvements when reasoning fields precede decision fields

Practical Implications

This matters for any application using structured outputs:

For classification tasks: Put your reasoning/justification field before your label/category field.

For yes/no decisions: Generate explanation before decision.

For complex analysis: Structure your output so that supporting data and analysis fields come before summary/conclusion fields.

For multi-step reasoning: Use sequential reasoning tags or fields before final outputs.

The pattern is consistent: anything that should inform the decision needs to be generated before the decision token itself.

The Counterintuitive Part

Here's what makes this particularly interesting: from a data structure perspective, JSON key order doesn't matter. From a parsing perspective, we get the same object either way. But from an LLM generation perspective, the order is everything.

We're not dealing with a traditional software system where data can be computed in any order and then assembled. We're dealing with a sequential generator that must commit to each token as it goes. The structure of the output determines the order of reasoning.

What This Means for Your Prompts

When designing prompts that expect structured output:

  1. Think about generation order, not just data structure. Consider what the model needs to have generated before it can accurately generate each subsequent field.
  1. Put "thinking" before "deciding." Any field that represents analysis, reasoning, or supporting information should come before fields that represent conclusions or decisions.
  1. Test both orderings. The difference can be dramatic—it's worth running evals to measure the impact for your specific use case.
  1. Watch for cascade errors. If you see the model consistently making wrong decisions but providing correct-seeming reasoning, check if your structure forces it to decide before reasoning.

The Bigger Picture

This goes beyond just JSON field ordering. It reveals something fundamental about working with LLMs: they're not traditional computing systems. They don't compute answers and then format them—they think by generating tokens in sequence.

Understanding this changes how you should design LLM-powered systems. The output structure isn't just a formatting choice—it's part of how the model reasons about the problem.

Next time you're designing a structured output schema, remember: you're not just defining a data format. You're defining the order in which the model will think about the problem.

And sometimes, that makes all the difference.