May 20261 min readJohan Bretonneau

Structured Output Across Providers
JSON Mode, Tool Use, and response_format - Compared

Getting structured JSON from LLMs sounds simple until you need it to work reliably across OpenAI, Anthropic, Google, and Mistral. Here are the failure modes nobody documents.

Structured output is one of those features that looks solved until you run it in production across multiple providers. Then you discover that "JSON mode" means four different things depending on who you're calling, and two of them will silently return invalid JSON under load.

Here's what actually works, and where each provider breaks.

The Four Approaches

1. OpenAI - response_format: { type: "json_object" }

The original. Instructs the model to return valid JSON without specifying a schema. Works reliably. The failure mode: it returns valid JSON, but not the schema you wanted. You asked for { "name": string, "age": number } and got { "result": "John, age 42" }. The model interpreted the format request but not your schema intent.

The fix: OpenAI's structured outputs with strict: true and a JSON schema definition. This enforces schema compliance at the decoding level - the model is constrained to produce tokens that match your schema. Reliability: ~99.7% in our tests.

Routing implication: Strict structured outputs require GPT-4o or GPT-4o-mini specifically. Not all OpenAI models support it.

2. Anthropic - Tool use

Claude doesn't have a response_format equivalent. The idiomatic approach is to define a tool with your desired JSON schema and instruct the model to call it. Claude returns a tool_use block with the structured data.

This is more verbose than OpenAI's approach, but it's actually more reliable - Claude is excellent at tool call formatting (94% on our instruction-following benchmark). The schema validation happens at the application layer, not the provider layer.

The failure mode: Claude sometimes returns a text response explaining why it can't call the tool, instead of calling it. This happens when the prompt context makes the tool call semantically inappropriate. Rate: ~2% in our tests on ambiguous prompts.

The fix: Add explicit instruction: "You MUST respond by calling the [tool_name] tool. Do not respond with text." Drops the failure rate to < 0.3%.

3. Google Gemini - response_mime_type + schema

Gemini supports response_mime_type: "application/json" with an optional response_schema. The schema validation is done server-side.

The failure mode we didn't expect: Under high load (concurrent requests during our peak test window), Gemini 2.0 Flash returned syntactically invalid JSON in 8.2% of requests - not an error code, just malformed output. Gemini 2.0 Pro showed the same issue at 1.4%. This was not documented anywhere we could find.

The fix: Always wrap Gemini structured output calls in a JSON parse try/catch and retry on parse failure. With one retry, effective failure rate drops to < 0.5%.

Routing implication: If you're using Gemini for cost reasons on structured output tasks, budget for retry logic. The savings are still significant even with the occasional retry token cost.

4. Mistral - response_format: { type: "json_object" }

Similar to OpenAI's original JSON mode - no schema enforcement, just a guarantee of valid JSON syntax. Reliable syntax (< 0.5% parse failure), but no schema validation.

The pattern that works: Use Mistral's JSON mode with explicit schema description in the system prompt. Something like:

Respond with a JSON object matching this exact schema:
{"name": "string", "confidence": "number between 0 and 1", "tags": "array of strings"}
Do not include any other fields.

Schema compliance: ~93% in our tests. Good enough for most extraction tasks.

Cross-Provider Reliability Summary

ProviderApproachSyntax reliabilitySchema complianceNotes
OpenAI (strict)response_format + schema99.7%99.5%Best overall
AnthropicTool use99.6%98.2%Best for complex schemas
MistralJSON mode + prompt99.5%93.1%Good for simple schemas
Gemini Proresponse_schema98.6%96.8%Retry on parse failure
Gemini Flashresponse_schema91.8%89.3%Requires retry logic

The Routing Strategy for Structured Output

You're not stuck with one provider for all structured output. The right routing depends on schema complexity and latency requirements:

Simple schemas (< 5 fields, flat structure): Gemini Flash + retry. Cheapest path, schema compliance is sufficient, retry overhead is acceptable.

Complex schemas (nested objects, arrays, enums): Anthropic tool use or OpenAI strict. The schema validation is load-bearing here; a 93% compliance rate means 7% of your outputs need human review or a retry.

Real-time structured output (< 500ms): GPT-4o-mini with strict mode. Gemini Flash is cheaper but the retry risk adds unpredictable latency. OpenAI strict is fast and predictable.

Cost-sensitive batch structured extraction: Gemini Flash with retry budget. Even at 8% retry rate, the blended cost is still 4× cheaper than OpenAI strict for the same task.

What This Means for Provider-Agnostic Code

If you're building a system that routes structured output requests across providers, your parsing layer needs to handle four different response shapes:

def extract_structured(response, provider):
    if provider == "openai":
        return json.loads(response.choices[0].message.content)
    elif provider == "anthropic":
        tool_block = next(b for b in response.content if b.type == "tool_use")
        return tool_block.input
    elif provider in ("gemini", "mistral"):
        try:
            return json.loads(response.text)
        except json.JSONDecodeError:
            raise RetryableError("Invalid JSON from provider")

HiWay2LLM normalizes this across providers - your application sees a consistent structured_output field regardless of which provider handled the request. The provider-specific extraction and retry logic lives in the routing layer, not your application code.

The One Rule

Validate output against your schema in your application, regardless of which provider you use. Provider-level schema enforcement is a best-effort guarantee, not a contract. The one time you skip the application-level validation is the one time a Gemini Flash response returns {"name": null} when you expected a string, and it silently propagates downstream.

Schema validation at the application boundary costs microseconds. Debugging a corrupted database because you trusted provider-level enforcement costs hours.

Start Saving →

No credit card required

Share

Was this useful?

Comments

Be the first to comment.