· Implementation

Structured Output Parsing in Production

llm parsing reliability

Getting an LLM to return valid JSON sounds simple until you try it at scale. In production, you need structured output that’s not just syntactically valid, but semantically correct and consistently formatted.

The Structured Output Spectrum

From least to most reliable:

  1. Raw text parsing — regex/string matching on free-form output. Fragile, don’t do this.
  2. JSON mode — model forced to output valid JSON. Syntactically correct, but no schema guarantees.
  3. Function calling — model outputs conform to a schema. Best option when available.
  4. Constrained generation — grammar-based decoding that makes invalid output impossible. Most reliable, but not available everywhere.

Schema Design for Models

The shape of your output schema directly affects reliability:

# Bad: complex nested structure with optional fields
class AnalysisResult(BaseModel):
    entities: list[Entity]
    relationships: list[Relationship]
    metadata: Optional[dict[str, Any]]
    confidence: Optional[float]
    sub_analyses: Optional[list["AnalysisResult"]]

# Good: flat, required fields, bounded lists
class AnalysisResult(BaseModel):
    entity_names: list[str] = Field(max_length=10)
    primary_relationship: str
    confidence: float = Field(ge=0, le=1)
    summary: str = Field(max_length=500)

Rules:

  • Prefer flat structures over deep nesting
  • Make fields required with defaults rather than optional
  • Bound list lengths to prevent unbounded generation
  • Use enums for categorical fields
  • Add field descriptions — they act as inline instructions

Validation and Recovery

Schema validation catches structural issues. Semantic validation catches everything else:

async def extract_with_validation(prompt, schema, max_retries=3):
    for attempt in range(max_retries):
        try:
            raw = await llm.generate(prompt, response_format=schema)
            result = schema.model_validate_json(raw)

            # Semantic validation
            errors = validate_semantics(result)
            if errors:
                prompt = append_correction(prompt, raw, errors)
                continue

            return result
        except ValidationError as e:
            prompt = append_schema_error(prompt, raw, e)
            continue

    raise ExtractionFailed(f"Failed after {max_retries} attempts")

The correction prompt is critical. Don’t just say “try again” — show the model its previous output, what was wrong, and what you expected instead.

Partial Extraction

Sometimes you don’t need perfect output — you need the best available data:

def extract_partial(raw_output, schema):
    """Extract as many valid fields as possible, even if the full schema fails."""
    result = {}
    for field_name, field_info in schema.model_fields.items():
        try:
            value = extract_field(raw_output, field_name, field_info)
            if value is not None:
                result[field_name] = value
        except (ValueError, KeyError):
            result[field_name] = field_info.default
    return result

This is especially useful for non-critical pipelines where having 80% of the data is better than failing entirely.

Monitoring Output Quality

Track these metrics in production:

  • Parse success rate — percentage of outputs that validate against the schema
  • Retry rate — how often extraction needs multiple attempts
  • Field completeness — percentage of fields successfully extracted
  • Semantic accuracy — spot-checked against ground truth

A sudden drop in parse success rate usually means the model was updated or your prompts drifted.

Subscribe to the newsletter

Get notified when I publish new articles about agent systems and AI engineering. No spam, unsubscribe anytime.