Chapters: 

SPEC-1-Recipe Ingredient Line Ingestion Fidelity

Background

Your recipe reader needs to support ingredient-aware behavior such as swaps, overrides, and structured parsing. Right now, ingredient data is being lost or collapsed, which breaks downstream behaviors. In the example shown, ingredient_lines_raw is empty even though the recipe clearly references ingredients in the body, and the override result degrades to a single structured ingredient (broccoli) rather than operating against preserved source lines.

The immediate design goal is to preserve upstream ingredient lines exactly as provided so the system can safely layer parsing, rendering, and override logic on top later. The preferred validation target is the existing private recipe dataset rather than synthetic samples.

Requirements

Must have

  • Treat provided ingredient lines as the source of truth when they exist.
  • For recipe records, stage must be immutable after ingestion except by explicit admin or data-migration action.
  • Store raw ingredient lines in ingredient_lines_raw.
  • Ensure ingredient_lines_raw is a list of strings.
  • Preserve each input line as a separate list item.
  • Preserve original line order.
  • Do not merge adjacent lines.
  • Do not rewrite, normalize, or reconstruct ingredient lines.
  • Do not deduplicate ingredient lines during ingestion.
  • Ignore truly empty lines, but preserve lines with visible ingredient content.
  • Keep ingredient_lines_raw populated even if structured parsing fails.
  • Derive structured ingredients from ingredient_lines_raw when raw lines exist.
  • Do not let fallback extraction overwrite explicit ingredient-line data.
  • A record with provided ingredient lines must not normalize to "ingredient_lines_raw": [] unless the source truly had none.
  • Validate behavior against the existing owned recipe corpus.
  • For recipe records, only these stage values are valid: Make ahead, Prepare the meal.
  • The following values must not be treated as valid recipe stages: N/A, Reflection, Still growing, To be planted.

Should have

  • Capture ingestion provenance, such as ingredient_lines_source = upstream_explicit | inferred_body | none.
  • Add a validation flag for records whose stage is outside the allowed recipe-stage set.
  • Store a lightweight ingestion diagnostic when raw lines were expected but absent.
  • Add regression fixtures from real recipes that previously failed.
  • Keep parsing, rendering, and override stages independently retryable.

Could have

  • Add a checksum over ordered raw ingredient lines for debugging ingestion regressions.
  • Add a side-by-side debug view in Flask showing upstream lines, stored raw lines, parsed ingredients, and applied overrides.
  • Preserve section headers separately later if your recipes contain grouped ingredient blocks.

Won't have

  • Perfect ingredient parsing in this change.
  • Smarter substitution logic in this change.
  • Body-text rewriting in this change.
  • Near-duplicate collapsing in this change.
  • Broader recipe architecture redesign in this change.

Method

Core design decision

Treat the recipe author's line structure inside the Content Ingredients section as the raw ingredient source when no dedicated upstream ingredient field exists. In other words:

  • Content remains the upstream container.
  • The Ingredients section within Content is the source of truth for raw ingredient lines.
  • ingredient_lines_raw must be produced by losslessly copying lines from that section, not by heuristically recognizing only lines that look parseable.

This changes the ingestion strategy from heuristic ingredient detection to section-aware line preservation.

Proposed ingestion flow

  1. Parse the raw Content into coarse sections such as Overview, Ingredients, Directions, Notes.
  2. If an Ingredients section is found, extract its lines in order.
  3. For each non-empty line in that section:
    • preserve the line exactly as authored
    • store it as one entry in ingredient_lines_raw
    • do not require bullets, quantities, or unit keywords
  4. Ignore truly empty lines.
  5. Optionally trim only trailing newline characters used by transport, but do not normalize internal spacing or wording.
  6. Only if no Ingredients section exists should fallback extraction run.
  7. Fallback output must be marked as inferred provenance and must never overwrite section-derived lines.

Data model

Recommended normalized recipe fields:

{
  "title": "...",
  "body": "...",
  "ingredient_lines_raw": [
    "2 zucchini",
    "1 onion",
    "garam masala"
  ],
  "ingredient_lines_source": "section_lines",
  "ingredient_lines_status": "present",
  "ingredients": [],
  "ingredients_parsed_from": "ingredient_lines_raw",
  "stage": "Prepare the meal"
}

Recommended provenance enum:

  • section_lines — extracted from an explicit Ingredients block in Content
  • heuristic_body — inferred because no clear Ingredients block existed
  • none — no ingredient lines found

Recommended status enum:

  • present
  • missing_section
  • empty_section
  • fallback_used

Behavioral rules

Rule 1: preserve first, parse second

ingredient_lines_raw is populated before any structured parsing runs.

Rule 2: parsing cannot destroy raw capture

If structured parsing fails, returns partial results, or produces only one detected ingredient, ingredient_lines_raw remains unchanged.

Rule 3: overrides should prefer parsed ingredients derived from raw lines

Override logic should operate on structured ingredients derived from ingredient_lines_raw. If parsing is incomplete, UI/debug views should still show the preserved raw lines so failures are visible rather than silently collapsed.

Rule 4: fallback is second-class

Fallback extraction from general prose is allowed only when no Ingredients section can be identified. Fallback-derived lines must be marked distinctly and must not be confused with authored section lines.

Why the current approach fails

Today the extractor appears to require bullets and/or quantity tokens. That means this valid authored block can be dropped:

Ingredients
zucchini
onion
green pepper
jalapeño
garam masala

Because preservation is gated by parse-like signals, ingestion loses user intent before downstream stages begin. Once that happens, override logic can only operate on whatever sparse structured inference remains.

Recommended algorithm

Section-aware extraction algorithm

Input: raw Content string
Output: ordered ingredient_lines_raw plus provenance

1. Split Content into lines preserving order.
2. Detect section headers using a configurable header matcher.
3. If an Ingredients header is found:
   a. collect subsequent lines until the next section header or end of content
   b. discard truly empty lines
   c. store each remaining line exactly as one element of ingredient_lines_raw
   d. set ingredient_lines_source = section_lines
4. Else:
   a. run existing heuristic extraction as fallback
   b. if any lines found, set ingredient_lines_source = heuristic_body
   c. otherwise set ingredient_lines_source = none

Header matcher

Start with a configurable case-insensitive list:

  • ingredients
  • ingredient
  • for the chicken
  • for the sauce
  • for serving

This should be configuration, not hard-coded business logic, because your corpus may use custom section names.

PlantUML

@startuml
start
:Load CSV row;
:Read Content;
:Split into ordered lines;
if (Ingredients section found?) then (yes)
  :Capture section lines exactly;
  :Drop only truly empty lines;
  :Set source = section_lines;
else (no)
  :Run heuristic fallback extraction;
  if (Fallback found lines?) then (yes)
    :Set source = heuristic_body;
  else (no)
    :Set source = none;
  endif
endif
:Persist ingredient_lines_raw;
:Attempt structured parsing from ingredient_lines_raw;
:Persist parsed ingredients separately;
stop
@enduml

Storage invariants

The normalized record must satisfy:

  • ingredient_lines_raw is always an array
  • when source lines were captured, array length must equal number of preserved non-empty ingredient section lines
  • line order equals original authored order
  • duplicate lines are allowed
  • parsing result cannot mutate prior raw lines
  • for recipe records, stage must be either Make ahead or Prepare the meal

Corpus-first test strategy

Use your own recipe dataset as the acceptance suite.

Create a regression folder with real examples in these buckets:

  • one ingredient per line, no bullets, no quantities
  • bullet list ingredients
  • measured ingredients with quantities
  • mixed sections such as Ingredients plus Directions
  • subsection headers like For the marinade
  • duplicate ingredient lines
  • recipes with no clear Ingredients section

For each fixture, assert:

  • expected ingredient_lines_raw
  • expected ingredient_lines_source
  • whether fallback was used
  • structured parsing may pass or fail independently

Likely code change location

The highest-value change is to replace or front-load extract_ingredient_lines(...) with a section-aware extractor in rag_setup/import_recipes.py before current heuristic logic runs. The current heuristic extractor can remain as fallback rather than primary ingestion behavior.

Implementation

Assumptions

  • The current ingestion entry point is rag_setup/import_recipes.py.
  • Content contains human-authored sections.
  • Existing heuristic extraction should be retained only as fallback for legacy records without a recognizable Ingredients block.
  • The first goal is safe ingestion fidelity against your corpus, not perfect parsing.

Step 1: Introduce a section-aware extractor

Add a new function before the current heuristic extractor, for example:

def extract_ingredient_section_lines(content: str, header_aliases: list[str]) -> tuple[list[str], str]:
    """
    Returns (ingredient_lines_raw, ingredient_lines_source)
    source is one of: section_lines, none
    """

Behavior:

  • split content into ordered lines
  • detect an Ingredients header using configurable aliases
  • capture subsequent lines until the next recognized section header or end of content
  • ignore truly empty lines
  • preserve every other line exactly as written
  • return ([], "none") if no ingredient section is found

Step 2: Flag missing ingredient sections for review

When no recognizable Ingredients section is found, the system should explicitly flag the recipe for review instead of silently treating prose extraction as equivalent source data.

Suggested behavior:

def extract_ingredient_lines_with_review_flag(content: str) -> tuple[list[str], str, list[str]]:
    lines, source = extract_ingredient_section_lines(content, HEADER_ALIASES)
    if lines:
        return lines, "section_lines", []

    return [], "none", ["missing_ingredients_section"]

This makes the missing state visible and reviewable.

Optional secondary mode for tooling only:

  • allow a separate operator/debug action to run heuristic extraction for suggestion purposes
  • store those suggestions separately from ingredient_lines_raw
  • never promote suggestions into source-of-truth fields without review

Recommended suggested field for operator assistance:

record["ingredient_line_suggestions"] = heuristic_suggestions

This field must be treated as advisory only.

Step 3: Persist provenance in the normalized record

Update normalization output to always emit:

  • ingredient_lines_raw
  • ingredient_lines_source

Example:

record["ingredient_lines_raw"] = ingredient_lines_raw
record["ingredient_lines_source"] = ingredient_lines_source

Optional but useful:

record["ingredient_lines_status"] = (
    "present" if ingredient_lines_raw else "missing_review_required"
)
record["ingestion_flags"] = ingestion_flags
Possible flags:
- `missing_ingredients_section`
- `empty_ingredients_section`
- `section_header_ambiguous`

Step 3b: Validate recipe stage

For recipe normalization, validate stage against a fixed allowlist:

ALLOWED_RECIPE_STAGES = {"Make ahead", "Prepare the meal"}

Behavior:

  • preserve the original input value in the record
  • if the value is in the allowlist, accept it unchanged
  • if the value is not in the allowlist, add an ingestion or data-quality flag such as invalid_recipe_stage
  • do not silently remap unrelated values like Reflection or To be planted into recipe stages
  • do not let routine agent logic change stage

Recommended optional field:

record["stage_is_valid"] = record.get("stage") in ALLOWED_RECIPE_STAGES

If stricter enforcement is preferred, invalid values can be routed to review instead of normal publication.

Step 4: Make parsing downstream-only

Any structured ingredient parsing should consume ingredient_lines_raw and must not repopulate or replace it from body.

Safe contract:

record["ingredients"] = parse_structured_ingredients(record["ingredient_lines_raw"])

Not allowed:

  • reconstructing ingredient_lines_raw from parsed ingredients
  • clearing ingredient_lines_raw when parsing fails
  • replacing section-derived lines with fallback-derived lines later in the pipeline

If ingredient_lines_status == "missing_review_required", parsing should either be skipped or clearly marked as operating on no approved source lines.

Step 5: Add regression fixtures from your own corpus

Create real-input fixtures from recipes that currently fail. For each fixture, store:

  • source Content
  • expected ingredient_lines_raw
  • expected ingredient_lines_source

Recommended fixture buckets:

  • one ingredient per line with no bullets
  • ingredient lines with bullets
  • ingredient lines with quantities
  • mixed sectioned recipes
  • recipes with no Ingredients header
  • recipes with ambiguous headers that should be flagged
  • recipes with an Ingredients header but no captured lines
  • records with valid recipe stages
  • records with invalid non-recipe stages

Step 6: Add invariant tests

Tests should assert:

assert isinstance(record["ingredient_lines_raw"], list)
assert all(isinstance(x, str) for x in record["ingredient_lines_raw"])

For section-derived fixtures:

  • exact string equality per line
  • exact ordering
  • duplicates preserved
  • non-empty section lines are not dropped because they lack digits or units

For parser-failure scenarios:

  • ingredient_lines_raw remains populated
  • ingredients may be partial or empty without failing ingestion

For stage validation scenarios:

  • Make ahead is accepted
  • Prepare the meal is accepted
  • N/A, Reflection, Still growing, and To be planted are flagged as invalid recipe stages
  • ingestion does not silently rewrite invalid stage values

Step 7: Improve debug visibility in Flask

In the recipe debug view, display these fields together:

  • ingredient_lines_source
  • ingredient_lines_status
  • ingestion_flags
  • ingredient_lines_raw
  • optional ingredient_line_suggestions
  • parsed ingredients
  • applied overrides

UI expectation:

  • if no Ingredients section is found, show a visible review warning
  • allow you to inspect the original Content
  • optionally show heuristic suggestions in a separate clearly-labeled block such as Suggested ingredient lines (unapproved)

That will make it obvious whether a failure is due to ingestion, parsing, missing section structure, or override logic.

Suggested rollout sequence

  1. implement section-aware extraction
  2. keep old heuristic logic as fallback
  3. regenerate a sample normalized output from your corpus
  4. verify recipes with missing Ingredients sections are visibly flagged for review
  5. diff old vs new ingredient_lines_raw
  6. run override tests on recipes containing substitutions like zucchini → broccoli
  7. promote once lossless capture is verified

Minimal pseudocode

HEADER_ALIASES = {
    "ingredients",
    "ingredient",
}

SECTION_HEADERS = {
    "overview",
    "ingredients",
    "instructions",
    "directions",
    "method",
    "notes",
}


def normalize_recipe(row: dict) -> dict:
    content = row.get("Content", "") or ""

    ingredient_lines_raw, ingredient_lines_source, ingestion_flags = extract_ingredient_lines_with_review_flag(content)

    record = {
        "title": row.get("Recipe", ""),
        "body": content,
        "dish": row.get("Dish", ""),
        "stage": row.get("Stages", ""),
        "food_pics": row.get("food_pics", ""),
        "ingredient_lines_raw": ingredient_lines_raw,
        "ingredient_lines_source": ingredient_lines_source,
        "ingredient_lines_status": "present" if ingredient_lines_raw else "missing_review_required",
        "ingestion_flags": ingestion_flags,
    }

    if record["ingredient_lines_raw"]:
        record["ingredients"] = parse_structured_ingredients(record["ingredient_lines_raw"])
    else:
        record["ingredients"] = []

    return record

Definition of done

This implementation is complete when:

  • recipes with one-authored-line-per-ingredient populate ingredient_lines_raw
  • the array preserves exact line boundaries and ordering
  • recipes without an Ingredients section are flagged for your review
  • heuristic extraction, if retained, is suggestions-only unless explicitly approved
  • parsing failures do not empty raw lines
  • overrides can be tested against recipes from your own dataset with visible raw-line provenance

Milestones

TBD

Gathering Results

TBD