Phase 3 - Recipe ingestion | Eckford on the side

Chapters:

The immediate design goal is to preserve upstream ingredient lines exactly as provided so the system can safely layer parsing, rendering, and override logic on top later. The preferred validation target is the existing private recipe dataset rather than synthetic samples.

key rule

When explicit ingredient lines are present in source input, they must be persisted losslessly and used as the sole raw ingredient source for downstream ingredient behavior.

SPEC-1-Recipe Ingredient Line Ingestion Fidelity

Background

Your recipe reader needs to support ingredient-aware behavior such as swaps, overrides, and structured parsing. Right now, ingredient data is being lost or collapsed, which breaks downstream behaviors. In the example shown, ingredient_lines_raw is empty even though the recipe clearly references ingredients in the body, and the override result degrades to a single structured ingredient (broccoli) rather than operating against preserved source lines.

Requirements

Must have

Treat provided ingredient lines as the source of truth when they exist.
Store raw ingredient lines in ingredient_lines_raw.
Ensure ingredient_lines_raw is a list of strings.
Preserve each input line as a separate list item.
Preserve original line order.
Do not merge adjacent lines.
Do not rewrite, normalize, or reconstruct ingredient lines.
Do not deduplicate ingredient lines during ingestion.
Ignore truly empty lines, but preserve lines with visible ingredient content.
Keep ingredient_lines_raw populated even if structured parsing fails.
Derive structured ingredients from ingredient_lines_raw when raw lines exist.
Do not let fallback extraction overwrite explicit ingredient-line data.
A record with provided ingredient lines must not normalize to "ingredient_lines_raw": [] unless the source truly had none.
Validate behavior against the existing owned recipe corpus.

Should have

Capture ingestion provenance, such as ingredient_lines_source = upstream_explicit | inferred_body | none.
Store a lightweight ingestion diagnostic when raw lines were expected but absent.
Add regression fixtures from real recipes that previously failed.
Keep parsing, rendering, and override stages independently retryable.

Could have

Add a checksum over ordered raw ingredient lines for debugging ingestion regressions.
Add a side-by-side debug view in Flask showing upstream lines, stored raw lines, parsed ingredients, and applied overrides.
Preserve section headers separately later if your recipes contain grouped ingredient blocks.

Won't have

Perfect ingredient parsing in this change.
Smarter substitution logic in this change.
Body-text rewriting in this change.
Near-duplicate collapsing in this change.
Broader recipe architecture redesign in this change.

Method

Core design decision

Treat the recipe author's line structure inside the Content Ingredients section as the raw ingredient source when no dedicated upstream ingredient field exists. In other words:

Content remains the upstream container.
The Ingredients section within Content is the source of truth for raw ingredient lines.
ingredient_lines_raw must be produced by losslessly copying lines from that section, not by heuristically recognizing only lines that look parseable.

This changes the ingestion strategy from heuristic ingredient detection to section-aware line preservation.

Proposed ingestion flow

Parse the raw Content into coarse sections such as Overview, Ingredients, Directions, Notes.
If an Ingredients section is found, extract its lines in order.
For each non-empty line in that section:
- preserve the line exactly as authored
- store it as one entry in ingredient_lines_raw
- do not require bullets, quantities, or unit keywords
Ignore truly empty lines.
Optionally trim only trailing newline characters used by transport, but do not normalize internal spacing or wording.
Only if no Ingredients section exists should fallback extraction run.
Fallback output must be marked as inferred provenance and must never overwrite section-derived lines.

Data model

Recommended normalized recipe fields:

{
  "title": "...",
  "body": "...",
  "ingredient_lines_raw": [
    "2 zucchini",
    "1 onion",
    "garam masala"
  ],
  "ingredient_lines_source": "section_lines",
  "ingredient_lines_status": "present",
  "ingredients": [],
  "ingredients_parsed_from": "ingredient_lines_raw"
}

Recommended provenance enum:

section_lines — extracted from an explicit Ingredients block in Content
heuristic_body — inferred because no clear Ingredients block existed
none — no ingredient lines found

Recommended status enum:

present
missing_section
empty_section
fallback_used

Behavioral rules

Rule 1: preserve first, parse second

ingredient_lines_raw is populated before any structured parsing runs.

Rule 2: parsing cannot destroy raw capture

If structured parsing fails, returns partial results, or produces only one detected ingredient, ingredient_lines_raw remains unchanged.

Rule 3: overrides should prefer parsed ingredients derived from raw lines

Override logic should operate on structured ingredients derived from ingredient_lines_raw. If parsing is incomplete, UI/debug views should still show the preserved raw lines so failures are visible rather than silently collapsed.

Rule 4: fallback is second-class

Fallback extraction from general prose is allowed only when no Ingredients section can be identified. Fallback-derived lines must be marked distinctly and must not be confused with authored section lines.

Why the current approach fails

Today the extractor appears to require bullets and/or quantity tokens. That means this valid authored block can be dropped:

Ingredients
zucchini
onion
green pepper
jalapeño
garam masala

Fixed

python3 - <<'PY'
import json
from pathlib import Path

paths = list(Path(".").rglob("recipes_normalized.jsonl"))
if not paths:
    print("No recipes_normalized.jsonl found under the current directory.")
    raise SystemExit(1)

path = paths[0]
print(f"Using: {path}")

with path.open("r", encoding="utf-8") as f:
    for line in f:
        row = json.loads(line)
        lines = row.get("ingredient_lines_raw", [])
        if lines:
            print("\nTITLE:", row.get("title"))
            print("SOURCE:", row.get("ingredient_lines_source"))
            print("STATUS:", row.get("ingredient_lines_status"))
            print("LINES:")
            for i, item in enumerate(lines, 1):
                print(f"  {i}. {item}")
PY

TITLE: Homegrown Pesto
SOURCE: section_lines
STATUS: present
LINES:
  1.         2 cups (60g) fresh basil leaves*
  2.         1/3 cup (48g) pine nuts*
  3.         1/3 cup (25g) freshly grated or shredded parmesan cheese
  4.         3 small cloves garlic (roasted garlic or fresh)*
  5.         1/3 cup (80ml) olive oil
  6.         1/2 fresh squeezed lemon
  7.         pinch of salt
  8.         freshly ground black pepper

Because preservation is gated by parse-like signals, ingestion loses user intent before downstream stages begin. Once that happens, override logic can only operate on whatever sparse structured inference remains.

Recommended algorithm

Section-aware extraction algorithm

Input: raw Content string
Output: ordered ingredient_lines_raw plus provenance

1. Split Content into lines preserving order.
2. Detect section headers using a configurable header matcher.
3. If an Ingredients header is found:
   a. collect subsequent lines until the next section header or end of content
   b. discard truly empty lines
   c. store each remaining line exactly as one element of ingredient_lines_raw
   d. set ingredient_lines_source = section_lines
4. Else:
   a. run existing heuristic extraction as fallback
   b. if any lines found, set ingredient_lines_source = heuristic_body
   c. otherwise set ingredient_lines_source = none

Header matcher

Start with a configurable case-insensitive list:

ingredients
ingredient
for the chicken
for the sauce
for serving

This should be configuration, not hard-coded business logic, because your corpus may use custom section names.

PlantUML

@startuml
start
:Load CSV row;
:Read Content;
:Split into ordered lines;
if (Ingredients section found?) then (yes)
  :Capture section lines exactly;
  :Drop only truly empty lines;
  :Set source = section_lines;
else (no)
  :Run heuristic fallback extraction;
  if (Fallback found lines?) then (yes)
    :Set source = heuristic_body;
  else (no)
    :Set source = none;
  endif
endif
:Persist ingredient_lines_raw;
:Attempt structured parsing from ingredient_lines_raw;
:Persist parsed ingredients separately;
stop
@enduml

Storage invariants

The normalized record must satisfy:

ingredient_lines_raw is always an array
when source lines were captured, array length must equal number of preserved non-empty ingredient section lines
line order equals original authored order
duplicate lines are allowed
parsing result cannot mutate prior raw lines

Corpus-first test strategy

Use your own recipe dataset as the acceptance suite.

Create a regression folder with real examples in these buckets:

one ingredient per line, no bullets, no quantities
bullet list ingredients
measured ingredients with quantities
mixed sections such as Ingredients plus Directions
subsection headers like For the marinade
duplicate ingredient lines
recipes with no clear Ingredients section

For each fixture, assert:

expected ingredient_lines_raw
expected ingredient_lines_source
whether fallback was used
structured parsing may pass or fail independently

Likely code change location

The highest-value change is to replace or front-load extract_ingredient_lines(...) with a section-aware extractor in rag_setup/import_recipes.py before current heuristic logic runs. The current heuristic extractor can remain as fallback rather than primary ingestion behavior.

Implementation

Assumptions

The current ingestion entry point is rag_setup/import_recipes.py.
Content contains human-authored sections.
Existing heuristic extraction should be retained only as fallback for legacy records without a recognizable Ingredients block.
The first goal is safe ingestion fidelity against your corpus, not perfect parsing.

Step 1: Introduce a section-aware extractor

Add a new function before the current heuristic extractor, for example:

def extract_ingredient_section_lines(content: str, header_aliases: list[str]) -> tuple[list[str], str]:
    """
    Returns (ingredient_lines_raw, ingredient_lines_source)
    source is one of: section_lines, none
    """

Behavior:

split content into ordered lines
detect an Ingredients header using configurable aliases
capture subsequent lines until the next recognized section header or end of content
ignore truly empty lines
preserve every other line exactly as written
return ([], "none") if no ingredient section is found

Step 2: Flag missing ingredient sections for review

When no recognizable Ingredients section is found, the system should explicitly flag the recipe for review instead of silently treating prose extraction as equivalent source data.

Suggested behavior:

def extract_ingredient_lines_with_review_flag(content: str) -> tuple[list[str], str, list[str]]:
    lines, source = extract_ingredient_section_lines(content, HEADER_ALIASES)
    if lines:
        return lines, "section_lines", []

    return [], "none", ["missing_ingredients_section"]

This makes the missing state visible and reviewable.

Optional secondary mode for tooling only:

allow a separate operator/debug action to run heuristic extraction for suggestion purposes
store those suggestions separately from ingredient_lines_raw
never promote suggestions into source-of-truth fields without review

Recommended suggested field for operator assistance:

record["ingredient_line_suggestions"] = heuristic_suggestions

This field must be treated as advisory only.

Step 3: Persist provenance in the normalized record

Update normalization output to always emit:

ingredient_lines_raw
ingredient_lines_source

Example:

record["ingredient_lines_raw"] = ingredient_lines_raw
record["ingredient_lines_source"] = ingredient_lines_source

Optional but useful:

record["ingredient_lines_status"] = (
    "present" if ingredient_lines_raw else "missing_review_required"
)
record["ingestion_flags"] = ingestion_flags

Possible flags:
- `missing_ingredients_section`
- `empty_ingredients_section`
- `section_header_ambiguous`

Step 4: Make parsing downstream-only

Any structured ingredient parsing should consume ingredient_lines_raw and must not repopulate or replace it from body.

Safe contract:

record["ingredients"] = parse_structured_ingredients(record["ingredient_lines_raw"])

Not allowed:

reconstructing ingredient_lines_raw from parsed ingredients
clearing ingredient_lines_raw when parsing fails
replacing section-derived lines with fallback-derived lines later in the pipeline

If ingredient_lines_status == "missing_review_required", parsing should either be skipped or clearly marked as operating on no approved source lines.

Step 5: Add regression fixtures from your own corpus

Create real-input fixtures from recipes that currently fail. For each fixture, store:

source Content
expected ingredient_lines_raw
expected ingredient_lines_source

Recommended fixture buckets:

one ingredient per line with no bullets
ingredient lines with bullets
ingredient lines with quantities
mixed sectioned recipes
recipes with no Ingredients header
recipes with ambiguous headers that should be flagged
recipes with an Ingredients header but no captured lines

Step 6: Add invariant tests

Tests should assert:

assert isinstance(record["ingredient_lines_raw"], list)
assert all(isinstance(x, str) for x in record["ingredient_lines_raw"])

For section-derived fixtures:

exact string equality per line
exact ordering
duplicates preserved
non-empty section lines are not dropped because they lack digits or units

For parser-failure scenarios:

ingredient_lines_raw remains populated
ingredients may be partial or empty without failing ingestion

Step 7: Improve debug visibility in Flask

In the recipe debug view, display these fields together:

ingredient_lines_source
ingredient_lines_status
ingestion_flags
ingredient_lines_raw
optional ingredient_line_suggestions
parsed ingredients
applied overrides

UI expectation:

if no Ingredients section is found, show a visible review warning
allow you to inspect the original Content
optionally show heuristic suggestions in a separate clearly-labeled block such as Suggested ingredient lines (unapproved)

That will make it obvious whether a failure is due to ingestion, parsing, missing section structure, or override logic.

Suggested rollout sequence

implement section-aware extraction
keep old heuristic logic as fallback
regenerate a sample normalized output from your corpus
verify recipes with missing Ingredients sections are visibly flagged for review
diff old vs new ingredient_lines_raw
run override tests on recipes containing substitutions like zucchini → broccoli
promote once lossless capture is verified

Minimal pseudocode

HEADER_ALIASES = {
    "ingredients",
    "ingredient",
}

SECTION_HEADERS = {
    "overview",
    "ingredients",
    "instructions",
    "directions",
    "method",
    "notes",
}


def normalize_recipe(row: dict) -> dict:
    content = row.get("Content", "") or ""

    ingredient_lines_raw, ingredient_lines_source, ingestion_flags = extract_ingredient_lines_with_review_flag(content)

    record = {
        "title": row.get("Recipe", ""),
        "body": content,
        "dish": row.get("Dish", ""),
        "stage": row.get("Stages", ""),
        "food_pics": row.get("food_pics", ""),
        "ingredient_lines_raw": ingredient_lines_raw,
        "ingredient_lines_source": ingredient_lines_source,
        "ingredient_lines_status": "present" if ingredient_lines_raw else "missing_review_required",
        "ingestion_flags": ingestion_flags,
    }

    if record["ingredient_lines_raw"]:
        record["ingredients"] = parse_structured_ingredients(record["ingredient_lines_raw"])
    else:
        record["ingredients"] = []

    return record

Definition of done

This implementation is complete when:

recipes with one-authored-line-per-ingredient populate ingredient_lines_raw
the array preserves exact line boundaries and ordering
recipes without an Ingredients section are flagged for your review
heuristic extraction, if retained, is suggestions-only unless explicitly approved
parsing failures do not empty raw lines
overrides can be tested against recipes from your own dataset with visible raw-line provenance

Milestones

TBD

Gathering Results

TBD

Need Professional Help in Developing Your Architecture?

Please contact me at sammuti.com :)