The immediate design goal is to preserve upstream ingredient lines exactly as provided so the system can safely layer parsing, rendering, and override logic on top later. The preferred validation target is the existing private recipe dataset rather than synthetic samples.
key rule
When explicit ingredient lines are present in source input, they must be persisted losslessly and used as the sole raw ingredient source for downstream ingredient behavior.
SPEC-1-Recipe Ingredient Line Ingestion Fidelity
Background
Your recipe reader needs to support ingredient-aware behavior such as swaps, overrides, and structured parsing. Right now, ingredient data is being lost or collapsed, which breaks downstream behaviors. In the example shown, ingredient_lines_raw is empty even though the recipe clearly references ingredients in the body, and the override result degrades to a single structured ingredient (broccoli) rather than operating against preserved source lines.
The immediate design goal is to preserve upstream ingredient lines exactly as provided so the system can safely layer parsing, rendering, and override logic on top later. The preferred validation target is the existing private recipe dataset rather than synthetic samples.
Requirements
Must have
- Treat provided ingredient lines as the source of truth when they exist.
-
Store raw ingredient lines in
ingredient_lines_raw. -
Ensure
ingredient_lines_rawis a list of strings. - Preserve each input line as a separate list item.
- Preserve original line order.
- Do not merge adjacent lines.
- Do not rewrite, normalize, or reconstruct ingredient lines.
- Do not deduplicate ingredient lines during ingestion.
- Ignore truly empty lines, but preserve lines with visible ingredient content.
-
Keep
ingredient_lines_rawpopulated even if structured parsing fails. -
Derive structured ingredients from
ingredient_lines_rawwhen raw lines exist. - Do not let fallback extraction overwrite explicit ingredient-line data.
-
A record with provided ingredient lines must not normalize to
"ingredient_lines_raw": []unless the source truly had none. - Validate behavior against the existing owned recipe corpus.
Should have
-
Capture ingestion provenance, such as
ingredient_lines_source = upstream_explicit | inferred_body | none. - Store a lightweight ingestion diagnostic when raw lines were expected but absent.
- Add regression fixtures from real recipes that previously failed.
- Keep parsing, rendering, and override stages independently retryable.
Could have
- Add a checksum over ordered raw ingredient lines for debugging ingestion regressions.
- Add a side-by-side debug view in Flask showing upstream lines, stored raw lines, parsed ingredients, and applied overrides.
- Preserve section headers separately later if your recipes contain grouped ingredient blocks.
Won't have
- Perfect ingredient parsing in this change.
- Smarter substitution logic in this change.
- Body-text rewriting in this change.
- Near-duplicate collapsing in this change.
- Broader recipe architecture redesign in this change.
Method
Core design decision
Treat the recipe author's line structure inside the Content Ingredients section as the raw ingredient source when no dedicated upstream ingredient field exists. In other words:
-
Contentremains the upstream container. -
The
Ingredientssection withinContentis the source of truth for raw ingredient lines. -
ingredient_lines_rawmust be produced by losslessly copying lines from that section, not by heuristically recognizing only lines that look parseable.
This changes the ingestion strategy from heuristic ingredient detection to section-aware line preservation.
Proposed ingestion flow
-
Parse the raw
Contentinto coarse sections such as Overview, Ingredients, Directions, Notes. - If an Ingredients section is found, extract its lines in order.
-
For each non-empty line in that section:
- preserve the line exactly as authored
-
store it as one entry in
ingredient_lines_raw - do not require bullets, quantities, or unit keywords
- Ignore truly empty lines.
- Optionally trim only trailing newline characters used by transport, but do not normalize internal spacing or wording.
- Only if no Ingredients section exists should fallback extraction run.
- Fallback output must be marked as inferred provenance and must never overwrite section-derived lines.
Data model
Recommended normalized recipe fields:
{
"title": "...",
"body": "...",
"ingredient_lines_raw": [
"2 zucchini",
"1 onion",
"garam masala"
],
"ingredient_lines_source": "section_lines",
"ingredient_lines_status": "present",
"ingredients": [],
"ingredients_parsed_from": "ingredient_lines_raw"
}
Recommended provenance enum:
-
section_lines— extracted from an explicit Ingredients block inContent -
heuristic_body— inferred because no clear Ingredients block existed -
none— no ingredient lines found
Recommended status enum:
-
present -
missing_section -
empty_section -
fallback_used
Behavioral rules
Rule 1: preserve first, parse second
ingredient_lines_raw is populated before any structured parsing runs.
Rule 2: parsing cannot destroy raw capture
If structured parsing fails, returns partial results, or produces only one detected ingredient, ingredient_lines_raw remains unchanged.
Rule 3: overrides should prefer parsed ingredients derived from raw lines
Override logic should operate on structured ingredients derived from ingredient_lines_raw. If parsing is incomplete, UI/debug views should still show the preserved raw lines so failures are visible rather than silently collapsed.
Rule 4: fallback is second-class
Fallback extraction from general prose is allowed only when no Ingredients section can be identified. Fallback-derived lines must be marked distinctly and must not be confused with authored section lines.
Why the current approach fails
Today the extractor appears to require bullets and/or quantity tokens. That means this valid authored block can be dropped:
Ingredients
zucchini
onion
green pepper
jalapeño
garam masalaFixed
python3 - <<'PY'
import json
from pathlib import Path
paths = list(Path(".").rglob("recipes_normalized.jsonl"))
if not paths:
print("No recipes_normalized.jsonl found under the current directory.")
raise SystemExit(1)
path = paths[0]
print(f"Using: {path}")
with path.open("r", encoding="utf-8") as f:
for line in f:
row = json.loads(line)
lines = row.get("ingredient_lines_raw", [])
if lines:
print("\nTITLE:", row.get("title"))
print("SOURCE:", row.get("ingredient_lines_source"))
print("STATUS:", row.get("ingredient_lines_status"))
print("LINES:")
for i, item in enumerate(lines, 1):
print(f" {i}. {item}")
PY
TITLE: Homegrown Pesto
SOURCE: section_lines
STATUS: present
LINES:
1. 2 cups (60g) fresh basil leaves*
2. 1/3 cup (48g) pine nuts*
3. 1/3 cup (25g) freshly grated or shredded parmesan cheese
4. 3 small cloves garlic (roasted garlic or fresh)*
5. 1/3 cup (80ml) olive oil
6. 1/2 fresh squeezed lemon
7. pinch of salt
8. freshly ground black pepperBecause preservation is gated by parse-like signals, ingestion loses user intent before downstream stages begin. Once that happens, override logic can only operate on whatever sparse structured inference remains.
Recommended algorithm
Section-aware extraction algorithm
Input: raw Content string
Output: ordered ingredient_lines_raw plus provenance
1. Split Content into lines preserving order.
2. Detect section headers using a configurable header matcher.
3. If an Ingredients header is found:
a. collect subsequent lines until the next section header or end of content
b. discard truly empty lines
c. store each remaining line exactly as one element of ingredient_lines_raw
d. set ingredient_lines_source = section_lines
4. Else:
a. run existing heuristic extraction as fallback
b. if any lines found, set ingredient_lines_source = heuristic_body
c. otherwise set ingredient_lines_source = none
Header matcher
Start with a configurable case-insensitive list:
-
ingredients -
ingredient -
for the chicken -
for the sauce -
for serving
This should be configuration, not hard-coded business logic, because your corpus may use custom section names.
PlantUML
@startuml
start
:Load CSV row;
:Read Content;
:Split into ordered lines;
if (Ingredients section found?) then (yes)
:Capture section lines exactly;
:Drop only truly empty lines;
:Set source = section_lines;
else (no)
:Run heuristic fallback extraction;
if (Fallback found lines?) then (yes)
:Set source = heuristic_body;
else (no)
:Set source = none;
endif
endif
:Persist ingredient_lines_raw;
:Attempt structured parsing from ingredient_lines_raw;
:Persist parsed ingredients separately;
stop
@enduml
Storage invariants
The normalized record must satisfy:
-
ingredient_lines_rawis always an array - when source lines were captured, array length must equal number of preserved non-empty ingredient section lines
- line order equals original authored order
- duplicate lines are allowed
- parsing result cannot mutate prior raw lines
Corpus-first test strategy
Use your own recipe dataset as the acceptance suite.
Create a regression folder with real examples in these buckets:
- one ingredient per line, no bullets, no quantities
- bullet list ingredients
- measured ingredients with quantities
- mixed sections such as Ingredients plus Directions
-
subsection headers like
For the marinade - duplicate ingredient lines
- recipes with no clear Ingredients section
For each fixture, assert:
-
expected
ingredient_lines_raw -
expected
ingredient_lines_source - whether fallback was used
- structured parsing may pass or fail independently
Likely code change location
The highest-value change is to replace or front-load extract_ingredient_lines(...) with a section-aware extractor in rag_setup/import_recipes.py before current heuristic logic runs. The current heuristic extractor can remain as fallback rather than primary ingestion behavior.
Implementation
Assumptions
-
The current ingestion entry point is
rag_setup/import_recipes.py. -
Contentcontains human-authored sections. - Existing heuristic extraction should be retained only as fallback for legacy records without a recognizable Ingredients block.
- The first goal is safe ingestion fidelity against your corpus, not perfect parsing.
Step 1: Introduce a section-aware extractor
Add a new function before the current heuristic extractor, for example:
def extract_ingredient_section_lines(content: str, header_aliases: list[str]) -> tuple[list[str], str]:
"""
Returns (ingredient_lines_raw, ingredient_lines_source)
source is one of: section_lines, none
"""
Behavior:
-
split
contentinto ordered lines - detect an Ingredients header using configurable aliases
- capture subsequent lines until the next recognized section header or end of content
- ignore truly empty lines
- preserve every other line exactly as written
-
return
([], "none")if no ingredient section is found
Step 2: Flag missing ingredient sections for review
When no recognizable Ingredients section is found, the system should explicitly flag the recipe for review instead of silently treating prose extraction as equivalent source data.
Suggested behavior:
def extract_ingredient_lines_with_review_flag(content: str) -> tuple[list[str], str, list[str]]:
lines, source = extract_ingredient_section_lines(content, HEADER_ALIASES)
if lines:
return lines, "section_lines", []
return [], "none", ["missing_ingredients_section"]
This makes the missing state visible and reviewable.
Optional secondary mode for tooling only:
- allow a separate operator/debug action to run heuristic extraction for suggestion purposes
-
store those suggestions separately from
ingredient_lines_raw - never promote suggestions into source-of-truth fields without review
Recommended suggested field for operator assistance:
record["ingredient_line_suggestions"] = heuristic_suggestions
This field must be treated as advisory only.
Step 3: Persist provenance in the normalized record
Update normalization output to always emit:
-
ingredient_lines_raw -
ingredient_lines_source
Example:
record["ingredient_lines_raw"] = ingredient_lines_raw
record["ingredient_lines_source"] = ingredient_lines_source
Optional but useful:
record["ingredient_lines_status"] = (
"present" if ingredient_lines_raw else "missing_review_required"
)
record["ingestion_flags"] = ingestion_flags
Possible flags:
- `missing_ingredients_section`
- `empty_ingredients_section`
- `section_header_ambiguous`
Step 4: Make parsing downstream-only
Any structured ingredient parsing should consume ingredient_lines_raw and must not repopulate or replace it from body.
Safe contract:
record["ingredients"] = parse_structured_ingredients(record["ingredient_lines_raw"])Not allowed:
-
reconstructing
ingredient_lines_rawfrom parsed ingredients -
clearing
ingredient_lines_rawwhen parsing fails - replacing section-derived lines with fallback-derived lines later in the pipeline
If ingredient_lines_status == "missing_review_required", parsing should either be skipped or clearly marked as operating on no approved source lines.
Step 5: Add regression fixtures from your own corpus
Create real-input fixtures from recipes that currently fail. For each fixture, store:
-
source
Content -
expected
ingredient_lines_raw -
expected
ingredient_lines_source
Recommended fixture buckets:
- one ingredient per line with no bullets
- ingredient lines with bullets
- ingredient lines with quantities
- mixed sectioned recipes
- recipes with no Ingredients header
- recipes with ambiguous headers that should be flagged
- recipes with an Ingredients header but no captured lines
Step 6: Add invariant tests
Tests should assert:
assert isinstance(record["ingredient_lines_raw"], list)
assert all(isinstance(x, str) for x in record["ingredient_lines_raw"])
For section-derived fixtures:
- exact string equality per line
- exact ordering
- duplicates preserved
- non-empty section lines are not dropped because they lack digits or units
For parser-failure scenarios:
-
ingredient_lines_rawremains populated -
ingredientsmay be partial or empty without failing ingestion
Step 7: Improve debug visibility in Flask
In the recipe debug view, display these fields together:
-
ingredient_lines_source -
ingredient_lines_status -
ingestion_flags -
ingredient_lines_raw -
optional
ingredient_line_suggestions -
parsed
ingredients - applied overrides
UI expectation:
- if no Ingredients section is found, show a visible review warning
-
allow you to inspect the original
Content -
optionally show heuristic suggestions in a separate clearly-labeled block such as
Suggested ingredient lines (unapproved)
That will make it obvious whether a failure is due to ingestion, parsing, missing section structure, or override logic.
Suggested rollout sequence
- implement section-aware extraction
- keep old heuristic logic as fallback
- regenerate a sample normalized output from your corpus
- verify recipes with missing Ingredients sections are visibly flagged for review
-
diff old vs new
ingredient_lines_raw - run override tests on recipes containing substitutions like zucchini → broccoli
- promote once lossless capture is verified
Minimal pseudocode
HEADER_ALIASES = {
"ingredients",
"ingredient",
}
SECTION_HEADERS = {
"overview",
"ingredients",
"instructions",
"directions",
"method",
"notes",
}
def normalize_recipe(row: dict) -> dict:
content = row.get("Content", "") or ""
ingredient_lines_raw, ingredient_lines_source, ingestion_flags = extract_ingredient_lines_with_review_flag(content)
record = {
"title": row.get("Recipe", ""),
"body": content,
"dish": row.get("Dish", ""),
"stage": row.get("Stages", ""),
"food_pics": row.get("food_pics", ""),
"ingredient_lines_raw": ingredient_lines_raw,
"ingredient_lines_source": ingredient_lines_source,
"ingredient_lines_status": "present" if ingredient_lines_raw else "missing_review_required",
"ingestion_flags": ingestion_flags,
}
if record["ingredient_lines_raw"]:
record["ingredients"] = parse_structured_ingredients(record["ingredient_lines_raw"])
else:
record["ingredients"] = []
return record
Definition of done
This implementation is complete when:
-
recipes with one-authored-line-per-ingredient populate
ingredient_lines_raw - the array preserves exact line boundaries and ordering
- recipes without an Ingredients section are flagged for your review
- heuristic extraction, if retained, is suggestions-only unless explicitly approved
- parsing failures do not empty raw lines
- overrides can be tested against recipes from your own dataset with visible raw-line provenance
Milestones
TBD
Gathering Results
TBD
Need Professional Help in Developing Your Architecture?
Please contact me at sammuti.com :)