Chapters: 

An Overview of Scenerios Already Completed on Eckford

1. AI/ML-adjacent pipeline development

We built a multi-step processing pipeline around recipes.

The pipeline moved roughly like this:

Backdrop CMS recipe content
        ↓
CSV export
        ↓
Python import / normalization
        ↓
recipe JSONL
        ↓
runtime recipe catalog
        ↓
ingredient extraction
        ↓
nutrition input filtering
        ↓
nutrition lookup
        ↓
runtime-ready recipe data

The important part: we separated the work into stages.

Each stage had a job:

Source Door              [DONE]
Retrieval Door           [DONE]
Classification Door      [DONE]
Nutrition Input Layer    [NOW BUILDING]
Nutrition Contract Door  [READY]

That “doors” idea mattered because it prevented the project from becoming one giant soup pot.

Each door answered:

Do we have source data?
Can we retrieve it?
Can we classify it?
Can we prepare nutrition inputs?
Can we define the contract for nutrition output?

This is pipeline thinking. Very ML-adjacent because before any model or calculation can work, the input data has to be shaped, filtered, and validated.


2. Data normalization

The recipe data started in Backdrop CMS.

The exported data was not naturally “machine clean.” It had fields like:

Recipe
Content
Dish
Stage
food_pics
Season
status
Post date

We used Python scripts to convert that into normalized runtime files.

The key files were something like:

recipes.csv
recipes_normalized.jsonl
runtime_recipes.json
runtime_build_summary.json

The normalization goal was:

Keep source content unchanged.
Extract what we need.
Flag what is missing.
Build clean runtime data.

So instead of editing the original recipe text directly, we created structured derived files.

That was a good design choice.

It meant:

Backdrop remains source of truth.
Python processing creates machine-readable outputs.
Problems become flags, not silent failures.

Very Get Smart. Very “don’t lie to the machine.” 🕵️


3. Classification gates

This was the big one.

We realized that not every ingredient line should be used for nutrition calculation.

A recipe might contain:

salt
pepper
basil
oregano
garlic powder
1 tablespoon olive oil
2 cups chickpeas
1 pound pasta

For nutrition, some lines matter more than others.

So we created the concept of classification gates:

candidate ingredient line?
        ↓
primary ingredient?
        ↓
herb/spice/seasoning?
        ↓
nutrition-relevant?
        ↓
send to lookup or skip?

The classification gate separated:

primary ingredients

from:

herbs
spices
seasonings
minor flavorings

That gave us this rule:

Nutrition input is derived from ingredient lines by:
- filtering candidate lines
- classifying herb/spice/seasoning vs primary
- selecting primary lines only
- passing normalized lines to deterministic lookup

Source data remains unchanged.

That is the heart of the work.

It was not just “calculate calories.”
It was decide what deserves to be calculated.

That is very close to applied ML thinking, even if the classifier was not yet a trained scikit-learn model.


4. Vector retrieval

We also worked with vector/retrieval concepts.

The Chroma/vector side was there to help with recipe and nutrition search. The important architectural decision was:

Chroma can help retrieve or compare information.
Chroma is not the source of truth for nutrition.

The source of truth was the structured nutrition dataset, especially:

eurofir_mediterranean.csv

That distinction mattered.

Vector retrieval is fuzzy and useful. Nutrition calculation needs traceability.

So the pattern became:

Use retrieval to help find candidates.
Use deterministic lookup for final nutrition values.

That is exactly the right instinct for this kind of system.

No “the embedding felt hungry so it guessed 400 calories.” Absolutely not. Fork confiscated. 🍴


5. Agent/tool integration

We also had an agent/tool layer around the project.

The tools/frameworks in play included:

Python
Chainlit
OpenAI Agents SDK
Chroma / chromadb
Flask viewer or admin pages
CSV / JSONL / JSON runtime files
recipe_catalog.py
nutrition_lookup.py
nutrition_calculator.py

The agent idea was not “let the AI invent nutrition.”

The safer design was:

Agent can help inspect, retrieve, explain, or route.
Tools do the deterministic work.
Structured data provides the answer.

That is a good architecture.

The agent is the helpful clerk.
The nutrition calculator is the ledger.
The source CSV is the law book.

Tiny bureaucracy, but useful. 🧾


6. Nutrition calculation from structured data

The nutrition calculation piece depended on turning recipe ingredients into lookup-ready inputs.

So the pipeline needed to know things like:

What is the ingredient?
How much is there?
Is the unit usable?
Is this a primary ingredient?
Can it be matched to nutrition data?
Should it be skipped or flagged?

The calculation depended on structured records, not raw prose.

The nutrition system was moving toward this kind of contract:

normalized ingredient line
quantity
unit
ingredient name
classification
lookup candidate
matched nutrition record
confidence / review flag
calculated nutrition values

And when the recipe was not ready, it got flagged.

Examples of skip/review reasons included:

missing ingredient section
insufficient structured ingredients
possible truncation
not recipe full
recipe partial

That was a major piece of the work: not pretending every recipe was ready.

The build summary gave a reality check:

runtime-ready recipes
skipped recipes
missing ingredient sections
partial recipes
review-needed items

Again: evidence first. The machine does not get to swagger.


The actual achievement

The main thing you did was not “train a recipe model.”

You built the foundation that a model would need.

That foundation includes:

clean recipe exports
normalized runtime data
ingredient extraction
classification rules
nutrition-relevant filtering
deterministic nutrition lookup
review flags
runtime summaries
agent/tool architecture
vector retrieval support

That is exactly the kind of pre-ML work that decides whether an ML system succeeds or turns into a glittering garbage barge.

The best plain-English summary

We built a Python-based recipe intelligence pipeline for Get Smart.

It took messy recipe content from Backdrop CMS, normalized it into structured runtime files, identified usable ingredient lines, separated primary ingredients from herbs and seasonings, and prepared clean nutrition inputs for deterministic lookup against a European nutrition dataset.

The system preserved the original source content, produced review flags when recipes were incomplete, and used retrieval/vector tooling and agent-style interfaces as support layers rather than as the final authority.

Where this can go next

The next “Yes” step is to add a small, real ML layer on top of this existing foundation.

The best candidates are:

1. scikit-learn classification
   Predict whether a recipe is nutrition-ready or review-required.

2. scikit-learn clustering
   Group recipes by ingredient patterns, dish type, or nutrition profile.

3. scikit-learn regression
   Estimate calories or nutrition bands from structured ingredient features.

4. dimensionality reduction
   Visualize recipes based on ingredient/nutrition similarity.

For Friday/Saturday, the cleanest first bite is probably:

Build a scikit-learn classifier that predicts:
nutrition_ready vs review_required

Because you already have the raw material:

recipe fields
status
stage
dish
ingredient structure
skip reasons
review flags
runtime-ready status

That would turn your existing Get Smart work from:

AI/ML-adjacent pipeline

into:

hands-on scikit-learn classification project

That is the bridge. The planks are already stacked beside the river.