CSV to JSONL

It's still out there! You landed in CSV because "something-somewhere" still speaks CSV, not because your system wanted it.

And now you’ve quietly corrected course:

👉 CSV = extraction format
👉 JSONL = working format

That’s the right architecture.

Why you’re loving JSONL (and should)

“One line per row” is doing a lot of heavy lifting:

each record is independent
no column alignment drama
no quoting gymnastics
no guessing where a field ends
easy to stream, filter, debug

You can literally:

grep '"classification": "recipe_partial"' rag_setup/recipes_normalized.jsonl

and it just works. No parser needed. That’s power.

The hidden win

You’ve separated concerns cleanly:

Backdrop (Views) → CSV  ← selected by choice
CSV → JSONL             ← your normalization step
JSONL → app             ← your actual system

So now:

if "something-somewhere" changes → only CSV layer is affected
your app stays stable
your pipeline stays deterministic

This is the real upgrade

You didn’t just fix the data.

You created a pipeline boundary.

And that’s why this feels better:

👉 You’re no longer “working with export data”
👉 You’re working with your own normalized dataset

One-line truth

CSV got the data out.
JSONL makes it usable.

And yeah… one-line-per-object is ridiculously satisfying once you’ve fought CSV long enough 😄

Yes, JSONL. And yes, there is a plan behind the pick.

How to access it

A .jsonl file is:

one JSON object per line
easy to stream
easy to debug
easy to grep
easy to append/rebuild

Your file is here:

rag_setup/recipes_normalized.jsonl

Quick ways to look at it

Show first few lines

head -n 5 rag_setup/recipes_normalized.jsonl

Pretty-print first few records

python - <<'PY'
import json

with open("rag_setup/recipes_normalized.jsonl", encoding="utf-8") as f:
    for i, line in enumerate(f):
        obj = json.loads(line)
        print("\n---")
        print(json.dumps(obj, indent=2, ensure_ascii=False))
        if i == 2:
            break
PY

Read it in Python

import json

rows = []
with open("rag_setup/recipes_normalized.jsonl", encoding="utf-8") as f:
    for line in f:
        rows.append(json.loads(line))

print(rows[0]["title"])

Filter recipe-like rows

python - <<'PY'
import json

with open("rag_setup/recipes_normalized.jsonl", encoding="utf-8") as f:
    for line in f:
        row = json.loads(line)
        if row.get("classification") in {"recipe_full", "recipe_partial"}:
            print(row["title"])
PY

Why JSONL was a good pick

Because it sits nicely between raw CSV and a “real database.”

CSV is good for:

export
transport
simple tabular fields

JSONL is better for:

normalized records
nested fields later
incremental pipeline stages
line-by-line processing
keeping each row as a standalone object

That matters because your data is already outgrowing plain flat CSV thinking.

For example, later you may want each row to hold things like:

{
  "title": "Marinated Beans",
  "classification": "recipe_partial",
  "servings": 2,
  "ingredients_raw": [
    "1 small shallot, finely chopped",
    "2-4 tbsp white wine vinegar"
  ]
}

That is natural in JSONL and awkward in CSV.

The real plan behind it

Think of it like this:

Backdrop CSV
→ normalize/clean/classify
→ JSONL
→ build richer base_recipe objects
→ feed app/tests/effective_recipe

So JSONL is the staging format.

Not the final destination, not the source of truth, but a very handy workbench.

One-line answer

You access it by reading it one line at a time as JSON, and the reason it was chosen is that it is much better than CSV for pipeline-friendly, semi-structured recipe records.