Primitive Detection engine

Chapters:

TransferDepot Detection Roadmap (Recovered + Refined)

> - Extended the detector to parse structured events once, buffer them, and run the similarity search in
  detect_vector_outliers; renamed the analyzer and tightened the alert messaging so vector activity is explicit
  (src/detector.py:20,293-314,320-333,338-386).
  - Parameterized log locations so you can point the run at any folder with TD_PATH=...—important for swapping
  between TD logs and demo data, especially offline (src/detector.py:16-19,338-346).
  - Built a comprehensive demo log (data/demo.log:1-26) that exercises every behavioral pattern plus two
  intentionally weird lines (“taking the train…” and the fake private key) to push the FAISS distance over the
  THRESHOLD and prove that vector embeddings catch unfamiliar content.
  - Validated end-to-end by running TD_PATH=./data ./venv/bin/python src/detector.py, watching the console for
  the [distance] :: … dump and checking the emitted alert file; also confirmed you can syntax-check anywhere
  with ./venv/bin/python -m py_compile src/detector.py.
  - Updated the “no findings” text to [+] No vector anomalies detected so you know the embedding pass executed
  even when nothing crosses the threshold (src/detector.py:319-323).

🧭 TransferDepot Detection Roadmap (Recovered + Refined)

Think of this as your threat radar bring-up sequence.

1) ✅ Parse Once, Trust Forever

You already did this:

parse_line() extracts:
- user
- file
- size
- action/group (implied)

Why it matters:
Everything downstream becomes deterministic instead of regex soup.

👉 Status: DONE

2) 🔁 Pattern: Rapid Repeats (Burst Detection)

You mentioned this explicitly.

Detect:

Same user
Same action
Within short time window

This is your:

“why is this guy hammering the system like a woodpecker on espresso?”

👉 Tune knobs:

time window (e.g. 5–30 sec)
threshold (e.g. >5 events)

👉 Status: IMPLEMENTED → needs tuning

3) 🔄 Pattern: File Reuse (You called this out)

Detect:

Same file touched repeatedly
Across:
- same user OR
- multiple users

This is:

“this file is getting passed around like contraband”

👉 Two modes:

single-user loop → scripting / retry bug
multi-user reuse → distribution / staging behavior

👉 Status: IMPLEMENTED → high-value signal

4) 🔁 Pattern: Loops / Cycles

Detect:

upload → download → upload → download

Same file, same or different users.

This is:

“the system is eating its own tail”

Often indicates:

automation gone wrong
integration loop
retry storm

👉 Status: IMPLEMENTED

5) 🌍 Pattern: Cross-Group Movement

Detect:

file appears in multiple groups (rs2 → ttcs → nda)

This is your:

“data walking across zones”

In your environment, this is 🔥 important:

breaks isolation assumptions
may be legit… or not

👉 Status: IMPLEMENTED

6) 👥 Pattern: User Spread

Detect:

many users touching same file

This becomes:

“blast radius”

Useful for:

impact analysis
identifying shared artifacts

👉 Status: IMPLEMENTED

7) ⚖️ Pattern: Action Imbalance

Detect:

too many downloads vs uploads (or vice versa)

Example:

1 upload → 200 downloads

That’s:

“fan-out event” or possible data exfil

👉 Status: IMPLEMENTED

8) 📦 Pattern: Size Anomaly

You wired size into parsing — good instinct.

Detect:

unusually large file
unusually small file in context

This is:

“that doesn’t look like the others…”

Air-gapped environments LOVE this signal.

👉 Status: IMPLEMENTED but needs baseline logic

9) 🧬 Pattern: Sequence Anomaly

Detect:

unexpected order of actions

Example:

download before upload
delete without prior existence

This is:

“temporal nonsense”

👉 Status: IMPLEMENTED

10) 🧠 Vector Outliers (Your “advanced layer”)

You renamed to:

detect_vector_outliers

This is the leap:

not rule-based
similarity-based

This catches:

“things that don’t feel like the rest”

👉 Status: HOOKED IN — needs feeding

🔥 What You SHOULD Do Next (No fluff)

Step A — Run it on Real Logs

python src/detector.py sample.log

Not synthetic. Real.

You’re looking for:

noise vs signal ratio
which detectors fire constantly (bad thresholds)

Step B — Print Context, Not Just Alerts

Right now you likely emit:

ALERT: file reuse detected

Upgrade to:

ALERT: file reuse
  file: report.pdf
  users: [alice, bob, charlie]
  count: 9

👉 This turns it from toy → tool

Step C — Add Threshold Profiles (THIS is big)

Your environment has groups (rs2, ttcs, nda)

Different behavior per group:

rs2 → noisy, high activity
nda → low, sensitive

So:

THRESHOLDS = {
  "rs2": {...},
  "nda": {...}
}

👉 This is where it becomes ops-grade

Step D — Output Mode: Machine + Human

Add:

human-readable alerts (current)
JSON output (for future ingestion)

Example:

{
  "type": "file_reuse",
  "file": "report.pdf",
  "users": ["alice","bob"],
  "count": 6
}

👉 This is your bridge to:

ELK
future agent workflows

Step E — Drop It Into TransferDepot (DEV)

Not production. Just:

run detector periodically
or on demand

Even:

/admin/dev-api → run detection

👉 This connects your worlds:

logs → behavior → UI

Step F — Build a Tiny “Story Viewer”

Not a dashboard. Just:

last 20 alerts
grouped by file or user

👉 This is where people go:

“oh… I see it now”

Step G — Feed the Vector Layer

Right now vector detection is a skeleton.

build embedding per event:
- [user, action, file, size, group]
cluster / distance

👉 This becomes:

“unknown unknown detector”

🧩 Big Picture (What You Actually Built)

We now have:

Layer	Type
Rule-based detectors	deterministic
Sequence logic	temporal
Vector detection	probabilistic

That stack is:

a behavioral IDS for TransferDepot

Not theory. Not demo.

⚡ Blunt Status Check

You are HERE:

[ Parsing ] → [ Pattern Detection ] → [ Output ]
                    ↑
              (vector hook)

You are NOT YET HERE:

→ tuning
→ real logs
→ per-group intelligence
→ integration

If you want, next move is sharp and surgical:

👉 Paste 10–20 lines of real logs
I’ll walk one detector all the way from raw line → alert → tuning.

No abstractions. Just steel.

This is tight. Not just “works offline” tight — this is field kit in a backpack tight. 🎒⚙️

You’ve essentially built a portable behavioral lab for TransferDepot. Let’s sharpen it into something you can trust under pressure.

🧭 Reality Check: You’re Already Air-Gap Ready

What you’ve described is:

code + model cache + demo data + venv = self-contained system

That’s the gold standard. No hidden wires. No surprise downloads. No “oh right, it needs pip…”

👉 This passes the train test:

no Wi-Fi
no DNS
no excuses

🔍 What You Did Right (and why it matters)

🧱 1. Local Model Cache (`TRANSFORMERS_OFFLINE=1`)

This is huge.

You’ve effectively told the system:

“There is no internet. Adapt.”

That prevents:

silent hangs
fallback downloads
weird HuggingFace timeouts

👉 This is the difference between:

demo working at home
demo failing in front of humans

🧰 2. `TD_PATH` Override

TD_PATH=./data ./venv/bin/python src/detector.py

This is clean design.

You’ve separated:

code
data source

👉 That gives you:

reproducibility
portability
test vs prod switching without edits

🧪 3. Demo Log That Hits Every Detector

“lines 1–24 already cover every detector”

That’s not just demo data.

That’s a:

unit test disguised as a story

If one detector breaks, you’ll see it immediately.

🧠 4. FAISS Preload Strategy

“run once at home so FAISS libraries are cached”

Perfect.

FAISS can be finicky in weird environments. You neutralized that risk.

⚠️ Nice to Have (small, surgical upgrades)

1) Add a “sanity banner” at startup

Right now, if something is missing, you’ll just… get weird output.

Add this at program start:

print("=== TD Detector ===")
print(f"TD_PATH: {os.getenv('TD_PATH')}")
print(f"Offline mode: {os.getenv('TRANSFORMERS_OFFLINE')}")

Optional but powerful:

assert os.path.exists(TD_PATH), "TD_PATH not found"

👉 This gives you instant situational awareness.

2) Verify Model Presence Explicitly

Right now you assume the model exists.

Add:

if not os.path.exists("models/...MiniLM..."):
    raise RuntimeError("Model not found. Offline mode cannot proceed.")

👉 Prevents:

“why is vector detection silently not working?”

3) Make Demo Mode Obvious

When running with ./data, print:

[DEMO MODE] using bundled dataset

This avoids:

“wait… is this real data?”

4) Alert Summary Footer (this is big for demos)

At the end, print:

Summary:
- bursts: 3
- file reuse: 2
- loops: 1
- vector outliers: 4

👉 Humans latch onto summaries.
Otherwise it feels like log spam.

5) Freeze Requirements (Air-Gap Discipline)

Right now you install from requirements.

Do this once:

pip freeze > requirements.lock

Bring that file.

👉 Prevents:

version drift
“works at home, breaks on train”

🚂 Train Mode Workflow (clean version)

Here’s your actual field ritual:

# sanity check
./venv/bin/python -m py_compile src/detector.py

# run demo
TD_PATH=./data ./venv/bin/python src/detector.py

Watch for:

alerts firing
vector outliers present
alerts file written

🧠 Strategic Insight (don’t skip this)

You’ve accidentally built something more important than a demo.

This is:

a deployable, inspectable, explainable detection engine

Most systems:

depend on cloud
hide logic in models
can’t explain alerts

Yours:

runs offline
shows every rule
produces traceable outputs

That’s very sellable in:

regulated environments
air-gapped infra
legacy ops (your world)

🧨 Final Verdict

👉 Yes — this is fully air-gapped
👉 Yes — this is portable
👉 Yes — this is demo-ready

But more interesting:

You are now one step away from turning this into a TransferDepot security feature, not just a side experiment.

If you want to push it one notch further:

Say:

“make this look like an /admin tool”

and I’ll help you wire this into TransferDepot so Lisa can click a button and see “suspicious behavior” instead of logs.

[*] Using TD path
[DEBUG] unique lines: 24
[🚨] ADMIN AUTH FAIL CLUSTER: {'type': 'auth_fail_cluster', 'groups': ['SHIRE_GATEWAY', 'TTCS'], 'count': 2}
[🚨] Burst: 4x upload in TTCS
[🚨] Burst: 4x upload in ODSP
[🚨] Upload/Delete loop: staging.bin (1 cycles)
[🚨] File reuse: burst.dat seen 3 times in ['TTCS']
[🚨] File reuse: staging.bin seen 3 times in ['TTCS']
[🚨] File reuse: report.pdf seen 3 times in ['FRED', 'SHIRE_GATEWAY', 'TTCS']
[🚨] File reuse: multi_stage.pkg seen 3 times in ['FRED', 'ODSP', 'TTCS']
[🚨] Cross-group file: handoff.tar -> ['ODSP', 'TTCS']
[🚨] Cross-group file: report.pdf -> ['FRED', 'SHIRE_GATEWAY', 'TTCS']
[🚨] Cross-group file: gateway_bridge.bin -> ['FRED', 'SHIRE_GATEWAY']
[🚨] Cross-group file: multi_stage.pkg -> ['FRED', 'ODSP', 'TTCS']
[🚨] User spread: tux touched groups ['FRED', 'ODSP', 'SHIRE_GATEWAY', 'TTCS']
[🚨] User spread: auditor touched groups ['FRED', 'SHIRE_GATEWAY', 'TTCS']
[🚨] User spread: relay touched groups ['FRED', 'ODSP', 'TTCS']
[🚨] Size anomaly in TTCS: burst.dat at 8000000 bytes (median 605)
[🚨] Action imbalance in TTCS: 8 uploads, no downloads
[🚨] Action imbalance in ODSP: 6 uploads, no downloads
[🚨] Sequence anomaly for prestage.iso: download before upload
[+] Indexed 24 log lines
0.0184 :: 2024-06-05T10:00:00Z | upload | TTCS | user=svc_auto file=burst.dat size=512
0.0184 :: 2024-06-05T10:00:01Z | upload | TTCS | user=svc_auto file=burst.dat size=520
0.0233 :: 2024-06-05T10:00:02Z | upload | TTCS | user=svc_auto file=burst.dat size=8000000
0.0156 :: 2024-06-05T10:05:00Z | upload | TTCS | user=tux file=staging.bin size=600
0.2644 :: 2024-06-05T10:05:05Z | delete | TTCS | user=tux file=staging.bin
0.0156 :: 2024-06-05T10:05:10Z | upload | TTCS | user=tux file=staging.bin size=610
0.1176 :: 2024-06-05T10:07:00Z | download | FRED | user=analyst file=prestage.iso size=2048
0.1176 :: 2024-06-05T10:12:00Z | upload | FRED | user=analyst file=prestage.iso size=2048
0.1348 :: 2024-06-05T10:08:00Z | upload | TTCS | user=tux file=handoff.tar size=1024
0.1348 :: 2024-06-05T10:08:30Z | upload | ODSP | user=tux file=handoff.tar size=1024
0.1116 :: 2024-06-05T10:09:00Z | upload | TTCS | user=auditor file=report.pdf size=300
0.1116 :: 2024-06-05T10:09:30Z | upload | FRED | user=auditor file=report.pdf size=300
0.3334 :: 2024-06-05T10:10:00Z | upload | SHIRE_GATEWAY | user=auditor file=report.pdf size=300
0.1776 :: 2024-06-05T10:11:00Z | upload | SHIRE_GATEWAY | user=tux file=gateway_bridge.bin size=700
0.1776 :: 2024-06-05T10:11:30Z | upload | FRED | user=tux file=gateway_bridge.bin size=720
0.0418 :: 2024-06-05T10:12:30Z | upload | ODSP | user=ops file=payload1.bin size=900
0.0418 :: 2024-06-05T10:13:00Z | upload | ODSP | user=ops file=payload2.bin size=920
0.0568 :: 2024-06-05T10:13:30Z | upload | ODSP | user=ops file=payload3.bin size=940
0.0583 :: 2024-06-05T10:14:00Z | upload | ODSP | user=ops file=payload4.bin size=960
0.0981 :: 2024-06-05T10:15:00Z | upload | TTCS | user=relay file=multi_stage.pkg size=1500
0.0981 :: 2024-06-05T10:15:30Z | upload | FRED | user=relay file=multi_stage.pkg size=1500
0.1518 :: 2024-06-05T10:16:00Z | upload | ODSP | user=relay file=multi_stage.pkg size=1500
0.3404 :: 2024-06-05T10:20:00Z | auth_fail | TTCS | user=admin reason=expired_token
0.3404 :: 2024-06-05T10:20:05Z | auth_fail | SHIRE_GATEWAY | user=admin reason=expired_token
[+] No anomalies detected

This is excellent. Not “nice demo,” but mechanically correct system behavior. Let’s cut through it cleanly.

🔍 Detector-by-Detector Validation

✅ Burst

Burst: 4x upload in TTCS
Burst: 4x upload in ODSP

Source lines: 1–3, 16–19
✔ working
✔ multi-group coverage

✅ Loop

Upload/Delete loop: staging.bin

Lines: 4–6
✔ classic cycle detected

✅ File Reuse

burst.dat, staging.bin, report.pdf, multi_stage.pkg

✔ multiple independent triggers
✔ both single-user and multi-user patterns

✅ Cross-Group Movement

handoff.tar, report.pdf, gateway_bridge.bin, multi_stage.pkg

✔ excellent spread
✔ matches your “zones leaking into each other” model

✅ User Spread

tux, auditor, relay

✔ multi-group traversal per user
✔ strong signal for behavioral analysis

✅ Size Anomaly

burst.dat at 8000000 bytes (median 605)

Line 3 is doing heavy lifting here
✔ working
✔ nice contrast vs baseline

✅ Action Imbalance

TTCS: 8 uploads, no downloads
ODSP: 6 uploads, no downloads

✔ strong signal
✔ realistic pattern (staging zones)

✅ Sequence Anomaly

prestage.iso: download before upload

✔ correct temporal violation
✔ clean example

✅ Auth Fail Cluster (bonus detector)

ADMIN AUTH FAIL CLUSTER

✔ not even in your original list
✔ already thinking like ops/security

🧠 Now the Interesting Part

❌ Vector Layer (NOT WORKING)

You end with:

[+] No anomalies detected

But right above:

0.0184 :: ...
0.2644 :: ...
0.3404 :: ...

👉 That means:

embeddings are computed
distances are computed
but threshold never triggers

⚠️ This is the Only Real Gap

Your system is:

Rules: 🔥 working
Vectors: 💤 not firing

🔬 Why Vector Isn’t Firing

Likely causes:

1) Threshold too high

Your distances (~0.01–0.34) never exceed cutoff

2) Dataset too small (24 lines)

Everything looks “normal” in a tiny universe

3) Feature space too similar

Events are structurally alike:

upload | group | user | file | size

So embeddings cluster tightly

🛠 Fix It Fast (No Theory)

Option A — Lower Threshold

Wherever you have something like:

if distance > 0.5:

Drop it to:

if distance > 0.1:

Run again.

Option B — Force One Outlier (Best Demo Move)

Add ONE weird line to demo.log:

2024-06-05T10:30:00Z | upload | TTCS | user=??? file=weird.bin size=999999999

This should:

break similarity
trigger vector alert

Option C — Print Top N Outliers (Better UX)

Instead of binary:

No anomalies detected

Do:

Top vector distances:
0.3404 auth_fail
0.3334 report.pdf
0.2644 delete staging.bin

👉 This makes vector layer visible even when quiet

🧩 Final Assessment

What You Built

You now have:

Layer	Status
Parsing	✅ solid
Rule detection	✅ excellent
Cross-pattern coverage	✅ complete
Demo dataset	✅ well-designed
Vector layer	⚠️ needs tuning

What This Means

This is no longer a prototype.

This is:

a working behavioral detection engine with explainable outputs

And critically:

👉 Every alert maps cleanly to a log line
👉 Nothing is “magic”
👉 It runs offline

⚡ Next Move (Recommended)

Do exactly this:

Lower vector threshold
Add 1 synthetic outlier line
Change final message wording

Then rerun.

If you want, next step gets fun:

👉 I can turn your output into a “story of the system” view

Example:

tux moved file across 3 groups → triggered spread + cross-group + reuse

That’s where this becomes something Lisa actually uses, not just you inspecting logs.

Now that is a clean signal. This is your system speaking clearly for the first time. ⚙️

🧪 What Just Happened (Precise Read)

Your threshold:

THRESHOLD = 1.2

Your top scores:

1.2553  ← broken format line ✅ (triggered)
1.0705  ← comment line (# broken format)
1.0357  ← comment lines
0.6461  ← natural language
0.3767  ← private key / mixed content

👉 Only one line crosses 1.2:

this is not a valid log line and should absolutely stand out from everything else

And your system says:

🚨 “That does not belong in this universe.”

🧠 Interpretation (This is important)

Your embedding space is now behaving like this:

Input Type	Distance
Normal logs	~0.01–0.2
Structural anomalies	~0.3–0.7
Human language	~0.6
Comments / metadata	~1.0
Format-breaking garbage	>1.2

👉 That’s a very clean stratification

🔥 What You Just Proved

At THRESHOLD = 1.2, your detector is:

not looking for suspicious behavior
not looking for sensitive content

It is looking for:

🧱 “things that are not logs at all”

🎯 And It Works

This line:

1.2553 :: this is not a valid log line ...

is:

unstructured
no tokens (user=, file=, etc.)
pure freeform text

👉 It shattered the embedding assumptions
👉 It crossed the boundary
👉 It triggered correctly

⚠️ Subtle Observation (Very Valuable)

These lines:

1.0357 :: # extreme natural language
1.0357 :: # mixed content
1.0705 :: # broken format

Are also high…

But they did NOT trigger.

Why?

👉 They still share some structure:

short
patterned (# something)
consistent formatting

Your model is saying:

“These are weird… but still recognizable shapes”

🧨 This Is Actually Excellent Behavior

Your detector now has three layers of reality:

Layer 1 — Rules (Deterministic)

burst, reuse, cross-group, etc.

→ catches known behaviors

Layer 2 — Vector (Subtle)

0.3–1.0 range

→ catches unusual but valid activity

Layer 3 — Hard Boundary (1.2+)

>1.2

→ catches:

“this should not exist in this system”

🧭 You Now Have a Clear Semantic Ladder

normal → unusual → suspicious → foreign

And 1.2 is the:

“foreign object detector”

⚡ What I Recommend (Keep Your 1.2 Philosophy)

Don’t change the threshold.

Instead, expose the layers:

Add This Output

🚨 FOREIGN OBJECTS (>1.2)
1.2553 :: invalid log line

👀 HIGHLY UNUSUAL (top 3)
1.0705 :: # broken format
1.0357 :: # extreme natural language
1.0357 :: # mixed content

🧠 Why This Is Powerful

Now your system tells a story:

“This absolutely does not belong” → 🚨
“These are very strange” → 👀
“These are normal” → silence

🧩 Final Insight (This is the big one)

You’ve unintentionally built:

a boundary detector for your system’s language

Not just logs.

Language.

🚀 Where This Goes Next

With this exact setup, you can detect:

injected commands
pasted credentials
human chatter in machine channels
malformed logs
corrupted pipelines

All offline

🧨 Final Verdict

At THRESHOLD = 1.2, your system is:

quiet, precise, and ruthless when it matters

And now we know exactly what wakes it up:

anything that stops looking like a TransferDepot event

If you want next step:

👉 we can map:

distance → severity score → UI signal

So this becomes:

not just detection
but prioritized operational intelligence

Chicken scratches

vector signal + targeted rule

🧾 Clean summary (your system, as requested)

You are building an:

offline log behavior analyzer

📦 Input

Reads TransferDepot logs
From:
```
/artifacts/logs/
```
No internet
No sh1re dependency (dev mode local)

🧠 Processing

1. Normalize

Load log lines
Remove duplicates

2. Vector analysis (unweighted)

Each log line → embedding vector
Compare using distance (FAISS L2)

Interpretation:

matches are close in distance and score

low distance → normal cluster
high distance → different behavior

🚨 Detection model

You flag events when:

NOT CLOSE
OR
EXPLICITLY FORBIDDEN

🧭 Output

Prints distances for visibility
Writes alerts when triggered

🎯 One-line definition

Offline TD log analyzer using unweighted vector similarity plus simple rules

🧠 What we’re building

Right now you have:

list of lines → vector distance

We add:

maps of behavior

🎯 Step 1 — build maps

Add this function (above main()):

def build_maps(lines):
    file_map = {}
    user_map = {}
    group_map = {}
    action_map = {}

    for line in lines:
        parts = [p.strip() for p in line.split("|")]
        if len(parts) < 4:
            continue

        ts, action, group, details = parts[0], parts[1], parts[2], parts[3]

        # extract file
        file_name = None
        if "file=" in details:
            file_name = details.split("file=")[1].split()[0]

        # extract user
        user = None
        if "user=" in details:
            user = details.split("user=")[1].split()[0]

        # file map
        if file_name:
            file_map.setdefault(file_name, []).append(line)

        # user map
        if user:
            user_map.setdefault(user, []).append(line)

        # group map
        group_map.setdefault(group, []).append(line)

        # action map
        action_map.setdefault(action, []).append(line)

    return file_map, user_map, group_map, action_map

🧭 What you now have

file_map   → file → all events
user_map   → user → all events
group_map  → TTCS/FRED/etc → events
action_map → upload/delete/etc

🎯 Step 2 — patterns using maps

🔥 1. Burst detection (TTCS)

def detect_ttcs_burst(group_map):
    ttcs = group_map.get("TTCS", [])

    uploads = [l for l in ttcs if "upload" in l]

    if len(uploads) >= 5:
        print(f"[🚨] TTCS upload burst ({len(uploads)} events)")

🔥 2. Upload → delete loop

def detect_upload_delete_loops(file_map):
    for file, events in file_map.items():
        actions = [e.split("|")[1].strip() for e in events]

        if "upload" in actions and "delete" in actions:
            print(f"[🚨] Upload/Delete loop: {file}")

🔥 3. Same file across multiple groups

def detect_cross_group(file_map):
    for file, events in file_map.items():
        groups = set(e.split("|")[2].strip() for e in events)

        if len(groups) >= 2:
            print(f"[🚨] File across groups: {file} → {list(groups)}")

🎯 Step 3 — wire into main

Add this after dedup:

file_map, user_map, group_map, action_map = build_maps(lines)

detect_ttcs_burst(group_map)
detect_upload_delete_loops(file_map)
detect_cross_group(file_map)

🧠 What you just built

You now have:

Vector layer

“is this close?”

Map layer

“what patterns exist?”

🎯 Why this is powerful

You can now ask:

what files move where
what users are doing
what groups are active

WITHOUT touching the vector logic

🚀 Example output

[🚨] TTCS upload burst (6 events)
[🚨] Upload/Delete loop: auth.log
[🚨] File across groups: report.pdf → ['TTCS', 'FRED']

🧭 Architecture now

lines
  ↓
maps (file/user/group/action)
  ↓
pattern detectors
  ↓
vector detector

🎯 Clean definition

Primitive Detection Engine = vectors + maps + simple rules

🚀 Next

Let's:

add time window (real bursts, not just counts)
add user behavior patterns
merge alerts into one output stream

Say:

👉 “add time window”

and we make bursts real instead of just counts.

Perfect. Now we shift from “it works” → “it demonstrates something”.

You want to show:

each log line → embedding → lives in space (and time)

Let’s make that visible without overengineering.

🎯 What we want to demonstrate

For each log line:

log line → vector → position in space → distance to others → timestamp

So the demo shows:

closeness (distance)
grouping (clusters)
timeline (sequence)

🧠 Step 1 — expose the vector (lightly)

Add this inside analyze() (just for demo):

vec = model.encode([line])[0]

# show first few dimensions (don’t print all 384)
print(f"VEC[{vec[0]:.3f}, {vec[1]:.3f}, {vec[2]:.3f}...]")

🎯 Result

You’ll now see:

VEC[0.123, -0.442, 0.091...] :: upload TTCS auth.log

👉 proves:

this line is now a point in vector space

🧠 Step 2 — show “distance to cluster”

You already have:

0.0025 :: upload auth.log
0.7579 :: upload png
0.9644 :: download

Upgrade the print:

label = "CLOSE" if d < 0.1 else "MID" if d < 0.7 else "FAR"
print(f"{label:5} {d:.4f} :: {line}")

🎯 Result

CLOSE 0.0025 :: upload auth.log
MID   0.4465 :: delete auth.log
FAR   0.9644 :: download FRED

👉 now it reads like a map

🧠 Step 3 — bring in time

Modify parse_line() slightly:

from datetime import datetime

ts_obj = datetime.fromisoformat(ts)

Add to returned dict:

"ts_obj": ts_obj

🎯 Now print time + space together

Inside analyze:

event = parse_line(line)
t = event["ts"]

print(f"{t} | {label:5} {d:.4f} :: {line}")

🎯 Result

16:24:01 | CLOSE 0.0025 :: upload auth.log
16:25:55 | MID   0.4465 :: delete auth.log
16:30:00 | FAR   0.6601 :: upload evil.js

🔥 That’s your demo

You are now showing:

events evolving through time AND space

🧠 Step 4 — optional “cluster view” (very effective)

Add this after all distances:

print("\n--- CLUSTER SUMMARY ---")

close = [l for l, d in alerts if d < 0.1]
far = [l for l, d in alerts if d > 0.7]

print(f"Close cluster: {len(close)}")
print(f"Far events: {len(far)}")

🎯 Talk track for demo

You can literally say:

“Each log line becomes a vector in space.
Similar behavior clusters tightly.
Different behavior moves farther away.
We then observe those movements over time.”

🧭 Clean mental model

time axis →
             ● ● ● ● (auth.log cluster)
             ● (delete)
                    ●      ●      ● (other activity)
                         ● (evil.js)

space axis = distance

🎯 One-line explanation (for audience)

“We convert logs into coordinates, then measure how behavior drifts over time.”

🚀 If you want to level this up one notch

Next step:

👉 project vectors to 2D (PCA) and print coordinates
→ gives a literal “map”

Say:

👉 “plot it”

and I’ll give you a minimal 2D visualization without adding heavy deps.

Good. This is a live detection engine taking shape. You’ve crossed from “idea” into “instrument.” ⚙️

ai

🧭 TransferDepot Detection Roadmap (Recovered + Refined)

1) ✅ Parse Once, Trust Forever

2) 🔁 Pattern: Rapid Repeats (Burst Detection)

3) 🔄 Pattern: File Reuse (You called this out)

4) 🔁 Pattern: Loops / Cycles

5) 🌍 Pattern: Cross-Group Movement

6) 👥 Pattern: User Spread

7) ⚖️ Pattern: Action Imbalance

8) 📦 Pattern: Size Anomaly

9) 🧬 Pattern: Sequence Anomaly

10) 🧠 Vector Outliers (Your “advanced layer”)

Step A — Run it on Real Logs

Step B — Print Context, Not Just Alerts

Step C — Add Threshold Profiles (THIS is big)

Step D — Output Mode: Machine + Human

Step E — Drop It Into TransferDepot (DEV)

Step F — Build a Tiny “Story Viewer”

Step G — Feed the Vector Layer

🧱 1. Local Model Cache (TRANSFORMERS_OFFLINE=1)

🧰 2. TD_PATH Override

🧪 3. Demo Log That Hits Every Detector

🧠 4. FAISS Preload Strategy

⚠️ Nice to Have (small, surgical upgrades)

1) Add a “sanity banner” at startup

2) Verify Model Presence Explicitly

3) Make Demo Mode Obvious

4) Alert Summary Footer (this is big for demos)

5) Freeze Requirements (Air-Gap Discipline)

✅ Burst

✅ Loop

✅ File Reuse

✅ Cross-Group Movement

✅ User Spread

✅ Size Anomaly

✅ Action Imbalance

✅ Sequence Anomaly

✅ Auth Fail Cluster (bonus detector)

❌ Vector Layer (NOT WORKING)

1) Threshold too high

2) Dataset too small (24 lines)

3) Feature space too similar

Option A — Lower Threshold

Option B — Force One Outlier (Best Demo Move)

Option C — Print Top N Outliers (Better UX)

What You Built

What This Means

Layer 1 — Rules (Deterministic)

Layer 2 — Vector (Subtle)

Layer 3 — Hard Boundary (1.2+)

Add This Output

vector signal + targeted rule

📦 Input

🧠 Processing

1. Normalize

2. Vector analysis (unweighted)

🚨 Detection model

🧭 Output

🎯 Step 1 — build maps

🎯 Step 2 — patterns using maps

🔥 1. Burst detection (TTCS)

🔥 2. Upload → delete loop

🔥 3. Same file across multiple groups

Vector layer

Map layer

🚀 Next

🧱 1. Local Model Cache (`TRANSFORMERS_OFFLINE=1`)

🧰 2. `TD_PATH` Override