Prompt-only iteration: removing real-party examples cuts profile leak to 1/10 but memorized-party hallucination stays at 5/10 — proving prompt engineering can't beat training-data memorization. Defines the three paths forward (A: programmatic extraction, B: bigger model, C: template-fill). Understanding why Path A (v3.2) was chosen.
Lighthouse Eval — v3.1 (10 docs, post-run synthesis fix)
Date: 2026-05-10
Model: Ollama / gemma2:2b (still unchanged — the point is to isolate prompt/code wins)
Inputs: same 10 CUAD JV-flavoured contracts as v3
Wall-clock: 32 min 48 s · Cost: $0.00
Change vs v3: synthesis prompt — replaced real-party examples ("Bravatek and COMPANY", "Veoneer and Nissin") with structural instructions; added explicit anti-recital line forbidding the client's concerns list from being dragged into risk descriptions.
Test suite: 1,686/1,686 still green.
TL;DR
v3.1 closed two of the three open issues from v3, ran into a hard model-size ceiling on the third. The remaining party-name hallucinations are training-data memorization in gemma2:2b, not a prompt problem. The realistic paths forward are programmatic name extraction or a larger local model — both spelled out below.
| Metric | v1 | v2 | v3 | v3.1 |
|---|---|---|---|---|
| Watchman NSW-leaked | 2/3 | 2/3 | 0/10 | 0/10 |
| Summary profile leak (NSW etc.) | 3/3 | 1/3 | 3/10 | 1/10 ✅ |
| Summary placeholder brackets | n/a | 1/3 | 0/10 | 0/10 |
| Summary memorised-party hallucination (Bravatek/Veoneer/Nissin) | n/a | n/a | 5/10 | 5/10 ❌ unchanged |
| Per-clause prompts with precedent block | 0% | 50% | 59% | 59% |
| Docs with ≥1 precedent injected | 0/3 | 2/3 | 7/10 | 7/10 |
| Risks per doc (no mega-risks) | varied | varied | 3/3 | 3/3 |
| Curator surface specificity | crashed | concern-recital | named docs + named pattern | named docs + different but valid pattern |
Precedents promoted to confirmed |
0 | 0 | 5 | 4 |
Profile leak went 3/10 → 1/10. Memorised-party hallucination 5/10 → 5/10. Everything else stayed flat or improved.
What v3.1 fixed
Profile-NSW leak: 3/10 → 1/10
The === CLIENT BACKGROUND (use ONLY to set tone) === block restructure + the explicit "Do not recite client concerns as risks" instruction reduced summary-level NSW leak from 3 docs in v3 to 1 in v3.1.
Specifically:
- v3 had NSW in: Theravance ("based in NSW, Australia"), Veoneer (none actually — re-check), XLI ("territory of NSW")
- v3.1 has NSW in: Theravance only ("Acme Holdings, a company based in NSW")
The XLI doc fully dropped the NSW reference in v3.1; the Veoneer doc was already clean in both. Theravance — the longest, taxonomically-ambiguous doc — is the one place gemma2:2b still falls back to NSW under load.
Curator surface decision: 1 named doc → 3 named docs + a different but valid cross-doc pattern
v3 surface: "Two joint venture agreements (JV) have critical findings: 'Veoneer' and 'Sibannac'. These both involve a risk of survival-of-pre-closing-claims. Recommend opening these first for review."
v3.1 surface: "Three joint venture agreements (acceleratedtechnologiesholdingcorp_04_24_2003.txt, sibannac_12_04_2017.txt, veoneer_02_21_2020.txt) show a pattern of unbounded-liability indemnity carve-outs. Consider portfolio position on this."
Both are specific, both name documents, both identify a pattern. v3.1 named one extra document and pivoted to a different valid pattern (unbounded-liability indemnity carve-outs) — same anti-recital quality.
Concern-vocabulary leak inside risks: partially closed
v3 LoopIndustries: "potentially exposing the Client to joint and several liability" v3.1 LoopIndustries: "between Assignor and Assignee, potentially impacting risk allocation" — no "Client" reference, no JV-specific concern names.
But v3.1 UnitedNationalBancorp still has "lacks specific details on capital call mechanics" even though it's a vendor outsourcing agreement. The fix didn't fully close this on every doc. Half-win.
What v3.1 did NOT fix — the gemma2:2b training-memorisation ceiling
5/10 summaries still hallucinate party names from gemma2:2b's training memory of the CUAD dataset:
| Doc | v3 hallucination | v3.1 hallucination |
|---|---|---|
| AcceleratedTechnologies | "Bravatek and COMPANY" | "Bravatek and Sibannac, operating as 'Accelerated Technologies Holding Corp.'" |
| BorrowMoney | "Bravatek and COMPANY" | "Veoneer and Nissin" |
| Theravance | "Bravatek and Sibannac" | "Acme Holdings" (client-name leak) |
| XLI Technologies | "Bravatek and BOSCH" | "Bravatek and Sibannac" |
| Sibannac | "Bravatek and Sibannac" (correct by coincidence) | same |
The pattern: when the per-clause findings don't strongly anchor party names, gemma2:2b reaches for names it has memorised from training — and Bravatek/Sibannac/Veoneer/Nissin are real party names from the CUAD dataset that gemma2:2b has seen during training. My v3.1 prompt edits removed the explicit example names, but the model still finds them in its weights.
This is the gemma2:2b ceiling. Prompt engineering alone cannot dislodge it. The three realistic paths forward:
Path A — Programmatic name injection (~1 hour engineering)
Don't trust the model to copy names from findings. Extract them deterministically:
- After per-clause analysis, scan all
clauseRiskSummary+ concern text fields for capitalised proper nouns (party-like patterns). - Tally the top 5 most-frequent and inject them into the synthesis prompt as a required-tokens block:
"PARTIES NAMED IN THE PER-CLAUSE FINDINGS (you MUST use exactly these names in the summary; if a name is not on this list, do not use it): VNUE, Promoter, ..."
- After synthesis, run a regex check: every party name in the output must be on the list. If not, regenerate the summary.
This is the same pattern as "JSON mode" — give the model a closed vocabulary for the high-stakes field and validate output deterministically.
Path B — Larger local model (1-line config change, hardware-permitting)
LAVERN_LOCAL_ANALYSIS_MODEL=qwen2.5:7b-instruct (or mistral-small, llama3.1:8b). These models don't have the memorisation collapse gemma2:2b shows on the CUAD dataset — they have richer enough representations to follow the prompt instruction "use party names from the findings, not your prior knowledge."
The cost: ~5GB RAM instead of 1.6GB. Most Macs released since 2022 have 16GB+. The "runs on a Mac mini" claim survives.
Path C — Synthesis as template-fill, not generative narrative
Stop asking the model to write the summary as free prose. Replace synthesis with a structured fill:
TEMPLATE:
"This document is a {documentType} between {party1} and {party2},
governed by {jurisdiction}. The biggest exposure is {topRisk1.description}."
The model fills the {} fields by EXTRACTING from per-clause findings,
not by composing prose.
This is the most robust path. It's also the one that loses the most "lighthouse-quality" feel — summaries become mechanical rather than narrative. Trade-off.
Recommendation
Ship the article on v3.1 with the model-size caveat. Implement Path A (programmatic name injection) in v3.2 over the next sprint, then re-run the eval.
Why ship now:
- 0/10 Watchman NSW leak (v1+v2 had 6/6 contaminated; v3+v3.1 are clean)
- 1/10 summary NSW leak (down from 3/10 in v3; only the hardest doc still slips)
- 0/10 placeholder brackets
- 59% precedent injection coverage with the board compounding across docs
- Curator producing specific, named, actionable surface decisions
- Phase 5 lifecycle (tentative→confirmed) firing in production
- 31-33 min wall-clock for 10 docs, $0, all local
Why not wait for v3.2:
- The party-name issue is a known small-model artifact that any reader who's worked with gemma2:2b will recognise immediately
- Path A is a discrete, named follow-up — better announced as "v3.2 dropping next week" than buried as "still broken"
- The article's central claims (three personas, compounding board, $0 local, working on a Mac) are now demonstrated across 10 real SEC contracts
Cost / time
10 real SEC-filed contracts, Watchman + Reader + Curator end-to-end:
- v3: 1,876 s (31 min 16 s) · $0
- v3.1: 1,968 s (32 min 48 s) · $0
The v3.1 cost +90s vs v3 is the Reader's slightly slower synthesis prompt evaluation on a more constrained prompt. Within noise. Both are reproducible on any Mac with Ollama and gemma2:2b.
Verdict on the v1→v3.1 arc
| Run | Claim status | One-line verdict |
|---|---|---|
| v1 | Plumbing claim | "The code doesn't throw, and the chunker silently fails on real docs." |
| v2 | Plumbing + compounding claim | "Fix 3 bugs, get architectural compounding across 3 docs." |
| v3 | Production-quality on 10 docs | "Architecture works at 10× scale; one self-inflicted prompt bug, one model ceiling." |
| v3.1 | Ship-ready except for the model ceiling | "Profile leak from 3/10 to 1/10. Party-name hallucination is gemma2:2b's training memory, not a prompt problem. v3.2 lands programmatic name injection." |
Four reports on disk:
- evals/jv/EVAL_REPORT.md — v1 (3 docs, brutal honest assessment)
- evals/jv/EVAL_REPORT_V2.md — v2 (3 docs, three production fixes + side-by-side)
- evals/jv/EVAL_REPORT_V3.md — v3 (10 docs, three more fixes + cross-doc table)
- evals/jv/EVAL_REPORT_V31.md — v3.1 (10 docs, post-run prompt fix + ceiling diagnosis) ← this file
Six runs of raw flight-recorder logs on disk (runs-v1/, runs-v2/, runs-v3/, runs/).
Anyone can reproduce all of this with LAVERN_LOCAL_ANALYSIS_MODEL=gemma2:2b npx tsx scripts/eval-lighthouse.ts against the 10 CUAD contracts in evals/jv/*.txt.
That's the receipt the article needs.