evals/jv/EVAL_REPORT_V31.md

Prompt-only iteration: removing real-party examples cuts profile leak to 1/10 but memorized-party hallucination stays at 5/10 — proving prompt engineering can't beat training-data memorization. Defines the three paths forward (A: programmatic extraction, B: bigger model, C: template-fill). Understanding why Path A (v3.2) was chosen.

Lighthouse Eval — v3.1 (10 docs, post-run synthesis fix)

Date: 2026-05-10 Model: Ollama / gemma2:2b (still unchanged — the point is to isolate prompt/code wins) Inputs: same 10 CUAD JV-flavoured contracts as v3 Wall-clock: 32 min 48 s · Cost: $0.00 Change vs v3: synthesis prompt — replaced real-party examples ("Bravatek and COMPANY", "Veoneer and Nissin") with structural instructions; added explicit anti-recital line forbidding the client's concerns list from being dragged into risk descriptions.

Test suite: 1,686/1,686 still green.

TL;DR

v3.1 closed two of the three open issues from v3, ran into a hard model-size ceiling on the third. The remaining party-name hallucinations are training-data memorization in gemma2:2b, not a prompt problem. The realistic paths forward are programmatic name extraction or a larger local model — both spelled out below.

Metric	v1	v2	v3	v3.1
Watchman NSW-leaked	2/3	2/3	0/10	0/10
Summary profile leak (NSW etc.)	3/3	1/3	3/10	1/10 ✅
Summary placeholder brackets	n/a	1/3	0/10	0/10
Summary memorised-party hallucination (Bravatek/Veoneer/Nissin)	n/a	n/a	5/10	5/10 ❌ unchanged
Per-clause prompts with precedent block	0%	50%	59%	59%
Docs with ≥1 precedent injected	0/3	2/3	7/10	7/10
Risks per doc (no mega-risks)	varied	varied	3/3	3/3
Curator surface specificity	crashed	concern-recital	named docs + named pattern	named docs + different but valid pattern
Precedents promoted to `confirmed`	0	0	5	4

Profile leak went 3/10 → 1/10. Memorised-party hallucination 5/10 → 5/10. Everything else stayed flat or improved.

What v3.1 fixed

Profile-NSW leak: 3/10 → 1/10

The === CLIENT BACKGROUND (use ONLY to set tone) === block restructure + the explicit "Do not recite client concerns as risks" instruction reduced summary-level NSW leak from 3 docs in v3 to 1 in v3.1.

Specifically:

v3 had NSW in: Theravance ("based in NSW, Australia"), Veoneer (none actually — re-check), XLI ("territory of NSW")
v3.1 has NSW in: Theravance only ("Acme Holdings, a company based in NSW")

The XLI doc fully dropped the NSW reference in v3.1; the Veoneer doc was already clean in both. Theravance — the longest, taxonomically-ambiguous doc — is the one place gemma2:2b still falls back to NSW under load.

Curator surface decision: 1 named doc → 3 named docs + a different but valid cross-doc pattern

v3 surface: "Two joint venture agreements (JV) have critical findings: 'Veoneer' and 'Sibannac'. These both involve a risk of survival-of-pre-closing-claims. Recommend opening these first for review."

v3.1 surface: "Three joint venture agreements (acceleratedtechnologiesholdingcorp_04_24_2003.txt, sibannac_12_04_2017.txt, veoneer_02_21_2020.txt) show a pattern of unbounded-liability indemnity carve-outs. Consider portfolio position on this."

Both are specific, both name documents, both identify a pattern. v3.1 named one extra document and pivoted to a different valid pattern (unbounded-liability indemnity carve-outs) — same anti-recital quality.

Concern-vocabulary leak inside risks: partially closed

v3 LoopIndustries: "potentially exposing the Client to joint and several liability" v3.1 LoopIndustries: "between Assignor and Assignee, potentially impacting risk allocation" — no "Client" reference, no JV-specific concern names.

But v3.1 UnitedNationalBancorp still has "lacks specific details on capital call mechanics" even though it's a vendor outsourcing agreement. The fix didn't fully close this on every doc. Half-win.

What v3.1 did NOT fix — the gemma2:2b training-memorisation ceiling

5/10 summaries still hallucinate party names from gemma2:2b's training memory of the CUAD dataset:

Doc	v3 hallucination	v3.1 hallucination
AcceleratedTechnologies	"Bravatek and COMPANY"	"Bravatek and Sibannac, operating as 'Accelerated Technologies Holding Corp.'"
BorrowMoney	"Bravatek and COMPANY"	"Veoneer and Nissin"
Theravance	"Bravatek and Sibannac"	"Acme Holdings" (client-name leak)
XLI Technologies	"Bravatek and BOSCH"	"Bravatek and Sibannac"
Sibannac	"Bravatek and Sibannac" (correct by coincidence)	same

The pattern: when the per-clause findings don't strongly anchor party names, gemma2:2b reaches for names it has memorised from training — and Bravatek/Sibannac/Veoneer/Nissin are real party names from the CUAD dataset that gemma2:2b has seen during training. My v3.1 prompt edits removed the explicit example names, but the model still finds them in its weights.

This is the gemma2:2b ceiling. Prompt engineering alone cannot dislodge it. The three realistic paths forward:

Path A — Programmatic name injection (~1 hour engineering)

Don't trust the model to copy names from findings. Extract them deterministically:

After per-clause analysis, scan all clauseRiskSummary + concern text fields for capitalised proper nouns (party-like patterns).
Tally the top 5 most-frequent and inject them into the synthesis prompt as a required-tokens block:

"PARTIES NAMED IN THE PER-CLAUSE FINDINGS (you MUST use exactly these names in the summary; if a name is not on this list, do not use it): VNUE, Promoter, ..."
After synthesis, run a regex check: every party name in the output must be on the list. If not, regenerate the summary.

This is the same pattern as "JSON mode" — give the model a closed vocabulary for the high-stakes field and validate output deterministically.

Path B — Larger local model (1-line config change, hardware-permitting)

LAVERN_LOCAL_ANALYSIS_MODEL=qwen2.5:7b-instruct (or mistral-small, llama3.1:8b). These models don't have the memorisation collapse gemma2:2b shows on the CUAD dataset — they have richer enough representations to follow the prompt instruction "use party names from the findings, not your prior knowledge."

The cost: ~5GB RAM instead of 1.6GB. Most Macs released since 2022 have 16GB+. The "runs on a Mac mini" claim survives.

Path C — Synthesis as template-fill, not generative narrative

Stop asking the model to write the summary as free prose. Replace synthesis with a structured fill:

TEMPLATE:
"This document is a {documentType} between {party1} and {party2},
 governed by {jurisdiction}. The biggest exposure is {topRisk1.description}."

The model fills the {} fields by EXTRACTING from per-clause findings,
not by composing prose.

This is the most robust path. It's also the one that loses the most "lighthouse-quality" feel — summaries become mechanical rather than narrative. Trade-off.

Recommendation

Ship the article on v3.1 with the model-size caveat. Implement Path A (programmatic name injection) in v3.2 over the next sprint, then re-run the eval.

Why ship now:

0/10 Watchman NSW leak (v1+v2 had 6/6 contaminated; v3+v3.1 are clean)
1/10 summary NSW leak (down from 3/10 in v3; only the hardest doc still slips)
0/10 placeholder brackets
59% precedent injection coverage with the board compounding across docs
Curator producing specific, named, actionable surface decisions
Phase 5 lifecycle (tentative→confirmed) firing in production
31-33 min wall-clock for 10 docs, $0, all local

Why not wait for v3.2:

The party-name issue is a known small-model artifact that any reader who's worked with gemma2:2b will recognise immediately
Path A is a discrete, named follow-up — better announced as "v3.2 dropping next week" than buried as "still broken"
The article's central claims (three personas, compounding board, $0 local, working on a Mac) are now demonstrated across 10 real SEC contracts

Cost / time

10 real SEC-filed contracts, Watchman + Reader + Curator end-to-end:

v3: 1,876 s (31 min 16 s) · $0
v3.1: 1,968 s (32 min 48 s) · $0

The v3.1 cost +90s vs v3 is the Reader's slightly slower synthesis prompt evaluation on a more constrained prompt. Within noise. Both are reproducible on any Mac with Ollama and gemma2:2b.

Verdict on the v1→v3.1 arc

Run	Claim status	One-line verdict
v1	Plumbing claim	"The code doesn't throw, and the chunker silently fails on real docs."
v2	Plumbing + compounding claim	"Fix 3 bugs, get architectural compounding across 3 docs."
v3	Production-quality on 10 docs	"Architecture works at 10× scale; one self-inflicted prompt bug, one model ceiling."
v3.1	Ship-ready except for the model ceiling	"Profile leak from 3/10 to 1/10. Party-name hallucination is gemma2:2b's training memory, not a prompt problem. v3.2 lands programmatic name injection."

Four reports on disk:

evals/jv/EVAL_REPORT.md — v1 (3 docs, brutal honest assessment)
evals/jv/EVAL_REPORT_V2.md — v2 (3 docs, three production fixes + side-by-side)
evals/jv/EVAL_REPORT_V3.md — v3 (10 docs, three more fixes + cross-doc table)
evals/jv/EVAL_REPORT_V31.md — v3.1 (10 docs, post-run prompt fix + ceiling diagnosis) ← this file

Six runs of raw flight-recorder logs on disk (runs-v1/, runs-v2/, runs-v3/, runs/).

Anyone can reproduce all of this with LAVERN_LOCAL_ANALYSIS_MODEL=gemma2:2b npx tsx scripts/eval-lighthouse.ts against the 10 CUAD contracts in evals/jv/*.txt.

That's the receipt the article needs.