Document-backed retrieval. Users upload contracts, memos, and reference material; Lavern indexes them into chunks searchable by FTS5 with BM25 ranking, legal-synonym expansion, and n-gram re-ranking. All queries are user-scoped — user A never sees user B's collection.
buffer → parseDocument() → walk sections → chunk → SQLite.
The indexer reuses the same parser as the rest of Lavern (PDF, DOCX,
Markdown, plain text) and the same section detector, so headings line up
with what agents see when they read the source document.
Hybrid two-stage: BM25 keyword search over-fetches 3×, then n-gram
overlap re-ranks for conceptual similarity, and the top-k by combined
score is returned. Falls back to LIKE substring search if
FTS parsing fails. Results carry chunk_id, the parent
document_id, collection metadata, heading, content, doc
type, and jurisdiction — enough for an agent to cite back to source.
POST /api/knowledge-base/collections | Create a collection |
GET /api/knowledge-base/collections | List collections (user-scoped) |
POST /api/knowledge-base/collections/:id/upload | Upload + index a document |
GET /api/knowledge-base/search | FTS5 search across collections |
DELETE /api/knowledge-base/collections/:id | Drop a collection and all chunks |
DELETE /api/knowledge-base/documents/:id | Drop a single document |
Agents reach the knowledge base through the knowledge-base.ts
MCP tool. Permission is phase-gated by src/permissions/ — a
researcher can read freely; an orchestrator typically does not. Retrieval
goes onto the debate board with the same provenance shape as document
quotes, so verification can still string-match every cited span.
scripts/seed-knowledge-base.ts populates five permissively
licensed corpora out of the box.
| CUAD | 510 commercial contracts, 41 clause types | CC BY 4.0 |
| MAUD | 152 merger agreements, 92 deal points | CC BY 4.0 |
| ACORD | 126K+ clause-retrieval pairs | CC BY 4.0 |
| UNFAIR-ToS | 5.5K sentences, 8 unfair-clause types | CC BY-SA 4.0 |
| LEDGAR | 60K SEC provisions, 98 clause types | CC BY-SA 4.0 |
ContractNLI was dropped — its CC BY-NC-SA 4.0 license is incompatible with Apache 2.0 redistribution.
The knowledge base is one of two persistence layers. The other is
Clawern's precedent-board.ts: an institutional-memory store
that accumulates findings across engagements per client, with
O(1) dedup, relevance search, and decay/compaction. KB is content you
brought in; the precedent board is what Lavern learned along the way.