What they remember

Knowledge base

Document-backed retrieval. Users upload contracts, memos, and reference material; Lavern indexes them into chunks searchable by FTS5 with BM25 ranking, legal-synonym expansion, and n-gram re-ranking. All queries are user-scoped — user A never sees user B's collection.

Pipeline

buffer → parseDocument() → walk sections → chunk → SQLite. The indexer reuses the same parser as the rest of Lavern (PDF, DOCX, Markdown, plain text) and the same section detector, so headings line up with what agents see when they read the source document.

Retrieval

Hybrid two-stage: BM25 keyword search over-fetches 3×, then n-gram overlap re-ranks for conceptual similarity, and the top-k by combined score is returned. Falls back to LIKE substring search if FTS parsing fails. Results carry chunk_id, the parent document_id, collection metadata, heading, content, doc type, and jurisdiction — enough for an agent to cite back to source.

API surface

POST /api/knowledge-base/collectionsCreate a collection
GET /api/knowledge-base/collectionsList collections (user-scoped)
POST /api/knowledge-base/collections/:id/uploadUpload + index a document
GET /api/knowledge-base/searchFTS5 search across collections
DELETE /api/knowledge-base/collections/:idDrop a collection and all chunks
DELETE /api/knowledge-base/documents/:idDrop a single document

Agent access

Agents reach the knowledge base through the knowledge-base.ts MCP tool. Permission is phase-gated by src/permissions/ — a researcher can read freely; an orchestrator typically does not. Retrieval goes onto the debate board with the same provenance shape as document quotes, so verification can still string-match every cited span.

Seeded legal datasets

scripts/seed-knowledge-base.ts populates five permissively licensed corpora out of the box.

CUAD510 commercial contracts, 41 clause typesCC BY 4.0
MAUD152 merger agreements, 92 deal pointsCC BY 4.0
ACORD126K+ clause-retrieval pairsCC BY 4.0
UNFAIR-ToS5.5K sentences, 8 unfair-clause typesCC BY-SA 4.0
LEDGAR60K SEC provisions, 98 clause typesCC BY-SA 4.0

ContractNLI was dropped — its CC BY-NC-SA 4.0 license is incompatible with Apache 2.0 redistribution.

Two memories, one repo

The knowledge base is one of two persistence layers. The other is Clawern's precedent-board.ts: an institutional-memory store that accumulates findings across engagements per client, with O(1) dedup, relevance search, and decay/compaction. KB is content you brought in; the precedent board is what Lavern learned along the way.