How to seed the knowledge base with the five bundled datasets (CUAD, MAUD, ACORD, UNFAIR-ToS, LEDGAR) via one idempotent script — plus the sharpest licensing discussion in the repo: what CC BY-SA actually obligates, and why ContractNLI is excluded. Bootstrapping a KB or checking dataset license obligations.
Importing the Legal Reference Datasets
How to load the five bundled legal datasets (CUAD, MAUD, ACORD, UNFAIR-ToS, LEDGAR) into your Lavern instance's knowledge base, so the agents — and the dashboard search — can retrieve them as reference material.
The import is driven by one script, scripts/seed-knowledge-base.ts. It downloads each dataset from its public source, chunks it, and indexes it as a global collection that every user on the instance can search.
1. What gets imported
| Dataset | Collection name | Contents | doc_type |
Source | License |
|---|---|---|---|---|---|
| CUAD | CUAD — Commercial Contract Clauses | 510 commercial contracts, 41 clause types (clause text tagged by type) | precedent | theatticusproject/cuad (GitHub zip) |
CC BY 4.0 |
| MAUD | MAUD — Merger Agreement Deal Points | 152 merger agreements, 92 deal-point annotations | precedent | theatticusproject/maud (HF) |
CC BY 4.0 |
| ACORD | ACORD — Clause Retrieval Pairs | 114 queries, 126K+ expert-rated clause-retrieval pairs | precedent | theatticusproject/acord (HF) |
CC BY 4.0 |
| UNFAIR-ToS | UNFAIR-ToS — Unfair Terms of Service Clauses | ~5.5K ToS sentences, 8 unfair-clause types (EU consumer law) | regulation | coastalcph/lex_glue/unfair_tos (HF) |
CC BY-SA 4.0 |
| LEDGAR | LEDGAR — SEC Contract Provisions | 60K labeled SEC provisions, 98 clause types | regulation | coastalcph/lex_glue/ledgar (HF) |
CC BY-SA 4.0 |
ContractNLI is not included and cannot be seeded by this script. Its CC BY-NC-SA 4.0 license is non-commercial, incompatible with a commercial distribution. Passing
--contractnliexits with an error by design. If you need it for personal, non-commercial use, fetch it from HuggingFace yourself and accept its terms separately. Do not bundle it into a commercial app.
See §6 Licensing before shipping these in a commercial product — all five are commercially usable, but BY-SA carries attribution + share-alike duties.
2. Prerequisites
- Node.js with the project's dev dependencies installed (
npm install). The script runs vianpx tsx(its shebang is#!/usr/bin/env npx tsx). - The database. You don't need to pre-create it — the script calls
initDatabase()itself, which runs the migrations and creates thekb_collections/kb_documents/kb_chunkstables and thekb_chunks_ftsfull-text index. The DB location follows your config (SHEM_DB_PATH, default./data/lavern.db). - Outbound network access to:
https://datasets-server.huggingface.co— MAUD, ACORD, UNFAIR-ToS, LEDGARhttps://github.com/TheAtticusProject/cuad— CUAD (downloaded as a zip; the HF Datasets Server can't serve CUAD because it runs custom Python).
If you're behind a restricted egress policy (e.g. the Azure deployment's network rules), allow-list those two hosts, or run the seed once in an environment that can reach them and copy the resulting DB + cache.
- Disk for the download cache at
./data/seed-cache/(raw JSON rows; CUAD also caches its zip + extraction there).
3. Running the import
From the repo root:
# Seed ALL five datasets (idempotent — skips any already imported)
npx tsx scripts/seed-knowledge-base.ts
# Re-seed everything from scratch (deletes existing collections first)
npx tsx scripts/seed-knowledge-base.ts --force
# Seed a single dataset
npx tsx scripts/seed-knowledge-base.ts --cuad
npx tsx scripts/seed-knowledge-base.ts --maud
npx tsx scripts/seed-knowledge-base.ts --acord
npx tsx scripts/seed-knowledge-base.ts --unfair-tos
npx tsx scripts/seed-knowledge-base.ts --ledgar
Behavior:
- No flags → seeds all five.
- Idempotent → if a collection already has data it's skipped (the script
prints
Already seeded. Use --force to re-seed.). Safe to re-run. --force→ deletes the existing collection and re-imports it fresh.- Per-dataset flags can be combined (e.g.
--unfair-tos --ledgar). - The full run takes a while — it paginates the HF Datasets Server 100 rows at a time with a 500 ms throttle between pages (to dodge HTTP 429s), and LEDGAR alone is 60K rows. First run is slow; subsequent runs use the cache and are fast.
Expected tail of a successful run:
═══════════════════════════════════════════════════════
Done. CUAD: <n>. MAUD: <n>. ACORD: <n>. UNFAIR-ToS: <n>. LEDGAR: <n>
Agents can now search with: search_knowledge_base("limitation of liability SaaS")
4. What happens under the hood
For each dataset the script:
initDatabase()— opens/migrates the DB.ensureSystemUser()— creates a__system__user (__system__@lavern.internal, no login) that owns the global collections.ensureGlobalCollection(name, description, docType)— creates a row inkb_collectionswithis_global = 1, owned by__system__. Global means every user'ssearch_knowledge_basecan see it (it's not scoped to one user).- Fetch the rows:
- HF datasets (MAUD/ACORD/UNFAIR-ToS/LEDGAR) via the paginated Datasets
Server, cached to
./data/seed-cache/<name>-rows.json. - CUAD via the GitHub
data.zip, extracted to./data/seed-cache/cuad-extracted/and parsed fromtrain_separate_questions.json.
- HF datasets (MAUD/ACORD/UNFAIR-ToS/LEDGAR) via the paginated Datasets
Server, cached to
- Chunk + index — groups annotations (by contract, deal point, or unfair
label), then inserts
kb_documents+kb_chunksrows in a single transaction. Each chunk carriesmetadataJSON recording itssource(CUAD/MAUD/…) and dataset-specific fields (e.g.unfairType,clauseType,category). - FTS indexing is automatic — the
kb_chunks_ftsvirtual table is kept in sync by triggers onkb_chunks, so no separate indexing step is needed.
Note: chunk content lands in
kb_chunks; the per-documentdoc_typecolumn onkb_documentsis written asprecedentby the shared insert helper, while the collection'sdoc_typereflects the table above (precedent vs regulation). Retrieval filters on either, so both are queryable.
Content + metadata stored per chunk
Every chunk persists both the searchable text and a structured metadata
JSON blob, so you get content and relevant metadata for each dataset. The
metadata is stored on kb_chunks.metadata and returned alongside results; the
search-facing fields are heading (the clause/label name) and content (the
text that gets full-text indexed).
| Dataset | heading |
content (indexed) |
metadata JSON fields |
|---|---|---|---|
| CUAD | clause type | the clause text | clauseType, contractTitle, source: "CUAD" |
| MAUD | deal-point heading | deal-point text | dealPoint (the question), category, answer, source: "MAUD" |
| ACORD | clause category | the clause text | clauseCategory, clauseId, associatedQueries (top 5 matching queries), source: "ACORD" |
| UNFAIR-ToS | Unfair: <type> |
the ToS sentence | unfairType, allLabels (all unfair labels on that sentence), source: "UNFAIR-ToS" |
| LEDGAR | provision type | the provision text | provisionType, source: "LEDGAR" |
Common to all: each chunk also carries chunk_index, word_count, the parent
document_id/collection_id, and user_id = __system__. The parent
kb_documents row records a synthetic filename (e.g. LEDGAR-<type>.txt,
<contractTitle>.txt), word_count, and page_count.
If you want additional metadata captured (e.g. preserving CUAD's source
contract category, MAUD's per-question answer rationale, or a jurisdiction
value — currently left empty), that's a small change in the relevant seedX
function: extend the JSON.stringify({ … }) block before insertChunk.run(...),
and/or pass a non-empty jurisdiction to insertDoc. The retrieval layer already
parses and returns metadata to callers, so anything you add there flows through
to the agents and the API without further changes.
5. Verifying the import
Via the agent tool (what the agents use) — search_knowledge_base:
search_knowledge_base("limitation of liability SaaS")
search_knowledge_base("merger termination fee")
search_knowledge_base("unilateral change of terms")
Via the API / dashboard — the knowledge-base route (src/api/routes/knowledge-base.ts) exposes search and a collection listing to the UI.
Via SQL (sanity check):
SELECT c.name, COUNT(ch.id) AS chunks
FROM kb_collections c
LEFT JOIN kb_chunks ch ON ch.collection_id = c.id
WHERE c.is_global = 1
GROUP BY c.id;
You should see all five collections with non-zero chunk counts.
6. Licensing obligations
All five are safe for commercial use (none is non-commercial). But you take on obligations on distribution — handle these before shipping:
- Attribution (all five). Credit each dataset's creator (the Atticus
Project for CUAD/MAUD/ACORD; LexGLUE for UNFAIR-ToS/LEDGAR), link the license,
and indicate that you adapted it (the seeder reshapes/chunks the data).
Surface this in a
NOTICEfile and/or an in-app "Data Sources" page. - ShareAlike (UNFAIR-ToS, LEDGAR — CC BY-SA 4.0). Any adaptation of those datasets that you distribute must be licensed BY-SA 4.0. This does not infect your application code or your proprietary logic — software that queries the data is not an adaptation of it. Keep the datasets as separate collections (they already are) and don't merge BY-SA content into a combined work you want to keep proprietary.
- Gray area — derived artifacts. If you generate and distribute a derivative of the BY-SA datasets (a cleaned corpus export, or an embeddings / vector index of those specific datasets), it may count as an adaptation and need to be BY-SA. Keeping such derivatives server-side and non-distributed avoids the question. Get IP counsel to confirm for your distribution model.
- No extra restrictions / no DRM on the BY-SA material itself.
This is informational, not legal advice — confirm with counsel for a commercial launch.
7. Looking ahead: seeding under Postgres + pgvector
Today the seeder writes to SQLite and relies on the FTS5 triggers to index chunks for lexical search. Under the planned Azure migration (docs/azure-migration.md), the KB moves to Postgres + pgvector, which changes seeding in two ways:
- Embeddings become part of seeding. Each chunk needs a vector
(
embedding vector(1536)) generated via Azure OpenAItext-embedding-3.seed-knowledge-base.tsbecomes an embedding call site alongside src/knowledge-base/indexer.ts — embed each chunk (batched) before insert, and writecontent_tsvfor the lexical half. - Backfill. Any rows seeded before the migration need a one-time pass to populate their embeddings.
Until then, the SQLite seeding flow above is unchanged. (The BY-SA embeddings caveat in §6 applies to the vectors you generate for UNFAIR-ToS and LEDGAR.)
8. Adding your own dataset
To import something beyond these five, mirror an existing seedX function in
scripts/seed-knowledge-base.ts:
- Add a collection name constant and a
seedYourData(force)function. - Fetch your rows (reuse
fetchAllHfRowsfor HF datasets, or your own loader). ensureGlobalCollection(name, description, docType)— pickprecedent,regulation,playbook,template, orprior_analysisfordoc_type.- Insert
kb_documents+kb_chunksviaprepareInsertStatements(), settingheading,content, and ametadataJSONsourcetag. FTS indexing is automatic. - Wire a
--your-flagintomain().
Confirm the license permits your intended (commercial) use and redistribution before bundling it — that's exactly the check ContractNLI failed.
Quick reference
npx tsx scripts/seed-knowledge-base.ts # all five, idempotent
npx tsx scripts/seed-knowledge-base.ts --force # wipe + re-import all
npx tsx scripts/seed-knowledge-base.ts --ledgar # one dataset
# cache lives in ./data/seed-cache/ · DB at $SHEM_DB_PATH (default ./data/lavern.db)