docs/importing-datasets.md

How to seed the knowledge base with the five bundled datasets (CUAD, MAUD, ACORD, UNFAIR-ToS, LEDGAR) via one idempotent script — plus the sharpest licensing discussion in the repo: what CC BY-SA actually obligates, and why ContractNLI is excluded. Bootstrapping a KB or checking dataset license obligations.

Importing the Legal Reference Datasets

How to load the five bundled legal datasets (CUAD, MAUD, ACORD, UNFAIR-ToS, LEDGAR) into your Lavern instance's knowledge base, so the agents — and the dashboard search — can retrieve them as reference material.

The import is driven by one script, scripts/seed-knowledge-base.ts. It downloads each dataset from its public source, chunks it, and indexes it as a global collection that every user on the instance can search.

1. What gets imported

Dataset	Collection name	Contents	`doc_type`	Source	License
CUAD	CUAD — Commercial Contract Clauses	510 commercial contracts, 41 clause types (clause text tagged by type)	precedent	`theatticusproject/cuad` (GitHub zip)	CC BY 4.0
MAUD	MAUD — Merger Agreement Deal Points	152 merger agreements, 92 deal-point annotations	precedent	`theatticusproject/maud` (HF)	CC BY 4.0
ACORD	ACORD — Clause Retrieval Pairs	114 queries, 126K+ expert-rated clause-retrieval pairs	precedent	`theatticusproject/acord` (HF)	CC BY 4.0
UNFAIR-ToS	UNFAIR-ToS — Unfair Terms of Service Clauses	~5.5K ToS sentences, 8 unfair-clause types (EU consumer law)	regulation	`coastalcph/lex_glue/unfair_tos` (HF)	CC BY-SA 4.0
LEDGAR	LEDGAR — SEC Contract Provisions	60K labeled SEC provisions, 98 clause types	regulation	`coastalcph/lex_glue/ledgar` (HF)	CC BY-SA 4.0

ContractNLI is not included and cannot be seeded by this script. Its CC BY-NC-SA 4.0 license is non-commercial, incompatible with a commercial distribution. Passing --contractnli exits with an error by design. If you need it for personal, non-commercial use, fetch it from HuggingFace yourself and accept its terms separately. Do not bundle it into a commercial app.

See §6 Licensing before shipping these in a commercial product — all five are commercially usable, but BY-SA carries attribution + share-alike duties.

2. Prerequisites

Node.js with the project's dev dependencies installed (npm install). The script runs via npx tsx (its shebang is #!/usr/bin/env npx tsx).
The database. You don't need to pre-create it — the script calls initDatabase() itself, which runs the migrations and creates the kb_collections / kb_documents / kb_chunks tables and the kb_chunks_fts full-text index. The DB location follows your config (SHEM_DB_PATH, default ./data/lavern.db).
Outbound network access to:
- https://datasets-server.huggingface.co — MAUD, ACORD, UNFAIR-ToS, LEDGAR
- https://github.com/TheAtticusProject/cuad — CUAD (downloaded as a zip; the HF Datasets Server can't serve CUAD because it runs custom Python).
If you're behind a restricted egress policy (e.g. the Azure deployment's network rules), allow-list those two hosts, or run the seed once in an environment that can reach them and copy the resulting DB + cache.
Disk for the download cache at ./data/seed-cache/ (raw JSON rows; CUAD also caches its zip + extraction there).

3. Running the import

From the repo root:

# Seed ALL five datasets (idempotent — skips any already imported)
npx tsx scripts/seed-knowledge-base.ts

# Re-seed everything from scratch (deletes existing collections first)
npx tsx scripts/seed-knowledge-base.ts --force

# Seed a single dataset
npx tsx scripts/seed-knowledge-base.ts --cuad
npx tsx scripts/seed-knowledge-base.ts --maud
npx tsx scripts/seed-knowledge-base.ts --acord
npx tsx scripts/seed-knowledge-base.ts --unfair-tos
npx tsx scripts/seed-knowledge-base.ts --ledgar

Behavior:

No flags → seeds all five.
Idempotent → if a collection already has data it's skipped (the script prints Already seeded. Use --force to re-seed.). Safe to re-run.
--force → deletes the existing collection and re-imports it fresh.
Per-dataset flags can be combined (e.g. --unfair-tos --ledgar).
The full run takes a while — it paginates the HF Datasets Server 100 rows at a time with a 500 ms throttle between pages (to dodge HTTP 429s), and LEDGAR alone is 60K rows. First run is slow; subsequent runs use the cache and are fast.

Expected tail of a successful run:

═══════════════════════════════════════════════════════
Done. CUAD: <n>. MAUD: <n>. ACORD: <n>. UNFAIR-ToS: <n>. LEDGAR: <n>
Agents can now search with: search_knowledge_base("limitation of liability SaaS")

4. What happens under the hood

For each dataset the script:

initDatabase() — opens/migrates the DB.
ensureSystemUser() — creates a __system__ user (__system__@lavern.internal, no login) that owns the global collections.
ensureGlobalCollection(name, description, docType) — creates a row in kb_collections with is_global = 1, owned by __system__. Global means every user's search_knowledge_base can see it (it's not scoped to one user).
Fetch the rows:
- HF datasets (MAUD/ACORD/UNFAIR-ToS/LEDGAR) via the paginated Datasets Server, cached to ./data/seed-cache/<name>-rows.json.
- CUAD via the GitHub data.zip, extracted to ./data/seed-cache/cuad-extracted/ and parsed from train_separate_questions.json.
Chunk + index — groups annotations (by contract, deal point, or unfair label), then inserts kb_documents + kb_chunks rows in a single transaction. Each chunk carries metadata JSON recording its source (CUAD/MAUD/…) and dataset-specific fields (e.g. unfairType, clauseType, category).
FTS indexing is automatic — the kb_chunks_fts virtual table is kept in sync by triggers on kb_chunks, so no separate indexing step is needed.

Note: chunk content lands in kb_chunks; the per-document doc_type column on kb_documents is written as precedent by the shared insert helper, while the collection's doc_type reflects the table above (precedent vs regulation). Retrieval filters on either, so both are queryable.

Content + metadata stored per chunk

Every chunk persists both the searchable text and a structured metadata JSON blob, so you get content and relevant metadata for each dataset. The metadata is stored on kb_chunks.metadata and returned alongside results; the search-facing fields are heading (the clause/label name) and content (the text that gets full-text indexed).

Dataset	`heading`	`content` (indexed)	`metadata` JSON fields
CUAD	clause type	the clause text	`clauseType`, `contractTitle`, `source: "CUAD"`
MAUD	deal-point heading	deal-point text	`dealPoint` (the question), `category`, `answer`, `source: "MAUD"`
ACORD	clause category	the clause text	`clauseCategory`, `clauseId`, `associatedQueries` (top 5 matching queries), `source: "ACORD"`
UNFAIR-ToS	`Unfair: <type>`	the ToS sentence	`unfairType`, `allLabels` (all unfair labels on that sentence), `source: "UNFAIR-ToS"`
LEDGAR	provision type	the provision text	`provisionType`, `source: "LEDGAR"`

Common to all: each chunk also carries chunk_index, word_count, the parent document_id/collection_id, and user_id = __system__. The parent kb_documents row records a synthetic filename (e.g. LEDGAR-<type>.txt, <contractTitle>.txt), word_count, and page_count.

If you want additional metadata captured (e.g. preserving CUAD's source contract category, MAUD's per-question answer rationale, or a jurisdiction value — currently left empty), that's a small change in the relevant seedX function: extend the JSON.stringify({ … }) block before insertChunk.run(...), and/or pass a non-empty jurisdiction to insertDoc. The retrieval layer already parses and returns metadata to callers, so anything you add there flows through to the agents and the API without further changes.

5. Verifying the import

Via the agent tool (what the agents use) — search_knowledge_base:

search_knowledge_base("limitation of liability SaaS")
search_knowledge_base("merger termination fee")
search_knowledge_base("unilateral change of terms")

Via the API / dashboard — the knowledge-base route (src/api/routes/knowledge-base.ts) exposes search and a collection listing to the UI.

Via SQL (sanity check):

SELECT c.name, COUNT(ch.id) AS chunks
FROM kb_collections c
LEFT JOIN kb_chunks ch ON ch.collection_id = c.id
WHERE c.is_global = 1
GROUP BY c.id;

You should see all five collections with non-zero chunk counts.

6. Licensing obligations

All five are safe for commercial use (none is non-commercial). But you take on obligations on distribution — handle these before shipping:

Attribution (all five). Credit each dataset's creator (the Atticus Project for CUAD/MAUD/ACORD; LexGLUE for UNFAIR-ToS/LEDGAR), link the license, and indicate that you adapted it (the seeder reshapes/chunks the data). Surface this in a NOTICE file and/or an in-app "Data Sources" page.
ShareAlike (UNFAIR-ToS, LEDGAR — CC BY-SA 4.0). Any adaptation of those datasets that you distribute must be licensed BY-SA 4.0. This does not infect your application code or your proprietary logic — software that queries the data is not an adaptation of it. Keep the datasets as separate collections (they already are) and don't merge BY-SA content into a combined work you want to keep proprietary.
Gray area — derived artifacts. If you generate and distribute a derivative of the BY-SA datasets (a cleaned corpus export, or an embeddings / vector index of those specific datasets), it may count as an adaptation and need to be BY-SA. Keeping such derivatives server-side and non-distributed avoids the question. Get IP counsel to confirm for your distribution model.
No extra restrictions / no DRM on the BY-SA material itself.

This is informational, not legal advice — confirm with counsel for a commercial launch.

7. Looking ahead: seeding under Postgres + pgvector

Today the seeder writes to SQLite and relies on the FTS5 triggers to index chunks for lexical search. Under the planned Azure migration (docs/azure-migration.md), the KB moves to Postgres + pgvector, which changes seeding in two ways:

Embeddings become part of seeding. Each chunk needs a vector (embedding vector(1536)) generated via Azure OpenAI text-embedding-3. seed-knowledge-base.ts becomes an embedding call site alongside src/knowledge-base/indexer.ts — embed each chunk (batched) before insert, and write content_tsv for the lexical half.
Backfill. Any rows seeded before the migration need a one-time pass to populate their embeddings.

Until then, the SQLite seeding flow above is unchanged. (The BY-SA embeddings caveat in §6 applies to the vectors you generate for UNFAIR-ToS and LEDGAR.)

8. Adding your own dataset

To import something beyond these five, mirror an existing seedX function in scripts/seed-knowledge-base.ts:

Add a collection name constant and a seedYourData(force) function.
Fetch your rows (reuse fetchAllHfRows for HF datasets, or your own loader).
ensureGlobalCollection(name, description, docType) — pick precedent, regulation, playbook, template, or prior_analysis for doc_type.
Insert kb_documents + kb_chunks via prepareInsertStatements(), setting heading, content, and a metadata JSON source tag. FTS indexing is automatic.
Wire a --your-flag into main().

Confirm the license permits your intended (commercial) use and redistribution before bundling it — that's exactly the check ContractNLI failed.

Quick reference

npx tsx scripts/seed-knowledge-base.ts                 # all five, idempotent
npx tsx scripts/seed-knowledge-base.ts --force         # wipe + re-import all
npx tsx scripts/seed-knowledge-base.ts --ledgar        # one dataset
# cache lives in ./data/seed-cache/  · DB at $SHEM_DB_PATH (default ./data/lavern.db)