GenAI Exclusion Rules · DLC-M04-002

DLC-M04-002 · Module 04 Deliverable

GenAI Exclusion Rules

Matter

Not every document belongs in a generative-AI workflow. Some make the model hallucinate; others make the cost curve absurd; a few will get you sanctioned. This one-pager lists the categorical exclusions we recommend before you route documents to LLM-based summarization, Q&A, or first-pass triage.

Categorical Exclusions

Do not route the following into a generative-AI workflow. If any exception is warranted, document it and get client sign-off.

Class	Why	Detail
Bad OCR / low text yield	Hallucination risk	Documents whose extracted text is < 100 characters, or whose text is majority-non-alphanumeric, will pull the model toward inventing content. Route through linear review or re-OCR first.
Oversized documents (> 200 pages)	Truncation risk	Most LLMs process ~100k tokens at a time. Long documents get truncated silently by tooling, and the model summarizes what it saw — not what you sent. Chunk manually with overlap, or exclude.
Spreadsheets with formulas	Semantic collapse	Excel exported to plain text loses structure. A GenAI model reading a flattened spreadsheet will invent narrative about "trends" and "totals" that don't exist. Convert to well-labeled tables or keep in linear review.
Encoded / obfuscated content	Prompt injection	Base64 blobs, JSON payloads, source code, escaped strings — the model may attempt to "execute" them as instructions. Route out.
Documents with adversarial content	Prompt injection	Text containing instruction-shaped strings ("ignore prior instructions", "system:", "you are now"). Assume the model will comply. Filter aggressively.
Images without meaningful OCR	Nothing to summarize	The model summarizes text, not pixels — even multimodal models lose fidelity on screenshots and photos of documents. Route to human vision review.
Multi-language documents (mixed script)	Case-by-case	Model quality varies dramatically by language. Confirm the model was benchmarked on the languages present; if not, route to bilingual review.
Documents flagged as potentially privileged	Client policy	Some firms categorically exclude anything that has hit a privilege search. Others don't. Set the rule at matter start, in writing.
Deposition transcripts, contracts, court filings	Depends on task	Summarization is generally fine. Q&A ("what does the contract require on delivery?") is high-risk without human verification. Never rely on the model's citations.
Anything used for training the model	Contamination risk	If your vendor's LLM was trained on public case law, do not use it to summarize the same case law and treat the output as independent.

Filter Logic — Copy-Paste Ready

Use these as boolean filters when constructing the "eligible for GenAI" population in your review tool. Apply as an EXCLUDE set on top of the responsive-eligible set.

EXCLUDE FROM GENAI IF ANY OF:
  ExtractedTextLength < 100
  OR PageCount > 200
  OR FileExtension IN ('xlsx','xls','csv') AND FormulaCount > 0
  OR FileType IN ('source_code','json','xml','base64')
  OR PrivilegeFlag = 'Potentially Privileged'
  OR TextContains(['ignore prior instructions','system:','you are now','disregard'])
  OR OCRConfidence < 0.85
  OR LanguageDetection NOT IN ApprovedLanguages
  OR FileType IN ('audio','video') AND TranscriptExists = FALSE

Documents We Recommend For GenAI

Class	Task	Why it works
Clean-text emails (native-extracted)	Summarization, threading enrichment	Well-structured, short, model-friendly.
Word docs with extracted text ≥ 500 chars	Summarization, first-pass triage	The typical "letter, memo, brief" — well within model capabilities.
PDF-with-text (not scanned)	Summarization, Q&A with citations	Cite pages back to the doc for verification.
Deposition summaries (already digest form)	Q&A over the summary, not the transcript	Reduces token cost; humans still verify.
Foreign-language, benchmarked	Translation-then-summary	Modern models are strong on French, Spanish, German, Portuguese. Weaker on low-resource languages.

The one universal rule Every GenAI output that a lawyer will rely on gets a spot-check by a human. Recall Mata v. Avianca: the sanction wasn't for using AI. It was for filing what the AI produced without checking. Context always matters.

Post-GenAI Validation

Every generative-AI pass requires validation. This is not optional — it is what makes the workflow defensible.

Validation	How to size	What "pass" looks like
Hallucination check	Sample 100 model outputs	Every claim in the output is verifiable in the source. Rate should be > 98%.
Citation accuracy	Every doc where model cited a source	Cited page/paragraph exists and says what the model said it said.
False-negative sampling	Random sample of excluded set (n from DLC-M04-001)	Model didn't wrongly classify responsive as non-responsive at higher than agreed rate.
Prompt-injection test	Manual review of top-token-count docs	Model output isn't following instructions found in the document text.
Consistency check	Same 20 documents run twice	Outputs are substantively identical. Divergence indicates temperature/config drift.

Disclosure & Documentation

For each matter using GenAI in a workflow that touches the production, the following must be documented and retained:

The vendor, model version, and any temperature / top-p settings used.
The exclusion rules applied (this document, plus any matter-specific additions).
The validation protocol run, and its results.
Any prompt templates used, including system prompts.
The name of the human reviewer who signed off on the GenAI output before it influenced a coding call.

Standing orders A growing number of federal courts (N.D. Tex., D.D.C., E.D. Pa. among them) require disclosure of AI use in filings. Check the local rules before every filing. When in doubt, disclose.