The core question
Every analytic — from near-dupe detection to GenAI Q&A — has a cost. That cost is worth paying only when it eliminates more review time than it costs to run. This reference gives you the shortcuts to answer that question at the SOW stage, before you commit.
Break-even matrix
| Analytic | Min corpus | Sweet spot | Best when |
| Near-duplicate | Any | Any | Always. It's cheap and it works. |
| Email threading (inclusive) | > 5k emails | 50k–2M emails | Volume is email-heavy; threading is standard practice. |
| Concept clustering | ~30k | 75k–500k | Corpus is text-rich and you don't know the vocabulary yet. |
| Communication analytics | ~10k emails/msgs | 50k+ comms | Custodian identification matters; social-graph analysis needed. |
| Entity extraction / PII | Any | Any regulated-data matter | You need to find named entities or PII at scale. |
| TAR 2 · CAL | ~30k | 75k–2M | Corpus is large, richness is moderate, reviewer hours are the constraint. |
| GenAI summarization | Any | 10k–200k | Documents are long-form (contracts, reports); reviewer time on first-read is the bottleneck. |
| GenAI Q&A | Any | Not size-driven — task-driven | Investigation phase, deposition prep, or issue-focused fact-finding. |
What vendors tend to charge for
- Fixed setup / index build — often one-time per matter.
- Per-document processed — the recurring cost that scales with corpus.
- Per-query, per-search — some AI-enabled search charges apply here.
- Per-model-call — GenAI tools price per LLM API call, sometimes with token pass-through.
- Regex vs. AI-search premiums — many platforms charge more for AI-enabled searches than for regular boolean; ask.
The pricing trap
Vendors quote list price. On matters of any size, negotiate. On matters of significant size, do not accept the first quote — competitive pricing exists. Fair is fair.