Sampling Calculator · DLC-M04-001

X DLC-M04-001 · Sampling Calculator.xlsx — Excel

FileHomeInsertPage LayoutFormulasDataReviewView

B4 ƒ_x =CEILING(NORMSINV(1-(1-C)/2)^2*p*(1-p)/E^2 * N/(N+NORMSINV(1-(1-C)/2)^2*p*(1-p)/E^2-1), 1)

DLC-M04-001 · Live Calculator

Sampling Calculator

Purpose: Control set sizing · Elusion sampling

You enter Calculated live Formula: two-sided normal-approximation with finite-population correction.

	A	B	C	D
1	Inputs
2	Parameter	Value	Unit	Notes
3	Corpus size (N)		docs	Total documents in the population you're sampling from.
4	Expected prevalence / richness (p)		decimal	Best guess at the responsive rate. Use 0.5 for maximum conservatism.
5	Confidence level (C)		decimal	Typically 0.95 (courts have accepted 0.90 in some matters).
6	Margin of error (E)		decimal	Half-width of confidence interval. 0.03 = ±3%.
7	Outputs
8	Result	Value	Unit	Interpretation
9	Required sample size (n)	—	docs	Randomly draw exactly this many documents for your sample.
10	Expected responsive in sample	—	docs	n × p. Your reviewers should see roughly this many responsive.
11	Sample as % of corpus	—	%	Rule of thumb — anything under 5% is generally proportionate.
12	z-score (from C)	—	—	1.96 at 95%, 1.645 at 90%, 2.576 at 99%.
13	Estimated review cost (@ $2.50 / doc)	—	USD	Adjust in-house rate; contract-review typical range $1.50–$5/doc.
14	Reference · Common Sample Sizes at 95% Confidence
15	Corpus size	±5% MoE, p=0.5	±3% MoE, p=0.5	±3% MoE, p=0.1
16	10,000	370	964	376
17	50,000	381	1,045	381
18	100,000	383	1,056	382
19	500,000	384	1,065	384
20	1,000,000	384	1,066	384
21	10,000,000	384	1,067	384
22	Notes on Use
23	For control sets in TAR 1: use richness (p) close to expected responsive rate on the seed set. Lower p = smaller required n (up to a point) but higher variance on precision estimates.
24	For elusion sampling in TAR 2: p is your expected elusion rate (usually low — 0.01 to 0.05). E is how tight you need the ceiling on missed responsives.
25	For validation of a GenAI pass: sample the excluded set, not the included set. p = expected false-negative rate.
26	Court-tested defaults: C=0.95, E=0.03 for control sets; C=0.95, E=0.02 for high-stakes elusion sampling.

Statistical caveats. The formula assumes simple random sampling from a finite population. If you're stratifying (e.g. sampling separately by custodian), size each stratum independently. The normal approximation degrades when np < 5 or n(1−p) < 5 — for extreme prevalence use exact binomial methods.

Calculator Reference Stratified Change Log Ready · 100%

W Sampling Calculator — Printed Reference Sheet

DLC-M04-001 · Printed Reference

Sampling — Quick Reference Card

Version

Why we sample

Sampling proves — to a stated statistical confidence — that a decision made on a large corpus (which items to review, which to exclude, which to produce) is not producing systematically wrong outcomes. It replaces the alternative of reviewing everything, which is often infeasible and always expensive.

The math, in one paragraph

To estimate the responsive rate of a corpus within ±E at confidence C, draw n = (z² × p × (1−p)) / E², adjusted for finite population when n is a meaningful fraction of N. z is the standard-normal quantile at C (1.96 at 95%). p is your prior estimate of responsive rate; use 0.5 when you have no prior. The formula is symmetric — it works for any binary classification: responsive/not, privileged/not, correctly-coded/not.

Three sampling protocols we recommend

Protocol	When	Numbers
TAR 1 Control Set	Predictive coding, static training	C=0.95, E=0.03, p=your best richness estimate. Typical n = 400–1,100.
TAR 2 Elusion (round n)	Continuous Active Learning	C=0.95, E=0.02, p=0.01–0.05. Typical n = 500–2,400.
Production QC	Before shipping a production	C=0.95, E=0.05, p=0.5 on responsive-tag accuracy. Typical n = 385.
Privilege QC	Before shipping a priv log	C=0.99, E=0.02, p=your priv rate. Typical n = 1,000–4,000.

What to tell the court

"We sampled [n] documents randomly drawn from a population of [N]. At [C]% confidence, the observed [responsive / elusion / privilege] rate of [x]% is within ±[E]% of the true population rate. The sample was reviewed by [role] on [dates] under the same protocol as the main population."

One rule Sample once, review once, report the results — good or bad. Do not re-sample until the number looks better. That's not sampling; that's guessing.