Your Significance Tests Are Wrong: How Taylor Series Linearization Fixes Weighted Survey Data

The problem in thirty seconds

A UK government department commissions a survey of 2,000 adults. The raw sample under-represents young men and over-represents older women, so rim weights (also called raking weights) are applied. Each respondent gets a weight reflecting how many people in the population they “represent.”

The analyst then compares the proportion who agree with a policy between two regions. They compute weighted proportions — 62% in the North vs 55% in the South — and run a standard z-test to check if the 7-percentage-point gap is statistically significant. The z-test returns p = 0.03. They report it as significant at the 95% level.

But the z-test formula they used — the one in every introductory textbook — assumes every respondent counts equally. It doesn’t know that the weights have inflated some respondents’ influence and reduced others’. The actual uncertainty around those estimates is larger than the formula thinks. If they’d computed the standard errors correctly, they’d have found p = 0.09. Not significant. A different conclusion entirely.

This isn’t a rare edge case. It happens on every weighted survey, on every significance test, every time the standard errors aren’t adjusted. The question is only how badly it distorts the results.

Why weighting inflates uncertainty

To understand why, think about what a weight of 3.0 means. It means this respondent is “standing in” for three people in the population. But they’re still only one person, with one set of views. You’ve tripled their influence on the estimate without tripling the actual information in the sample. That mismatch — more influence, same information — is where the extra uncertainty comes from.

The effect is symmetric. A respondent with a weight of 0.5 contributes less to the estimate than an unweighted respondent would, but they still occupy a “slot” in the sample. You can’t unbuy their interview.

The net result is that variable weights always reduce the effective information in your sample. This is measured by the design effect due to weighting (sometimes called DEFF or d²):

DEFF = 1 + CV²(w)

where CV²(w) is the squared coefficient of variation of the weights — the variance of the weights divided by the square of the mean weight. If all weights are identical (i.e., no weighting), the CV is zero and DEFF = 1: no information loss. The more variable the weights, the larger the DEFF.

The effective sample size is then:

n_eff = n / DEFF = (Σw_i)² / Σw_i²

This is Kish’s formula. For a typical UK social survey with rim weighting, a DEFF of 1.2 to 1.8 is common. That means a sample of 2,000 might have an effective sample size of only 1,100 to 1,700. Your standard errors should be 10–35% wider than the textbook formula produces. And because your standard errors are in the denominator of every test statistic, the test statistics should be 10–35% smaller. Differences that looked significant aren’t.

Why Kish’s formula isn’t enough

If DEFF gives you the effective sample size, why not just plug that into the standard formula and call it a day? Many analysts do exactly this. It’s the most popular “hack” in the industry. It doesn’t work, for two reasons.

First, the design effect varies by variable. A variable that is correlated with the weighting dimensions (say, employment status, if age and sex are the weighting targets) will have a larger design effect than a variable that’s orthogonal to them (say, favourite colour). Kish’s formula gives you a single global DEFF based on the weights alone. It knows nothing about the variable you’re analysing. Using it for every test applies the same correction everywhere, which is too conservative for some variables and too liberal for others.

Second, and more fundamentally, Kish’s formula only captures the weighting component of the design effect. If your survey also has clustering (multi-stage sampling) or stratification, those affect the variance too. For many UK government surveys — telephone omnibus surveys, simple online panels with quota sampling — this isn’t a concern because there’s no explicit clustering. But it’s a limitation you should be aware of. What you actually need is a method that computes the correct standard error for each specific statistic, accounting for the actual weights on the actual data. That method is Taylor series linearization.

Taylor series linearization: the intuition

Here’s the core insight, without any formulas.

A weighted proportion is a ratio of two uncertain quantities: the weighted count of “yes” responses divided by the total sum of weights. Both the numerator and the denominator are random variables — they’d come out differently if you drew a different sample. The standard formula for the variance of a proportion, p(1−p)/n, pretends the denominator is fixed. It isn’t.

Taylor series linearization handles this by converting the ratio problem into a simpler problem. It assigns each respondent a “pseudo-value” that captures how much they contribute to the uncertainty in the ratio estimate. Once you have these pseudo-values, computing the variance is just a matter of calculating the weighted variance of a single variable — which is straightforward.

The name “Taylor series” comes from the mathematical technique used to derive those pseudo-values: a first-order Taylor expansion of the ratio around the population values. But the resulting computation is simple arithmetic. You don’t need calculus to implement it. You need a for-loop and some subtractions.

The mathematics, step by step

Step 1: Express the statistic as a ratio of weighted totals

A weighted proportion is:

&pcirc; = Ŷ / &Xcirc; = (Σ w_i · y_i) / (Σ w_i)

where y_i is a 0/1 indicator for the category of interest, and w_i is the weight for respondent i. The numerator Ŷ is the weighted count of “yes” responses. The denominator &Xcirc; is the sum of all weights. A weighted mean of a continuous variable has the same form — replace y_i with any variable. Every proportion is a mean of a binary variable. Every mean is a ratio estimator.

Step 2: The Taylor expansion

For any ratio &Rcirc; = Ŷ/&Xcirc;, we want Var(&Rcirc;). Apply a first-order Taylor expansion around the population values (Y, X):

&Rcirc; − R ≈ (1/X) · [(Ŷ − Y) − R · (&Xcirc; − X)]

This is just the multivariable chain rule applied to f(a,b) = a/b, evaluated at (Y, X) and perturbed by (Ŷ − Y, &Xcirc; − X). Taking the variance of both sides:

Var(&Rcirc;) ≈ (1/X²) · [Var(Ŷ) + R² · Var(&Xcirc;) − 2R · Cov(Ŷ, &Xcirc;)]

Step 3: Construct the linearized variable

Define, for each respondent i:

z_i = (y_i − &Rcirc; · x_i) / &Xcirc;

For a proportion, x_i = 1 for every respondent (each contributes 1 to the denominator), so this simplifies to:

z_i = (y_i − &pcirc;) / (Σ w_i)

The z_i is a “pseudo-value” or “linearized value.” It’s the first-order influence of respondent i on the estimate. If respondent i said “yes” (y_i = 1), their pseudo-value is positive and proportional to (1 − &pcirc;). If they said “no” (y_i = 0), it’s negative and proportional to &pcirc;. Respondents in the minority exert more influence on the estimate, which is intuitively right.

Now the key result:

Var(&Rcirc;) ≈ Var(Σ w_i · z_i)

We’ve converted the variance of a nonlinear ratio into the variance of a linear weighted total.

Step 4: Compute the variance of the weighted total

For a design with no stratification and no clustering — which is the case for most UK social surveys using rim weights from online panels — each respondent is their own “primary sampling unit.” The variance formula is:

&Vcirc;(&pcirc;) = [n / (n − 1)] · Σ_i (w_iz_i − wz)²

where wz = (1/n) · Σ w_iz_i, and n is the actual (unweighted) number of respondents. This is what R’s survey package computes when you call svymean() on a binary variable with svydesign(id = ~1, weights = ~wt).

The standard error is the square root of this variance: SE(&pcirc;) = √&Vcirc;(&pcirc;)

Step 5: Build the test statistic

To test whether the proportion differs between two independent subgroups (A and B), compute the linearized SE for each subgroup separately using only that subgroup’s data and weights, then:

t = (&pcirc;_A − &pcirc;_B) / √[&Vcirc;(&pcirc;_A) + &Vcirc;(&pcirc;_B)]

Because the subgroups contain different respondents, there’s no covariance term — the variances simply add. Compare this t-statistic to a t-distribution with degrees of freedom = n_A + n_B − 2.

A worked example with real numbers

Let’s make this concrete with a small example you can verify by hand. Suppose you survey 10 people about whether they support a policy. Five are in the North, five are in the South. Rim weighting has been applied:

Respondent	Region	Support (y)	Weight (w)
1	North	1	1.5
2	North	1	0.8
3	North	0	1.2
4	North	1	0.7
5	North	0	1.8
6	South	0	1.3
7	South	1	0.6
8	South	0	1.1
9	South	0	1.4
10	South	0	0.9

Naive approach (wrong)

The naive analyst computes weighted proportions and plugs them into the standard formula.

North: &pcirc;_N = (1.5 + 0.8 + 0.7) / (1.5 + 0.8 + 1.2 + 0.7 + 1.8) = 3.0 / 6.0 = 0.500

South: &pcirc;_S = 0.6 / (1.3 + 0.6 + 1.1 + 1.4 + 0.9) = 0.6 / 5.3 = 0.1132

SE_naive = √[(0.500 × 0.500)/5 + (0.1132 × 0.8868)/5] = 0.2647

z_naive = (0.500 − 0.1132) / 0.2647 = 1.461

p-value ≈ 0.144

Linearized approach (correct)

North (n=5, Σw = 6.0, &pcirc; = 0.500): compute z_i = (y_i − 0.5)/6.0

i	y_i	w_i	z_i	w_iz_i
1	1	1.5	0.08333	0.12500
2	1	0.8	0.08333	0.06667
3	0	1.2	−0.08333	−0.10000
4	1	0.7	0.08333	0.05833
5	0	1.8	−0.08333	−0.15000

&Vcirc;(&pcirc;_N) = (5/4) × (0.12500² + 0.06667² + 0.10000² + 0.05833² + 0.15000²) = 1.25 × 0.05597 = 0.06997

SE(&pcirc;_N) = √0.06997 = 0.2645

South (n=5, Σw = 5.3, &pcirc; = 0.1132): compute z_i = (y_i − 0.1132)/5.3

i	y_i	w_i	z_i	w_iz_i
6	0	1.3	−0.02136	−0.02777
7	1	0.6	0.16732	0.10039
8	0	1.1	−0.02136	−0.02350
9	0	1.4	−0.02136	−0.02991
10	0	0.9	−0.02136	−0.01922

&Vcirc;(&pcirc;_S) = 1.25 × 0.01266 = 0.01583 → SE(&pcirc;_S) = 0.1258

SE_diff = √(0.06997 + 0.01583) = 0.2929

t = (0.500 − 0.1132) / 0.2929 = 1.320 (df = 8)

p-value = 0.223 (two-tailed)

Comparing the results

Approach	SE of difference	Test statistic	p-value
Naive (wrong)	0.2647	1.461	0.144
TSL (correct)	0.2929	1.320	0.223

The linearized standard error is about 11% larger than the naive one. The p-value shifts from 0.144 to 0.223. In larger datasets where the naive p-value is just below 0.05, this correction routinely pushes it above.

What happens at scale

The toy example above uses only 10 respondents to keep the arithmetic visible. The effect becomes more consequential with real survey data.

Consider a DWP survey of 4,000 claimants, rim-weighted by age, sex, benefit type, and region. The weights have a coefficient of variation of 0.55. That gives a global DEFF of about 1.30 and an effective sample size of roughly 3,080.

Compare claimants aged 18–34 (n = 600, heavily upweighted) versus 55+ (n = 1,200, slightly downweighted). The young group’s weights have a much higher CV — say 0.70 vs 0.35. Their subgroup-specific DEFFs are about 1.49 and 1.12, giving effective sample sizes of roughly 403 and 1,071.

If observed weighted proportions are 48% (young) vs 41% (old), the naive z-test gives z = 2.817, p = 0.005. The linearized version gives t ≈ 2.407, p ≈ 0.016. Still significant at 5%, but the p-value has more than tripled.

For comparisons between smaller subgroups — say region by ethnicity cells with n = 150 each and heavy weighting — the correction can easily flip a result from significant to non-significant. This is the scenario that matters most in government research, where subgroup comparisons drive policy recommendations.

The Rao-Scott extension for chi-squared tests

The approach above works for pairwise comparisons between subgroups. But survey reports also commonly present cross-tabulations with an overall test of association — typically a chi-squared test. The chi-squared test has its own weighting problem, and its own solution.

The standard Pearson chi-squared statistic, computed on weighted cell proportions, follows a distribution that is not chi-squared under complex sampling. The first-order Rao-Scott correction divides the Pearson statistic by the average eigenvalue of the design-effect matrix:

X²_RS1 = X² / δ̄

The second-order Rao-Scott correction — the default in R’s survey package — goes further, using a Satterthwaite-type F-approximation that matches both the mean and variance of the reference distribution. In R:

design <- svydesign(id = ~1, weights = ~wt, data = mydata)

svychisq(~row_var + col_var, design = design, statistic = "F")

The hacks, and why they fail

If you’ve worked in survey research for any length of time, you’ve encountered — or used — one of these workarounds. Each is an attempt to correct for the weighting effect without actually computing the right standard errors. Each fails in specific, predictable ways.

Hack 1: Use the unweighted sample size

Works when weighting only affects the composition of groups being compared. Breaks when the weight changes the proportions themselves — which is almost always the case when weighting by multiple dimensions.

Hack 2: Test on the unweighted data

Gives valid unweighted estimates with valid standard errors — but the estimates don’t reflect your target population. If a weight corrects for over-representation of one response category, the unweighted test may show a significant difference that disappears once the population correction is applied.

Hack 3: Scale weights to average 1

Ensures the test sees the correct sample size in aggregate, but still ignores the variable-specific design effect. Off by 5–15% on standard errors with moderate rim weights; 30%+ with extreme weights (common in longitudinal studies with attrition adjustments).

Hack 4: Use the effective sample size

The most sophisticated hack and genuinely useful for back-of-envelope calculations. Gets within 5–10% of the right answer for most UK social research. Its limitation: it applies the same correction to every variable, when the true design effect is variable-specific.

What software gets this right

R’s `survey` package

The reference implementation. Twenty years of development, validated against SAS, Stata, and SUDAAN, used by the US Census Bureau. Implements Taylor linearization as the default.

design <- svydesign(id = ~1, weights = ~weight, data = df)

svyttest(outcome ~ group, design = design)

svychisq(~row + col, design = design)

Stata & SPSS Complex Samples

Stata handles this natively with svy: prefix commands. SPSS Complex Samples is a paid add-on that does the job.

Python

One viable option: samplics, which implements Taylor linearization and Rao-Scott corrections. The more popular statsmodels treats weights as frequency weights, not sampling weights, and will give you wrong answers.

JavaScript

Has nothing. No npm package, no library, no implementation of any kind. If you’re building web dashboards for government survey results — increasingly common — you’re either calling R from the backend, precomputing everything, or getting the wrong answer.

Where this leaves us

The mathematics of Taylor series linearization aren’t complicated once you break them down. The key steps are: express your statistic as a ratio of weighted totals, compute linearized pseudo-values for each respondent, calculate the weighted variance of those pseudo-values, and take the square root to get your standard error.

For UK government social research, the practical implications are clear. If your agency is running significance tests on weighted data without linearization (or an equivalent correction), you are systematically overestimating the precision of your estimates and reporting too many significant differences.

R does it. Python nearly does. JavaScript doesn’t at all. The gap is real, and it matters — particularly as government research moves toward interactive digital outputs where the statistical engine runs in the browser. Closing that gap is a tractable engineering project built on well-understood mathematics. The formulas in this article are everything you need.