Effective Sample Size
Posted by Tom Leinster
On a scale of 0 to 10, how much does the average citizen of the Republic of Elbonia trust the president?
You’re conducting a survey to find out, and you’ve calculated that in order to get the precision you want, you’re going to need a sample of 100 statistically independent individuals. Now you have to decide how to do this.
You could stand in the central square of the capital city and survey the next 100 people who walk by. But these opinions won’t be independent: probably politics in the capital isn’t representative of politics in Elbonia as a whole.
So you consider travelling to 100 different locations in the country and asking one Elbonian at each. But apart from anything else, this is far too expensive for you to do.
Maybe a compromise would be OK. You could go to 10 locations and ask… 20 people at each? 30? How many would you need in order to match the precision of 100 independent individuals — to have an “effective sample size” of 100?
The answer turns out to be closely connected to a quantity I’ve written about many times before: magnitude. Let me explain…
The general situation is that we have a large population of individuals (in this case, Elbonians), and with each there is associated a real number (in this case, their level of trust in the president). So we have a probability distribution, and we’re interested in discovering some statistic (in this case, the mean, but it might instead be the median or the variance or the 90th percentile). We do this by taking some sample of individuals, and then doing something with the sampled data to produce an estimate of .
The “something” we do with the sampled data is called an estimator. So, an estimator is a real-valued function on the set of possible sample data. For instance, if you’re trying to estimate the mean of the population, and we denote the sample data by , then the obvious estimator for the population mean would be just the sample mean,
But it’s important to realize that the best estimator for a given statistic of the population (such as the mean) needn’t be that same statistic applied to the sample. For example, suppose we wish to know the mean mass of men from Mali. Unfortunately, we’ve only weighed three men from Mali, and two of them are brothers. You could use
as your estimator, but since body mass is somewhat genetic, that would give undue importance to one particular family. At the opposite extreme, you could use
(where is the mass of the non-brother). But that would be going too far, as it gives the non-brother as much importance as the two brothers put together. Probably the best answer is somewhere in between. Exactly where in between depends on the correlation between masses of brothers, which is a quantity we might reasonably estimate from data gathered elsewhere in the world.
(There’s a deliberate echo here of something I wrote previously: in what proportions should we sow poppies, Polish wheat and Persian wheat in order to maximize biological diversity? The similarity is no coincidence.)
There are several qualities we might seek in an estimator. I’ll focus on two.
High precision The precision of an estimator is the reciprocal of its variance. To make sense of this, you have to realize that estimators are random variables too! An estimator with high precision, or low variance, is not much changed by the effects of randomness. It will give more or less the same answer if you run it multiple times.
For instance, suppose we’ve decided to do the Elbonian survey by asking 30 people in each of the 5 biggest cities and 20 people from each of 3 chosen villages, then taking some specific weighted mean of the resulting data. If that’s a high-precision estimator, it will give more or less the same final answer no matter which specific Elbonians happen to have been stopped by the pollsters.
Unbiased An estimator of some statistic is unbiased if its expected value is equal to that statistic for the population.
For example, suppose we’re trying to estimate the variance of some distribution. If our sample consists of a measly two individuals, then the variance of the sample is likely to be much less than the variance of the population. After all, with only two individuals observed, we’ve barely begun to glimpse the full variation of the population as a whole. It can actually be shown that with a sample size of two, the expected value of the sample variance is half the population variance. So the sample variance is a biased estimator of the population variance, but twice the sample variance is an unbiased estimator.
(Being unbiased is perhaps a less crucial property of an estimator than it might at first appear. Suppose the boss of a chain of pizza takeaways wants to know the average size of pizzas ordered. “Size” could be measured by diameter — what you order by — or area — what you eat. But since the relationship between diameter and area is quadratic rather than linear, an unbiased estimator of one will be a biased estimator of the other.)
No matter what statistic you’re trying to estimate, you can talk about the “effective sample size” of an estimator. But for simplicity, I’ll only talk about estimating the mean.
Here’s a loose definition:
The effective sample size of an estimator of the population mean is the number with the property that our estimator has the same precision (or variance) as the estimator got by sampling independent individuals.
Let’s unpack that.
Suppose we choose individuals at random from the population (with replacement, if you care). So we have independent, identically distributed random variables . As above, we take the sample mean
as our estimator of the population mean. Since variance is additive for independent random variables, the variance of this estimator is
where is the population variance. The precision of the estimator is, therefore, . That makes sense: as your sample size increases, the precision of your estimate increases too.
Now, suppose we have some other estimator of the population mean. It’s a random variable, so it has a variance . The effective sample size of the estimator is the number satisfying
This doesn’t entirely make sense, as the unique number satisfying this equation needn’t be an integer, so we can’t sensibly talk about a sample of size . Nevertheless, we can absolutely rigorously define the effective sample size of our estimator as
And that’s the definition. Differently put,
Trivial examples If is the mean value of uncorrelated individuals, then the effective sample size is . If is the mean value of extremely highly correlated individuals, then the variance of the estimator is little less than the variance of a single individual, so the effective sample size is little more than .
Now, suppose our pollsters have come back from their trips to various parts of Elbonia. Together, they’ve asked individuals how much they trust the president. We want to take that data and use it to estimate the population mean — that is, the mean level of trust in the president across Elbonia — in as precise a way as possible.
We’re going to restrict ourselves to unbiased estimators, so that the expected value of the estimator is the population mean. We’re also going to consider only linear estimators: those of the form
where are the trust levels expressed by the Elbonians surveyed.
Question:
What choice of unbiased linear estimator maximizes the effective sample size?
To answer this, we need to recall some basic statistical notions…
Correlation and covariance
Variance is a quadratic form, and covariance is the corresponding bilinear form. That is, take two random variables and , with respective means and . Then their covariance is
This is bilinear in and , and .
is bounded above and below by , the product of the standard deviations. It’s natural to normalize, dividing through by to obtain a number between and . This gives the correlation coefficient
Alternatively, we can first scale and to have variance , then take the covariance, and this also gives the correlation:
Now suppose we have random variables, . The correlation matrix is the matrix whose -entry is . Correlation matrices have some easily-proved properties:
The entries are all in .
The diagonal entries are all .
The matrix is symmetric.
The matrix is positive semidefinite. That’s because the corresponding quadratic form is , and variances are nonnegative.
And actually, it’s not so hard to prove that any matrix with these properties is the correlation matrix of some sequence of random variables.
In what follows, for simplicity, I’ll quietly assume that the correlation matrices we encounter are strictly positive definite. This only amounts to assuming that no linear combination of the s has variance zero — in other words, that there are no exact linear relationships between the random variables involved.
Back to the main question
Here’s where we got to. We surveyed individuals from our population, giving identically distributed but not necessarily independent random variables . Some of them will be correlated because of geographical clustering.
We’re trying to use this data to estimate the population mean in as precise a way as possible. Specifically, we’re looking for numbers such that the linear estimator is unbiased and has the maximum possible effective sample size.
The effective sample size was defined as , where is the variance of the distribution we’re drawing from. Now we need to work out the variance in the denominator.
Let denote the correlation matrix of . I said a moment ago that is the quadratic form corresponding to the bilinear form represented by the covariance matrix. Since each has variance , the covariance matrix is just times the correlation matrix . Hence
where denotes a transpose and .
So, the effective sample size of our estimator is
We also wanted our estimator to be unbiased. Its expected value is
where is the population mean. So, we need .
Putting this together, the maximum possible effective sample size among all unbiased linear estimators is
Which achieves this maximum, and what is the maximum possible effective sample size? That’s easy, and in fact it’s something that’s appeared many times at this blog before…
The magnitude of a matrix
The magnitude of an invertible matrix is the sum of all entries of . To calculate it, you don’t need to go as far as inverting . It’s much easier to find the unique column vector satisfying
(the weighting of ), then calculate . This sum is the magnitude of , since is the th row-sum of .
Most of what I’ve written about magnitude has been in the situation where we start with a finite metric space , and we use the matrix with entries . This turns out to give interesting information about . In the metric situation, the entries of the matrix are between and . Often is positive definite (e.g. when ), as correlation matrices are.
When is positive definite, there’s a third way to describe the magnitude:
The supremum is attained just when , and the proof is a simple application of the Cauchy–Schwarz inequality.
But that supremum is exactly the expression we had for maximum effective sample size! So:
The maximum possible value of is .
Or more wordily:
The maximum effective sample size of an unbiased linear estimator of the mean is the magnitude of the sample correlation matrix.
Or wordily but approximately:
Effective sample size magnitude of correlation matrix.
Moreover, we know how to attain that maximum. It’s attained if and only if our estimator is
where is the weighting of the correlation matrix.
I’m not too sure where this “result” — observation, really — comes from. I learned it from the statistician Paul Blackwell at Sheffield, who, like me, had been reading this paper:
Andrew Solow and Stephen Polasky, Measuring biological diversity. Environmental and Ecological Statistics 1 (1994), 95–103.
In turn, Solow and Polasky refer to this:
Morris Eaton, A group action on covariances with applications to the comparison of linear normal experiments. In: Moshe Shaked and Y.L. Tong (eds.), Stochastic inequalities: Papers from the AMS-IMS-SIAM Joint Summer Research Conference held in Seattle, Washington, July 1991, Institute of Mathematical Statistics Lecture Notes — Monograph Series, Volume 22, 1992.
But the result is so simple that I’d imagine it’s much older. I’ve been wondering whether it’s essentially the Gauss-Markov theorem; I thought it was, then I thought it wasn’t. Does anyone know?
The surprising behaviour of effective sample size
You might expect the effective size of a sample of individuals to be at most . It’s not.
You might expect the effective sample size to go down as the correlations within the sample go up. It doesn’t.
This behaviour appears in even the simplest nontrivial example:
Example Suppose our sample consists of just two individuals. Call the sampled values and , and write the correlation matrix as Then the maximum-precision unbiased linear estimator is , and its effective sample size is As the correlation between the two variables increases from to , the effective sample size decreases from to , as you’d expect.
But when , the effective sample size is greater than 2. In fact, as , the effective sample size tends to . That’s intuitively plausible. For if is close to then, writing and , we have , and so is a very good estimator of . In the extreme, when , it’s an exact estimator of — it’s infinitely precise.
The fact that the effective sample size can be greater than the actual sample size seems to be very well known. For instance, there’s a whole page about it in the documentation for Q, which is apparently “analysis software for market research”.
What’s interesting is that this doesn’t only occur when some of the variables are negatively correlated. It can also happen when all the correlations are nonnegative, as in the following example from the paper by Eaton cited above.
Example Consider the correlation matrix where . This is positive definite, so it’s the correlation matrix of some random variables .
A routine computation shows that As we’ve shown, this is the greatest possible effective sample size you can achieve by taking an unbiased linear combination of , and .
When , it’s , as you’d expect: the variables are uncorrelated. As increases, decreases, again as you’d expect: more correlation between the variables leads to a smaller effective sample size. This behaviour continues until , where .
But then something strange happens. As increases from to , the effective sample size increases from to . Increasing the correlation increases the effective sample size. For instance, when , we have : the maximum-precision estimator is as precise as if we’d chosen independent individuals! For that value of , the maximum-precision estimator turns out to be Go figure!
This is very like the fact that a metric space with points can have magnitude (“effective number of points”) greater than , even if the associated matrix is positive definite.
These examples may seem counterintuitive, but Eaton cautions us to beware of our feeble intuitions:
These examples show that our rather vague intuitive feeling that “positive correlation tends to decrease information content in an experiment” is very far from the truth, even for rather simple normal experiments with three observations.
Anyone with any statistical knowledge who’s still reading will easily have picked up on the fact that I’m a total amateur. If that’s you, I’d love to hear your comments!
Re: Effective Sample Size
I am also an amateur at statistics. However, on the question of how n positively correlated samples can have an effective sample size greater than n, I wonder how you can know what the true correlation matrix of your samples is. Presumably that knowledge is what somehow gets you the extra power of your experiment.