How big a sample?

March 23, 2009 By Alan J. Salzberg
Suppose we want to figure out what percentage of BIGbank's 1,000,000 loans are bad. We also want to look at smallbank, with 100,000 loans. Many people seem to think you'd need to look at 10 times as many loans from BIGbank as you would for smallbank.


The fact is that you would use the same size sample, in almost all practical circumstances, for the two populations above. Ditto if the population were 100,000,000 or 1,000.


The reasons for this, and the concept behind it, go back to the early part of the 20th century when modern experimental methods were developed by (Sir) Ronald A. Fisher. Though Wikipedia correctly sites Fisher in its entry on experimental design, the seminal book, Design of Experiments, is out of stock at Amazon (for $157.50, you can get a re-print of this and two other texts together in a single book). Luckily, for a mere $15.30, you can get David Salsburg's (no relation and he spells my name wrong! ;-) ) A Lady Tasting Tea, which talks about Fisher's work. Maybe this is why no one knows this important fact about sample size--because we statisticians have bought up all the books that you would otherwise be breaking down the doors (or clogging the internet) to buy. Fisher developed the idea of using randomization to create a mathematical and probability framework around making inferences of data. In English? He figured out a great way to do experiments, and this idea, or randomization, is what allows us to make statistical inferences about all sorts of things (and the lack of randomization is what sometimes makes it very difficult to prove otherwise obvious things).


Why doesn't (population) size matter?
To answer this question, we have to use the concept of randomization, as developed by Fisher. First, let's think about the million loans we want to know about at BIGbank. Each of them is no doubt very different, and we could probably group them into thousands of different categories. Yet, let's ignore that and just look at the two categories we care about: 1) good loan or 2) bad loan. Now, with enough time studying a given loan, suppose we can reasonably make a determination about which category it falls into. Thus, if we had enough time, we could look at the million loans and figure out that G% are good and B% (100% - G%) are bad.


Now suppose that we took BIGbank's loan database (ok, we need to assume they know who they loaned money to), and randomly sampled 100 loans from it. Now, stop for a second. Take a deep breath. You have just entered probability bliss -- all with that one word, randomly. The beauty to what we've just done is that we've taken a million disparate loans and with them, formed a set of 100 "good"s and "bad"s, that are identical in their probability distribution. This means that each of the 100 sampled loans that we are about to draw has exactly a G% chance of being a good one and a B% chance of being a bad one, corresponding to the actual proportions in the population of 1,000,000.


If this makes sense so far, skip this paragraph. Otherwise, envision the million loans as quarters lying on a football field. Quarters heads up denote good loans and quarters tails up denote bad loans. We randomly select a single coin. What chance does it have of being heads up? G%, of course, because exactly G% of the million are heads up and we had an equal chance of selecting each one.


Now, once we actually select (and look at) one of the coins, the chances for the second selection change slightly, because where we had G% exactly, now there is one less quarter to choose from, so we have to adjust accordingly. However, that adjustment is very slight. Suppose, G were 90%. Then, we'd have, for the second selection, if the first were a good coin, a 899999/999999 chance of selecting another good one (that's an 89.99999% chance instead of a 90% chance). For smallbank, we'd be looking at a whopping reduction to an 89.9999% chance from a 90% chance. This gives an inkling of why population size, as long as it is much bigger than sample size, doesn't much matter.


So, now we have a sample set of 100 loans. We find that 80 are good and 20 are bad. Right off, we know that, whether dealing with the 100,000 population or the 1,000,000 population, that our best guess for the percentage of good loans, G, is 80%. That is because of how we selected our sample. It doesn't matter one bit how different the loans are. They are just quarters on a football field. It follows from the fact that we selected them randomly.


We also can calculate several other facts, based on this sample. For example, if the actual number of good loans were 90% (900,000 out of 1,000,000), we'd get 80 or fewer in our sample of 100 only 0.1977% of the time. The corresponding figure, if we had sampled from the population of 100,000 (and had 90,000 good loans), would be 0.1968%. What does this lead us to conclude? Very likely, the proportion of "good" loans is less than 90%. We can continue to do this calculation for different possible values of G:

If G were 89%: .586% of the time would you get 80 or fewer.
If G were 88%: 1.47% of the time would you get 80 or fewer.
If G were 87%: 3.12% of the time would you get 80 or fewer.
If G were 86.3%: 5.0% of the time would you get 80 or fewer.
If G were 86%: 6.14% of the time would you get 80 or fewer.

In each of the above cases, the difference between a population of 1,000,000 and 100,000 loans makes a difference only at the second decimal place, if that.


Such a process allows us to create something called a confidence interval. A confidence interval kind of turns this calculation on its head and says, "Hey, if we only get 80 or fewer in a sample 1.47% of the time when the population is 88% good, and I got only 80 good loans in my sample, it doesn't sound too likely that the population is 88% good." The question then becomes, at what percentage would you start to worry?


For absolutely no reason at all (and I mean that), people seem to like to limit this percent to 5%. Thus, in the example above, most would allow that, if we estimated G such that 5% (or more) of the time, 80 or fewer of 100 loans would be good (where 80 is the number of good in our sample), then they would feel comfortable. Thus, for the above, we would say, with "95% confidence, 86.3% or fewer of the loans in the population are good." We could just as well have figured out the number that corresponded to 1% and stated the above in terms of 99% confidence, with the corresponding higher G or figured out the number that corresponds to 30% and stated the above in terms of 70% confidence. However, everyone seems to love 5% and the 95% confidence that goes with it.


Back to sample size versus population. As stated above, the population size, though 10 times bigger, doesn't makes a difference. For a given probability above, we are using the hypergeometric distribution to calculate the exact figure (the mathematics behind it are discussed some in my earlier post).


Here are some of the chances associated with a G of 85% and a sample size of 100 that yields 80 good loans.

Population infinite : 10.65443%
Population 1,000,000: 10.65331%
Population 100,000 : 10.64%
Population 10,000 : 10.54%
Population 1,000 : 9.49%
Population 500 : 8.21%

This example follows the rule of thumb: you can ignore the population size unless the sample is at least 10% of the population.