Why are polls always wrong?

November 2, 2020 By Alan Salzberg

Even the best pre-election polls have errors.  Here are a few.

Sampling Error.  Sampling error is the difference between the sample result and the result if the poll were to be given to every member of the sampled population (hopefully all voters, but if not, it is the source of other errors--see below).  Sampling errors exist simply due to the random nature of sampling.  You randomly select from a population of voters who are 50-50 on an issue but the ones you sample might be 47-53 just by chance.  Sampling error depends only on the size of the poll (it's slightly more complicated if the survey is stratified (grouped) in advance, but we'll assume it is a "simple" random sample.  In that case, sampling error is at most around 3-5% for most polls (whose size is between 400 and 1000).  By at most, I mean 19 out of 20 samples will come within around 4% (exact number depends on sample size) of the actual value.  Most of the 19 that are within 4% will be off by much less than 4%, but the twentieth will be off by more than 4%..

There are two good things about sampling error with respect to election polling:

1) Sampling error is not biased-- a poll result is no more likely to be too off in one direction than it is to be off in another.

2) If you are using a site like https://projects.fivethirtyeight.com/polls/ that aggregates polls, the sampling error is much lower, because these sites combines many independent polls.  Therefore, instead of several hundred voters, the averages combine the results from thousands.  This means the margin of error might be more like 1% (margin of errors generally halves when you quadruple the sample size).

So net-net, sampling error will not make the difference here.

Non-response Error.  You might be able to guess what non-response error is -- it is the error that occurs because some portion of the people you tried to survey did not respond (or said they are undecided).  Like sampling error, this can be completely random (the people you missed might, on average, be planning to vote in the same proportion for Biden or Trump as the people you did not miss).  However, if people who tend to vote for one candidate are less likely to respond, or less likely to answer their phone, then this error can be in a particular direction.  If non-response is small, like only a couple percent, it needs to be highly skewed to matter.  Suppose, for example, that the non-response rate is 10%, and that the 10% who did not respond are voting 70-30% for Trump versus 40-60% for people who respond.  Then the actual number who are voting for Trump is 70%*10% + 40%*.9=43%.  So this 10% moved the dial, but only 3% (from 40% to 43%) and not 10%.  Unfortunately, non-response is far higher than 10% - it's more like 90% (see, for example this Pew Research discussion).  In the past, non-response has not skewed that far in one direction or the other in election polls, but even if it makes a little difference, the polls could be off by several percent.

In short, non-response error could easily be the difference-maker, but given past outcomes, it is unlikely to matter.

Measurement Error.  Measurement Error occurs when the survey response is different than what you are trying to measure.  Why would it occur here?  1. People change their minds.  2. People lie (embarrassed to say who they are voting for, perhaps).  In 2016, polls tightened in the couple of days before the election, indicating people were changing their minds towards Trump, and polls just three or four days before the election were thus inaccurate.  Five thirty eight counts "leaning" voters as voting for the candidate they are leaning towards, so if even this "leaning" switches in one direction more than the other, the polls will be off.  Also, it is believed that some Trump voters didn't want to say they were voting for Trump, even to an anonymous pollster.  Polls do not seem to be tightening this time, indicating that polls a few days ago do not have measurement error due to people changing their minds.  But are Trump voters still more likely to not say who they are voting for, or lie about it?  My own guess is that is less true now--many people voting for Trump seem to be very proud of it.  But that's just a guess, based on zero statistical data.  

In other words, my take is measurement error will be a smaller difference than last time, and thus not enough to matter. 

Specification Error.  Specification error can mean many things but Im using it here to mean the error that occurs when the sampled population is different than the target population -- polls aim to survey people who are actually going to vote (target population), and thus give results for likely voters.  This has two obvious problems: people who the surveyor believes are going to vote may not vote and people who the surveyor believes will not vote might vote.  In any election where turnout changes a lot from prior elections, this error can be substantial.  It is clear that in 2020, many more people are voting, so our guesses of who will vote based on past elections will obviously exclude some people.  Are these people biasing the polls?  That of course depends on whether the unlikely voters who vote are more for Biden or Trump.  The conventional wisdom is a heavy turnout favors Democrats, but it is not at all clear whether that will be true this year.  Republicans had a lot of success with voter registration efforts (see, for example, this article).  With a substantial increase in voting this year, the outcome of the election could hinge on this difference.  

So, who is going to win?.  Given Biden's consistent lead of a few points in enough states and what we know about past and present errors, the only possible error that seems large enough to overcome Biden's lead is specification error.  In other words, for Trump to win, there would need to be a large specification error in the polling.  That is certainly possible, given the unique nature of the election and the enormous, and enormously different, turnout that is expected.