Herd immunity is the point at which new infections no longer naturally exponentially increase, but instead exponentially decay. At this point, the disease will naturally die out, though it still may take some time. Naive estimates of herd immunity for COVID put it around 70%, meaning 70% of the population needs to be immune before the virus begins to increase at a slowing rate and dies out.
However, experts agree that this estimate, though it works well when talking about vaccinations, is much too high when people are obtaining immunity by contracting the disease. Instead, the consensus appears to be that herd immunity for COVID would be between about 40% and 50% if there is no vaccination.
In this article (https://www.medrxiv.org/content/10.1101/2020.05.06.20093336v2), authors Tom Britton, Pieter Trapman, and Frank Ball show that with some assumptions herd immunity could be 43%. In this news article (https://www.quantamagazine.org/the-tricky-math-of-covid-19-herd-immunity-20200630/), Britton says it might even be lower than 43%, pointing out that his paper does not account for all the factors that would lower it. Virginia Pitzer of Yale School of Public Health believes it is "between 40 and 50%", and Harvard's Mark Lipsitch puts it at "about 50%." In other words, experts agree it is hard to pinpoint, but also agree it is much lower than the 70% arrived at using a naive calculation.
Now, we'll spend some time walking through the basic math and some examples, to see why the naive estimate is around 70% and how the actual threshold could be much lower.
Herd immunity is directly related to the reproduction rate R, discussed in a prior blog by Joshua Lafair (https://salthillstatistics.com/posts/66 ). To summarize, R is the number of people infected by the average person, and R0 is this number at "Time 0", the point at which a new disease (with no vaccinations or natural immunity) breaks out. So if R is 3, and we started with 100 infected people, they would infect 300 people, who infect 900 people, who infect 2700 people, and so forth. That quick increase is the exponential increase we saw with COVID in March in NYC and right now in the South. Think about what would happen though if R were below 1. Suppose R was 0.5 -- the average person would infect only .5 people. That means if we started with 100 infected people again, they would infect 50 people, who would infect 25, who would infect 12.5, etc. In all, the disease would die out after another 100 (50+25+12.5 +... = 100) were infected. If R is closer to 1 but below, we might get 200 or 300 more people infected, but it would still die out, because it is "decaying" exponentially. When R is greater than 1, the disease does not die out, but increases exponentially.
Once R is below 1, we are said to have herd immunity. Suppose, for example, that 100 new infected people come to NYC after NYC has achieved herd immunity. Those 100 will infect fewer than 100 people (because R is less than 1), who will infect fewer than that, etc. So even though we have people coming in and carrying the disease, it will not explode again. It's still not good, because these new disease carriers will infect people, but we do not need to worry about them causing another expanding outbreak like we had in March, if we are at herd immunity.
COVID is believed to have had an initial R, called R0, of around 3. There's a lot of debate about what the exact number was but it is definitely well above 1. As people get infected, the reproduction rate R drops from R0 such that an R0 of 3 quickly becomes a lower reproduction rate R, eventually hitting herd immunity. In a simple model, how does this work mathematically?
Let's take what's called a homogeneous population -- everyone is the same. Suppose first that each person interacts with 15 random people long enough while infected to potentially pass on COVID. Suppose second that each person exposed for enough time has a 1 in 5 chance of getting COVID. This means the R0 is 3: a newly infected person will be in contact with 15 people, of whom 3 (one-fifth of 15) will get infected.
Now, as the disease "flows" (I hate to use a Trump term) through the population, some of the 15 random people will be immune or already infected. Suppose that 1/3 have already been infected. Then, of a person's 15 random contacts, only 10 could be newly infected. And only about 2 (10 divided by 5), will get infected. So the R will drop to 2 once 1/3 are infected or have been infected. Now, think about when 2/3 of the population has had COVID. Now each person can only possibly infect 5 of their 15 random contacts, and will only infect 1 of them (since the infection rate is 1/5). Now R is just 1--each infected person only infects 1 other person. Just after that point, when more than 2/3 have been infected R drops below 1 and we have herd immunity as described above.
So, in this naive model, we achieve herd immunity at 67% infected. The formula for this naive herd immunity is simple: 1 - 1/R0, and I've just explained how it works.
So why doesn't the naive estimate work? This is because the two homogeneity assumptions, that 1) everyone contacts the same number of people, and 2) each person contacted has the same chance of being infected, are both wrong. Populations are not homogeneous. Instead some people have many more contacts than average, violating assumption 1. Having more contacts is a double-edged sword: it means if you are infected you are more likely to infect others and if you are not infected it means you are more likely to get infected because you come in contact with so many. Some people have much higher chances of being infected with each person they contact and thus they are more susceptible to the disease, violating assumption 2.
I'll use a couple of examples to explain why the falsity of these assumptions makes the naive herd immunity estimate too high.
First let's relax the assumption that everyone comes in sufficient contact with 15 people. We will keep the average number of contacts at 15, though. Instead of everyone regularly coming in extended contact with 15 people, suppose we have 4 out of 5 people who are grandparents and who come in contact and potentially infect just 2 other people while 1 out of 5 are college kids who regularly come in contact with 67 other people. So if we have 5 random people, 4 will be grandparents and have 8 total contacts (forget about overlap) and one will be a college student and will have 67 contacts. So in total we have 5 people and 8+67 = 75 contacts. The average is thus still 15: 75 divided by 5.
First let's see that this change does not change the math of R0. Suppose a random 100 people get the disease initially. Four-fifths of the 100 will be grandparents, so there will be about 80 grandparents initially getting the disease. These 80 grandparents will contact 160 people (2 each) and spread it to 32 (one-fifth of those contacted). Returning to the initial 100 infected, the remaining 20 will be college kids. The 20 college kids will contact 1340 people (20*67) and spread it to 268 (1340 divided by 5). So in total, we have 300 (32+268) infections from the initial 100. This is an R0 of 3 because we went from 100 to 300 new infections in the initial round of reproduction. In other words, the example we created with a heterogeneous population of grandparents and college kids has an R0 of 3, just like the original homogeneous population example.
Now remember in the homogeneous population the R stays above 1 until 2/3rds of the population has the disease, at which point we reach herd immunity. So how does the disease progress in this heterogeneous population? Let's use our example, and make one other simplifying assumption: the college kids only hang out with college kids and the grandparents only hang out with grandparents. This is not quite reality but it simplifies everything. Now we have two populations that never interact.
We can then consider each of them homogeneous populations. The grandparents have an R0 of 2/5 = 0.4 (2 contacts each and chance of infection 1 over 5) and thus are herd immune from the start (because R0 is less than 1), so they are herd immune at 0% infected. The college kids have an R0 of 67/5 = 13.4 (67 contacts each and chance of infection 1 over 5), so they will reach herd immunity after 1-1/13.4 are infected, or about 93%. Since grandparents are 80% of the population and college kids are 20%, we have herd immunity at 80%*0 + 20%**93% = 18.6%.
So despite the population with R0 of 3, we reach herd immunity at 18.6% because of the heterogeneity of the population. In other words, due to heterogeneity, the herd immunity would be lower than the homogeneous herd immunity of 67% due to population heterogeneity in contacts in this example.
Now let's look at the second homogeneity assumption: that everyone has the same chances of acquiring the disease when contacted (1/5). Let's relax this assumption and assume that 20% of the population is in Nursing homes and is "susceptible" and 80% are in track clubs and are not susceptible. Nursing home residents have a 1 in 2 chance of acquiring the disease with an infected contact whereas members of track clubs have a 1 in 8 chance of acquiring the disease.
Assume 100 random people get the disease at first. Keeping the prior assumption of 15 contacts per person, and assume that nursing home residents never contact track club members and vice versa. Then there will be 20 of the initial 100 who are nursing home residents and they will give the disease to half of their nursing home contacts: 20*15*1/2 = 150 people in total. There are 80 track club members that initially get the disease and they will give the disease to one in eight of their contacts: 80*15*1/8=150 people in total. So the initial 100 random infections will result in 300 additional infections, and thus the R0 is 3, the same as was assumed to be in a homogeneous population.
Let's calculate herd immunity in this population. We again will assume two completely separate populations (of nursing home residents and track club members). We can calculate herd immunity for each by first computing R0. For nursing homes, R0 is 15/2 = 7.5 (15 contact divided by a 50% chance of contracting). This implies herd immunity at 1-1/7.5 = 87%. For track clubs, R0 is 15/8=1.875 (15 contacts divided by a 1 in 8 chance of contracting). This implies herd immunity at 47%.
This means that the full population will get herd immunity after 87% of nursing home residents and 47% of track club members are infected, or 87%*20% + 47%*80%47*.8 are infected: 55%.
So what have we shown here?
That population heterogeneity, either in number of contacts or in susceptibility, leads to lower herd immunity than the homogeneous population estimate. You might wonder whether these examples could go the other way. That is, can the herd immunity threshold increase when you make the population heterogeneous? The answer is no--heterogeneous populations do not increase the herd immunity threshold. In the simple types of non-overlapping populations in my examples, this is proved by showing that the maximum of the herd immunity equation occurs when the R0 for both populations is equal (this can be shown with simple math and a tiny bit of calculus). For a more complicated model and discussion, see this paper (https://www.medrxiv.org/content/10.1101/2020.04.27.20081893v3.full.pdf ).
So herd immunity might be at 50% or even 40%. What does this mean for NYC, and the country?
In New York City, we may be nearing this herd immunity threshold. This means that the relaxing of social distancing measures and re-opening of businesses will not lead to another explosive outbreak, even without good case tracking. Assuming we are not at herd immunity but close, such relaxation will lead to increasing cases, but at a slower rate and for a much shorter time than in March, because R, if it is not below 1, is now much closer to 1. We can also probably withstand some influxes of infectious people without seeing major outbreaks (but again, we might expect to see small upticks in cases when this happens).
The rest of the country is nowhere near herd immunity, so it needs a different strategy in order to safely re-open. We can use population heterogeneity to enable a safer reopening. But that is the subject for another blog.