Winning an election by 1 vote. What are the chances?

Voting is thought of as the most important civic duty.  But does your one vote really count?  I mean, how many elections are won by just a vote?  Certainly not many.  Yet, I just read that in Virginia, an election for state representative was won by a single vote!  In this election there were a total of 23,215 votes and the final outcome (as of a recount and as of this writing) was 11,608-11,607.  That’s right, it was won by a single vote.  All those voters who would’ve voted for the candidate with 11,607 votes but didn’t bother to show up are really kicking themselves now.

To figure out the chances of winning by a vote, let’s first make a couple of simplifying assumptions.  First, let’s assume that the eligible voting age population is exactly split 50-50 in their preferences (so if everyone voted the election would be a tie).  For simplicity, let’s also assume this population is much larger than the number of people who actually vote (this is sometimes true in smaller elections but less true in national elections).

If there are just 3 people voting, the chances of it being 2-1 or 1-2 are 75% (6 out of 8).  We can figure  this out by simply enumerating the 8 equally likely possibilities of the three people and counting how many are 1 vote wins (A is one candidate and B is the other):

AAA, AAB, ABA, BAA, BAB, BBA, ABB, BBB  : 6 of the 8 (all but the first and last) are a 2-1 win.  There is a faster way to do this by using a “binomial” distribution.  Using the R language, you can just double the chances of it being 2-1 (since 1-2 and 2-1 are equally likely) and the code looks like this: 2*dbinom(1,3,.5) .  But the R language and binomial distribution are for another time.

It is probably not surprising that as you increase the number of people voting, a 1 vote spread, even if the electorate is evenly split, is less and less likely.  For a coin toss, you might have heard the percent heads would get closer and closer to 50% as you flip more and more times.  That is true.  However, the *difference* in the number of heads and tails gets larger and larger.  The chance of a one vote difference is the same as the chance of there being a difference of only one between heads and tails, which gets small as the flips increases.

For 1,001 voters, the one vote spread has a chance of about 5%.

For 10,001 votes, the one vote spread has a chance of 1.6%.

For 23,215 votes (the number in the Virginia election), the chances are just 1.0%

For a presidential election with 100 million votes?  It’s 0.016%, or about 1 in 6,000.  Pretty small, but certainly large enough for me to go to the ballot box.





Is the NY Times right about college diversity?

The New York Times reported today that “Blacks and Hispanics are more underrepresentated at top colleges than 35 years ago.”  Is this correct?

The short answer is yes with respect to blacks and no with respect to hispanics.

The Times presents a series of graphs that show the raw percentage difference of underrepresentation.   For example, the times reports that “Black students are just 6 percent of freshmen but 15 percent of college-age Americans” and therefore show a graph with a 9 percent (15-6 percent) difference.  This has expanded from about a 7 percent difference.  The raw percentage differences appear small and  don’t really tell the story.  A 5% difference between 80 and 85% is very different than a 5% difference between 10 and 15%.

Because of this issue, the ratio of enrollment to college-age population is a much better measure.  Using the ratio, we can calculate that blacks enroll only at about 40% (6%/15%) of their population percentage.  Back in 1980, it was closer to 45% (6%/13%), so it’s gotten a little worse and the Times is correct.

From the graphs shown by the Times, we can see that Hispanics are currently (2015) 22% of the population but only 13% of freshman enrollment.  This gap is 9%, far larger than the 3% gap in 1980, and therefore the Times concludes the situation has worsened.  Though it is larger on a pure percentage basis, is actually better in 2015, and the Times conclusion is wrong with respect to Hispanics.  Hispanics went from an enrollment ratio of 50% (3%/6%) in 1980 to one of 60% (13%/22%).  This appears to be true not just overall but in all the breakdowns the Times shows.

The Times article has a number of graphs showing the percentage enrollment, but doesn’t address the reasons for the deficit.  You can’t very well enroll in college without a high school degree and the Times’ numbers (except for 1980) are for number of people in the right age group and not number of people with high school diplomas.  Lack of a diploma would indicate an issue not with the universities but a deficit earlier on.

And a diploma is not the only requirement for college enrollment.  A proper comparison would look at similarly situated individuals where the only difference is racial or ethnic background.  Only this sort of comparison could tease out the real effect of universities (as opposed to other reasons for the differences).  As another recent article has discussed, even though Asians are over-represented, they may still face racial discrimination because there may be even more qualified Asians than are admitted.


What are the chances of three snow days in one winter?

Any time there is even a ghost of a chance of snow, my three kids get very excited (ok, let’s face it, being a southern transplant to NY, so do I).  We go to in the hopes it will gives us some idea of how likely it is that school will be canceled.  Here in NYC, my wife tells me it used to be rare to never, but, of late, it seems the threshold is about 8 inches of snow.  Less than that and everyone wakes up early and trudges through.  More than that and we sleep late and go sledding.

This last winter we had one snow day early on and then nothing but beautiful weather.  My kids informed me at the time that they really didn’t want to have more than 3 snow days, because after that, they need to make it up.  I have no idea if this is true, but it is the set up to  a nice probability problem.

It goes like this: if the chance any random winter day being a snow day is p, what is the chance that you have exactly the sweet spot of 3 snow days in a season where there are n school days? For the example below, we’ll assume the chance of a snow day is 1 in 20, and that there are 60 school days in winter.  If we make a couple of assumptions, the calculation of whether there will be exactly 3 days becomes a standard probability question that is solvable using a statistical distribution called the Binomial Distribution.

But first things first.  Let’s assume (almost certainly wrongly) that whether or not a particular day is a snow day is completely independent of whether there were snow days on any prior days.  Also, lets assume that the chances of 1 in 20 (5%) apply equally for the first day of winter as they do for the middle of winter or the end of winter.  So, in other words, the chance of 1 in 20 stays the same throughout the winter and doesnt vary based on the kind of winter we have had and doesnt even vary based on weather we had 2 feet of snow and canceled school the day before.

The first important concept is that, given these assumptions of independence and equal probability, the chance of having, say, two snow days in a row is just the chance of 1 snow day squared.  In other words, the chance that we have a snow day the first 2 days of winter is (.05)*(.05)=.0025, or 1 in 400.  So, the chances of the first two days of winter off is 1 in 400.  Extending that to the required 3 days,  we have .05*.05*.05*, which is 1 in 8,000, an exceedingly small number, implying that only once in 8,000 years would we be lucky enough to have the first three days as snow days.

But that would not be good unless there were no snow days the rest of the season.  Therefore, we need to account for the chances that the other 57 days are *not* snow days.  That number is .05^3 (as before) multiplied by the chance of no snow day (.95) to the power of the number of days that occurs, which is 57.  Therefore, the chance that we have 3 snow days followed by 57 non-snow days is (.05^3)*(.95^57)= about 1 in 150,000.  It’s looking really grim right now.  On average, it will take 150,000 winters before we have one where the first three days are snow days and we have no more the rest of the season.

But luckily, we are trying to figure out something more likely, and that is the chance it will be a snow day on *any* of 3 of the 60 days of winter.  We don’t care at all if it is the first three, the last three, or any three in the middle, and they dont need to be consecutive.  It turns out that the chance we calculated, of 1 in 150,000 is the correct chance of a snow day on a particular set of 3 days no matter which particular set we choose.  For example, the chance of a snow day on January 1, February 1, and March 1, but not at all the rest of the winter is also 150,000, as is the chance of a snow day on Jan 31, Feb 28, and March 31  (this is true because we assume that the chance  of a snow day is the same and snow days are independent of one another).

This realization allows us to solve the original question of the chance of exactly three snow days simply by adding up all the different ways we can choose 3 days out of 60 and multiplying it by the chance of 1 in 150,000.  Luckily for us, it turns out there are quite a few ways to choose 3 days out of 60 — about 34,000 ways.  This is computed by figuring out that there are 60 ways to choose the first snow day, 59 ways to choose the second and 58 ways to choose the third (60*59*58) and dividing this by 6 to account for the fact that the selection doesn’t need to distinguish between the first second and third days-they’re all just snow days and can be ordered 123 132 213 231 321 or 312 (i.e., in 6 ways).  Net-net, the chance of exactly three days being a snow day is about 34000/150000, or a little more than 1 in 5.

If you want to learn more about the general formulation of this type of problem, lookup up the Binomial Distribution (Wikipedia can be a good start: ).

Super Bowl Coin Toss

Is there any advantage to calling the coin toss in the Super Bowl?  It doesn’t appear so, as in the 49 games to date, tails has come up 25 times and heads has come up 24 (see for detail).  This is not surprising, given the conventional wisdom that a tossed coin comes up heads half the time.

Suppose, however, that you are playing a game with a coin toss, and you suspect that your opponent has an unfair coin, but you don’t know whether it comes up heads more often or tails more often.  As long as you are calling the toss, you can make the game fair by calling heads half the time and tails half the time (secretly flip your own coin first to decide).  Even if the game coin comes up, say, heads, all the time, you are fine, because you will have a 50-50 shot at choosing heads.  It is easy to check that no matter what percentage of the time heads comes up on the game coin, you will have a 50 percent chance of winning if you randomize your call.

Here’s how.  Suppose the game coin comes up heads percent of the time.  Since you choose heads 50% of the time, you will win because you chose heads .5*p  percent of the time and you will win because you chose tails .5*(1-p) percent of the time.  Add these together and you get .5p+.5-.5p=.5=50%.  So you never need to worry about the fairness of your opponent’s coin as long as you call the toss.

Some people have suggested a different solution.  It is as follows: flip two coins.  The caller calls even (both heads or both tails) or odd (one heads, one tail).  Assuming the two coins both have a p chance of coming up heads, this gives the following odds.  For Even it is p^2 (both heads) + (1-p)^2 (both tails), which equals 2p^2 +1 -2p.  It is easy to see that this hits its minimum at p=.5 (ok–not so easy if you dont remember calculus, but take the derivative and set to 0 to determine that p = .5 is a min or a max and take the second derivative to determine that it is a mininum).
Thus, if you call even, your odds will be AT LEAST 50%=2(.5)^2 + 1 – 2(.5).  If p is EITHER smaller or larger than 0.5, then your odds of winning by calling even are better than 50-50.

The second method is therefore advantageous to the person calling ‘even’ or ‘odd’, because by always calling ‘even’ the caller has odds of winning 50% if the coin is fair, and more than 50% if the coin is unfair, no matter whether it comes up heads more often or tails more often. 

Bridge splits re-visited

A couple years back, I wrote on the chances of various “splits” in bridge (and explained why this is something bridge players care about) in this post, which also explains the math behind the chances.

Here are the chances of the different splits of 7 trumps that are out, between the other two players.

4-3 split: 62.2%

5-2 split: 30.5%

6-1 split: 6.8%

7-0 split: 0.5%


For completeness, here are splits with 6 and fewer (from the prior post).

For hands with 6 trumps out:

3-3 split : 35.5%

4-2 split: 48.4%

5-1 split: 14.5%

6-0 split:  1.5%


For hands with 5 trumps out, we get:

3-2 split: 67.8%

4-1 split: 28.3%

5-0 split: 3.9%


For hands with 4 trumps out:

2-2 split: 40.7%

3-1 split: 49.7%

4-0 split: 9.5%


For hands with 3 trumps out:

2-1 split: 78%

3-0 split: 22%


For hands with 2 trumps out:

1-1 split: 52%

2-0 split: 48%
It’s worth mentioning that these probabilities are unconditional.  Since the bidding that precedes playing any given hand gives some information, it is typically true that some splits can be ruled out or downplayed.  For example. in the 4 spade hand I played last night, a 5-2 or (especially) worse split seemed unlikely, because there was no double from the other side, so I would’ve put the chances of a 403 split far higher than the unconditional 62%.

Going to College: what are the chances?

The NY Times answers it today ( ).  It seems they have done a simple linear regression of percentile of income vs. college attendance ( not graduation).  You can go to the site and guess the relationship (spoiler alert: below is how I guessed in purple against the actual in grey).  It seems I did better than 95% but most were still similar to mine.

Pages from You Draw It_ How Family Income Affects Children’s College Chances - NYTimes

What is a p value and why do you care?

I feel like I’ve written this too many times, but here we go again.

There was a splendid article in the New York times today concerning Bayesian statistics, except that, as usual, it had some errors.


Lest you think me overly pedantic, I will note that Andrew Gelman, the Columbia professor profiled in much of the article, has already posted his own blog entry highlighting a bunch of the errors (including the one I focus on) here.


Concerning p-values the article states:

“accepting everything with a p-value of 5 percent means that one in 20 “statistically significant” results are nothing but random noise.”  This is nonsense.  I found this nonsense particularly interesting because I recently read almost this exact line in a work written by an MIT professor.


P-value explained in brief


Before I get to explaining why the Times is wrong, I need to explain what a p-value is.  A p-value is a probability calculation, first of all.  Second of all, it has an inherent assumption behind it (technically speaking, it is a conditional probability calculation).  Thus, it calculates a probability assuming a certain state of the world.  If that state of the world does not exist, then the probability is inapplicable.


An example: I declare:”The probability you will drown today is 99%.”  “Not true,” you say, “I am not going swimming today and am in the middle of a desert.”  “I forgot to mention,” I explain, “that this was under the assumption that you were in the middle of the Atlantic with no land for 40 miles and water that is 300 feet deep.”  The p-value is a probability like that — it is based on an assumption.


The assumption behind the p-value is often called a Null Hypothesis. The p-value is the chance of obtaining your particular favorable research result, under the “Null Hypothesis” assumption that your research is garbage.    It is the chances that, given your research is useless, you obtained a result at least as positive as the one you did.  But, you say, “my research may not be totally useless!”  The p-value doesn’t care about that one bit.


More detail using an SAT prep course example

Suppose we are trying to determine whether an SAT prep course results in a better score for the SAT. The Null Hypothesis would be characterized as follows:

H0=Average Change in Score after course is 0 points or even negative.  In shorthand, we could call the average change in score D (for difference) and say H0: D<=0.  Of course, we are hoping the test results in a higher score, so there is also a research hypothesis: D>0.  For the purposes of this example, we will assume the change that occurs is wholly due to the course and not to other factors, such as the students becoming more mature with or without the course, the later test being easier, etc.


Now suppose we have an experiment where we randomly selected 100 students who took the SAT and gave them the course before they re-took the exam.  We measure each students change and thus calculate the average d for the sample (I am using a small d to denote the sample average while the large D is the average if we were to measure it across the universe of all students who ever existed or will exist).  Suppose that this average for the 100 students is an score increase of 40 points.  We would like to know, given the average difference, d, in the sample, is the universe average D greater than 0?  Classical statistics neither tells us the answer to this question nor does it even give the probability that the answer to this question is “yes.”


Instead, classical statistics allows us only to calculate the p-value: P(d>=40| D<=0).  In words, the p-value for this example is the probability that the average difference in our sample is 40 or more, given that the Universe average difference is 0 or less (Null Hypothesis is true).  If this probability is less than 5%, we usually conclude the Null Hypothesis is FALSE, and if the NUll Hypothesis were in fact true, we would be incorrectly concluding statistical significance.  This incorrect conclusion is often called a false positive.  The chance of a false positive can be written in shorthand as P(FP|H0), where FP is false positive, “|” means given, and H0 means Null hypothesis.  (Technically,  but not important here, we calculate the probability at D=0 even though the Null Hypothesis covers values less than zero, because that gives the highest (most conservative) value.)  If the p-value is set at 5% for statistical significance, that means P(FP|H0)=5%.


A more general way of defining the p-value is that the p-value is the chance of obtaining a result at least as extreme as our sample result under the condition/assumption that the Null Hypothesis is true.  If the Null Hypothesis is false (in our example if the universe difference is more than 0), the p-value is meaningless.


So why do we even use the p-value?  The idea is that if the p-value is extremely small, it indicates that our underlying Null Hypothesis is false.  In fact, it says either we got really lucky or we were just assuming the wrong thing.  Thus, if it is low enough, we assume we couldn’t have been that lucky and instead decide that the Null Hypothesis must have been false.  BINGO–we then have a statistically significant result.


If we set the level for statistical significance at 5% (sometimes it is set at 1% or 10%), p-values at or below 5% result in rejection of the Null Hypothesis and a declaration of a statistically significant difference.   This mode of analysis leads to four possibilities:

False Positive (FP), False Negative (FN), True Positive (TP), and True Negative(TN).

False Positives occur when the research is useless but we nonetheless get a result that leads us to conclude it is useful.

False Negatives occur when the research is useful but we nonetheless get a result that leads us to conclude that it is useless.

True Positive occur when the research is useful and we get a result that leads us to conclude that it is useful.

True Negatives occur when the research is useless and we get a result that leads us to conclude that it is useless.

We only know if the result was positive (statistically significant) or negative (not statistically significant)–we never know if the result was TRUE (correct)  or FALSE (incorrect).  The p-value limits the *chance* of a false positive to 5%.  It does not explicitly deal with FN, TP, or TN.


Back to the Question of how many published studies are garbage, but it gets a little technical

Now, back to the quote in the article: “accepting everything with a p-value of 5 percent means that one in 20 “statistically significant” results are nothing but random noise.”

Let’s consider a journal that publishes 100 statistically significant results regarding SAT courses that improve scores and statistical significance is based on p-values of 5% or below.  In other words, this journal published 100 articles with research showing that 100 different courses were helpful.  What number of these courses actually are helpful?

Given what we have just learned about the p-value, I hope your answer is ‘we have no idea.’ There is no way to answer this question without more information.  It may be that all 100 courses are helpful and it may be that none of them are.  Why?  Because we do not know if these are all FPs or all TPs or something in-between–we only know that they are positive, statistically significant results.


To figure out the breakdown, let’s do some math.  First, create an equation, using some of the terminology from earlier in the post.

The Number of statistically significant results = False positives (FP) plus True positives (TP).  This is simple enough


We can go one step further and define the probability of a false positive given the Null hypothesis is true and the probability of a true positive given the alternative hypothesis is true — P(FP|H0) and P(TP|HA).  We know that P(FP|H0) is 5% — we set this is by only considering a result statistically significant when the p-value is 5%.  However, we do not know P(TP|HA), the chances of getting a true positive when the alternative hypothesis is true.  The absolute best case scenario is that it is 100%–that is, any time a course is useful, we get a statistically significant result.


Suppose that we know that B% of courses are bad and (1-B)% of courses are helpful.  Bad courses do not improve scores and helpful courses do.   Further, let’s suppose that N courses in total were considered, in order to get the 100 with statistically significant results.   In other words, a total of N studies were performed on courses and those with statistically significant results were published by the journal.  Let’s further assume the extreme concept above that ALL good courses will be found to be good (no False Negatives), so that P(TP|HA)=100%.  Now we have the components to figure out how many bad courses are among the 100 publications regarding helpful courses.


The number of statistically significant results is :

100= B*N*P(FP|H0) + (1-B)*N*P(TP|HA)

This first term just multiplies the (unknown) percent of courses that are bad by the total studies performed by the percent of studies that will give the false positive result that says the course is good.  The second term is analogous, but for good courses that achieve true positive results.  These reduce to:

100 = N(B*5% + (1-B)*100%)  [because the FP chances are 5% and TP chances are 100% ]

= N(.05B +1 – B)      [algebra]

= N(1-.95B)              [more algebra]

==> B =  (20/19)*(1- 100/N) [more algebra]

The published courses equal B*N*P(FP|H0), which in turn equals (1/19)*(N-100) [using more algebra].


If you skipped the algebra, what this comes down to is that the number of bad courses published depends on N, the total number of different courses that were researched.

If N were 100, then 0 of the publications were garbage and all 100 were useful.

If the N were 1,000, then about 947 were garbage, about 47 of which were FPs and thus among the 100 publications.  So 47 garbage courses were among the 100 published.

If the total courses reviewed were 500, then about 421 were garbage, about 21 which were FPs and thus among the 100 publications.

You might notice, that given our assumptions, N cannot be below 100, the point at which no studies published are garbage.

Also, N cannot be above 2000, the point at which all studies published are garbage.


You might be thinking–we have no idea how many studies are done for each journal article accepted for publication though, and thus knowing that 100 studies are published tells us nothing about how many are garbage–it could be anything from 0 to 100% of all studies! Correct.  We need more information to crack this problem. However, 5%  garbage may not be so terrible anyway.


While it might seem obvious that 0 FPs is the goal, such a stringent goal, even if possible, would almost certainly lead to many more FNs, meaning good and important research would be ignored because its statistical significance did not meet a more stringent standard.  In other words, if standards were raised to 1% or 0.1%, then some TPs under the 5% standard would become FNs under the more stringent standard, important research–thought to be garbage–would be ignored, and scientific progress would be delayed.


To Huck or not to Huck?

I play a lot of Ultimate Frisbee, a game akin to football in that there are end zones, but akin to soccer in that there is constant action until someone scores.  In Ultimate, you can only advance by throwing the disc (so-called because we generally do not generally use Wham-O branded discs, which are called Frisbees). An incomplete pass or a pass out of bounds is a turnover, as is a “stall,” where the offense holds the disc without throwing for more than 10 seconds.

In other words, in order for the offense to score, you need to complete passes until someone catches the disc in the end zone.  The accepted method of doing this is to complete shorter, high-percentage passes.  On a non-windy day, it seems fairly simple for at least one of your six teammates to get open and thus you can march down the field.  Of course, one long pass, or “huck,” can shortcut the process and give your team the quick score.  Much like football, the huck is not typically done except in desperation (game almost over due to time or thrower almost stalled).

However, I am not at all sure this logic makes sense.  Suppose you need six short passes to advance to a score.  If your team completes short passes with a probability of 90%, you will score about 53% of the time (90% to the sixth power gives the chances of completing six passes in a row).  In other words, as long as the chance of completing the huck is more than 53%, you would have a better chance of scoring with a huck.

Thus, the relative chances of scoring via the two methods depends on three things: 1) chance of completing a short pass, 2) chance of completing a huck, and 3) number of short passes needed for a score.  The graph below shows the threshold huck completion rate (the rate at which it makes more sense to huck) for different short pass completion rates and always assuming 6 short passes is enough for a score and one huck is enough for a score.

huckornottohuckIn case it is difficult to see, at a 95% short pass completion rate, your huck percentage needs to be 74% in order for it to be better to huck.  If your short pass completion rate is 50%, huck away unless your huck completion rate is less than 2%.

Of course, this simple analysis assumes 6 throws equals a score, and it also leaves out a number of other factors.  For example, an incomplete huck confers a field advantage to the hucking team because the opposing team has to begin from the point of in-completion (as long as it was in-bounds).  On the other hand, it may not take long for the opposing team to figure out the hucking strategy and play a zone style defense that will lower the hucking chances considerably.

Citibike Rides–what are the chances?

I have been working with Joe Jansen on the Citibike data in the R Language.  Citibike is New York’s bike sharing program, which started in may and currently has more than 80,000 annual members.  The R Language is a freely available object oriented programming language designed originally for doing statistics at Bell Labs.

Joe has downloaded all the data and done an extensive analysis, which you can find here.  I did a simpler regression model and graphed it using ggplot2 in R.  I found that maximum temperature, humidity, wind, and amount of sunshine to be significant factors (rain was not, but of course sunshine and rain are confounded, so you wouldn’t necessarily assume both would be important factors in a regression model).  The day of the week, surprisingly, was not.  The R-squared for my model, which predicts trips per 1,000 (annual) members, was more than 70%.

Here is a graph of the results of predicted versus actual, with the day of the week shown by colored points.

You can see I am an amateur at ggplot, as the legend for day of the week has the days of the week out of order (but in alphabetical order).  Help on that and other aspects of ggplot for this graph would be welcome (please comment accordingly).

If day of the week made a difference, for any given point on the x-axis (predicted trips) you would have more of a certain color that is high on the y-axis than other colors.  For example, if more trips occurred on weekends, you would have more of the green colors (Saturday and Sunday) on top.  However, no such affect seems to exist.  I guess people are enjoying Citibike every day of the week, or casual riders on the weekends are roughly making up for weekday commuting riders.

What are the chances of different “splits” in bridge?

If you know how to play bridge, skip to the fourth paragraph!
In bridge, 13 cards are dealt to each of 4 players (so all 52 cards are dealt).  Players sitting across from each other are partners, so we could think of the two teams positions as North and South and East and West on a compass.  A process of “bidding” ensues, in which the team with the highest bid has selected a “trump” suit and a number of rounds, or “tricks” that they have contracted to take.


Suppose North-South had the highest bid and North is playing the hand.  Then East “leads” a card, meaning East places a card (any card he/she wants) face up on the table.  The play goes clockwise, East-> South-> West -> North.  South, West and North must play a card of the same suit that East played.  When four cards are down, the highest one wins the “trick” and that winner puts any card of his/hers down, in order to begin a new trick.  Play continues until 13 rounds of 4 cards each have been played.


Suppose that West wins a trick and thus gets to play a card.  He plays the Ace of Hearts.  North, who is next and otherwise required to play hearts, is out of hearts.  North can play any other suit, but if he chooses to play the “trump” suit (say Spades are trump), then he automatically wins the trick unless East or South is also out of hearts and play a higher card in Spades (the trump suit).  In other words, trumps are very valuable.  In the bidding process, the teams try to bid in such a way that the trump suit is one in which they have a lot of cards.  Generally, the team with the winning bid (the “contract”) will have at least 7 of the 13 trumps between the two of them, meaning the other team will have 6 or fewer.  Whatever the number the opponents have, it is generally advantageous to the contract winners if they have the same number each rather than them being skewed to one or the other opponent.


Bridge players begin here:
So here is the probability piece.  Suppose you and your partner hold 7 trumps between you, what are the chances the opponents each have 3?  have 4 and 2?  have 5 and 1?  have 6 and 0?  To solve this sort of problem, we use combinations.  See my earlier post for some detail (and more odds of bridge hands).


The opponents have 26 cards altogether and we want to know the number of different groups of six among those 26 cards.  Think of this process as a process of picking six cards from the 26.  You have 26 choices for the first card, 25 for the second, and so on, and thus there are 26*25*24*23*22*21 total ‘permutations’ of size 6.  However, we do not care what order they are in so for each first card, there are 6 possible positions, for each second card, 5, etc., and thus we need to divide these permutations by 6*5*4*3*2*1, in order to get the number of unique sets when order does not matter. Again, see my earlier post for a more detailed explanation of this concept.


The R language allows for calculation of this combination of 6 out of 26 with the command “choose(26,6).” This is the denominator when we calculate probabilities, because it gives the total number of equally likely combinations of 6 cards.  The numerator is split into the two bridge hands of 13 cards each.   The number of combinations with an even 3-3 split are “13 choose 3” for both hands.
To calculate that probability in R, we write:   choose(13,3)*choose(13,3)/choose(26,6) and get 35.5%


How about hands with a 4-2 split?  That is the chance that Opponent 1’s hand has 4 trumps multiplied by the chance that Opponent 2’s hand has 2 trumps PLUS the chances that Opponent 2’s hand has 4 trumps multiplied by the chance that Opponent 1’s hand has 2 trumps.  Since the chance that either Opponent has 4 are the same, we can just double the probability of Opponent 1 having 4 and Opponent 2 having 2.  We get: choose(13,4)*choose(13,2)*2/choose(26,6) = 48.4% of one opponent having 4 and the other having 2 trumps.


Continuing this calculation, we get the following chances for hands with 6 trumps in the opponents hands( 6 trumps “out”):
3-3 split : 35.5%
4-2 split: 48.4%
5-1 split: 14.5%
6-0 split:  1.5%


For hands with 5 trumps out, we get:
3-2 split: 67.8%
4-1 split: 28.3%
5-0 split: 3.9%


For hands with 4 trumps out:
2-2 split: 40.7%
3-1 split: 49.7%
4-0 split: 9.5%


For hands with 3 trumps out:
2-1 split: 78%
3-0 split: 22%


For hands with 2 trumps out:
1-1 split: 52%
2-0 split: 48%


I find it interesting that the even split (for 2, 4, or 6 trumps out) is only the most likely scenario when 2 trumps are out.  When 4 trumps are out, a 3-1 split is more likely.  When 6 are out, a 4-2 split is more likely.



Simpson’s Paradox

A North Slope real estate broker (named North) is trying to convince you that North Slope is a more affluent neighborhood than South Slope.  To prove it, he explains that professionals in North Slope earn a median income of $150,000, versus only $100,000 in South Slope.  Working class folks fare better in North Slope also, with hourly workers making $30,000 a year to South Slope’s $25,000.


The South Slope real estate broker (named South) explains that North is crazy.  South Slope is much more affluent.  The median income in South Slope is $80,000 versus the North Slope median of $40,000.


Question: Who is lying, North or South?
Answer: It could be neither.
Consider the breakdown of income shown below.

We can see that North is not lying.  Half the hourly South Slope workers earn $20K and half $30K, for a median of 25K.  A similar calculation for the North Slope workers yields an hourly median of 30K.  For professionals in the South Slope, the median is $100K, with half earning $80K and half earning $120K.  In the North Slope, a similar calculation yields the median of $150,000.


South is not lying either.  For the South Slope, the median is $80,000, since more than half of the workers make less than or equal to $80,000 and more than half make $80,000 or more (according to the definition of median, at least half must be above the median and at least half must be below).  For the North Slope, the median is $40,000.


What happened here?  The problem, and the reason for the conflict between the wages according to type of work and the overall wages, is that the percentage of residents in each category does not match.  Thus, though professionals and hourly workers make more in the North Slope, there are far more hourly workers in the North Slope than in the South Slope.  Thus, the overall median (or mean) income is lower in the North Slope.


While Wikipedia has an entry for Simpson’s Paradox, a specific example of which I described above, it seems that most people are unaware of it.  My motivation for writing about it is not the made-up example I present above but the fact that I encounter it so much in my everyday work.  I either make my clients very happy by explaining that the ‘bad’ effect they have found may well be spurious or, anger them when I explain the interesting relationship they have found is a mere statistical anomaly.


The Worst Graph

One reason for quotes like there are “there are lies, damn lies, and statistics” is because of graphs like these:


This was on the front of this morning with the caption: “Huge US Oil Boom ahead: The U.S. will overtake Saudi Arabia to become the world’s biggest oil producer before 2020.”

I was shocked at first glance, because I thought oil production was going to go up 10 or 20-fold from the tiny amount in 2011 to the huge amount in 2015.  That does indeed sound huge.  Then I looked at the left y-axis, where I can see it is only going from 8 million to 10 million barrels a day, an increase of about 25%.  

Fine, you say, but you can still easily see that the light blue bar is above the dark blue bar starting in 2025, showing the US overtakes Saudia Arabia.  

I’m afraid not.  The two bars are not Saudia Arabia versus US production but oil versus gas production, and it is not even clear whose production is depicted.  Is the the whole world, the US, Saudi Arabia?  The article puts US production at 5.8 million barrels a day in 2011, so it appears not to be US production, but other sources put it at closer to 9 million, so maybe it is the US.  

Ok, you say, despite the poor caption, at least you can clearly see that gas production begins to top oil production (in whatever country the graph is depicting) around 2025.  

Not really.  Since oil is in millions of barrels per data and the gas is in billions of  cubic meters (per day, per month, per year, who knows?), this is actually not the case.  The year 2030 shows oil at about 10 million barrels per day and gas at nearly 800 billion cubic meters.  Which is more?  Maybe the readers of money can quickly translate these figures into BTUs or some useful measure of production output, but I sure can’t tell you.

Fine, you say, but since they start at about the same level, we at least know that gas increases more than oil over the time period.  

Sorry, even that is incorrect.  Look at the scale on the left axis (oil), which starts at 8 and goes to 12, a 50% increase.  The right axes starts at 600 and goes to 800, a 25% increase. Thus, oil goes from 8 to just over 10 (more than a 25% increase) while gas goes from a little over 600 to just little under 800 (less than a 33% increase–maybe a little more than oil but maybe not).

The only thing that appears to be correct about this graph is the year, until you realize that in the first period, there are only four years (2011-2015) while in the other periods, there are five year differences.


Election Polls

With the upcoming election, I have been following my favorite prediction site:  That site has a big map showing current predictions state-by-state as well as the overall electoral vote prediction.  It also shows the senate predictions.  It has been amazing accurate in the past (though, of course, this doesn’t mean that the sites predictions wont change considerably between now and the election). The predictions are all based on some sort of averaging of polls, and the site shows the results of each poll.  What I have found interesting (and it has been noted on the site) is that some polls appear to lean toward Obama while others lean toward Romney.  In other words, the polls appear to have biases.


Why?  Theories abound about this, and much of it comes down to the polling methodology.  The most compelling reason I have seen comes from Nate Silver’s blog on the New York Times site.  Silver’s blog compares traditional polls, which call only land-line phones, with more modern polls, which call cell phones along with land-lines.


As shown by a chart in Silver’s blog,  there is a clear and consistent difference in every swing state between the two types of polls, with modern polls leaning toward Obama.  This is consistent with the idea that younger people are both likely to vote for Obama and also more likely to not have landlines.  This issue has been pointed out before, and a Pew Research Report in 2010 noted substantial differences in party affiliation between voters who had a landline and those who only had a cell phone.


There is no doubt that the percentage of homes without landlines is rising rapidly.  See, for example, the CDC Report from last year, showing that about 30% of adults did not have a landline in 2011, about twice the percentages as 2008.  This increase in wireless-only homes does not necessarily mean an increase in bias (more and more Republicans may be shedding their landlines, and thus the bias could fall even as wireless only usage increases).  Still, the departure in the polls indicates that a bias persists.


Born to Run?

About a year ago, I read a book called “Born to Run,” by Christopher McDougall, who last week wrote an article in the New York Times Magazine on the same subject.


McDougall’s basic premise is that we were faster and less injury-prone before we started wearing all these fancy running shoes and that they are what’s causing running injuries. For example, in the New York times article:


“Back in the ’60s, Americans ‘ran way more and way faster in the thinnest little shoes, and wenever got hurt,’ Amby Burfoot, a longtime Runner’s World editor and former Boston Marathonchampion, said during a talk before the Lehigh Valley Half-Marathon I attended last year. ‘Inever even remember talking about injuries back then,’ Burfoot said. ‘So you’ve got to wonderwhat’s changed.'”

Statistics frowns on such anecdotal evidence, though it does make a good story. Did we really run faster? There are a lot of facts that we can look at though average times aren’t among them. Marathon records (shown in Wikipedia) for men have indeed only downticked a little since the sixties. In 1970, Ron Hill of the UK (close enough, runnig-shoe wise, to be considered american?) set a record of 2:09:29. This year, a new record of 2:03:38 was set (the most recent US record was 2:05:38 in 2002). Six minutes in 40 years doesnt seem like much, but is it because of the shoes or because the sport has matured? And are Americans seen less because running isnt really a big competitive sport here?
When you look at women’s times, the changes are much more dramatic. Women more recently began running marathons and fewer participated in the sport in general until relatively recently. In 1970, the women’s marathon record was 3:02:53 (set by an american). In 2003, Paula Radcliffe (England) ran it in 2:15:25. That’s a 47 minute improvement, or nearly 2 minutes per mile. In the 2011 New York Marathon, 40 women from the US bested the 1970 record time (see marathon site here for results).
So, I can’t agree that we ran “way faster” 40 years ago. This doesn’t mean that bare foot runners are slower than shoed runners because changes over the last fourty years in the level of competition, and improvements in training and fitness, rather than shoes, might have been the factors contributing to improved times.
How about injury rates? Do people get more injuries with running shoes than without? Unfortunately, any data on injury rates is tainted by the changes in the makeup of the population that runs (from a small, highly fit population to a large more population more varied in fitness–think of the then-overweight President Clinton running with a stop at McDonalds post-jog), and there haven’t been any studies that directly compare injuries over time for barefoot running against running shoe running. A good summary article is here.
A recent article in Nature, while not looking at historical data, supports McDougall’s contention that running shoes can be more harmful than bare feet when running. The article is lead-authored by Daniel Lieberman, a big advocate of barefoot running, so his bias may have been to look at things he believed were helpful about barefoot running and not at aspects of barefoot running that may be harmful. The article looks at impact forces and not at injuries, and doesn’t consider that runners with shoes may be able to change their stride to reduce the impact forces (McDougall says this is hard to do with running shoes, and, from my own experience, I tend to agree, though I don’t think it is impossible).
The statistical net-net is that there is no direct evidence either way right now. I admit some bias but I would say that the lack of evidence, given the power and money behind the shoe industry, tends to make me believe that, at best, fancy shoes are no better than bare feet, because if there were an effect in favor of shoes, I would certainly think we’d have seen a study by now (this is something correctly pointed out by McDougall and other advocates of barefoot running). Therefore, don’t be surprised if you see me running with feet au-naturel someday soon.

Detecting cheating

In my professional work, I like being the statistical sleuth, trying to figure out whether a person or company cheated, and how much they cheated. Thus it was with a lot of interest that I read a recent article in USA Today describing suspicious activity that went on some standardized tests in DC schools.


It seems that standardized tests at certain DC schools have improved dramatically. For example, the article says, “in 2008, 84% of fourth-grade math students were listed as proficient or advanced, up from 22% for the previous fourth-grade class.” Of course, this could just be part of the amazing turn around.
However, the review found that this dramatic change corresponded with a another interesting statistic: the school had a very high number of erased answers that were changed from wrong answers to right answers (WTR erasures). Again, here’s what the article said: “On the 2009 reading test, for example, seventh-graders in one Noyes classroom averaged 12.7 wrong-to-right erasures per student on answer sheets; the average for seventh-graders in all D.C. schools on that test was less than 1. The odds are better for winning the Powerball grand prize than having that many erasures by chance.”
Here’s my problem with this logic: the calculation of the chances assumes that each student is acting independently and erasing much more than usual. In other words, the chances are calculated assuming that the students are randomly grouped by school with respect to the number of WTR erasures they have, and thus no school should have a particularly high or low number of erasures: number of erasures and the associated school would be statistically independent.
This statistical independence assumption falls apart if there is cheating, wherein teachers erase wrong answers and change them to correct answers after the test is completed. However, the statistical independence assumption also could also fall apart for innocuous reasons.
Suppose the students at this school were instructed to arbitrarily fill in the last 10 questions immediately upon beginning the exam (this might be a good strategy if there is no penalty for guessing and if many students do not finish the exam). Then, for the ones who get to the end of the test, they are erasing most of their guesses. This is a completely legitimate strategy, but it would make raise the number of WTR erasures a great deal. A lot of more complicated test taking strategies would also lead to more erasures, and if this school in particular taught those strategies, there would be a very high chance that there would be far more erasures at this school than at others, and some of the people interviewed cited strategies that may have led to more erasures.
Thus, the high erasure rate, even WTR erasures, may have a relatively simple explanation: this school effectively coached the kids in test taking while other schools did not or coached the children differently.
The article provides a link to several documents summarizing the results of the analysis. What I find interesting is that the worst school, BS Monroe ES, in terms of WTR erasures, also has a lot of WTW (wrong to wrong erasures). On average, this school has about between 2 and 3 WTW erasures per student, or about 1 WTW for every 5 WTR erasures. A more interesting, and I think more revealing, analysis would be to see how this ratio compares to the normal ratio. If the normal ratio is 1 WTW to 5 WTR, it indicates cheating may not have been the reason for the erasures (unless the cheaters were purposefully erasing some and changing them to wrong answers–which seems unliklely since there is no indication potential cheaters realized erasures could be detected at all). If the general ratio is far from 5 to 1, it would be another indicator of a different process going on at BS Monroe ES, perhaps involving cheating though it is still hard to rule out other, innocuous explanations that involve test-taking strategy.
Another analysis would be to look at the WTR vs. WTW erasures student by student. Presumably, students who answered a higher percentage of un-erased problems correctly would have a better ratio of WTR to WTW erasures. If that were not true, then it would lead more clearly to the conclusion that someone else was doing the erasing.
The research revealed in the article shows the correlation of two things: a dramatic increase in test scores and a dramatic number of WTR erasures. Cheating is one explanation for these increases. Another, however, is the implementation of a smart test-taking strategy at the school, which might well be part of an overall program to increase the test scores and improve the school. A statistical test can have a seemingly dramatic result (less likely than winning the lottery), but while defeating a specific hypothesis (independence of erasures by school), it doesn’t necessarily prove another hypothesis (cheating).

Throw away your cold medicine again?

A couple years ago, I wrote about a study that looked at the effect of a seawater nasal spray on the health of children (see that post).


Yesterday’s New York Times, explored a very similar claim. Anahad O’Connor’s column, “Really? The Claim: Gargling With Salt Water Can Ease Cold Symptoms,” looks at a study of 387 Japanese adults aged 18 to 65 (see this page for an abstract). Treatment groups gargled with PLAIN water or a “povidone-iodine” solution. Those gargling with plain water did the best, with 0.17 URTIs (upper respiratory tract infections) every 30 person-days, meaning about 1 in 6 get a URTI per month if they gargle with water. The control group had a rate of .26, meaning about 1 in 4 got a URTI. The iodine group had a rate of .24, also meaning about 1 in 4 go a URTI.


So water looks pretty good. The only caveat, and it is the same as the issue I mentioned in the earlier post, is that the outcomes were self-measured. The people doing the gargling reported whether or not they had a URTI. IN Japan, where the study was performed, there is a strong bias toward water gargling, at least according to the abstract of the study, which says: “Gargling to wash the throat is commonly performed in Japan, and people believe that such hygienic routine, especially with gargle medicine, prevents upper respiratory tract infections (URTIs).” In fact, the article reports that those in the control group gargled one time a day on average as well l (but those in the treated group gargled around 3 times a day). This affinity for water gargling and the belief that it stops infection may result in water-gargles reporting fewer infections, thus throwing the results of the study into question.


The New York Times, by the way, gives recommendations based on an upcoming book by Philip Hagen, to gargle with *salt* water, but cites this study, which is referring to *plain* water only.


My conclusion? If you THINK it is going to work, it’s fairly likely water gargling will be effective, and it is a lot cheaper than buying some kind of preventative medicine. If you don’t think it will work, this study provides little help in deciding whether it actually will work.

You asked for it, you got it. Toyota!

I think that’s how the ad line went. When? maybe 25 years ago.


Well, it seems to apply now. Sudden acceleration. Mention a problem with a car, any problem with any car, and people will start crawling out of the wood-work with the complaint. Why? It’s a numbers game. There were more than 100,000 pri-i(?) sold in the US in 2005-9. With that many people driving them around, any tiny problem that is reported is going to be “substantiated” by others. Those of us old enough to remember the Audi 5000 found the high correlation between those Audi’s with sudden acceleration and those sold to 85 year-old ladies inexplicable (studies mostly concluded it was driver error–see a recent article here in Wired).


The latest, after the brake-related Prius recall, is the claim of sudden acceleration. A guy in California managed to call 911 while it was happening–pretty amazing, huh? Unless, of course, you made it all up. Here’s what the current thoughts about it are (from wikipedia):
“On March 8, 2010, a 2008 Prius allegedly uncontrollably accelerated to 94 miles per hour on a California Highway (US), and the Prius had to be stopped with the verbal assistance of the California Highway Patrol as news cameras watched [86]. Subsequent to the event, media investigations uncovered suspicious information about the alleged runaway Prius driver, 61-year old James Sikes, including false police reports, suspect insurance claims, theft and fraud allegations, television aspirations, and bankruptcy.[87][88]Sikes was found to be US$19,000 behind in his Prius car payments and had $US700,000 in accumulated debt.[87]Sikes stated he wanted a new car as compensation for the incident.[87][89] Analyses by and Forbes found Sikes‘ acceleration claims and fears of shifting to neutral implausible, with Edmunds concluding that “in other words, this is BS”,[90] and Forbes comparing it to the balloon boy hoax.[88]


Notwithstanding the apparent CA tale above, the reality is that the rare problem is a tough nut to crack statistically. Suppose there is an issue in 1 in 10,000 Prius’ and that this issue only crops up on one in 10,000 rides on those cars. Thus, it’s a 1 in 100 million car rides in Prius. Even among those, it may be a very short-lived problem and not cause any injury or accident. Such a rare problem might be drowned out by other driver error problems, such as accidently hitting the gas instead of the break, perceiving that the car is accelarating when it is not, hitting both the gas and the break simultaneously in an attempt to hit the break. Each of these things can be exceedingly rare (1 in a million) and still be 100 times as common as the real problem.


There are other ways to go about teasing out rare events. In the lab, a machine could possibly simulate conditions that were occurring when the supposed sudden acceleration took place and see if it is repeatable. Yet these conditions are hard to figure out, as they are determined with the imperfect information of the person reporting the incident. As might be the case with the recent report, that person could be lying, but even if not, they are likely shooken up enough that they cannot remember the exact conditions very well. Consider airline crashes, where we often have very objective information (the black box), but it is still very difficult to figure out what happened and why.


One thing seems certain to be true: we won’t know whether or not Prius cars are at fault for a long time to come, and far fewer of them will be bought in the next couple years.

More germs = less disease?

So says an article in today’s Science Daily, which reports on a recent study at Northwestern of children from the Philippines. The study finds that children from the Philippines have much lower levels of C-reactive protein (CRP), which indicates better resistance to disease. Exposure to germs was much higher for the children in the Philippines.


So what’s wrong with this study? It’s a very tenuous association, and from what I can gather in the articles, no attempt was made to ensure the children in the U.S. that were compared to the children in the Philippines were similar in other ways. They might be different in CRP due to other environmental or hereditary factors. Perhaps it’s the weather? The diet? One of any number of things could account for the difference.


In addition, the study appears to ignore the much higher infant mortality rate and much lower life expectancy in the Philippines (you can try for life expectancy and other information by country). In other words, even if higher germ exposure does mean lower CRP, does it actually mean less disease and longer life? The broad indication is that it does not.


In order for the study to be valid, it needs to adjust for whatever inherent differences (in addition to germ exposure) exist between Phillipino and US children, and then see if CRP levels are still different. An even better way to do such a study would be to study children living in similar environments (same place, socio-economic situation, etc.) and determine if the ones exposed to more germs had lower levels of CRP when they reached adult-hood.


I’ve seen articles (see this for example, but I can’t find a more definitive one at this time) that indicate that children with early exposure to farm animals have fewer allergies, but nothing showing exposure to more serious germs is good. And some of the germs that we are exposed to are more than just common germs–they are deadly. It might be that those who are exposed to these deadly germs early, and live, are much better off later in life, but that is no reason to expose them to those germs unnecessarily. Of course, you wouldnt give your child a deadly disease so that, if they survived, they’d be resistant to it later in life.


We live in a society that is sometimes alarmist concerning germs, and I have written about this. Yet this doesn’t mean that, on the whole, a clean environment does not promote good health, and the article cited above seems to only have the most tenuous of indications that it may not.

Why Swine Flu is not a bunch of hogwash

This updates my previous blog: “Why Swine flu is a bunch of hogwash?”


Things have changed a bit in the months since that blog, and the hysteria I cited has leveled off. President Obama did declare a swine flu emergency a couple days ago, but I think that was a good idea.


Here is what has changed:
1) Swine flu deaths have been at epidemic levels the last three weeks. The chart below (from the CDC) shows flu and pneumonia deaths as a percentage of all deaths. The upper black line indicates epidemic level, and the red line is the current level. The graph shows four years of weekly figures.



While this graph doesn’t look too serious, and 2008 levels were much further above the threshold at their peak, the scary thing here is that it is so early in the season. This graph serves as a reminder, too, that every year the flu kills thousands of people, and the flu vaccine could prevent a large number of those deaths.


2) Hospitals are already getting crowded. One of the big problems with a real epidemic is the overcrowding of hospitals. This means that the really sick people cannot get treatment, and that is part of the reason the emergency was declared. See this article in USA Today about over-crowding. ok, so it’s USA Today, a paper that loves hyperbole, but, again, it’s early in the season and any indication of overcrowding at this point is scary.


3) The vaccine is not yet fully available. The regular flu vaccine has been out for weeks. Unfortunately, almost none of the flu this year seems to be covered by that vaccine. The majority seems to be 2009 H1N1 (the swine flu). See this chart for a breakdown. Note the orange/brown is 2009 H1N1, and note the yellow means it is not tested for sub-type, so almost all typed flu is swine flu.


That’s why I am worried. The other concern is that, even when the vaccine does come out, people won’t take it. See my brother’s blog about why you should and the crazies who say you should not.