I have been working with Joe Jansen on the Citibike data in the R Language. Citibike is New York’s bike sharing program, which started in may and currently has more than 80,000 annual members. The R Language is a freely available object oriented programming language designed originally for doing statistics at Bell Labs.
Joe has downloaded all the data and done an extensive analysis, which you can find here. I did a simpler regression model and graphed it using ggplot2 in R. I found that maximum temperature, humidity, wind, and amount of sunshine to be significant factors (rain was not, but of course sunshine and rain are confounded, so you wouldn’t necessarily assume both would be important factors in a regression model). The day of the week, surprisingly, was not. The R-squared for my model, which predicts trips per 1,000 (annual) members, was more than 70%.
Here is a graph of the results of predicted versus actual, with the day of the week shown by colored points.
You can see I am an amateur at ggplot, as the legend for day of the week has the days of the week out of order (but in alphabetical order). Help on that and other aspects of ggplot for this graph would be welcome (please comment accordingly).
If day of the week made a difference, for any given point on the x-axis (predicted trips) you would have more of a certain color that is high on the y-axis than other colors. For example, if more trips occurred on weekends, you would have more of the green colors (Saturday and Sunday) on top. However, no such affect seems to exist. I guess people are enjoying Citibike every day of the week, or casual riders on the weekends are roughly making up for weekday commuting riders.