Motivation: Asset Idle Time
Asset Sharing companies capture the lost value of these idle assets by predicting who will need them and when, saving you money, and scraping a little off the top for profit. The key value these services add is in understanding and predicting the needs for these assets.
The D.C. Bike Share system has a similar problem. They want to predict how many bikes are needed so they maintain the right number of bikes for the city.
Data: DC Bike Share / Kaggle
Follow along in my Jupyter Notebook here: https://github.com/Ryanglambert/dc_bike_share_analysis/blob/master/Bike_Share_EDA.ipynb
Given date, time, weather, and other variables we will predict how many bikes will be used in a given hour. Let's take a look at the data!
I'll also be using Kaggle.com to check how well our model is doing against the "hold out" test set.
- **datetime** - hourly date + timestamp
- **season** - 1 = spring, 2 = summer, 3 = fall, 4 = winter
- **holiday** - whether the day is considered a holiday
- **workingday** - whether the day is neither a weekend nor holiday
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- **temp** - temperature in Celsius
- **atemp** - "feels like" temperature in Celsius
- **humidity** - relative humidity
- **windspeed** - wind speed
- **count** - number of total rentals
Plenty of combinations to consider. Using `seaborn` let's visualize at what variables correlate with bike share use. (I've made dummy variables for all categorical variables)
Categorical Variables Affect on Bike share use
Let's look at how time of day contributes to bike use.
Time of Day Effect On Bike Use
Zooming out a bit, the variance is many multiples of the expected value. I did not expect this at all.
Selecting The Right Link Function: Generalized Linear Models
There is a cousin to Poisson called Negative Binomial Distribution. It is the same as Poisson with one small difference. The mean and variance vary together, but are some multiple of one another.
First Model: GLM with Negative Binomial Link Function
Pearson Chi^2: 1320
Our histogram of residuals is mostly normally distributed. From these two plots I feel confident I have picked the right kind of distribution.
Kaggle Leader Board: 2522
Second Model: GLM Negative Binomial Link + Interaction terms
Aside: Interaction Terms
Interaction terms with dummy variables.
Interaction terms are a pairwise multiplication of our features by each other and ignoring the squared terms. i.e. (A + B) * (A + B) = (A^2 + 2AB + B^2). We're only interested in the '2AB' feature in there and we'll ignore all others. Scikit-learn makes this easy for us.
The cross-terms that have a 0 dummy variable will simply go to zero and thus shrink those appropriate parameters. Here's an example of what that looks like.
Interaction Model Performance
Pearson Chi^2: 599
Kaggle Leaderboard: 1827
Something I might do in the future is Model Stacking. You essentially find how to separate the data into groups that fit to separate models better and use those models for those data points. There is also ensembling where models can be given a weighted vote for an outcome variable.
(1) Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.