## Motivation: Asset Idle Time

Asset Sharing systems are a way of minimizing 'asset idle time'. Getaround, Zipcar, Airbnb and many others make it a goal to minimize 'Asset Idle Time'. Asset Idle Time is when your car is sitting in your carport. It's when you leave for vacation and no one uses your home.

Asset Sharing companies capture the lost value of these idle assets by predicting who will need them and when, saving you money, and scraping a little off the top for profit. The key value these services add is in understanding and predicting the needs for these assets.

The D.C. Bike Share system has a similar problem. They want to predict how many bikes are needed so they maintain the right number of bikes for the city.

Asset Sharing companies capture the lost value of these idle assets by predicting who will need them and when, saving you money, and scraping a little off the top for profit. The key value these services add is in understanding and predicting the needs for these assets.

The D.C. Bike Share system has a similar problem. They want to predict how many bikes are needed so they maintain the right number of bikes for the city.

## Data: DC Bike Share / Kaggle

The data we're using is the DC bike share data hosted by UCI and also a Kaggle Challenge. (1)

Given date, time, weather, and other variables we will predict how many bikes will be used in a given hour. Let's take a look at the data!

I'll also be using Kaggle.com to check how well our model is doing against the "hold out" test set.

**Follow along in my Jupyter Notebook here:**https://github.com/Ryanglambert/dc_bike_share_analysis/blob/master/Bike_Share_EDA.ipynbGiven date, time, weather, and other variables we will predict how many bikes will be used in a given hour. Let's take a look at the data!

I'll also be using Kaggle.com to check how well our model is doing against the "hold out" test set.

**Predictors**

- **datetime** - hourly date + timestamp

- **season** - 1 = spring, 2 = summer, 3 = fall, 4 = winter

- **holiday** - whether the day is considered a holiday

- **workingday** - whether the day is neither a weekend nor holiday

- **weather**

- 1: Clear, Few clouds, Partly cloudy, Partly cloudy

- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

- **temp** - temperature in Celsius

- **atemp** - "feels like" temperature in Celsius

- **humidity** - relative humidity

- **windspeed** - wind speed

**Response Variable**

- **count** - number of total rentals

When are you more likely to ride a bike? When it rains or when it's sunny? When it's cold or when it's warm? When you're going to work? "If it's raining I'll just drive my car..." but you don't own a car.

Plenty of combinations to consider. Using `seaborn` let's visualize at what variables correlate with bike share use. (I've made dummy variables for all categorical variables)

Plenty of combinations to consider. Using `seaborn` let's visualize at what variables correlate with bike share use. (I've made dummy variables for all categorical variables)

## Categorical Variables Affect on Bike share use

There's some signal in here, but not much. Temperature and Humidity appear to have the strongest effect out of what is shown here.

Let's look at how time of day contributes to bike use.

Let's look at how time of day contributes to bike use.

## Time of Day Effect On Bike Use

It looks like it does, and pretty significantly.

Zooming out a bit, the variance is many multiples of the expected value. I did not expect this at all.

Zooming out a bit, the variance is many multiples of the expected value. I did not expect this at all.

## Selecting The Right Link Function: Generalized Linear Models

Since my response variable is the number of bikes used, that number will always be positive. The first thought might be to use the Poisson distribution. Poisson is usually what you use when you're modeling counts, or something that can only be positive. However, the Poisson distribution has equal mean and variance. For us, it looks more like variance is some multiple of the mean instead. The variance is HUGE!

There is a cousin to Poisson called Negative Binomial Distribution. It is the same as Poisson with one small difference. The mean and variance vary together, but are some multiple of one another.

There is a cousin to Poisson called Negative Binomial Distribution. It is the same as Poisson with one small difference. The mean and variance vary together, but are some multiple of one another.

## First Model: GLM with Negative Binomial Link Function

**Deviance: 1503.3**

Pearson Chi^2: 1320

Pearson Chi^2: 1320

The residuals are heteroscedastic. The variance in errors increases with the size of the prediction. That is appropriate for Poisson and Negative Binomial distributions, and it's expected since we know the variance increases as some multiple of our expected value.

Our histogram of residuals is mostly normally distributed. From these two plots I feel confident I have picked the right kind of distribution.

Our histogram of residuals is mostly normally distributed. From these two plots I feel confident I have picked the right kind of distribution.

## Kaggle Leader Board: 2522

Can we do better?

## Second Model: GLM Negative Binomial Link + Interaction terms

Let's make our model better using interaction terms.

## Aside: Interaction Terms

As we saw previously, the weekends and weekdays have distinct behaviors to account for. Other features will likely have similar effects. Whether or not it's a holiday, whether or not it's raining, etc.

How do we make our model account for these categories?

Interaction terms with dummy variables.

Interaction terms are a pairwise multiplication of our features by each other and ignoring the squared terms. i.e. (A + B) * (A + B) = (A^2 + 2AB + B^2). We're only interested in the '2AB' feature in there and we'll ignore all others. Scikit-learn makes this easy for us.

`sklearn.preprocessing.PolynomialFeatures(2, interaction_only=True)`

The cross-terms that have a 0 dummy variable will simply go to zero and thus shrink those appropriate parameters. Here's an example of what that looks like.

Interaction terms with dummy variables.

Interaction terms are a pairwise multiplication of our features by each other and ignoring the squared terms. i.e. (A + B) * (A + B) = (A^2 + 2AB + B^2). We're only interested in the '2AB' feature in there and we'll ignore all others. Scikit-learn makes this easy for us.

`sklearn.preprocessing.PolynomialFeatures(2, interaction_only=True)`

The cross-terms that have a 0 dummy variable will simply go to zero and thus shrink those appropriate parameters. Here's an example of what that looks like.

Days the were not categorized as "spring" had a zero, and when multiplied by the corresponding row in "temp", went to zero.

## Interaction Model Performance

**Deviance: 589.29**

Pearson Chi^2: 599Pearson Chi^2: 599

## Kaggle Leaderboard: 1827

By adding Interaction terms we brought our RMSLE score from .7 to .5. This means

What is RMSLE? (Root Mean Squared Log Error)

Like RMSE (Root Mean Squared Error) , RMSLE is a fit score but uses the Log of the outputs of the models vs the log of the actuals. The reason for this is the same reason why we're using a non-linear link function: We're predicting non-negative count data.

The way you can interpret RMSLE is essentially how many factors of the constant "e" (e=2.7818..., that e) that we're off by.

Back to evaluating the performance of the model, recall I had scores: .5 and .7

Exponentiating gives:

First model:

Model with interaction terms:

What is RMSLE? (Root Mean Squared Log Error)

Like RMSE (Root Mean Squared Error) , RMSLE is a fit score but uses the Log of the outputs of the models vs the log of the actuals. The reason for this is the same reason why we're using a non-linear link function: We're predicting non-negative count data.

The way you can interpret RMSLE is essentially how many factors of the constant "e" (e=2.7818..., that e) that we're off by.

Back to evaluating the performance of the model, recall I had scores: .5 and .7

Exponentiating gives:

First model:

**e^.7 = 2**

On average, our model was off by a factor of 2 :If we were predicting 100 cyclists our model would predict between 50 and 200 on average.On average, our model was off by a factor of 2 :

Model with interaction terms:

**e^.5 = 1.64**

On average, our model with interaction terms was off by a factor of 1.64:If we were predicting 100 cyclists our model would predict between 60 and 164 on average.On average, our model with interaction terms was off by a factor of 1.64:

## What's next

The intent of this project was to brush up on multivariable linear modeling. I hadn't expected to use something like Negative Binomial Distribution, but was excited to go through the exercise.

Something I might do in the future is Model Stacking. You essentially find how to separate the data into groups that fit to separate models better and use those models for those data points. There is also ensembling where models can be given a weighted vote for an outcome variable.

Something I might do in the future is Model Stacking. You essentially find how to separate the data into groups that fit to separate models better and use those models for those data points. There is also ensembling where models can be given a weighted vote for an outcome variable.

*Reference:*

(1) Fanaee-T, Hadi, and Gama, Joao,

*Event labeling combining ensemble detectors and background knowledge*, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.