R vs Python Predictive Analytics Part 2: Results

Part 1 of the series outlined the modelling choices. This part shows the code and results from R and Python predictive models.

The pipelines are simple so you should not any have difficulties to follow the code. Let’s dive into it!

Both scripts have these distinctive phases:

  • Load the dataset
  • Define the cross validator
  • Define the model to fit and test
  • Print the cross validation results
  • Save the model for later use

I think that the best way to understand the code is to read it line-by-line and look for these phases.

R Modelling

R script on Github

# Import necessary packages
# Caret for modelling
library(caret)
# MASS for the Boston dataset
library(MASS)

# Load data
data("Boston")

# Define cross-validation procedure
train_control = trainControl(method="cv", number=5)
# Fit the model (random forest)
model <- train(medv~., data=Boston, trControl=train_control,
               method="rf")

# Get the results from the best model
rmse = model$results[2, c("RMSE")]
# 95% confidence interval from the results
ninefiveconf <- model$results[2, c("RMSESD")] * 2

print("Random forest regressor (R)")
print("5-fold cross-validation performance")
print("Root mean squared error with 95% confidence interval:")
sprintf("%.3f ( +/- %.3f)", rmse, ninefiveconf)

# Save model for later usage
save(model, file="R_model.Rdata")

Python Modelling

Python script on Github


# Import necessary packages
import pickle

import numpy as np
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score

# Declare the regressor
rgr = RandomForestRegressor(n_estimators=100)
# Split the data into features and targets
X_data = load_boston().data
y_data = load_boston().target

# Calculate cross-validation scores (MSE)
MSEs = cross_val_score(estimator=rgr,
X=X_data,
y=y_data,
scoring='mean_squared_error',
cv=5)

# Flip the sign (https://github.com/scikit-learn/scikit-learn/issues/2439)
MSEs *= -1
# Convert MSE to get RMSE
RMSEs = np.sqrt(MSEs)

# Report model performance
print "Random forest regressor (Python)"
print "5-fold cross-validation performance"
print "Root mean squared error with 95% confidence interval:"
print ("{:.3f} (+/- {:.3f})").format(RMSEs.mean(), RMSEs.std()*2)
print ""

# Fit model with the whole dataset
rgr.fit(X_data, y_data)

# Save model for later usage
model_name = "python_rgr_model.pkl"
with open(model_name, 'wb') as f:
pickle.dump(rgr, f)

print "Model saved as: {}".format(model_name)

Baseline estimator

The baseline estimator was built to compare model performances. The baseline is the average price for the real estate. Baseline estimator is used for sanity checking and benchmarking.


y_data  = load_boston().target
# The mean is used as a baseline estimator
# R and Python estimators are later compared to this
# baseline.
# Calculate mean:
y_mean = y_data.mean()
# Difference (error) from the mean for each observation:
y_err = y_data - y_mean
# Calculate RMSE:
# Squared Error (SE):
y_SE = y_err**2
# Mean Squared Error (MSE):
y_MSE = y_SE.mean()
# Root Mean Squared Error (RMSE):
y_RMSE = np.sqrt(y_MSE)
# print "RMSE baseline: {:.3f}".format(y_RMSE)
# 9.188

Results

The models did comparatively well (compared to the baseline :D). This means that this kind of minimal modelling succeeded at least to some extent. Notice that the sample sizes were small and the results were affected a lot by variance. In python there was one fold which did really bad and therefore the overall score suffered quite a bit. The “final” results with 95% confidence interval:

  • R: 3.139 ( +/- 1.090)
  • Python: 4.407 ( +/- 2.664)
  • Baseline: 9.188

What kind of performance could be achieved with extensive preprocessing, feature selection, model tuning, ensembling etc? Would the additional effort be profitable for the client? How could the real-estate agent benefit from this kind of modelling? And most importantly: How to develop valuable business cases around predictive modelling?