R vs Python Predictive Analytics Part 1: Setup

This is part 1 of a short blog series about R, Python and predictive analytics. Part 1 shows how to establish a simple predictive analytics pipeline in R and Python. Part 2 shows the code and the results from both languages. The code is also available on Github.

There are existing and extensive comparisons about these two languages. To get a general idea about the differences and similarities you should check:

These two resources should give you a pretty good understanding of the languages. The languages are relatively similar from predictive analytics point of view. This series offers a quick comparison in predictive analytics context.

Modelling Choices

The analysis is done with Boston house prices dataset. The modelling problem is to predict median value of house prices with 14 given features.

The basic development workflow in both languages is really similar. I used VIM with Python plugins to write the python script and RStudio IDE for R. Scikit-learn was used to model and cross-validate the model in python. Caret was responsible for modelling and cross-validation in R. Scikit-learn and Caret are general purpose machine learning packages that provide tools for the whole pipeline from preprocessing to cross-validation. Be sure to check them out!

The pipeline included 5-fold cross-validation with model performance evaluation. Root mean squared error (RMSE) with confidence interval of 95% was calculated for both implementations. After running the script the model is saved to the hard drive and model performance is redirected to a text file.

I built a baseline estimator for sanity checking and benchmarking. The baseline is the average price for the real estate, which means that it does not have any intelligence in it.

Part 2 – Code and Modelling Results

In part 2 you will see how the modelling was actually done and how the models performed.