Last year Microsoft acquired Revolution Analytics, which provided open and enterprise distributions of hugely popular statistical computing language R. Earlier this year, Microsoft launched Microsoft R open and Microsoft R Server which are built upon Revolution Analytics software. Open (source and free) version promises improved performance with multithreaded math libraries and reproducibility with snapshots of CRAN repository and checkpoint package that allows you to get the older versions of the R packages you use.
However, the more interesting product (from our point of view) is the Microsoft R Server which is also the topic of this post. According to Microsoft, applying open source R in enterprise scale poses the following challenges: R needs to load data into memory, it is inherently single threaded, it requires skilled resources to scale out, is not straight forward to deploy in production and lacks professional support. Microsoft promises that R server will solve all these issues. R server aims to tackle these issues with packages called RevoScaleR, DeployR and RevoPemaR on top of R open.
PEMA stands for parallel external memory algorithm and RevoPemaR is a framework for writing them. In all its simplicity, PEMAs work as follows: Data is loaded into memory and processed in chunks. Results are updated after a chunk is processed and thus we only need to carry the intermediate results over to the next chunk. The process can also be parallelized over multiple cores or a cluster so we can analyze huge volumes of data incredibly fast.
For example, we could calculate the mean of a column in a huge data frame as follows. Firstly, set sum and observations to zero (initialize), then for a chunk of data calculate the sum and the number of rows (process data), add sum and observations to current values (update results) and when we have processed all the data, divide sum by the number of observations (process results). PEMAs scale linearly over arbitrary number of rows and cores but they are limited to the types of algorithms that can be processed in chunks.
RevoPemaR is based on R’s reference class and writing a fully custom algorithm is fairly complicated. The package contains PemaBaseClass which is a R reference class that contains initialize(), processData(), updateResults() and processResults() methods that are (partly) overwritten when writing custom solutions.
While RevoPemaR is the underlying engine, ScaleR provides a collection of readymade PEMAs and data manipulation capabilities. ScaleR provides many of the most commonly applied statistical techniques and machine learning models in an easy to use R like syntax. ScaleR works with any size of the data, but the what it interesting is how it scales both up as well as out. Scaling up means that we deploy a larger instance and scaling out means that we deploy more instances (a cluster).
ScaleR provides something for all parts of the common data analysis process. It can utilize a variety of data sources ranging from a csv file to Teradata database and Hadoop cluster. A binary file called xdf is ScaleRs native format and as we’ll see later, it allows extremely efficient data retrieval and processing. In addition to importing data to be computed in the R server, we can define a get around this task and send our R code to the data for computation. This is done by setting a new “compute context” to let the program know that we want to send the script to for example SQL Server 2016 that has R built in.
For pre-modeling part, the main workhorse for data manipulation is the rxDataStep which provides the functionality to create new variables, select columns, select rows by condition and clean missing data. Other available manipulations include factorization, sort, merge and split. ScaleR also offers basic tools for exploratory data analysis. These include summary statistics, cross tabulations and correlation matrices as well as simple plots like histograms and line plots.
Statistical and machine learning models included in ScaleR at the moment are
- Linear models
- Generalized linear models
- Decision trees
- Stochastic gradient boosting
- Naïve Bayes classifier
- K-means clustering
and people at Microsoft have indicated that there’s more to come.
DeployR is meant to act as an intermediary between application developers and data scientists allowing both to focus on their own work. In a high level workflow, the data scientist produces an R scripts as usual and deposits into DeployR depository manager. Server turns the script automatically into an API. Developer utilizes “the other side” and integrates the model into the data product.
Everything that we read and heard from Microsoft sounded a little too good to be true so we wanted to test for ourselves. Besides, what’s the point of getting a new toy if you don’t get to play with it? Test were run on a (Standard DS12 v2) 4 core 28GB RAM R server in Azure. Documentation for setup (and everything else) is lacking but setting the server up from Azure marketplace was pretty straightforward nonetheless (We used the windows based server, Linux is also available). Server contained MRO Rgui but we opted for working with Rstudio (MRO Rgui probably resonates with useRs whom are into retro stuff).
After some simple data manipulation operations and data formatting, the first obvious test was fitting a linear model. Linear model was used all the way since same principles apply to all algorithms processed in chunks. We tested performance with different types of data sources as well as against an open source R based setup. With open source R, we opted for ff and biglm (If you know of superior solutions let me know!). This is not really an “apples to apples” comparison since biglm fits the model by updating the QR decomposition and can’t be easily parallelized (If you have a solution, contact me) and ff and xdf are not exactly the same either. However, this is not an issue since we wanted to compare the R server solution to an open source solution.
First tests were conducted with a mockup insurance claims data(http://packages.revolutionanalytics.com/datasets, claims2.csv) that contains 13m+ rows and 36 columns. First we wanted to test how well ScaleR version of a linear model (rxLinMod) compares against an open source R alternative (biglm) and how well it scales with the number cores. We fitted a linear model to explain the amount of an insurance claim. Model doesn’t really make much sense but it as good as any for clocking runtime. Test results are depicted in a table below and show that rxLinMod performs very well no matter what the data source. Results also show that xdf files are superior to csv files for out of core computations. Performance does not scale linearly with the number of cores. PEMAs execution framework and the results below suggests that I/O is the most probable bottleneck.
Results leave us with quite a few questions. Firstly, 13m rows is still quite small. Secondly, how about working with a database? And how does it perform with wider datasets? To explore the first question, we used the famous airline delay data that is also called “the iris of bigdata”. The name is a bit misleading since 150 million rows hardly qualifies as “big data”. However, it is of a size that would be a real pain to work with on your laptop.
We fitted a linear model to explain late arrival with 3 numeric and 1 factor variable. The following results indicate that rxLinMod scales well with the number of rows but yet again scaling with more cores is not that great.
To explore the second and third question, I created a new data set into an SQL database in Azure (S1 Standard with 20DTUs). Data set contained 10 features that follow a normal distribution and a response variable that is a linear function of the features. This allowed us to test how the performance with a data base is, how the rxLinMod scales with the number of features and get a sanity check on the results since we know the true data generation process. Run with 19 features was completed with 10 original + 9 interactive variables.
Results in the table below show that the rxLinMod scales practically linearly with the number of columns when using xdf files. Performance gain is much higher with database is larger but it is really slow in the first place. This is another I/O related problem.
The database takes its time since it needs to retrieve data for the script to run. Microsoft SQL Server 2016 (which came out 1.6) allows you to do the exact opposite. It has R built into it and you can send your script to the data therefore cutting majority of the time. SQL Server is one of the “external compute context” you can use (now or in the near future), the others being Hadoop, Spark and Teradata. We’ll test this out too in the near future.
Even though Microsoft R Server contains a modest number of algorithms, the performance was impressive and most use cases do not require latest stats methods or neural networks. Microsoft folks working on R server indicated that there’s more to come and when needed, R server still contains all our favorite open source R packages. When the size of data sets is growing fast, the ability to send your script to the data instead of moving the data around will provide lots of efficiency gains. Overall, R server is a powerful addition into Azure’s analytical capabilities and is definitely going to be added into BDP’s solution stack. However, R server is not the only solution that allows you to break the memory. We’ll introduce a Python based alternative after holidays.