Comparing R to Matlab for Data Mining

For the past three years or so, i have used R daily, and the largest portion of that daily use is spent on Machine Learning/Data Mining problems.

I was an exclusive Matlab user while in University; at the time i thought it was an excellent set of tools/platform. I am sure it is today as well.

The Neural Network Toolbox, the Optimization Toolbox, Statistics Toolbox, and Curve Fitting Toolbox are each highly desirable (if not essential) for someone using MATLAB for ML/Data Mining work, yet they are all separate from the base MATLAB environment--in other words, they have to be purchased separately.

My Top 5 list for Learning ML/Data Mining in R:

  • Mining Association Rules in R

This refers to a couple things: First, a group of R Package that all begin arules (available from CRAN); you can find the complete list (arules, aruluesViz, etc.) on the Project Homepage. Second, all of these packages are based on a data-mining technique known as Market-Basked Analysis and alternatively as Association Rules. In many respects, this family of algorithms is the essence of data-mining--exhaustively traverse large transaction databases and find above-average associations or correlations among the fields (variables or features) in those databases. In practice, you connect them to a data source and let them run overnight. The central R Package in the set mentioned above is called arules; On the CRAN Package page for arules, you will find links to a couple of excellent secondary sources (vignettes in R's lexicon) on the arules package and on Association Rules technique in general.

The most current edition of this book is available in digital form for free. Likewise, at the book's website (linked to just above) are all data sets used in ESL, available for free download. (As an aside, i have the free digital version; i also purchased the hardback version from BN.com; all of the color plots in the digital version are reproduced in the hardbound version.) ESL contains thorough introductions to at least one exemplar from most of the major ML rubrics--e.g., neural metworks, SVM, KNN; unsupervised techniques (LDA, PCA, MDS, SOM, clustering), numerous flavors of regression, CART, Bayesian techniques, as well as model aggregation techniques (Boosting, Bagging) and model tuning (regularization). Finally, get the R Package that accompanies the book from CRAN (which will save the trouble of having to download the enter the datasets).

  • CRAN Task View: Machine Learning

The +3,500 Packages available for R are divided up by domain into about 30 package families or 'Task Views'. Machine Learning is one of these families. The Machine Learning Task View contains about 50 or so Packages. Some of these Packages are part of the core distribution, including e1071 (a sprawling ML package that includes working code for quite a few of the usual ML categories.)

  • Revolution Analytics Blog

With particular focus on the posts tagged with Predictive Analytics

  • ML in R tutorial comprised of slide deck and R code by Josh Reich

A thorough study of the code would, by itself, be an excellent introduction to ML in R.

And one final resource that i think is excellent, but didn't make in the top 5:

  • A Guide to Getting Stared in Machine Learning [in R]

posted at the blog A Beautiful WWW


Please look at the CRAN Task Views and in particular at the CRAN Task View on Machine Learning and Statistical Learning which summarises this nicely.