Calculating the Pearson correlation coefficient using R and Ruby

January 21, 2009

In a previous article I talked about using the GNU scientific library to implement the Pearson correlation algorithm, as used for example in acts_as_{recommendable} As a prelude to some forthcoming articles, I'd like to show you how easy it is to implement the same thing using the Ruby bindings to the R project.

For this to work you'll need to install R and the rsruby gem. Take a look at the documentation for the rsruby gem, as while installation is straightforward there's a couple of things to be aware of.

Having done that, let's reopen the Pearson class from the previous article and add a new R-based method

class Pearson
  def initialize
    require 'rsruby'
    @r = RSRuby.instance
  end

  def R_pearson(x,y)
    @r.cor(x,y)
  end
end

The new initialize method sets up the communication with the R instance. The actual definition for the Pearson method simple, conversion from Ruby Arrays to R vectors is handled automatically by the bindings, you simply need to know the correct R method to call - in this case 'cor'. Take a look through the R manual to learn more about this powerful tool - almost all the features are accessible through the Ruby bindings.

Here's a quick modification to the benchmark to compare performances:

require 'benchmark'

n = 100000
x = []; n.times{x << rand}
y = []; n.times{y << rand}

p = Pearson.new()

Benchmark.bm() do |bm|
  bm.report("Ruby:") {p.ruby_pearson(x,y)}
  bm.report("GSL:") {p.gsl_pearson(x,y)}
  bm.report("Inline:") {p.inline_pearson(n,x,y)}
  bm.report("R:") {p.R_pearson(x,y)}
end

And the results,


	user	system	total	real
Ruby:	1.590000	0.020000	1.610000	(1.610470)
GSL:	0.010000	0.000000	0.010000	(0.062538)
Inline:	0.010000	0.000000	0.010000	(0.004548)
R:	0.220000	0.010000	0.230000	(0.227184)

The R version is around 7 times faster than the native Ruby version, but not as fast as the C-based approaches described earlier.

The real power of interfacing with R however is in the ability to quickly swap out one algorithm for another, or experiment interactively with your data in an irb console. Once you have an algorithm that works well in your case, then it may be necessary to re-implement in a faster language if performance is a concern.

I'll be talking about R and Ruby a little more in the future, so subscribe to the RSS feed if you're interested.