Identify Programming Languages with SourceClassifier
Do you need to identify the programming language used in a snippet of code? For example, in a pastie style application or in your blog comments system. I've just released version 0.2.1 of SourceClassifier over on github.
Source classifier identifies programming language using a Bayesian classifier trained on a corpus generated from the "Computer Language Benchmarks Game":http://shootout.alioth.debian.org/. It is written in Ruby and available as a gem. Out of the box SourceClassifier recognises C, Java, Javascript, Perl, Python and Ruby. A nice advantage of using a Bayesian classifier to identify the source code is that even false matches will still give some usable highlighting. To train the classifier to identify new languages download the sources from github .
Usage
First install the gem using github as a source
$ gem sources -a http://gems.github.com
$ sudo gem install chrislo-sourceclassifier
Then, to use
require 'rubygems' require 'sourceclassifier' s = SourceClassifier.new ruby_text = <<EOT def my_sorting_function(a) a.sort end EOT c_text = <<EOT #include <unistd.h> int main() { write(1, "hello world\n", 12); return(0); } EOT s.identify(ruby_text) #=> Ruby s.identify(c_text) #=> Gcc
Training
Download the sources from github and in the directory run the training rake test
$ rake train
In the ./sources directory are sub-directories for each language you wish to identify. Each sub-directory contains examples of programs written in that language. The name of the directory is significant - it is the value returned by the SourceClassifier.identify() method.
The rake task populate can be used to build these sub-directories from a checkout of the computer language shootout sources but you are free to train the classifier using any available examples.
Acknowledgments
This library depends heavily on the great Classifier gem by Lucas Carlson and David Fayram II.