Sofia-ml is a project mainly written in C++ and C, based on the Apache-2.0 license.
A fork of the sofia ml machine learning program
sofia-ml
Project homepage: http://code.google.com/p/sofia-ml/
==Introduction==
The suite of fast incremental algorithms for machine learning (sofia-ml) can be used for training models for classification or ranking, using several different techniques. This release is intended to aid researchers and practitioners who require fast methods for classification and ranking on large, sparse data sets.
Supported learners include:
* Pegasos SVM
* Stochastic Gradient Descent (SGD) SVM
* Passive-Aggressive Perceptron
* Perceptron with Margins
* ROMMA
These learners can be configured for classification and ranking, with several sampling methods available.
This implementation gives very fast training times. For example, 100,000 Pegasos SVM training iterations can be performed on data from the CCAT task from the RCV1 benchmark data set (with roughly 780,000 examples) in 0.1 CPU seconds on an ordinary 2.4GHz laptop, with no loss in classification performance compared with other SVM methods. On LETOR learning to rank benchmark tasks, training time with 100,000 Pegasos SVM rank steps complete 0.2 CPU seconds on an ordinary laptop.
The primary computational bottleneck is actually reading the data off of disk; sofia-ml reads and parses data from disk substantially faster than other SVM packages we tested. For example, sofia-ml can read and parse data nearly 10 times faster than the reference Pegasos implementation by Shalev-Shwartz, and nearly 3 times faster than svm_perf by Joachims.
This package provides a commandline utility for training models and using them to predict on new data, and also exposes an API for model training and prediction. The underlying libraries for data sets, weight vectors, and example vectors are also provided for researchers wishing to use these classes to implement other algorithms.
==Quick Start==
These quick-start instructions assume the use of the unix/linux commandline, with g++ installed. There are no external code dependencies.
Step 1 Check out the code:
svn checkout http://sofia-ml.googlecode.com/svn/trunk/sofia-ml sofia-ml-read-only
Step 2 Compile the code:
cd sofia-ml-read-only/src/ make ls ../sofia-ml
Executable should be in main sofia-ml-read-only directory.
make clean make all_test
Step 3 Test the code:
cd .. ./sofia-ml
This should display the set of commandline flags and descriptions.
./sofia-ml --learner_type pegasos --loop_type stochastic --lambda 0.1 --iterations 100000 --dimensionality 150000 --training_file demo/demo.train --model_out demo/model
This should display something like the following:
Reading training data from: demo/demo.train Time to read training data: 0.056134 Time to complete training: 0.075364 Writing model to: demo/model Done.
./sofia-ml --model_in demo/model --test_file demo/demo.train --results_file demo/results.txt
Should display the following:
Reading model from: demo/model Done. Reading test data from: demo/demo.train Time to read test data: 0.046729 Time to make test prediction results: 0.000844 Writing test results to: demo/results.txt Done.
head -5 demo/results.txt
Format is:
\t file corresponds to the same line (in order) in the test file.
1.02114 1 1.18046 1 -1.24609 -1 -1.12822 -1 -1.41046 -1
Note that exact results may vary slightly because these algorithms train
by randomly sampling one example at a time.
perl eval.pl demo/results.txt
Should display something like:
Results for demo/results.txt:
Accuracy 0.9880 (using threshold 0.00) (988/1000) Precision 0.9719 (using threshold 0.00) (311/320) Recall 0.9904 (using threshold 0.00) (311/314) ROC area: 0.999406
Total of 1000 trials.
==Data Format==
This package uses the popular SVM-light sparse data format.
The feature id's are expected to be in ascending numerical order. The lowest allowable feature-id is 1 (0 is reserved for the bias term internally.) Any feature not specified is assumed to have value 0 to allow for sparse representation.
The class label for test data is required but not used; it's okay to put in a dummy placeholder value such as 0 for test data. For binary-class classification problems, the training labels should be 1 or -1. For ranking problems, the labels may be any numeric value, with higher values being judged as "more preferred".
Currently, the comment string is not used. However, it is available for use in other algorithms, and can also be useful to aid in bookkeeping of data files.
Examples:
1 1:1.2 3:-0.5
-1 qid:3 5011:1.2
-1 1:7 3:-0.5#This example is especially interesting.
==Commandline Details==
File Input and Output
--model_in
--model_out
--training_file
--test_file
--results_file
Learning Options
--learner_type
--loop_type
--eta_type
--dimensionality
--iterations
--lambda
--passive_aggressive_c
--passive_aggressive_lambda
--perceptron_margin_size
--hash_mask_bits
Other Options
--random_seed
--training_objective
==References==
If you use this source code for scientific research, please cite the following:
* D. Sculley. Large Scale Learning to Rank. NIPS Workshop on Advances in Ranking, 2009. Presents the indexed sampling methods used learning to rank, including the rank and roc loops.
Additional reading and references:
* K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. J. Mach. Learn. Res., 7, 2006. Presents the Passive-Aggressive Perceptron algorithm.
* T. Joachims. Optimizing search engines using clickthrough data. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002. Presents the RankSVM objective function, a pairwise objective function used by the rank loop method in sofia-ml.
* Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Mach. Learn., 46(1-3), 2002. Presents the ROMMA algorithm.
* S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML ’07: Proceedings of the 24th international conference on Machine learning,