Immgen is a project mainly written in PYTHON and R, it's free.
some basic tools to access and normalise the IMMGEN arrays available from GEO
This project aims to wrap up a set of tools using Python and R (Bioconductor and aroma.affymetrix) in order to easily process data generated using the IMMGEN protocol.
For more information about the Immunological Genome Project see their website at http://www.immgen.org
The raw data sets are made available via the GEO data base, and can be found using the GEO accession ID GSE15907
This project is part of the NIH Nanomedicine Center for Mechanobiology. Code written in the Wiggins Group at Columbia University.
This will only work on a UNIX computer. It will definitely NOT work on Windows.
You'll need a special R package for this to work:
which allows you to normalise so many arrays in finite memory (i.e. your laptop).
You'll need the following things from Bioconductor
First you need to set up your folder structure in which to store the data. This is mandated by the aroma.affymetrix package. A valid, populated folder structure is generated using the following python command:
python preprocess_setup.py /path/to/rootfolder
This will download and unzip the WHOLE immgen corpus which is, at time of writing, 508 CEL files. So be prepared to a) wait and b) use a lot of space.
Then run the R script to actually do all the preprocessing, using either
source("preprocess_rscript.r")
on the command line of R OR
R CMD BATCH preprocess_rscript.r
This takes a lot of time, and seemingly needs a lot of RAM, despite aroma's claim that these things can work in finite memory. IMPORTANT: You will need to edit preprocess_rscript.r so that the paths in there make sense on your system.
This program uses Python to:
If you've already done all this manually, then the above is skipped, unless there's something wrong with the cel files, or the directory structure, in which case an error will be issued.
Then this program uses the aroma package in R to perform:
all this is in the immgen.preprocess
function in the preprocess.r
file. Then Bioconductor ExpressionSet
objects are formed and saved in the userData
folder in the data folder you specified.