Ms is a project mainly written in SHELL and R, based on the GPL-3.0 license.
MS thesis - improving multiple alignments by removing sequences
UNNAMED is a classifier of outlier sequences in multiple alignments
Website: https://github.com/alevchuk/ms
Requirements: GNU Linux environment R Ruby Bioperl (needed by GUIDANCE) dos2unix (needed by GUIDANCE) ROCR, ggplot2 (R packages)
Included packages: MAFFT v6.857 GUIDANCE v1.1 norMD 1_3 GNU Parallel v20110822 bio3d 1.1-3
Tested on: Debian GNU/Linux 6.0 (Lenny)
Contact: Aleksandr Levchuk [email protected]
Installing:
One command is all it takes. Simply run: make.
Takes about 4 minutes. Installs to
Running the protein sequence family cleanup: You can detect nonmember contaminates by running: ./008-remover <x.fasta> replace <x.fasta> with a valid path to any FASTA file. For usage information run "./008-remover" without any arguments.
Running the complete experiment: Data is not included in this code repository. Follow data/DATA_README for downloading current versions of CDD and Uniprot.
Once data tar.gz are downloaded, run each command manually: 002-start, 004-reshape-cdd, 006-reshape-uniprot, 008-sample, ...
006-reshape-uniprot is the most time consuming non-parallelizable task, it converts FASTA to TAB and takes 7.6 hours. 009-add-random can take 1 hour for sample size 100 and 5 samples total. 010-run-guidance and 026-run-normd are highly parallel each script will only take a few minutes, but there are 2 SAMPLE_SIZE NUM_SAMPLES * (1 + NUM_RANDOM_INJECTIONS) scripts which are generated by 010-run-guidance and 026-run-guidance. These scripts should preferably run on a cluster via qsub or on a number of local CPU-cores via GNU parallel:
ls ./trial-cdd-2011-XX-uniprot-2011-XX/010-run-guidance-data-in/tasks/* | \ ./opt/parallel-20110822/bin/parallel
ls ./trial-cdd-2011-XX-uniprot-2011-XX/026-run-normd-data-in/tasks/* | \ ./opt/parallel-20110822/bin/parallel
Both runs above can start immediately after 009-add-random is completed.
Data Availability: http://biocluster.ucr.edu/projects/GROUPBALANCER/
Contributing: This code is GPLv3. Free Software. So please don't hesitate to make a fork on GitHub, make changes, and send GitHub Pull Requests. I will gladly accept reasonable revisions and will cite you (the author) as you prefer.
Pipeline API: Almost all scripts use ./api/pipeline.sh a very small script to help avoid repeating some fine-granularity operations, useful is you forget which operations already completed and if some cluster jobs terminate.
References: Thompson, J. D., Plewniak, F., Ripp, R., Thierry, J.-C., Poch, O., 2001. Towards a reliable objective function for multiple sequence alignments. Journal of Molecular Biology 314 (4), 937-951. URL http://www.sciencedirect.com/science/article/pii/S0022283601951873
Penn, O., Privman, E., Landan, G., Graur, D., Pupko, T., Aug 2010. An alignment confidence score capturing robustness to guide tree uncertainty. Mol Biol Evol 27 (8), 1759-1767. URL http://www.hubmed.org/display.cgi?uids=20207713
O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
R Development Core Team (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.