Home > ms

ms

Ms is a project mainly written in SHELL and R, based on the GPL-3.0 license.

MS thesis - improving multiple alignments by removing sequences

UNNAMED is a classifier of outlier sequences in multiple alignments

Website: https://github.com/alevchuk/ms

Requirements: GNU Linux environment R Ruby Bioperl (needed by GUIDANCE) dos2unix (needed by GUIDANCE) ROCR, ggplot2 (R packages)

Included packages: MAFFT v6.857 GUIDANCE v1.1 norMD 1_3 GNU Parallel v20110822 bio3d 1.1-3

Tested on: Debian GNU/Linux 6.0 (Lenny)

Contact: Aleksandr Levchuk [email protected]

Installing: One command is all it takes. Simply run: make. Takes about 4 minutes. Installs to /opt - a local project directory.

Running the protein sequence family cleanup: You can detect nonmember contaminates by running: ./008-remover <x.fasta> replace <x.fasta> with a valid path to any FASTA file. For usage information run "./008-remover" without any arguments.

Running the complete experiment: Data is not included in this code repository. Follow data/DATA_README for downloading current versions of CDD and Uniprot.

Once data tar.gz are downloaded, run each command manually: 002-start, 004-reshape-cdd, 006-reshape-uniprot, 008-sample, ...

006-reshape-uniprot is the most time consuming non-parallelizable task, it converts FASTA to TAB and takes 7.6 hours. 009-add-random can take 1 hour for sample size 100 and 5 samples total. 010-run-guidance and 026-run-normd are highly parallel each script will only take a few minutes, but there are 2 SAMPLE_SIZE NUM_SAMPLES * (1 + NUM_RANDOM_INJECTIONS) scripts which are generated by 010-run-guidance and 026-run-guidance. These scripts should preferably run on a cluster via qsub or on a number of local CPU-cores via GNU parallel:

ls ./trial-cdd-2011-XX-uniprot-2011-XX/010-run-guidance-data-in/tasks/* | \ ./opt/parallel-20110822/bin/parallel

ls ./trial-cdd-2011-XX-uniprot-2011-XX/026-run-normd-data-in/tasks/* | \ ./opt/parallel-20110822/bin/parallel

Both runs above can start immediately after 009-add-random is completed.

Data Availability: http://biocluster.ucr.edu/projects/GROUPBALANCER/

Contributing: This code is GPLv3. Free Software. So please don't hesitate to make a fork on GitHub, make changes, and send GitHub Pull Requests. I will gladly accept reasonable revisions and will cite you (the author) as you prefer.

Pipeline API: Almost all scripts use ./api/pipeline.sh a very small script to help avoid repeating some fine-granularity operations, useful is you forget which operations already completed and if some cluster jobs terminate.

References: Thompson, J. D., Plewniak, F., Ripp, R., Thierry, J.-C., Poch, O., 2001. Towards a reliable objective function for multiple sequence alignments. Journal of Molecular Biology 314 (4), 937-951. URL http://www.sciencedirect.com/science/article/pii/S0022283601951873

Penn, O., Privman, E., Landan, G., Graur, D., Pupko, T., Aug 2010. An alignment confidence score capturing robustness to guide tree uncertainty. Mol Biol Evol 27 (8), 1759-1767. URL http://www.hubmed.org/display.cgi?uids=20207713

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

R Development Core Team (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.

Previous:miOrden