AutoSOME-Seq: Software and Documentation

Description

AutoSOME-Seq is a tool for clustering large biological sequence datasets without prior knowledge of cluster number or structure. Short sequences (5-20 characters) and/or medium length sequences (up to 50 characters) with compositional bias and varied levels of degeneracy are ideally suited for AutoSOME-Seq (e.g. protein motifs or domains), though longer sequences are also acceptable. For an overview of the method, see Newman and Cooper, 2011 (search for AutoSOME-TR).

Download

Click here. No installation necessary. Simply unzip contents into the same directory.

Requirements

Java 1.6+; to allocate greater than 1.6 GB RAM, 64-bit Java is required. Multiple CPUs are recommended for faster performance.

Usage

java -Xmx1600m -Xms1600m -jar autosome-seq.jar input.in parameters

Run without arguments for help. Allocate more memory (in MB) using -Xmx and -Xms arguments (e.g., 10 GB: -Xmx10000m -Xms10000m).

1) Input format:

Text file with three tab-delimited fields per row: sequence id (tab) consensus motif (tab) full sequence

If there is no consensus motif, the second and third fields should be identical

Example input 1 (small set of miscellaneous peptides; sequence ids represent previously defined clusters)

Example input 2 (protein tandem repeat domains from the Florida lancelet Branchiostoma floridae)

2) Main options (default value):

-N		perform nucleotide sequence clustering (amino acid clustering)
-t	Integer	set number of threads (# available CPUs)
-Z		use dipeptide/dinucleotide vector for compositional clustering (false)
-I	0-1	set consensus error threshold for homology clustering (0.3)
-p	0-1	set p-value threshold for compositional clustering (0.01)
-e	Integer	set number of individual runs to merge into ensemble (20)
-v		display previous clustering results: input=clustering output text file (false)
-D	Directory	set output directory (parent directory of input file)
-$		write intermediate data to temp folder to save memory (false)
-S	Integer	set maximum cluster size for performing sequence alignments (3000)
-T	Integer	set maximum cluster size for identifying sub-clusters (10)
-U	File	set amino acid substitution alphabet for compositional clustering (none) example

3) Output:

AutoSOME-Seq produces several html and text output files, as detailed below:

AutoSOME_inputName_X_EY_PvalZ_Summary.html
AutoSOME_inputName_X_EY_PvalZ.html
AutoSOME_inputName_X_EY_PvalZ.txt

where, X = COMPOSITION/HOMOLOGY, Y = number of ensemble runs, Z = p-value threshold

4) Open and browse results:

To display cluster results, run AutoSOME-Seq with the command line option (-v) and use the text file (.txt) corresponding to your cluster output as input. A graphical user interface will appear.

Bawazer et al. cluster results

All identified sequence clusters (download).
To display clusters, download the text file and open using the AutoSOME-Seq browser (-v option). Allocate at least 1GB RAM (see Usage). To view sequence clusters as sequence logo diagrams (e.g., Figure 2A), select View>motif logo>frequency view. Then, select View>motif logo>nucleotide colors.

References

1) Newman AM, Cooper JB (2010) AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number. BMC Bioinformatics 11: 117.

2) Newman AM, Cooper JB (2011) Global analysis of proline-rich tandem repeat proteins reveals broad phylogenetic diversity in plant secretomes. PLoS ONE 6(8): e23167.

3) Bawazer LA*, Newman AM*, Gu Q, Ibish A, Arcila M, Kosik KS, Cooper JB, Meldrum F, Morse DE (2012) Efficient selection of biomineralizing DNA aptamers using next generation sequencing and population clustering. (submitted)

Last updated: 06-30-13