AutoSOME-Seq: Software and Documentation


  • Description
  • Download
  • Requirements
  • Usage
  • Bawazer et al. cluster results
  • References

  • Description

  • AutoSOME-Seq is a tool for clustering large biological sequence datasets without prior knowledge of cluster number or structure. Short sequences (5-20 characters) and/or medium length sequences (up to 50 characters) with compositional bias and varied levels of degeneracy are ideally suited for AutoSOME-Seq (e.g. protein motifs or domains), though longer sequences are also acceptable. For an overview of the method, see Newman and Cooper, 2011 (search for AutoSOME-TR).

  • Download

    Click here. No installation necessary. Simply unzip contents into the same directory.


  • Java 1.6+; to allocate greater than 1.6 GB RAM, 64-bit Java is required. Multiple CPUs are recommended for faster performance.

  • Usage

  • java -Xmx1600m  -Xms1600m  -jar  autosome-seq.jar  parameters

  • Run without arguments for help. Allocate more memory (in MB) using -Xmx and -Xms arguments (e.g., 10 GB: -Xmx10000m -Xms10000m).
  • 1) Input format:

  • Text file with three tab-delimited fields per row: sequence id (tab) consensus motif (tab) full sequence

  • Example input 1 (small set of miscellaneous peptides; sequence ids represent previously defined clusters)
  • Example input 2 (protein tandem repeat domains from the Florida lancelet Branchiostoma floridae)
  • 2) Main options (default value):

    -Nperform nucleotide sequence clustering (amino acid clustering)
    -tIntegerset number of threads (# available CPUs)
    -Zuse dipeptide/dinucleotide vector for compositional clustering (false)
    -I0-1set consensus error threshold for homology clustering (0.3)
    -p0-1set p-value threshold for compositional clustering (0.01)
    -eIntegerset number of individual runs to merge into ensemble (20)
    -vdisplay previous clustering results: input=clustering output text file (false)
    -DDirectoryset output directory (parent directory of input file)
    -$write intermediate data to temp folder to save memory (false)
    -SIntegerset maximum cluster size for performing sequence alignments (3000)
    -TIntegerset maximum cluster size for identifying sub-clusters (10)
    -UFileset amino acid substitution alphabet for compositional clustering (none) example

    3) Output:

  • AutoSOME-Seq produces several html and text output files, as detailed below:
  • where, X = COMPOSITION/HOMOLOGY, Y = number of ensemble runs, Z = p-value threshold
  • 4) Open and browse results:

  • To display cluster results, run AutoSOME-Seq with the command line option (-v) and use the text file (.txt) corresponding to your cluster output as input. A graphical user interface will appear.

  • Bawazer et al. cluster results


  • 1) Newman AM, Cooper JB (2010) AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number. BMC Bioinformatics 11: 117.

  • 2) Newman AM, Cooper JB (2011) Global analysis of proline-rich tandem repeat proteins reveals broad phylogenetic diversity in plant secretomes. PLoS ONE 6(8): e23167.

  • 3) Bawazer LA*, Newman AM*, Gu Q, Ibish A, Arcila M, Kosik KS, Cooper JB, Meldrum F, Morse DE (2012) Efficient selection of biomineralizing DNA aptamers using next generation sequencing and population clustering. (submitted)

  • Last updated: 06-30-13