Benchmark Datasets

The columns of each benchmark dataset below are normalized into the range 0-100 and can be clustered by AutoSOME without further processing. All benchmark datasets were obtained from the University of Califorina, Irvine Machine Learning Repository except for the rings and bars datasets.

rings

Two interlocking orthogonal rings

bars

Eight straight lines: two horizontal and six vertical

wine

Wine dataset

zoo

Zoo animals dataset

iris

Iris dataset

wisc

Breast cancer Wisconsin (original) dataset

derm

Dermatology dataset

Microarray Datasets

The microarray datasets available for download here have not been normalized by us. To replicate results in Newman AM, Cooper JB (2010) BMC Bioinformatics, 11:117, cluster both cancer cell line datasets using unit variance normalization, Euclidean distance for distance matrix construction, and 500 ensemble runs (comparable results are possible with as few as 50-100 ensemble runs).

For the GSE11508 dataset, we used 100 ensemble runs and the following normalization settings: unit variance, median-centering of rows and columns, sum of squares=1 of rows and columns. Other parameters were set to default values. (We used the Write Ensemble Runs to Disk option due to memory limitations; 30-50 ensemble runs should yield comparable results, and depending on your RAM, writing intermediate runs to disk may not be necessary)

cancer cells I

Filtered microarray dataset (de Souto et al. 2008)
primary dataset: Alizadeh et al. 2000 (Nature, 403:503)

cancer cells II

Filtered microarray dataset (Brunet et al. 2004)
primary dataset: Golub et al. 1999 (Science, 286:531)

human cell lines (GSE11508)

Original microarray dataset
primary dataset: Müller et al. 2008 (Nature, 455:401)

All of the Above

Download