Benchmark Datasets
The columns of each benchmark dataset below are normalized into the range 0-100 and can be clustered by AutoSOME without further processing. All benchmark datasets were obtained from the University of Califorina, Irvine
Machine Learning Repository
except for the rings and bars datasets.
rings
Two interlocking orthogonal rings
bars
Eight straight lines: two horizontal and six vertical
wine
Wine dataset
zoo
Zoo animals dataset
iris
Iris dataset
wisc
Breast cancer Wisconsin (original) dataset
derm
Dermatology dataset
Microarray Datasets
The microarray datasets available for download here have not been normalized by us. To replicate results in Newman AM, Cooper JB (2010)
BMC Bioinformatics
,
11:
117, cluster both cancer cell line datasets using unit variance normalization, Euclidean distance for distance matrix construction, and 500 ensemble runs (comparable results are possible with as few as 50-100 ensemble runs).
For the GSE11508 dataset, we used 100 ensemble runs and the following normalization settings: unit variance, median-centering of rows and columns, sum of squares=1 of rows and columns. Other parameters were set to default values. (We used the
Write Ensemble Runs to Disk
option due to memory limitations; 30-50 ensemble runs should yield comparable results, and depending on your RAM, writing intermediate runs to disk may not be necessary)
cancer cells I
Filtered microarray dataset
(de Souto et al. 2008)
primary dataset: Alizadeh et al. 2000 (Nature, 403:503)
cancer cells II
Filtered microarray dataset
(Brunet et al. 2004)
primary dataset: Golub et al. 1999 (Science, 286:531)
human cell lines (GSE11508)
Original microarray dataset
primary dataset: Müller et al. 2008 (Nature, 455:401)
All of the Above
Download