Dataset Downloads
The following simulated datasets can be used to test and compare the performance of gene set/pathway based approaches.
Simulated Datasets Used in GSAASeqSP Paper
gsaaseqsp_simulation.zip includes 200 simulated replicates for six scenarios S1-S6 (see paper GSAASeqSP: A toolset for gene set association analysis of RNA-Seq data for details). Each simulated replicate includes the following files: The phenotype label file for gene expression dataset is exp.cls The gene set dataset file is geneset.gmt |
Scenario S1-S6 |
---|
download gsaaseqsp_simulation.zip |
Simulated Datasets Used in GSAA Paper
gsaa_simulation_oddratio_1.1_1.3.zip includes 200 simulated replicates for five scenarios S1-S5 (see paper Integrating genetic and gene expression evidence into genome-wide association analysis of gene sets for details). For these simulations, the odds ratios for causal loci were drawn from U [1.1, 1.3]. gsaa_simulation_oddratio_1.2_1.4.zip includes 200 simulated replicates for five scenarios S1-S5. For these simulations, the odds ratios for causal loci were drawn from U [1.2, 1.4]. Each simulated replicate includes the following files: The phenotype label file for gene expression dataset is exp.cls The gene set dataset file is geneset.gmt |
U [1.1, 1.3] | U [1.2, 1.4] |
---|---|
download gsaa_simulation_oddratio_1.1_1.3.zip | download gsaa_simulation_oddratio_1.2_1.4.zip |
The following datasets were based on a different simulation strategy.
Gene Expression Datasets
Each gene expression dataset includes 1000 genes. Only first 20 genes are causal genes. Gene expression values were drawn from normal distributions. Three different scenarios were simulated: Each scenario contains 30 replicates. |
Sample size | N(10.5, 1) & N(10, 1) | N(10.3, 1) & N(10, 1) | N(10.1, 1) & N(10, 1) |
---|---|---|---|
100 (50 cases & 50 controls) |
download exp_100_10.5.zip | download exp_100_10.3.zip | download exp_100_10.1.zip |
200 (100 cases & 100 controls) |
download exp_200_10.5.zip | download exp_200_10.3.zip | download exp_200_10.1.zip |
400 (200 cases & 200 controls) |
download exp_400_10.5.zip | download exp_400_10.3.zip | download exp_400_10.1.zip |
1200 (600 cases & 600 controls) |
download exp_1200_10.5.zip | download exp_1200_10.3.zip | download exp_1200_10.1.zip |
SNP Datasets
Each SNP dataset includes 1000 genes. The first 20 genes are causal genes. Each causal gene covers three SNP markers, only the second marker is in LD with the disease variant. All of other genes also have three SNP markers, but none of them is in LD with the disease variant. Genotype data were generated by SIMLA (see paper Extension of the SIMLA package for generating pedigrees with complex inheritance patterns: environmental covariates, gene-gene and gene-environment interaction). We first generated genotype data for pedigrees and then took the proband of each pedigree to form unrelated population samples. Parameters for the disease models were based on a susceptibility locus rs17221417 uncovered by a published Crohn’s disease (CD) genome-wide association study (see paper Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls). We chose this locus since it represents caspase recruitment domain-containing protein 15 (CARD15, also called NOD2) which is the first confirmed CD-susceptibility gene (see paper A frameshift mutation in NOD2 associated with susceptibility to Crohn's disease and Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease) and it has a moderate genotype relative risk. Based on information for this locus, we set the risk allele frequency for the simulation equal to 0.287, and the homozygote and heterozygote genotype relative risks at 1.617 and 1.08, respectively. We set disease prevalence in the population to 0.001985 to match the estimated prevalence of CD in North America (see paper The epidemiology and natural history of Crohn's disease in population-based patient cohorts from North America: a systematic review). In this simulation study, we detect causal genes by indirect association, based on the LD between markers and causal variants. Five different scenarios were simulated: Each scenario contains 30 replicates. Two types of disease models, a dominant model and a recessive model, were simulated. The data files with "_d" were from the dominant disease model while data files with "_r" from the recessive desease model. |
Sample size | R^2=1 | R^2=0.9 | R^2=0.7 | R^2=0.5 | R^2=0.3 |
---|---|---|---|---|---|
100 (50 cases & 50 controls) |
|||||
200 (100 cases & 100 controls) |
|||||
400 (200 cases & 200 controls) |
|||||
1200 (600 cases & 600 controls) |
Phenotype labels files & Gene list file
Sample size | Phenotype labels files (expression) | Phenotype labels files (SNP) | Gene list file |
---|---|---|---|
100 (50 cases & 50 controls) |
download pheno_100_exp.cls | download pheno_100_snp.cls | download genes.txt |
200 (100 cases & 100 controls) |
download pheno_200_exp.cls | download pheno_200_snp.cls | |
400 (200 cases & 200 controls) |
download pheno_400_exp.cls | download pheno_400_snp.cls | |
1200 (600 cases & 600 controls) |
download pheno_1200_exp.cls | download pheno_1200_snp.cls |
Gene set datasets
Each gene set dataset includes 100 simulated gene sets with each gene set containing 20 genes. Only the first gene set includes causal genes. The percentage of these causal genes (PRG) in that first gene set is varied. Four different scenarios were simulated: |
20/20 | 15/20 | 10/20 | 5/20 |
---|---|---|---|
download gt_20_20.gmt | download gt_15_20.gmt | download gt_10_20.gmt | download gt_5_20.gmt |