GSAA - gene set association analysis

Dataset Downloads

The following simulated datasets can be used to test and compare the performance of gene set/pathway based approaches.

Simulated Datasets Used in GSAASeqSP Paper

gsaaseqsp_simulation.zip includes 200 simulated replicates for six scenarios S1-S6 (see paper GSAASeqSP: A toolset for gene set association analysis of RNA-Seq data for details).

Each simulated replicate includes the following files:
The gene expression dataset for each scenario is as follows:
Scenario S1: exp_count1.gct
Scenario S2: exp_count2.gct
Scenario S3: exp_count3.gct
Scenario S4: exp_count4.gct
Scenario S5: exp_count5.gct
Scenario S6: exp_count6.gct

The phenotype label file for gene expression dataset is exp.cls

The gene set dataset file is geneset.gmt

Scenario S1-S6
download gsaaseqsp_simulation.zip

Simulated Datasets Used in GSAA Paper

gsaa_simulation_oddratio_1.1_1.3.zip includes 200 simulated replicates for five scenarios S1-S5 (see paper Integrating genetic and gene expression evidence into genome-wide association analysis of gene sets for details). For these simulations, the odds ratios for causal loci were drawn from U [1.1, 1.3]. gsaa_simulation_oddratio_1.2_1.4.zip includes 200 simulated replicates for five scenarios S1-S5. For these simulations, the odds ratios for causal loci were drawn from U [1.2, 1.4].

Each simulated replicate includes the following files:
The gene expression dataset and SNP dataset for each scenario are as follows:
Scenario S1: exp1.gct geno.snp
Scenario S2: exp2.gct geno1.snp
Scenario S3: exp3.gct geno.snp
Scenario S4: exp4.gct geno.snp
Scenario S5: exp5.gct geno2.snp

The phenotype label file for gene expression dataset is exp.cls
The phenotype label file for SNP dataset is snp.cls

The gene set dataset file is geneset.gmt

U [1.1, 1.3]	U [1.2, 1.4]
download gsaa_simulation_oddratio_1.1_1.3.zip	download gsaa_simulation_oddratio_1.2_1.4.zip

The following datasets were based on a different simulation strategy.

Gene Expression Datasets

Each gene expression dataset includes 1000 genes. Only first 20 genes are causal genes. Gene expression values were drawn from normal distributions. Three different scenarios were simulated:
1) Expression values of causal genes are drawn from N(10.5, 1) in the case group and from N(10, 1) for all other genes in both groups;
2) Expression values of causal genes are drawn from N(10.3, 1) in the case group and from N(10, 1) for all other genes in both groups;
3) Expression values of causal genes are drawn from N(10.1, 1) in the case group and from N(10, 1) for all other genes in both groups.

Each scenario contains 30 replicates.

Sample size	N(10.5, 1) & N(10, 1)	N(10.3, 1) & N(10, 1)	N(10.1, 1) & N(10, 1)
100 (50 cases & 50 controls)	download exp_100_10.5.zip	download exp_100_10.3.zip	download exp_100_10.1.zip
200 (100 cases & 100 controls)	download exp_200_10.5.zip	download exp_200_10.3.zip	download exp_200_10.1.zip
400 (200 cases & 200 controls)	download exp_400_10.5.zip	download exp_400_10.3.zip	download exp_400_10.1.zip
1200 (600 cases & 600 controls)	download exp_1200_10.5.zip	download exp_1200_10.3.zip	download exp_1200_10.1.zip

SNP Datasets

Each SNP dataset includes 1000 genes. The first 20 genes are causal genes. Each causal gene covers three SNP markers, only the second marker is in LD with the disease variant. All of other genes also have three SNP markers, but none of them is in LD with the disease variant. Genotype data were generated by SIMLA (see paper Extension of the SIMLA package for generating pedigrees with complex inheritance patterns: environmental covariates, gene-gene and gene-environment interaction). We first generated genotype data for pedigrees and then took the proband of each pedigree to form unrelated population samples. Parameters for the disease models were based on a susceptibility locus rs17221417 uncovered by a published Crohn’s disease (CD) genome-wide association study (see paper Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls). We chose this locus since it represents caspase recruitment domain-containing protein 15 (CARD15, also called NOD2) which is the first confirmed CD-susceptibility gene (see paper A frameshift mutation in NOD2 associated with susceptibility to Crohn's disease and Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease) and it has a moderate genotype relative risk. Based on information for this locus, we set the risk allele frequency for the simulation equal to 0.287, and the homozygote and heterozygote genotype relative risks at 1.617 and 1.08, respectively. We set disease prevalence in the population to 0.001985 to match the estimated prevalence of CD in North America (see paper The epidemiology and natural history of Crohn's disease in population-based patient cohorts from North America: a systematic review).

In this simulation study, we detect causal genes by indirect association, based on the LD between markers and causal variants. Five different scenarios were simulated:
1) R^2 between the disease variant and the second marker is 1 for all causal genes;
2) R^2 between the disease variant and the second marker is 0.9 for all causal genes;
3) R^2 between the disease variant and the second marker is 0.7 for all causal genes;
4) R^2 between the disease variant and the second marker is 0.5 for all causal genes;
5) R^2 between the disease variant and the second marker is 0.3 for all causal genes.

Each scenario contains 30 replicates. Two types of disease models, a dominant model and a recessive model, were simulated. The data files with "_d" were from the dominant disease model while data files with "_r" from the recessive desease model.

Sample size	R^2=1	R^2=0.9	R^2=0.7	R^2=0.5	R^2=0.3
100 (50 cases & 50 controls)	snp_100_1_d.zip snp_100_1_r.zip	snp_100_0.9_d.zip snp_100_0.9_r.zip	snp_100_0.7_d.zip snp_100_0.7_r.zip	snp_100_0.5_d.zip snp_100_0.5_r.zip	snp_100_0.3_d.zip snp_100_0.3_r.zip
200 (100 cases & 100 controls)	snp_200_1_d.zip snp_200_1_r.zip	snp_200_0.9_d.zip snp_200_0.9_r.zip	snp_200_0.7_d.zip snp_200_0.7_r.zip	snp_200_0.5_d.zip snp_200_0.5_r.zip	snp_200_0.3_d.zip snp_200_0.3_r.zip
400 (200 cases & 200 controls)	snp_400_1_d.zip snp_400_1_r.zip	snp_400_0.9_d.zip snp_400_0.9_r.zip	snp_400_0.7_d.zip snp_400_0.7_r.zip	snp_400_0.5_d.zip snp_400_0.5_r.zip	snp_400_0.3_d.zip snp_400_0.3_r.zip
1200 (600 cases & 600 controls)	snp_1200_1_d.zip snp_1200_1_r.zip	snp_1200_0.9_d.zip snp_1200_0.9_r.zip	snp_1200_0.7_d.zip snp_1200_0.7_r.zip	snp_1200_0.5_d.zip snp_1200_0.5_r.zip	snp_1200_0.3_d.zip snp_1200_0.3_r.zip

Phenotype labels files & Gene list file

Sample size	Phenotype labels files (expression)	Phenotype labels files (SNP)	Gene list file
100 (50 cases & 50 controls)	download pheno_100_exp.cls	download pheno_100_snp.cls	download genes.txt
200 (100 cases & 100 controls)	download pheno_200_exp.cls	download pheno_200_snp.cls
400 (200 cases & 200 controls)	download pheno_400_exp.cls	download pheno_400_snp.cls
1200 (600 cases & 600 controls)	download pheno_1200_exp.cls	download pheno_1200_snp.cls

Gene set datasets

Each gene set dataset includes 100 simulated gene sets with each gene set containing 20 genes. Only the first gene set includes causal genes. The percentage of these causal genes (PRG) in that first gene set is varied. Four different scenarios were simulated:
1) The first gene set includes 20 causal genes;
2) The first gene set includes 15 causal genes;
3) The first gene set includes 10 causal genes;
4) The first gene set includes 5 causal genes.

20/20	15/20	10/20	5/20
download gt_20_20.gmt	download gt_15_20.gmt	download gt_10_20.gmt	download gt_5_20.gmt