Introduction to transcriptome-wide association studies (TWAS)

TWAS are techniques that were concurrently created by Gamazon et al, 2015 and Gusev et al, 2016 to leverage eQTLs to identify gene-trait associations. A traditional TWAS pipeline is as follows:

  1. In a tissue that is relevant to a phenotype of interest (i.e. breast tumor for breast cancer outcomes or the pre-frontal cortex for neuropsychiatric disorders), we train predictive models of tissue-specific expression from genotype.
  2. Then, in a large GWAS cohort, we use these predictive models to impute tissue-specific genetically regulated expression to detect gene-trait associations.

In training predictive expression models, most TWAS methods (e.g. PrediXcan, FUSION, TIGAR, EpiXcan, etc) focus on local genetic variation around the gene of interest (0.5 to 1 Megabase around the gene). However, Boyle et al, 2017 and Liu et al, 2019 have proposed that up to 70% of gene expression can be attributed to distal variation in the genome, suggesting that the inclusion of these distal variants in TWAS models may improve prediction and power to detect gene-trait associations.

For this reason, we have developed Multi-Omic Strategies for TWAS (MOSTWAS), an intuitive suite of tools to prioritize distal variants in transcriptomic prediction and conduct TWAS-like association testing using GWAS summary statistics. MOSTWAS incorporates two methods to include distal-eQTLs in transcriptomic prediction: Mediator-enriched TWAS (MeTWAS) and distal-eQTL prioritization via mediation analysis (DePMA).

  • MeTWAS first identifies \(m\) mediators (e.g. CpG sites, miRNAs, gene coding for transcription factors, etc) such that the intensity (methylation or expression levels) of these mediators are associated to the mRNA expression. A model for the genetically regulated intensities (GRIn) is estimated using the local SNPs to these mediators, and the GRIn of these mediators are imputed into the training set. The final gene expression model is estimated by incorporating the GRIn of the mediators as fixed effects and the local SNPs to the gene as regularized effects (see Methods for more details).

  • DePMA first identified testing triplets of the gene of interest, a distal eSNP (SNP in an eQTL) to the gene, and any associated mediators local to the eQTL. We estimate the total mediation effect (TME) of the eQTL on the gene through the set of mediators and test the two-sided hypothesis of \(H_0: TME = 0\). Any distal-eSNP with a significant TME is included with the SNPs local to the gene of interest to form the final design matrix. A model including all local SNPs and all significant distal-eSNPs is fit to the expression of the gene using either elastic net or linear mixed modeling (see Methods for more details).

All MOSTWAS models output:

  • the SNP-gene effect sizes (i.e. the model),
  • cross-validation \(R^2\),
  • heritability estimate from GCTA,
  • \(P\)-value for the likelihood ratio test for heritability, and
  • predicted expression in the reference panel.

We recommend training genes with both MeTWAS and DePMA and prioritizing the model with greater cross-validation \(R^2\) for association testing.

Using MOSTWAS

MOSTWAS is meant to be run on a high-performance cluster (HPC) and integrates with GCTA to perform file formatting, linkage disequilibrium estimation, and heritability estimation. If your HPC already has PLINK and GCTA installed, be sure to load and add those modules to your personal PATH with something like this:

module add plink
module add gcta

If PLINK and GCTA aren’t automatically installed, you can simply download and install these software add them manually to your PATH:

module use /path/to/personal/plink
module use /path/to/persona/gcta
module load plink
module load gcta64
module add plink
module add gcta

Now, we can go to R to use MOSTWAS.

Installing MOSTWAS

MOSTWAS is available freely on Github. To install the package:

devtools::install_github('bhattacharya-a-bt/MOSTWAS')

Necessary input files

Let’s assume that the reference eQTL panel has \(n\) samples. MOSTWAS, at the very least, requires the input files for the following:

  1. SNP dosages and locations,
  2. gene expression and locations,
  3. mediator intensities and locations, and
  4. relevant covariates.

Mediators can include any biomarker that may influence the transcription of a gene (i.e. transcription factors, microRNAs, DNA CpG methylation sites, etc). The term “intensity” here is used as a catch-all term for expression and methylation of mediators.

MOSTWAS uses files that are formatted in the MatrixEQTL format (see the MatrixEQTL documentation). Briefly, SNP dosage, gene expression, and mediator intensity files all have \(n+1\) columns, where the first column is an identifier for the biomarker (SNP, gene, or mediator) and the last \(n\) columns are the scaled values for that biomarker. The location files have columns (in order) for the identifier, the chromosome, and the base-pair position of the biomarker. Gene location files have a fourth column with the end location of the gene.

We recommend that the file that contains the covariates include principal components of the SNP dosage matrix (in the range of 5-20, depending on the dataset). Relevant clinical covariates can be included as well. This file has \(n+1\) columns, where the first column is the identifier for the covariate and the last \(n\) columns contain the values for the covariates across the \(n\) samples.

Here’s an example of what a SNP dosage file looks like:

#>       SNP     sample1      sample2    sample3    sample4    sample5     sample6
#> 1:  snp55  0.71266631 -0.008309014 -1.1007406  0.1860351 -1.2589214  0.24350206
#> 2: snp148 -0.07356440  0.128855402  0.7145094  0.1764178 -0.2153845  1.11436266
#> 3: snp211 -0.03763417 -0.145875628 -0.2464703  0.9158482 -2.4719617 -0.07655271
#> 4:  snp79 -0.68166048 -0.163910957 -0.3197862  0.3201767 -0.6741693  0.93292505
#> 5:  snp61 -0.32427027  1.763552003  1.3626443 -0.3666873 -0.5012972 -0.15798174
#> 6:  snp56  0.06016044  0.762586512 -1.2278826 -0.9406113  1.5423258 -0.68150951
#>       sample7    sample8     sample9
#> 1: -1.5894114 -0.5572063  0.04947015
#> 2:  1.2341042 -1.1133253 -0.45554882
#> 3: -0.3354548 -1.6306354  0.65788940
#> 4: -2.0010025 -0.3878239 -1.48546741
#> 5: -0.0174584  1.7226320  1.46918455
#> 6: -0.2409669 -0.5585003 -0.22176117

Here’s an example of what a SNP location file looks like:

#>    snpid chr        pos
#> 1:  snp1  10 8066096444
#> 2:  snp2  10 8066309675
#> 3:  snp3  10 8066711134
#> 4:  snp4  10 8067381843
#> 5:  snp5  10 8065968791
#> 6:  snp6  10 8067362206

For training models using the MeTWAS() and DePMA() models, we recommend keeping the genotype files in binary PLINK format (i.e. .bed/.bim/.fam) and reading in the file using the bigsnpr package. Here’s a link to their documentation.

eQTL analysis

MeTWAS and DePMA require results for QTL analysis:

  • MeTWAS requires associations between mediators and genes, and
  • DePMA requires distal-eQTLs between SNPs and genes, and local mediator QTLs for those distal-eQTLs.

We have provided a wrapper function to do these eQTL analyses with MatrixEQTL:


eQTL_MOSTWAS(SNP_file_name = 'snps.txt',
             snps_location_file_name = 'snpLocs.txt',
             SNP_file_name_dis = 'snps_dis.txt',
             expression_file_name = 'genes.txt',
             gene_location_file_name = 'geneLocs.txt',
             mediators_file_name = 'mediators.txt',
             meds_location_file_name = 'medLocs.txt',
             covariates_file_name = 'covs.txt',
             output_file_name_loc_qtl = 'out_loc_qtl.txt',
             output_file_name_dis_qtl = 'out_dis_qtl.txt',
             output_file_name_loc_med = 'out_loc_med.txt',
             output_file_name_dis_med = 'out_dos_med.txt',
             p_loc_qtl = 1e-6,
             p_dis_qtl = 1e-6,
             FDRcut = 0.05,
             useModel = 'modelLinear',
             DePMA = F)

We have provided a toggle DePMA that takes in a logical object (i.e. TRUE or FALSE) to do either QTL analysis for DePMA or MeTWAS. MatrixEQTL is also a very intuitive software, so you can easily code your own analyses.

Please cite Shabalin 2012 if you use MOSTWAS!

Training models with MeTWAS

MOSTWAS wrapper functions have multiple options to provide users with freedom in training predictive models. It may get a little confusing, but we’ll cover the options one-by-one.

You can run MeTWAS all in one, for a given gene (call this gene test):


MeTWAS(geneInt,
       snpObj,
       mediator,
       medLocs,
       covariates,
       dimNumeric,
       qtlFull,
       h2Pcutoff = .1,
       numMed = 5,
       seed = 1218,
       k = 5,
       cisDist = 1e6,
       parallel = F,
       prune = F,
       ldThresh = .5,
       cores,
       verbose = T,
       R2Cutoff = 0.01,
       modelDir,
       tempFolder)

A few notes on these options:

  • snpObj is a bigSNP object that is read in using the bigsnpr package. This is a much more memory-efficient and faster approach.
  • dimNumeric takes in the number of leading rows of the covariates file that are genotype PCs for heritability estimation using GCTA.
  • qtlFull takes in the QTL results, as a data.frame or data.table that give associations between mediators and genes. This is in the format of MatrixEQTL output.
  • h2Pcutoff gives a \(P\)-value cutoff for heritability estimation using GCTA
  • numMed gives an upper limit to the number of top gene-associated mediators to limit computational time. Ideally, in the distal QTL analysis between mediators (predictors) and genes (outcomes)
  • parallel toggles parallel fitting of the numMed local-only models of mediators with an mclapply() implementation. The cores option allows you to set the number of parallel cores for this implementation.
  • prune togggles wheter the SNP dosages need to be LD-pruned. We have seen previously that LD-pruning may improve prediction Bhattacharya et al 2020. ldThresh provides the LD threshold for clumping.
  • snpAnnot takes in a data.frame of SNP annotations that includes three columns. This is generally needed if the SNP identifiers are not in 1000GP form (i.e. chrom:position:REF:ALT).
  • Models are outputted to modelDir and backing files for bigsnpr packages are stored in tempFolder

Training models with DePMA

You can run DePMA all in one, for a given gene (call this gene test). Many of the options are similar, so we’ll highlight the different options in DePMA:


DePMA(geneInt = 'test',
      snpObj,
      mediator,
      medLocs,
      covariates,
      cisDist = 1e6,
      qtlTra,
      qtMed,
      h2Pcutoff,
      dimNumeric,
      verbose,
      seed,
      sobel = F,
      nperms = 1000,
      k,
      parallel,
      parType = 'no',
      prune,
      ldThresh = .5,
      cores,
      qtlTra_parts,
      qtMed_parts,
      modelDir,
      tempFolder,
      R2Cutoff)

A few notes on these options:

  • qtlTra and qtMed take in outputs from eQTL_MOSTWAS if the DePMA = T toggle is set. These are data.frame or data.table objects that contain the MatrixEQTL output for associations between distal SNPs and genes and those distal-eQTLs and any local mediators. For cross-validation (generally \(k < 5\) folds), you must also run fold-wise QTL analysis and input these filenames in qtlTra_parts and qtMed_parts as character vectors.
  • sobel toggles whether the asymptotic Sobel test is used, instead of the permutation test, to test if the total mediation effect for a distal-eQTL is large. The asymptotic Sobel test is less powerful yet much faster compared to the permutation test.
  • We use the boot package for permutation. nperms sets the number of resamplings in permutation test, and parType sets the parallelization method as in the boot package.

Running multiple genes in parallel

Usually, you’ll need to train models for 15,000+ genes in an RNA-seq panel. Serially running DePMA is slow (MeTWAS is quite fast), even when parallelizing within a single gene. We recommend a batch submission procedure, perhaps using the rslurm package. Given that the genes of interest as in the vector geneList, we can simply define a data.frame of parameters and submit multiple jobs all in one:


pars = data.frame(geneInt = geneList)
batch_DeP <- function(geneInt){
  DePMA(geneInt,snpObj,mediator,...)
}
library(rslurm)
sjob <- slurm_apply(batch_DeP, pars, jobname = 'test_DePMA',
                    nodes = 2, cpus_per_node = 2, submit = TRUE)

This generates R scripts for every gene and submits them to your SLURM cluster. You can delete all the files using cleanup_files(sjob).

Other options include the BatchJobs package on CRAN or snakemake.

With slurm_apply(), we were able to fit a whole RNA-seq panel using both MeTWAS and DePMA in roughly 20 hours.

FUSION-like association testing

We provide functions for testing as well. The function burdenTest() does the weighted burned test of association and gives the option for permutation testing, as in FUSION:

burdenTest(wgt, ### .RData file name for a given gene's MOSTWAS model
           snps, ### data.frame/data.table for SNP dosages
           sumStats, ### data.frame/data.table for GWAS summary statistics
           snpAnnot = NULL,
           onlyCis = F, ### toggle to include only local SNPs
           beta, ### column name for effect sizes in GWAS summary stats
           se, ### column name for standard errors for betas in GWAS summary stats
           chr, ### column name for SNP chromosome locations in GWAS summary stats
           pos, ### column name for SNP basepair position in GWAS summary stats
           ref, ### column name for SNP reference allele in GWAS summary stats
           pval, ### column name for P-value in GWAS summary stats
           R2cutoff = 0.01, ### cross-validation R2 cutoff to test a gene
           alpha, ### P-value cutoff in TWAS association to conduct permuation test
           nperms = 1e3 ### number of permutations in permuation test)

A comment on permutation testing:

  • Permutation testing assess wheter the TWAS association persists when conditioning on large GWAS effect sizes. This test is meant to be conservatively used, so we recommend imposing a stringent \(P\)-value cutoff to prioritize a gene for permutation testing (i.e. alpha at the Bonferroni cutoff).

Distal-SNPs added last test

For genes with significant TWAS associations from burdenTest(), we may be interested in assessing whether the distal SNPs provide any added information beyond what we can observe from the local variation around the gene of interest. We developed an added-last test of distal SNPs using summary statistics. It can be run using the addedLastTest() function that take in similar options as burdenTest():

addedLastTest(wgt,
              snps,
              sumStats,
              snpAnnot = NULL,
              beta,
              se,
              chr,
              pos,
              ref,
              pval,
              R2cutoff,
              locChrom ### chromosome on which the gene of interest exists)

The added-last test need not be run only on genes prioritized for permutation testing. We recommned any significant gene in overall TWAS association to be prioritized for added-last testing. The results from this test can allow users to focus on these distal regions for functional hypothesis generation and possible follow-up studies.