MOSTWAS_vignette.Rmd
TWAS are techniques that were concurrently created by Gamazon et al, 2015 and Gusev et al, 2016 to leverage eQTLs to identify gene-trait associations. A traditional TWAS pipeline is as follows:
In training predictive expression models, most TWAS methods (e.g. PrediXcan, FUSION, TIGAR, EpiXcan, etc) focus on local genetic variation around the gene of interest (0.5 to 1 Megabase around the gene). However, Boyle et al, 2017 and Liu et al, 2019 have proposed that up to 70% of gene expression can be attributed to distal variation in the genome, suggesting that the inclusion of these distal variants in TWAS models may improve prediction and power to detect gene-trait associations.
For this reason, we have developed Multi-Omic Strategies for TWAS (MOSTWAS), an intuitive suite of tools to prioritize distal variants in transcriptomic prediction and conduct TWAS-like association testing using GWAS summary statistics. MOSTWAS incorporates two methods to include distal-eQTLs in transcriptomic prediction: Mediator-enriched TWAS (MeTWAS) and distal-eQTL prioritization via mediation analysis (DePMA).
MeTWAS first identifies \(m\) mediators (e.g. CpG sites, miRNAs, gene coding for transcription factors, etc) such that the intensity (methylation or expression levels) of these mediators are associated to the mRNA expression. A model for the genetically regulated intensities (GRIn) is estimated using the local SNPs to these mediators, and the GRIn of these mediators are imputed into the training set. The final gene expression model is estimated by incorporating the GRIn of the mediators as fixed effects and the local SNPs to the gene as regularized effects (see Methods for more details).
DePMA first identified testing triplets of the gene of interest, a distal eSNP (SNP in an eQTL) to the gene, and any associated mediators local to the eQTL. We estimate the total mediation effect (TME) of the eQTL on the gene through the set of mediators and test the two-sided hypothesis of \(H_0: TME = 0\). Any distal-eSNP with a significant TME is included with the SNPs local to the gene of interest to form the final design matrix. A model including all local SNPs and all significant distal-eSNPs is fit to the expression of the gene using either elastic net or linear mixed modeling (see Methods for more details).
All MOSTWAS models output:
We recommend training genes with both MeTWAS and DePMA and prioritizing the model with greater cross-validation \(R^2\) for association testing.
MOSTWAS is meant to be run on a high-performance cluster (HPC) and integrates with GCTA to perform file formatting, linkage disequilibrium estimation, and heritability estimation. If your HPC already has PLINK and GCTA installed, be sure to load and add those modules to your personal PATH with something like this:
If PLINK and GCTA aren’t automatically installed, you can simply download and install these software add them manually to your PATH:
module use /path/to/personal/plink
module use /path/to/persona/gcta
module load plink
module load gcta64
module add plink
module add gcta
Now, we can go to R
to use MOSTWAS.
MOSTWAS is available freely on Github. To install the package:
devtools::install_github('bhattacharya-a-bt/MOSTWAS')
Let’s assume that the reference eQTL panel has \(n\) samples. MOSTWAS, at the very least, requires the input files for the following:
Mediators can include any biomarker that may influence the transcription of a gene (i.e. transcription factors, microRNAs, DNA CpG methylation sites, etc). The term “intensity” here is used as a catch-all term for expression and methylation of mediators.
MOSTWAS uses files that are formatted in the MatrixEQTL
format (see the MatrixEQTL documentation). Briefly, SNP dosage, gene expression, and mediator intensity files all have \(n+1\) columns, where the first column is an identifier for the biomarker (SNP, gene, or mediator) and the last \(n\) columns are the scaled values for that biomarker. The location files have columns (in order) for the identifier, the chromosome, and the base-pair position of the biomarker. Gene location files have a fourth column with the end location of the gene.
We recommend that the file that contains the covariates include principal components of the SNP dosage matrix (in the range of 5-20, depending on the dataset). Relevant clinical covariates can be included as well. This file has \(n+1\) columns, where the first column is the identifier for the covariate and the last \(n\) columns contain the values for the covariates across the \(n\) samples.
Here’s an example of what a SNP dosage file looks like:
#> SNP sample1 sample2 sample3 sample4 sample5 sample6
#> 1: snp55 0.71266631 -0.008309014 -1.1007406 0.1860351 -1.2589214 0.24350206
#> 2: snp148 -0.07356440 0.128855402 0.7145094 0.1764178 -0.2153845 1.11436266
#> 3: snp211 -0.03763417 -0.145875628 -0.2464703 0.9158482 -2.4719617 -0.07655271
#> 4: snp79 -0.68166048 -0.163910957 -0.3197862 0.3201767 -0.6741693 0.93292505
#> 5: snp61 -0.32427027 1.763552003 1.3626443 -0.3666873 -0.5012972 -0.15798174
#> 6: snp56 0.06016044 0.762586512 -1.2278826 -0.9406113 1.5423258 -0.68150951
#> sample7 sample8 sample9
#> 1: -1.5894114 -0.5572063 0.04947015
#> 2: 1.2341042 -1.1133253 -0.45554882
#> 3: -0.3354548 -1.6306354 0.65788940
#> 4: -2.0010025 -0.3878239 -1.48546741
#> 5: -0.0174584 1.7226320 1.46918455
#> 6: -0.2409669 -0.5585003 -0.22176117
Here’s an example of what a SNP location file looks like:
#> snpid chr pos
#> 1: snp1 10 8066096444
#> 2: snp2 10 8066309675
#> 3: snp3 10 8066711134
#> 4: snp4 10 8067381843
#> 5: snp5 10 8065968791
#> 6: snp6 10 8067362206
For training models using the MeTWAS()
and DePMA()
models, we recommend keeping the genotype files in binary PLINK format (i.e. .bed/.bim/.fam) and reading in the file using the bigsnpr
package. Here’s a link to their documentation.
MeTWAS and DePMA require results for QTL analysis:
We have provided a wrapper function to do these eQTL analyses with MatrixEQTL
:
eQTL_MOSTWAS(SNP_file_name = 'snps.txt',
snps_location_file_name = 'snpLocs.txt',
SNP_file_name_dis = 'snps_dis.txt',
expression_file_name = 'genes.txt',
gene_location_file_name = 'geneLocs.txt',
mediators_file_name = 'mediators.txt',
meds_location_file_name = 'medLocs.txt',
covariates_file_name = 'covs.txt',
output_file_name_loc_qtl = 'out_loc_qtl.txt',
output_file_name_dis_qtl = 'out_dis_qtl.txt',
output_file_name_loc_med = 'out_loc_med.txt',
output_file_name_dis_med = 'out_dos_med.txt',
p_loc_qtl = 1e-6,
p_dis_qtl = 1e-6,
FDRcut = 0.05,
useModel = 'modelLinear',
DePMA = F)
We have provided a toggle DePMA
that takes in a logical object (i.e. TRUE
or FALSE
) to do either QTL analysis for DePMA or MeTWAS. MatrixEQTL
is also a very intuitive software, so you can easily code your own analyses.
Please cite Shabalin 2012 if you use MOSTWAS!
MOSTWAS wrapper functions have multiple options to provide users with freedom in training predictive models. It may get a little confusing, but we’ll cover the options one-by-one.
You can run MeTWAS all in one, for a given gene (call this gene test
):
MeTWAS(geneInt,
snpObj,
mediator,
medLocs,
covariates,
dimNumeric,
qtlFull,
h2Pcutoff = .1,
numMed = 5,
seed = 1218,
k = 5,
cisDist = 1e6,
parallel = F,
prune = F,
ldThresh = .5,
cores,
verbose = T,
R2Cutoff = 0.01,
modelDir,
tempFolder)
A few notes on these options:
snpObj
is a bigSNP object that is read in using the bigsnpr
package. This is a much more memory-efficient and faster approach.dimNumeric
takes in the number of leading rows of the covariates
file that are genotype PCs for heritability estimation using GCTA.qtlFull
takes in the QTL results, as a data.frame
or data.table
that give associations between mediators and genes. This is in the format of MatrixEQTL output.h2Pcutoff
gives a \(P\)-value cutoff for heritability estimation using GCTAnumMed
gives an upper limit to the number of top gene-associated mediators to limit computational time. Ideally, in the distal QTL analysis between mediators (predictors) and genes (outcomes)parallel
toggles parallel fitting of the numMed
local-only models of mediators with an mclapply()
implementation. The cores
option allows you to set the number of parallel cores for this implementation.prune
togggles wheter the SNP dosages need to be LD-pruned. We have seen previously that LD-pruning may improve prediction Bhattacharya et al 2020. ldThresh
provides the LD threshold for clumping.snpAnnot
takes in a data.frame of SNP annotations that includes three columns. This is generally needed if the SNP identifiers are not in 1000GP form (i.e. chrom:position:REF:ALT).modelDir
and backing files for bigsnpr
packages are stored in tempFolder
You can run DePMA all in one, for a given gene (call this gene test
). Many of the options are similar, so we’ll highlight the different options in DePMA:
DePMA(geneInt = 'test',
snpObj,
mediator,
medLocs,
covariates,
cisDist = 1e6,
qtlTra,
qtMed,
h2Pcutoff,
dimNumeric,
verbose,
seed,
sobel = F,
nperms = 1000,
k,
parallel,
parType = 'no',
prune,
ldThresh = .5,
cores,
qtlTra_parts,
qtMed_parts,
modelDir,
tempFolder,
R2Cutoff)
A few notes on these options:
qtlTra
and qtMed
take in outputs from eQTL_MOSTWAS
if the DePMA = T
toggle is set. These are data.frame
or data.table
objects that contain the MatrixEQTL output for associations between distal SNPs and genes and those distal-eQTLs and any local mediators. For cross-validation (generally \(k < 5\) folds), you must also run fold-wise QTL analysis and input these filenames in qtlTra_parts
and qtMed_parts
as character vectors.sobel
toggles whether the asymptotic Sobel test is used, instead of the permutation test, to test if the total mediation effect for a distal-eQTL is large. The asymptotic Sobel test is less powerful yet much faster compared to the permutation test.boot
package for permutation. nperms
sets the number of resamplings in permutation test, and parType
sets the parallelization method as in the boot
package.Usually, you’ll need to train models for 15,000+ genes in an RNA-seq panel. Serially running DePMA is slow (MeTWAS is quite fast), even when parallelizing within a single gene. We recommend a batch submission procedure, perhaps using the rslurm
package. Given that the genes of interest as in the vector geneList
, we can simply define a data.frame
of parameters and submit multiple jobs all in one:
pars = data.frame(geneInt = geneList)
batch_DeP <- function(geneInt){
DePMA(geneInt,snpObj,mediator,...)
}
library(rslurm)
sjob <- slurm_apply(batch_DeP, pars, jobname = 'test_DePMA',
nodes = 2, cpus_per_node = 2, submit = TRUE)
This generates R
scripts for every gene and submits them to your SLURM cluster. You can delete all the files using cleanup_files(sjob)
.
Other options include the BatchJobs
package on CRAN or snakemake.
With slurm_apply()
, we were able to fit a whole RNA-seq panel using both MeTWAS and DePMA in roughly 20 hours.
We provide functions for testing as well. The function burdenTest()
does the weighted burned test of association and gives the option for permutation testing, as in FUSION:
burdenTest(wgt, ### .RData file name for a given gene's MOSTWAS model
snps, ### data.frame/data.table for SNP dosages
sumStats, ### data.frame/data.table for GWAS summary statistics
snpAnnot = NULL,
onlyCis = F, ### toggle to include only local SNPs
beta, ### column name for effect sizes in GWAS summary stats
se, ### column name for standard errors for betas in GWAS summary stats
chr, ### column name for SNP chromosome locations in GWAS summary stats
pos, ### column name for SNP basepair position in GWAS summary stats
ref, ### column name for SNP reference allele in GWAS summary stats
pval, ### column name for P-value in GWAS summary stats
R2cutoff = 0.01, ### cross-validation R2 cutoff to test a gene
alpha, ### P-value cutoff in TWAS association to conduct permuation test
nperms = 1e3 ### number of permutations in permuation test)
A comment on permutation testing:
alpha
at the Bonferroni cutoff).For genes with significant TWAS associations from burdenTest()
, we may be interested in assessing whether the distal SNPs provide any added information beyond what we can observe from the local variation around the gene of interest. We developed an added-last test of distal SNPs using summary statistics. It can be run using the addedLastTest()
function that take in similar options as burdenTest()
:
addedLastTest(wgt,
snps,
sumStats,
snpAnnot = NULL,
beta,
se,
chr,
pos,
ref,
pval,
R2cutoff,
locChrom ### chromosome on which the gene of interest exists)
The added-last test need not be run only on genes prioritized for permutation testing. We recommned any significant gene in overall TWAS association to be prioritized for added-last testing. The results from this test can allow users to focus on these distal regions for functional hypothesis generation and possible follow-up studies.