Skip to contents

Parses a GTF file and creates a similarity matrix for isoforms of a specified gene based on shared exon structure. This matrix can be used with multivariate_graph_reg for graph-regularized prediction.

Usage

similarity_from_gtf(
  gtf_path,
  gene,
  method = c("jaccard", "overlap_coef", "binary"),
  transcript_ids = NULL,
  min_transcripts = 2,
  verbose = TRUE
)

Arguments

gtf_path

character, path to GTF file (can be gzipped)

gene

character, gene name (HGNC symbol) or Ensembl gene ID

method

character, similarity method:

  • "jaccard": Jaccard index based on exon overlap (default)

  • "overlap_coef": Overlap coefficient = overlap / min(length_i, length_j)

  • "binary": 1 if any overlap, 0 otherwise

transcript_ids

character vector, optional subset of transcript IDs to include. If NULL, uses all transcripts for the gene.

min_transcripts

int, minimum number of transcripts required (default 2)

verbose

logical, print progress messages

Value

A list containing:

  • similarity_matrix: symmetric matrix of pairwise isoform similarities

  • transcript_ids: character vector of transcript IDs (matrix row/col names)

  • gene_id: matched gene identifier

  • n_transcripts: number of transcripts

  • n_exons: named vector of exon counts per transcript

Details

The function computes pairwise similarity between isoforms based on their exon structure. Isoforms that share more exonic sequence are considered more similar, reflecting the biological intuition that they likely share more cis-regulatory effects.

The Jaccard similarity is computed as: $$J(i,j) = \frac{|overlap(exons_i, exons_j)|}{|union(exons_i, exons_j)|}$$

where overlap and union are computed in terms of genomic base pairs.

Examples

if (FALSE) { # \dontrun{
# Create similarity matrix for BRCA1 isoforms
sim <- similarity_from_gtf("gencode.v45.annotation.gtf.gz", "BRCA1")

# Use with graph-regularized prediction
model <- multivariate_graph_reg(X, Y, similarity_matrix = sim$similarity_matrix)
} # }