[23]:
import sys
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
sys.path.insert(0, "../../")
import spycone as spy
import subprocess
from gtfparse import read_gtf

Transcript-level workflow

Prepare the dataset

The dataset is SARS-Cov-2 infection with 8 time points and 4 replicates.

[ ]:
#sample data
subprocess.call("wget https://zenodo.org/record/7228475/files/tutorial_alt_sorted_bc_tpm.csv?download=1 -O alt_sorted_bc_tpm.csv", shell=True)
subprocess.call("wget https://zenodo.org/record/7228475/files/tutorial_alt_genelist.csv?download=1 -O alt_genelist.csv", shell=True)

data = pd.read_csv("alt_sorted_bc_tpm.csv", sep="\t")
genelist = pd.read_csv("alt_genelist.csv", sep="\t")

geneid= list(map(lambda x: str(int(x)) if not np.isnan(x) else x,  genelist['gene'].tolist()))
transcriptid = genelist['isoforms'].to_list()

Read the data to the dataset function.

[25]:
dset = spy.dataset(ts=data,
        transcript_id=transcriptid,
        gene_id = geneid,
        species=9606,
        keytype='entrezgeneid',
        timepts=4, reps1=3)

Import biological network of your choice with BioNetwork, Spycone provides Biogrid, IID network in entrez ID as node name. Please specify the keytype if you are using a different ID.

[26]:
bionet = spy.BioNetwork(path="human", data=(('weight',float),))

Preprocessing

Filtering out genes that has expression across all time points lower than 1. By giving the biological network, it removes genes from the dataset that are not in the network.

[27]:
spy.preprocess(dset, bionet, cutoff=1)

Detect isoform switch

iso object contains method for isoform switch detect and total isoform usage calculation.

[28]:
iso = spy.iso_function(dset)
        #run isoform switch
ascov=iso.detect_isoform_switch(filtering=False, min_diff=0.05, corr_cutoff=0.5, event_im_cutoff=0.1, p_val_cutoff=0.05)
iso.detect_isoform_switch returns a dataframe containing the metrics of all isoform switch events detected.
iso_ratio : ratio of switching time to number of time points
diff : difference of relative abundances before and after switch
p_value : combined p-value of maj_pval and min_pval (corresponding p-value from t-test of the transcripts)
corr : dissimilar correlation (inverted pearson correlation)
final_sp : best switch point
event_importance : the ratio of the relative abundance of the events to the highest expressed transcripts
exclusive_domains : domain loss/gain of the event
adj_pval : p-values after multiple test correction
[29]:
ascov.head()
[29]:
gene gene_symb major_transcript minor_transcript switch_prob corr diff event_importance exclusive_domains p_value adj_pval
0 1019 CDK4 ENST00000552862.1 ENST00000257904.11 1.0 0.822139 0.066160 0.343667 [] 0.002980 0.002838
1 54882 ANKHD1 ENST00000433049.5 ENST00000394722.7 1.0 0.706529 0.073003 0.891058 [PF12796, PF00013] 0.056052 0.039891
2 9140 ATG12 ENST00000379594.7 ENST00000509910.2 1.0 0.613932 0.050557 0.537392 [PF04110] 0.028167 0.022642
3 64419 MTMR14 ENST00000353332.9 ENST00000431250.1 1.0 0.663200 0.062577 0.564498 [] 0.001605 0.001549
4 51184 GPN3 ENST00000228827.8 ENST00000537466.6 1.0 0.913216 0.089762 0.189560 [] 0.001605 0.001549

You can also plot out the genes using switch_plot, by provide your dataset object and the result dataframe from detect_isoform_switch.

[30]:
%matplotlib inline
spy.switch_plot("CDK4", dset, ascov)
_images/alternative_17_0.png
_images/alternative_17_1.png
_images/alternative_17_2.png
_images/alternative_17_3.png

With this function, you can also plot all isoforms in one view, by putting all_isoforms=True.

[31]:
%matplotlib inline
spy.switch_plot("BRCC3", dset, ascov, all_isoforms=True)
_images/alternative_19_0.png

Clustering of total isoform usage

Total isoform usage measures the magnitude of change of isoform relative abundance over time. It indicates the changes due to post-transcriptional regulation (e.g. aternative splicing). You can cluster the genes with total isoform gene, thus getting groups of gene with similar patterns of change of total isoform usage over time.

[32]:
dset = iso.total_isoform_usage(ascov, gene_level=True)
[33]:
asclu = spy.clustering(dset, algorithm='hierarchical', metrics="euclidean", linkage="ward", input_type="isoformusage", n_clusters=8, composite=False)
c = asclu.find_clusters()
[34]:
%matplotlib inline
spy.vis_all_clusters(asclu, col_wrap=5,y_label="total isoform usage")
_images/alternative_23_0.png

Gene set enrichment analysis

Perform gene set enrichment analysis using clusters_gsea. Change the gene_sets parameter into the choice of your knowledge base or gene set database, e.g. Reactome, KEGG, etc. Use spy.list_genesets to view the available knowledge base.

With the isoform switch detection, we can enriched the domains affect by the isoform switch events using Nease.

Nease only support human at the moment.

[35]:
asclu_go, _ = spy.clusters_gsea(dset, "hsapiens", is_results=ascov, method="nease", gene_sets = ['Reactome'])
[36]:
%matplotlib inline
spy.gsea_plot(asclu_go, cluster=4, nterms=15)
_images/alternative_26_0.png

DOMINO on the domain level

[37]:
mods = spy.run_domino(asclu, network_file=bionet, output_file_path="newslices.txt")

Splicing Factor analysis

Perform co-expression analysis to look for splicing factors with similar expression patterns as isoform relative abundance.

[38]:
sf=spy.SF_coexpression(dset, padj_method="fdr_bh",corr_cutoff=0)
sf_df, cluster_sf = sf.coexpression_with_sf()

Read gtf annotation file for exons information.

[ ]:
# obtain exons gtf, or input customs gtf
subprocess.call("wget https://zenodo.org/record/7229557/files/exons.gtf?download=1 -O exons.gtf", shell=True)
gtf = read_gtf("exons.gtf")

Co-expression analysis result in dataframe sf_df that contains the co-expressed splicing factors and the corresponding cluster. We are specifically interesting in splicing factors that positively and negatively correlated to isoforms in the same gene.

[40]:
thiscluster=6 #which cluster
end="5p" #specify 3'p or 5'p
sub = sf_df[sf_df['cluster']==thiscluster].drop_duplicates()
subcount = sub.groupby(['iso_genesymb', 'sf']) #extract list of co-expressed splicing factors
subcount = subcount['bin_corr'].agg(sum).reset_index()
subcount=subcount[subcount['bin_corr']==0]

To perform motif search, SF_motifsearch creates a motif object, taking a list of splicing factors, list of genes (e.g. genes in a cluster), dataset object, and the gtf file). Motif searching (search_motifs) will be focusing on loss exons or gain exons with exons parameter and 3p or 5p end with site parameter. Background are the constants exons which can be done by search_background_motifs.

Motif search may take a long time, it is recommended to run in command line environment.

[41]:
motif=spy.SF_motifsearch(subcount['sf'].unique(), asclu.symbs_clusters[thiscluster], dataset=dset, gtf=gtf)
lossmots = motif.search_motifs(exons="loss", site=end)
gainmots = motif.search_motifs(exons="gain", site=end)
bg = motif.search_background_motifs(site=end)
[42]:
alldf=motif._df_forvis(lossmots, gainmots, bg=bg)
pvals = motif._get_pvals(alldf)
[43]:
palette = {"lost exons":"#505BEA",
            "gained exons": "#F5D776",
           "background":"#76DACA"}
motif.vis_motif_density(alldf=alldf, pvals=pvals, pval_cutoff=0.05, palette=palette)
<Figure size 576x648 with 0 Axes>
_images/alternative_40_1.png