cell2cell
  • Home
  • API Documentation

cell2cell Tutorials

  • Cell-cell communication from bulk dataset
  • Cell-cell communication from single-cell dataset

Tensor-cell2cell Tutorials

  • Obtaining patterns of cell-cell communication with Tensor-cell2cell
  • Downstream analysis 1: Factor-specific analyses
  • Downstream analysis 2: Gene Set Enrichment Analysis
    • 1 - Load Data
    • 2 - Generate LR set from a Gene set (KEGG LR set in this case)
    • 3 - GSEA
      • Run the GSEA in each of the factors
      • Visualization
  • Inspecting CCC patterns from spatial transcriptomics
  • Running Tensor-cell2cell on your own GPU or on Google Colab's GPU
cell2cell
  • »
  • Tensor-cell2cell Tutorials »
  • Downstream analysis 2: Gene Set Enrichment Analysis
  • Edit on GitHub

Downstream analysis 2: Gene Set Enrichment Analysis¶

This tutorial is focused on running GSEA based on the loadings that the ligand-receptor pairs obtained from the tensor factorization. Tensor-cell2cell does not have functions for running GSEA directly from the tool. Here, we use the library gseapy, which has its own documentation: https://gseapy.readthedocs.io/en/latest/

IMPORTANT: Even though we use gseapy as an external tool, we have to build a gene set for ligand-receptor pairs because the standard GSEA is intended for individual genes.

In [1]:
Copied!
import cell2cell as c2c

import numpy as np
import pandas as pd

import gseapy
from statsmodels.stats.multitest import fdrcorrection
from collections import defaultdict

import seaborn as sns
%matplotlib inline
import cell2cell as c2c import numpy as np import pandas as pd import gseapy from statsmodels.stats.multitest import fdrcorrection from collections import defaultdict import seaborn as sns %matplotlib inline
In [2]:
Copied!
import warnings
warnings.filterwarnings('ignore')
import warnings warnings.filterwarnings('ignore')

1 - Load Data¶

Specify folders where the data is located, and where the outputs will be written:

In [3]:
Copied!
import os

data_folder = './'
directory = os.fsencode(data_folder)

output_folder = './results/'
if not os.path.isdir(output_folder):
    os.mkdir(output_folder)
import os data_folder = './' directory = os.fsencode(data_folder) output_folder = './results/' if not os.path.isdir(output_folder): os.mkdir(output_folder)

Open the loadings obtained from the tensor factorization

In [4]:
Copied!
factors = c2c.io.load_tensor_factors('./results/Loadings.xlsx')
factors = c2c.io.load_tensor_factors('./results/Loadings.xlsx')

Load gene set used as reference to build the LR set

In this case we use the KEGG pathways. The .gmt file was obtained from http://www.gsea-msigdb.org/gsea/downloads.jsp. If you are interested in using another gene set such as GO Terms, you can download it from the same website.

In [5]:
Copied!
pathway_per_gene = defaultdict(set)
with open('KEGG.gmt', 'rb') as f:
    for i, line in enumerate(f):
        l = line.decode("utf-8").split('\t')
        l[-1] = l[-1].replace('\n', '')
        l = [pw for pw in l if ('http' not in pw)] # Remove website info
        for gene in l[1:]:
            pathway_per_gene[gene] = pathway_per_gene[gene].union(set([l[0]]))
pathway_per_gene = defaultdict(set) with open('KEGG.gmt', 'rb') as f: for i, line in enumerate(f): l = line.decode("utf-8").split('\t') l[-1] = l[-1].replace('\n', '') l = [pw for pw in l if ('http' not in pw)] # Remove website info for gene in l[1:]: pathway_per_gene[gene] = pathway_per_gene[gene].union(set([l[0]]))

This file contains pathways associated with each gene. For example for IL6:

In [6]:
Copied!
pathway_per_gene['IL6']
pathway_per_gene['IL6']
Out[6]:
{'KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION',
 'KEGG_CYTOSOLIC_DNA_SENSING_PATHWAY',
 'KEGG_GRAFT_VERSUS_HOST_DISEASE',
 'KEGG_HEMATOPOIETIC_CELL_LINEAGE',
 'KEGG_HYPERTROPHIC_CARDIOMYOPATHY_HCM',
 'KEGG_INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION',
 'KEGG_JAK_STAT_SIGNALING_PATHWAY',
 'KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY',
 'KEGG_PATHWAYS_IN_CANCER',
 'KEGG_PRION_DISEASES',
 'KEGG_TOLL_LIKE_RECEPTOR_SIGNALING_PATHWAY'}

Open LR pairs from CellChat

Different databases of ligand-receptor interactions could be used. We previously created a repository that includes many available DBs (https://github.com/LewisLabUCSD/Ligand-Receptor-Pairs). In this tutorial, we employ the ligand-receptor pairs from CellChat (https://doi.org/10.1038/s41467-021-21246-9), which includes multimeric protein complexes.

The database must contain columns for the interactions using the same gene name nomenclature used in the .gmt file.

In [7]:
Copied!
lr_pairs = pd.read_csv('https://raw.githubusercontent.com/LewisLabUCSD/Ligand-Receptor-Pairs/master/Human/Human-2020-Jin-LR-pairs.csv')
lr_pairs = lr_pairs.astype(str)
lr_pairs = pd.read_csv('https://raw.githubusercontent.com/LewisLabUCSD/Ligand-Receptor-Pairs/master/Human/Human-2020-Jin-LR-pairs.csv') lr_pairs = lr_pairs.astype(str)
In [8]:
Copied!
lr_pairs.head(2)
lr_pairs.head(2)
Out[8]:
interaction_name pathway_name ligand receptor agonist antagonist co_A_receptor co_I_receptor evidence annotation interaction_name_2 ligand_symbol receptor_symbol ligand_ensembl receptor_ensembl interaction_symbol interaction_ensembl
0 TGFB1_TGFBR1_TGFBR2 TGFb TGFB1 TGFbR1_R2 TGFb agonist TGFb antagonist nan TGFb inhibition receptor KEGG: hsa04350 Secreted Signaling TGFB1 - (TGFBR1+TGFBR2) TGFB1 TGFBR1&TGFBR2 ENSG00000105329 ENSG00000106799&ENSG00000163513 TGFB1^TGFBR1&TGFBR2 ENSG00000105329^ENSG00000106799&ENSG00000163513
1 TGFB2_TGFBR1_TGFBR2 TGFb TGFB2 TGFbR1_R2 TGFb agonist TGFb antagonist nan TGFb inhibition receptor KEGG: hsa04350 Secreted Signaling TGFB2 - (TGFBR1+TGFBR2) TGFB2 TGFBR1&TGFBR2 ENSG00000092969 ENSG00000106799&ENSG00000163513 TGFB2^TGFBR1&TGFBR2 ENSG00000092969^ENSG00000106799&ENSG00000163513

2 - Generate LR set from a Gene set (KEGG LR set in this case)¶

Here, we label a LR interaction with a given pathway if all members of the interactions are labeled with that pathway.

In [9]:
Copied!
# If the LR db include protein complexes. 
# This is the character separating members
complex_sep = '&' 

# Dictionary to save the LR interaction (key) and the annotated pathways (values).
pathway_sets = defaultdict(set)

# Iterate through the interactions in the LR DB.
for idx, row in lr_pairs.iterrows():
    lr_label = row['interaction_symbol']
    lr = lr_label.split('^')
    
    # Gene members of the ligand and the receptor in the LR pair
    if complex_sep is None:
        ligands = [lr[0]]
        receptors = [lr[1]]
    else:
        ligands = lr[0].split(complex_sep)
        receptors = lr[1].split(complex_sep)
    
    # Find pathways associated with all members of the ligand
    for i, ligand in enumerate(ligands):
        if i == 0:
            ligand_pathways = pathway_per_gene[ligand]
        else:
            ligand_pathways = ligand_pathways.intersection(pathway_per_gene[ligand])
            
    # Find pathways associated with all members of the receptor
    for i, receptor in enumerate(receptors):
        if i == 0:
            receptor_pathways = pathway_per_gene[receptor]
        else:
            receptor_pathways = receptor_pathways.intersection(pathway_per_gene[receptor])

    # Keep only pathways that are in both ligand and receptor.
    lr_pathways = ligand_pathways.intersection(receptor_pathways)
    for p in lr_pathways:
        pathway_sets[p] = pathway_sets[p].union([lr_label])
# If the LR db include protein complexes. # This is the character separating members complex_sep = '&' # Dictionary to save the LR interaction (key) and the annotated pathways (values). pathway_sets = defaultdict(set) # Iterate through the interactions in the LR DB. for idx, row in lr_pairs.iterrows(): lr_label = row['interaction_symbol'] lr = lr_label.split('^') # Gene members of the ligand and the receptor in the LR pair if complex_sep is None: ligands = [lr[0]] receptors = [lr[1]] else: ligands = lr[0].split(complex_sep) receptors = lr[1].split(complex_sep) # Find pathways associated with all members of the ligand for i, ligand in enumerate(ligands): if i == 0: ligand_pathways = pathway_per_gene[ligand] else: ligand_pathways = ligand_pathways.intersection(pathway_per_gene[ligand]) # Find pathways associated with all members of the receptor for i, receptor in enumerate(receptors): if i == 0: receptor_pathways = pathway_per_gene[receptor] else: receptor_pathways = receptor_pathways.intersection(pathway_per_gene[receptor]) # Keep only pathways that are in both ligand and receptor. lr_pathways = ligand_pathways.intersection(receptor_pathways) for p in lr_pathways: pathway_sets[p] = pathway_sets[p].union([lr_label])

Keep only pathways annotations that contain at least K LR pairs

Here K=15

In [10]:
Copied!
K=15

lr_set = defaultdict(set)
    
for k, v in pathway_sets.items():
    if len(v) >= K:
        lr_set[k] = v
K=15 lr_set = defaultdict(set) for k, v in pathway_sets.items(): if len(v) >= K: lr_set[k] = v

And this results in a total of 22 pathways that were assigned to at least 15 LR pairs.

In [11]:
Copied!
len(lr_set)
len(lr_set)
Out[11]:
22

This LR set can be exported to be used in other analyses

In [12]:
Copied!
c2c.io.export_variable_with_pickle(lr_set, output_folder + '/CellChat-LR-KEGG-set.pkl')

# It can be loaded with:
# lr_set = c2c.io.load_variable_with_pickle(output_folder + '/CellChat-LR-KEGG-set.pkl')
c2c.io.export_variable_with_pickle(lr_set, output_folder + '/CellChat-LR-KEGG-set.pkl') # It can be loaded with: # lr_set = c2c.io.load_variable_with_pickle(output_folder + '/CellChat-LR-KEGG-set.pkl')
./results//CellChat-LR-KEGG-set.pkl  was correctly saved.

3 - GSEA¶

Here we use the gseapy.prerank() function based on the loadings of the LR pairs.

GSEA parameters

In [13]:
Copied!
weight = 1
min_size = 15
permutations = 999
significance_threshold = 0.05
weight = 1 min_size = 15 permutations = 999 significance_threshold = 0.05

Loadings of the LR pairs

In [14]:
Copied!
loadings = factors['Ligand-Receptor Pairs'].reset_index()
loadings = factors['Ligand-Receptor Pairs'].reset_index()

Run the GSEA in each of the factors¶

In [15]:
Copied!
for factor in loadings.columns[1:]:
    # Rank LR pairs of each factor by their respective loadings
    test = loadings[['index', factor]]
    test.columns = [0, 1]
    test = test.sort_values(by=1, ascending=False)
    test.reset_index(drop=True, inplace=True)

    # RUN GSEA
    gseapy.prerank(rnk=test,
                   gene_sets=lr_set,
                   min_size=min_size,
                   weighted_score_type=weight,
                   processes=6,
                   permutation_num=permutations, # reduce number to speed up testing
                   outdir=output_folder + '/GSEA/' + factor, format='png', seed=6)
for factor in loadings.columns[1:]: # Rank LR pairs of each factor by their respective loadings test = loadings[['index', factor]] test.columns = [0, 1] test = test.sort_values(by=1, ascending=False) test.reset_index(drop=True, inplace=True) # RUN GSEA gseapy.prerank(rnk=test, gene_sets=lr_set, min_size=min_size, weighted_score_type=weight, processes=6, permutation_num=permutations, # reduce number to speed up testing outdir=output_folder + '/GSEA/' + factor, format='png', seed=6)

Adjust P-values across all factors (GSEA only adjusts within factor)

First we read the files generated by gseapy within each Factor folder. Then put all those P-values together and perform a Benjamini-Hochberg correction.

In [16]:
Copied!
pvals = []
terms = []
factors = []
nes = []
for factor in loadings.columns[1:]:
    p_report = pd.read_csv(output_folder + '/GSEA/' + factor + '/gseapy.prerank.gene_sets.report.csv')
    pval = p_report['pval'].values.tolist()
    pvals.extend(pval)
    terms.extend(p_report.Term.values.tolist())
    factors.extend([factor] * len(pval))
    nes.extend(p_report['nes'].values.tolist())
pval_df = pd.DataFrame(np.asarray([factors, terms, nes, pvals]).T, columns=['Factor', 'Term', 'NES', 'P-value'])
pval_df = pval_df.loc[pval_df['P-value'] != 'nan']
pval_df['P-value'] = pd.to_numeric(pval_df['P-value'])
pval_df['P-value'] = pval_df['P-value'].replace(0., 1./(permutations+1))
pval_df['NES'] = pd.to_numeric(pval_df['NES'])
pvals = [] terms = [] factors = [] nes = [] for factor in loadings.columns[1:]: p_report = pd.read_csv(output_folder + '/GSEA/' + factor + '/gseapy.prerank.gene_sets.report.csv') pval = p_report['pval'].values.tolist() pvals.extend(pval) terms.extend(p_report.Term.values.tolist()) factors.extend([factor] * len(pval)) nes.extend(p_report['nes'].values.tolist()) pval_df = pd.DataFrame(np.asarray([factors, terms, nes, pvals]).T, columns=['Factor', 'Term', 'NES', 'P-value']) pval_df = pval_df.loc[pval_df['P-value'] != 'nan'] pval_df['P-value'] = pd.to_numeric(pval_df['P-value']) pval_df['P-value'] = pval_df['P-value'].replace(0., 1./(permutations+1)) pval_df['NES'] = pd.to_numeric(pval_df['NES'])

BH correction:

In [17]:
Copied!
pval_df['Adj. P-value'] = fdrcorrection(pval_df['P-value'].values,
                                        alpha=significance_threshold)[1]
pval_df['Adj. P-value'] = fdrcorrection(pval_df['P-value'].values, alpha=significance_threshold)[1]

Enriched LR pairs

In [18]:
Copied!
pval_df.loc[(pval_df['Adj. P-value'] < significance_threshold) & (pval_df['NES'] > 0.)]
pval_df.loc[(pval_df['Adj. P-value'] < significance_threshold) & (pval_df['NES'] > 0.)]
Out[18]:
Factor Term NES P-value Adj. P-value
1 Factor 1 KEGG_ERBB_SIGNALING_PATHWAY 1.454242 0.001000 0.008709
2 Factor 1 KEGG_CELL_ADHESION_MOLECULES_CAMS 1.398443 0.001000 0.008709
3 Factor 1 KEGG_AXON_GUIDANCE 1.241130 0.006006 0.029029
17 Factor 2 KEGG_AXON_GUIDANCE 1.563928 0.001000 0.008709
18 Factor 2 KEGG_CELL_ADHESION_MOLECULES_CAMS 1.309822 0.005005 0.029029
31 Factor 3 KEGG_ERBB_SIGNALING_PATHWAY 1.486924 0.003003 0.018662
32 Factor 3 KEGG_AXON_GUIDANCE 1.388006 0.001001 0.008709
34 Factor 3 KEGG_ECM_RECEPTOR_INTERACTION 1.327261 0.001000 0.008709
35 Factor 3 KEGG_SMALL_CELL_LUNG_CANCER 1.335054 0.008008 0.033176
36 Factor 3 KEGG_FOCAL_ADHESION 1.279068 0.001001 0.008709
45 Factor 4 KEGG_ERBB_SIGNALING_PATHWAY 1.741709 0.001000 0.008709
46 Factor 4 KEGG_CELL_ADHESION_MOLECULES_CAMS 1.287627 0.003003 0.018662
47 Factor 4 KEGG_AXON_GUIDANCE 1.254568 0.007007 0.032085
61 Factor 5 KEGG_CELL_ADHESION_MOLECULES_CAMS 1.422309 0.001000 0.008709
62 Factor 5 KEGG_MAPK_SIGNALING_PATHWAY 1.310704 0.008008 0.033176
63 Factor 5 KEGG_REGULATION_OF_ACTIN_CYTOSKELETON 1.312981 0.002002 0.015834
64 Factor 5 KEGG_ERBB_SIGNALING_PATHWAY 1.326159 0.006006 0.029029
75 Factor 6 KEGG_CELL_ADHESION_MOLECULES_CAMS 1.297625 0.003003 0.018662
78 Factor 6 KEGG_FOCAL_ADHESION 1.190575 0.006006 0.029029

Depleted LR pairs

In [19]:
Copied!
pval_df.loc[(pval_df['Adj. P-value'] < significance_threshold) & (pval_df['NES'] < 0.)]
pval_df.loc[(pval_df['Adj. P-value'] < significance_threshold) & (pval_df['NES'] < 0.)]
Out[19]:
Factor Term NES P-value Adj. P-value
0 Factor 1 KEGG_NOTCH_SIGNALING_PATHWAY -1.380117 0.001 0.008709
15 Factor 2 KEGG_NOTCH_SIGNALING_PATHWAY -2.423423 0.001 0.008709

Exported Normalized Enrichment Scores and their adjusted P-values across factors

In [20]:
Copied!
pval_df.to_excel(output_folder + '/GSEA-Adj-Pvals.xlsx')
pval_df.to_excel(output_folder + '/GSEA-Adj-Pvals.xlsx')

Visualization¶

DataFrame needed for the pval_plot included in cell2cell

In [21]:
Copied!
pval_pivot = pval_df.pivot(index="Term", columns="Factor", values="Adj. P-value").fillna(1.)
scores = pval_df.pivot(index="Term", columns="Factor", values="NES").fillna(0)
pval_pivot = pval_df.pivot(index="Term", columns="Factor", values="Adj. P-value").fillna(1.) scores = pval_df.pivot(index="Term", columns="Factor", values="NES").fillna(0)

Dot Plot

In [22]:
Copied!
with sns.axes_style("darkgrid"):
    dotplot = c2c.plotting.pval_plot.generate_dot_plot(pval_df=pval_pivot,
                                                       score_df=scores,
                                                       significance=significance_threshold,
                                                       xlabel='',
                                                       ylabel='KEGG Pathways',
                                                       cbar_title='NES',
                                                       cmap='PuOr',
                                                       figsize=(9,16),
                                                       label_size=20,
                                                       title_size=20,
                                                       tick_size=14,
                                                       filename=output_folder + '/GSEA-Dotplot.svg'
                                                      )
with sns.axes_style("darkgrid"): dotplot = c2c.plotting.pval_plot.generate_dot_plot(pval_df=pval_pivot, score_df=scores, significance=significance_threshold, xlabel='', ylabel='KEGG Pathways', cbar_title='NES', cmap='PuOr', figsize=(9,16), label_size=20, title_size=20, tick_size=14, filename=output_folder + '/GSEA-Dotplot.svg' )

This dot plot indicates the pathways or LR sets enriched/depleted (depending on the color) in each of the factors. The size of the circles indicates whether they are significant. The threshold size is for the significance threshold (usually P < 0.05).

Previous Next

Built with MkDocs using a theme provided by Read the Docs.
GitHub « Previous Next »