Running Tensor-cell2cell on your own GPU or on Google Colab's GPU¶

You can run this notebook locally or on Google Colab directly here

Before running this notebook, make sure to have a proper NVIDIA GPU driver (https://www.nvidia.com/Download/index.aspx) as well as the CUDA toolkit (https://developer.nvidia.com/cuda-toolkit) installed.

Then, make sure to create an environment with Pytorch following these instructions to enable CUDA

Citations¶

If you plan using this notebook with your own scRNA-seq data, please cite these references:

Tensor-cell2cell Tool: Armingol E., Baghdassarian H., Martino C., Perez-Lopez A., Aamodt C., Knight R., Lewis N.E. Context-aware deconvolution of cell-cell communication with Tensor-cell2cell. Nat Commun 13, 3665 (2022). https://doi.org/10.1038/s41467-022-31369-2
Repository of Ligand-Receptor Pairs used to download the list of LR pairs: Armingol, E., Officer, A., Harismendy, O. et al. Deciphering cell–cell interactions and communication from gene expression. Nat Rev Genet 22, 71–88 (2021). https://doi.org/10.1038/s41576-020-00292-x
List of Ligand-Receptor Pairs: Jin, S., Guerrero-Juarez, C.F., Zhang, L. et al. Inference and analysis of cell-cell communication using CellChat. Nat Commun 12, 1088 (2021). https://doi.org/10.1038/s41467-021-21246-9

Initial setups¶

Installing necessary libraries¶

In [ ]:

            
                Copied!
                
# To install a stable version of cell2cell run this:
#!pip install cell2cell
# To install a stable version of cell2cell run this:
#!pip install cell2cell

In [3]:

            
                Copied!
                
import cell2cell as c2c
import scanpy as sc

import pandas as pd
import numpy as np

from tqdm import tqdm

%matplotlib inline
import cell2cell as c2c
import scanpy as sc

import pandas as pd
import numpy as np

from tqdm import tqdm

%matplotlib inline

In [4]:

            
                Copied!
                
import warnings
warnings.filterwarnings('ignore')
import warnings
warnings.filterwarnings('ignore')

In [5]:

            
                Copied!
                
c2c.__version__
c2c.__version__

Out[5]:

'0.6.4'

Load Data¶

Directories to store data¶

In [6]:

            
                Copied!
                
output_folder = './BALF-COVID19/'
c2c.io.directories.create_directory(output_folder)
output_folder = './BALF-COVID19/'
c2c.io.directories.create_directory(output_folder)

./BALF-COVID19/ already exists.

Example dataset of COVID-19¶

This is a BALF COVID-19 dataset that consists in 63k immune and epithelial cells in lungs from 3 control, 3 moderate COVID-19, and 6 severe COVID-19 patients.

This dataset was previously published in [1], and this object contains the raw counts for the annotated cell types available in: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145926

Here we obtain an AnnData object containing the raw counts.

References:

[1] Liao, M., Liu, Y., Yuan, J. et al. Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nat Med 26, 842–844 (2020). https://doi.org/10.1038/s41591-020-0901-9

In [7]:

            
                Copied!
                
adata = c2c.datasets.balf_covid()
adata = c2c.datasets.balf_covid()

In [8]:

            
                Copied!
                
adata
adata

Out[8]:

AnnData object with n_obs × n_vars = 63103 × 33538
    obs: 'sample', 'sample_new', 'group', 'disease', 'hasnCoV', 'cluster', 'celltype', 'condition'

Ligand-Receptor pairs¶

Different databases of ligand-receptor interactions could be used. We previously created a repository that includes many available DBs (https://github.com/LewisLabUCSD/Ligand-Receptor-Pairs). In this tutorial, we employ the ligand-receptor pairs from CellChat (https://doi.org/10.1038/s41467-021-21246-9), which includes multimeric protein complexes.

In [9]:

            
                Copied!
                
lr_pairs = pd.read_csv('https://raw.githubusercontent.com/LewisLabUCSD/Ligand-Receptor-Pairs/master/Human/Human-2020-Jin-LR-pairs.csv')
lr_pairs = lr_pairs.astype(str)
lr_pairs = pd.read_csv('https://raw.githubusercontent.com/LewisLabUCSD/Ligand-Receptor-Pairs/master/Human/Human-2020-Jin-LR-pairs.csv')
lr_pairs = lr_pairs.astype(str)

In [10]:

            
                Copied!
                
lr_pairs.head(2)
lr_pairs.head(2)

Out[10]:

	interaction_name	pathway_name	ligand	receptor	agonist	antagonist	co_A_receptor	co_I_receptor	evidence	annotation	interaction_name_2	ligand_symbol	receptor_symbol	ligand_ensembl	receptor_ensembl	interaction_symbol	interaction_ensembl
0	TGFB1_TGFBR1_TGFBR2	TGFb	TGFB1	TGFbR1_R2	TGFb agonist	TGFb antagonist	nan	TGFb inhibition receptor	KEGG: hsa04350	Secreted Signaling	TGFB1 - (TGFBR1+TGFBR2)	TGFB1	TGFBR1&TGFBR2	ENSG00000105329	ENSG00000106799&ENSG00000163513	TGFB1^TGFBR1&TGFBR2	ENSG00000105329^ENSG00000106799&ENSG00000163513
1	TGFB2_TGFBR1_TGFBR2	TGFb	TGFB2	TGFbR1_R2	TGFb agonist	TGFb antagonist	nan	TGFb inhibition receptor	KEGG: hsa04350	Secreted Signaling	TGFB2 - (TGFBR1+TGFBR2)	TGFB2	TGFBR1&TGFBR2	ENSG00000092969	ENSG00000106799&ENSG00000163513	TGFB2^TGFBR1&TGFBR2	ENSG00000092969^ENSG00000106799&ENSG00000163513

Interaction columns contianing the gene names

In [11]:

            
                Copied!
                
int_columns = ('ligand_symbol', 'receptor_symbol')
int_columns = ('ligand_symbol', 'receptor_symbol')

Data Preprocessing¶

Information of samples¶

First, generate a dictionary indicating what condition group is associated to each sample/context

In [12]:

            
                Copied!
                
context_dict = adata.obs.set_index('sample')['condition'].sort_values().to_dict()
context_dict = adata.obs.set_index('sample')['condition'].sort_values().to_dict()

In [13]:

            
                Copied!
                
context_dict
context_dict

Out[13]:

{'C100': 'Control',
 'C52': 'Control',
 'C51': 'Control',
 'C142': 'Moderate COVID-19',
 'C141': 'Moderate COVID-19',
 'C144': 'Moderate COVID-19',
 'C143': 'Severe COVID-19',
 'C148': 'Severe COVID-19',
 'C149': 'Severe COVID-19',
 'C146': 'Severe COVID-19',
 'C145': 'Severe COVID-19',
 'C152': 'Severe COVID-19'}

Generate list of RNA-seq data (per sample) with single cells aggregated into cell types¶

Here, we convert the single-cell expression levels into cell-type expression levels. A basic context-wise preprocessing is performed here, keeping only genes that are expressed at least in 4 single cells.

Important: A list of aggregated gene expression matrices must be used for building the 4D-communication tensor. Each of the gene expression matrices is for one of the contexts, following the same order as the previous dictionary (context_dict).

Each gene expression is aggregated here across all single cells annotated with the same cell type as the average log1p(CPM) value. Other aggregation methods could be used, please inspect parameter method='average' in the aggregate_single_cell() function. Additionally, other gene expression values could be passed, using other preprocessing approaches such as the non-zero fraction of cells expressing a gene or batch correction methods.

In [14]:

            
                Copied!
                
                    
                    
                
                

        
rnaseq_matrices = []

# Iteraty by sample/context
for context in tqdm(context_dict.keys()):
    # Obtain metadata for context
    meta_context = adata.obs.loc[adata.obs['sample'] == context].copy()
    # Single cells in the context
    cells = list(meta_context.index)

    # Rename index name to identify the barcodes when aggregating expression
    meta_context.index.name = 'barcode'

    # Subset RNAseq data by the single cells in the sample/context
    tmp_adata = adata[cells]

    # Keep genes in each sample with at least 4 single cells expressing it
    genes = sc.pp.filter_genes(tmp_adata, min_cells=4, inplace=False)[0]
    
    # Convert raw counts into log1p(CPM)
    sc.pp.normalize_total(tmp_adata, target_sum=1e6)
    sc.pp.log1p(tmp_adata)

    # Convert to dataframe and keep filtered genes
    tmp_adata = tmp_adata.to_df().loc[:, genes]

    # Aggregate gene expression of single cells into cell types
    exp_df = c2c.preprocessing.aggregate_single_cells(rnaseq_data=tmp_adata,
                                                      metadata=meta_context,
                                                      barcode_col='barcode',
                                                      celltype_col='celltype',
                                                      method='average',
                                                     )

    rnaseq_matrices.append(exp_df)
rnaseq_matrices = []

# Iteraty by sample/context
for context in tqdm(context_dict.keys()):
    # Obtain metadata for context
    meta_context = adata.obs.loc[adata.obs['sample'] == context].copy()
    # Single cells in the context
    cells = list(meta_context.index)

    # Rename index name to identify the barcodes when aggregating expression
    meta_context.index.name = 'barcode'

    # Subset RNAseq data by the single cells in the sample/context
    tmp_adata = adata[cells]

    # Keep genes in each sample with at least 4 single cells expressing it
    genes = sc.pp.filter_genes(tmp_adata, min_cells=4, inplace=False)[0]
    
    # Convert raw counts into log1p(CPM)
    sc.pp.normalize_total(tmp_adata, target_sum=1e6)
    sc.pp.log1p(tmp_adata)

    # Convert to dataframe and keep filtered genes
    tmp_adata = tmp_adata.to_df().loc[:, genes]

    # Aggregate gene expression of single cells into cell types
    exp_df = c2c.preprocessing.aggregate_single_cells(rnaseq_data=tmp_adata,
                                                      metadata=meta_context,
                                                      barcode_col='barcode',
                                                      celltype_col='celltype',
                                                      method='average',
                                                     )

    rnaseq_matrices.append(exp_df)

100%|███████████████████████████████████████████| 12/12 [00:09<00:00,  1.23it/s]

Preprocessing of LR pairs¶

Generate a dictionary with function info for each LR pairs.

Keys are LIGAND_NAME^RECEPTOR_NAME (this is the same nomenclature that will be used when building the 4D-communication tensor later), and values are the function in the annotation column in the dataframe containing ligand-receptor pairs. Other functional annotations of the LR pairs could be used if available.

In [15]:

            
                Copied!
                
ppi_functions = dict()

for idx, row in lr_pairs.iterrows():
    ppi_label = row[int_columns[0]] + '^' + row[int_columns[1]]
    ppi_functions[ppi_label] = row['annotation']
ppi_functions = dict()

for idx, row in lr_pairs.iterrows():
    ppi_label = row[int_columns[0]] + '^' + row[int_columns[1]]
    ppi_functions[ppi_label] = row['annotation']

Tensor-cell2cell analysis¶

Build 4D-Communication Tensor Here we use as input the list of gene expression matrices that were aggregated into a cell-type granularity. This list contains the expression matrices of all samples/contexts.

The following functions perform all the steps to build a 4D-communication tensor:

TC2C .

The parameter communication_score indicates the scoring function to use.

how='inner' is used to keep only cell types and genes that are across all contexts. Use how='outer' to include all cell types, even if they are not in all samples/contexts. When using the last option, a tensor.mask attribute is created, where zeros indicates what are the missing cell pairs and LR interactions in the given contexts, and ones indicate those that are present. Thus, we can deal with missing values by assigning them a value of zero during the tensor decomposition. We can also control the minimum fraction of samples that must containg a LR pair or a cell type to be included in the tensor, through the parameter outer_fraction.

complex_sep='&' is used to specify that the list of ligand-receptor pairs contains protein complexes and that subunits are separated by '&'. If the list does not have complexes, use complex_sep=None instead.

In [16]:

            
                Copied!
                
                    
                    
                
                

        
tensor = c2c.tensor.InteractionTensor(rnaseq_matrices=rnaseq_matrices,
                                      ppi_data=lr_pairs,
                                      context_names=list(context_dict.keys()),
                                      how='outer',
                                      outer_fraction=0.5, # Considers elements in at least 50% of samples
                                      complex_sep='&',
                                      interaction_columns=int_columns,
                                      communication_score='expression_gmean',
                                     )
tensor = c2c.tensor.InteractionTensor(rnaseq_matrices=rnaseq_matrices,
                                      ppi_data=lr_pairs,
                                      context_names=list(context_dict.keys()),
                                      how='outer',
                                      outer_fraction=0.5, # Considers elements in at least 50% of samples
                                      complex_sep='&',
                                      interaction_columns=int_columns,
                                      communication_score='expression_gmean',
                                     )

Getting expression values for protein complexes
Building tensor for the provided context

Evaluate some tensor properties¶

Tensor shape

This indicates the number of elements in each tensor dimension: (Contexts, LR pairs, Sender cells, Receiver cells)

In [17]:

            
                Copied!
                
tensor.tensor.shape
tensor.tensor.shape

Out[17]:

(12, 807, 10, 10)

Fraction of excluded elements

This represents the fraction of values that are ignored (masked) in the analysis.

In [18]:

            
                Copied!
                
tensor.excluded_value_fraction()
tensor.excluded_value_fraction()

Out[18]:

0.28569496076001655

Prepare Tensor Metadata¶

To interpret analysis on the tensor, we can assign groups to each sample/context, and to every elements in the other dimensions (LR pairs and cells).

We can generate respective dictionaries manually or automatically from DBs.

In [19]:

            
                Copied!
                
meta_tensor = c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor,
                                                  metadata_dicts=[context_dict, ppi_functions, None, None],
                                                  fill_with_order_elements=True
                                                  )
meta_tensor = c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor,
                                                  metadata_dicts=[context_dict, ppi_functions, None, None],
                                                  fill_with_order_elements=True
                                                  )

Run Tensor-cell2cell¶

Here we use the tf_optimization='regular', which is faster but generates less robust results. We recommend using tf_optimization='robust, which takes longer to run.

In [20]:

            
                Copied!
                
                    
                    
                
                

        
tensor2 = c2c.analysis.run_tensor_cell2cell_pipeline(tensor,
                                                     meta_tensor,
                                                     copy_tensor=True, # Whether to output a new tensor or modifying the original
                                                     rank=None, # Number of factors to perform the factorization. If None, it is automatically determined by an elbow analysis
                                                     tf_optimization='regular', # To define how robust we want the analysis to be.
                                                     random_state=888, # Random seed for reproducibility
                                                     backend='pytorch', # This enables a banckend that supports using a GPU.
                                                     device='cuda', # Device to use. If using GPU and PyTorch, use 'cuda'. For CPU use 'cpu'
                                                     elbow_metric='error', # Metric to use in the elbow analysis.
                                                     smooth_elbow=False, # Whether smoothing the metric of the elbow analysis.
                                                     upper_rank=25, # Max number of factors to try in the elbow analysis
                                                     tf_init='random', # Initialization method of the tensor factorization
                                                     tf_svd='numpy_svd', # Type of SVD to use if the initialization is 'svd'
                                                     cmaps=None, # Color palettes to use in color each of the dimensions. Must be a list of palettes.
                                                     sample_col='Element', # Columns containing the elements in the tensor metadata
                                                     group_col='Category', # Columns containing the major groups in the tensor metadata
                                                     fig_fontsize=14, # Fontsize of the figures generated
                                                     output_folder=output_folder, # Whether to save the figures and loadings in files. If so, a folder pathname must be passed
                                                     output_fig=True, # Whether to output the figures. If False, figures won't be saved a files if a folder was passed in output_folder.
                                                     fig_format='pdf', # File format of the figures.
                                                    )
tensor2 = c2c.analysis.run_tensor_cell2cell_pipeline(tensor,
                                                     meta_tensor,
                                                     copy_tensor=True, # Whether to output a new tensor or modifying the original
                                                     rank=None, # Number of factors to perform the factorization. If None, it is automatically determined by an elbow analysis
                                                     tf_optimization='regular', # To define how robust we want the analysis to be.
                                                     random_state=888, # Random seed for reproducibility
                                                     backend='pytorch', # This enables a banckend that supports using a GPU.
                                                     device='cuda', # Device to use. If using GPU and PyTorch, use 'cuda'. For CPU use 'cpu'
                                                     elbow_metric='error', # Metric to use in the elbow analysis.
                                                     smooth_elbow=False, # Whether smoothing the metric of the elbow analysis.
                                                     upper_rank=25, # Max number of factors to try in the elbow analysis
                                                     tf_init='random', # Initialization method of the tensor factorization
                                                     tf_svd='numpy_svd', # Type of SVD to use if the initialization is 'svd'
                                                     cmaps=None, # Color palettes to use in color each of the dimensions. Must be a list of palettes.
                                                     sample_col='Element', # Columns containing the elements in the tensor metadata
                                                     group_col='Category', # Columns containing the major groups in the tensor metadata
                                                     fig_fontsize=14, # Fontsize of the figures generated
                                                     output_folder=output_folder, # Whether to save the figures and loadings in files. If so, a folder pathname must be passed
                                                     output_fig=True, # Whether to output the figures. If False, figures won't be saved a files if a folder was passed in output_folder.
                                                     fig_format='pdf', # File format of the figures.
                                                    )

Running Elbow Analysis

  0%|          | 0/25 [00:00<?, ?it/s]

The rank at the elbow is: 10
Running Tensor Factorization
Generating Outputs
Loadings of the tensor factorization were successfully saved into ./BALF-COVID19//Loadings.xlsx

Top genes in each factor

In [21]:

            
                Copied!
                
for i in range(tensor2.rank):
    print(tensor2.get_top_factor_elements('Ligand-Receptor Pairs', 'Factor {}'.format(i+1), 10))
    print('')
for i in range(tensor2.rank):
    print(tensor2.get_top_factor_elements('Ligand-Receptor Pairs', 'Factor {}'.format(i+1), 10))
    print('')

CD99^CD99             0.396975
MIF^CD74&CXCR4        0.368345
MIF^CD74&CD44         0.348051
LGALS9^PTPRC          0.319667
TNFSF13B^TNFRSF13B    0.244478
LGALS9^CD44           0.239305
APP^CD74              0.226378
SELPLG^SELL           0.223956
PTPRC^CD22            0.206616
CXCL10^CXCR3          0.174164
Name: Factor 1, dtype: float32

MIF^CD74&CD44        0.607854
LGALS9^CD44          0.450888
CD99^CD99            0.402892
LGALS9^PTPRC         0.260705
LGALS9^HAVCR2        0.258238
MIF^CD74&CXCR4       0.166606
APP^CD74             0.142985
COL9A2^CD44          0.112663
CD22^PTPRC           0.103688
NAMPT^ITGA5&ITGB1    0.098853
Name: Factor 2, dtype: float32

CCL8^CCR1            0.328729
SPP1^CD44            0.318466
CCL3L1^CCR1          0.311767
CCL3^CCR1            0.311655
CCL7^CCR1            0.280976
ANXA1^FPR1           0.240010
MIF^CD74&CXCR4       0.191990
NAMPT^ITGA5&ITGB1    0.188227
MIF^CD74&CD44        0.179285
SPP1^ITGA5&ITGB1     0.165535
Name: Factor 3, dtype: float32

CCL5^CCR1            0.405997
CD99^PILRA           0.308518
ANXA1^FPR1           0.298691
PTPRC^MRC1           0.209846
CCL5^CCR5            0.194877
CD99^CD99            0.193112
ANXA1^FPR2           0.189305
MIF^CD74&CXCR4       0.182442
SELPLG^SELL          0.169664
ITGAL&ITGB2^ICAM1    0.165196
Name: Factor 4, dtype: float32

MDK^NCL         0.379103
APP^CD74        0.365288
LAMC2^CD44      0.290392
LAMB3^CD44      0.283392
LAMA5^CD44      0.221012
PTN^NCL         0.194058
GDF15^TGFBR2    0.178591
LAMB2^CD44      0.167368
LAMA3^CD44      0.153982
COL4A5^CD44     0.147150
Name: Factor 5, dtype: float32

PTPRC^MRC1       0.392890
FN1^CD44         0.374219
MIF^CD74&CD44    0.336574
RETN^CAP1        0.302886
LGALS9^CD44      0.274760
GRN^SORT1        0.239909
HLA-DMA^CD4      0.239026
LGALS9^PTPRC     0.222783
PECAM1^PECAM1    0.168159
RETN^TLR4        0.152095
Name: Factor 6, dtype: float32

APP^CD74         0.463218
HLA-DMA^CD4      0.393801
LGALS9^PTPRC     0.250710
HLA-DOA^CD4      0.226411
LGALS9^HAVCR2    0.209939
GRN^SORT1        0.196472
IL16^CD4         0.183482
CD99^PILRA       0.183337
ICAM1^SPN        0.174727
PECAM1^PECAM1    0.164378
Name: Factor 7, dtype: float32

CD99^CD99              0.279922
NAMPT^INSR             0.274649
SEMA4D^PLXNB2          0.239051
GRN^SORT1              0.223482
SEMA4D^PLXNB1          0.199096
SELL^PODXL             0.187005
TNFSF10^TNFRSF10A      0.169820
AREG^EGFR              0.156962
TGFB1^ACVR1B&TGFBR2    0.152839
AREG^EGFR&ERBB2        0.152592
Name: Factor 8, dtype: float32

CD99^CD99         0.344679
LGALS9^PTPRC      0.341739
MIF^CD74&CD44     0.220866
MIF^CD74&CXCR4    0.212046
CXCL16^CXCR6      0.210066
CXCL10^CXCR3      0.209790
CD48^CD244        0.207834
CCL4^CCR5         0.188202
CCL5^CCR5         0.175231
ALCAM^CD6         0.174957
Name: Factor 9, dtype: float32

RETN^CAP1        0.394428
FN1^CD44         0.363788
MDK^NCL          0.291849
SIGLEC1^SPN      0.274590
LGALS9^CD44      0.228481
LGALS9^PTPRC     0.193877
THBS1^CD47       0.177242
CCL3^CCR1        0.141885
CCL8^CCR1        0.139728
LGALS9^HAVCR2    0.126326
Name: Factor 10, dtype: float32

Export objects¶

Beyond the generated results, we can export the objects here for using them in posterior analyses.

In [22]:

            
                Copied!
                
# Export Tensor after decomposition

c2c.io.export_variable_with_pickle(tensor, output_folder + 'Tensor.pkl')
# Export Tensor after decomposition

c2c.io.export_variable_with_pickle(tensor, output_folder + 'Tensor.pkl')

./BALF-COVID19/Tensor.pkl  was correctly saved.

In [23]:

            
                Copied!
                
# Export Tensor Metadata

c2c.io.export_variable_with_pickle(meta_tensor, output_folder + 'Tensor-Metadata.pkl')
# Export Tensor Metadata

c2c.io.export_variable_with_pickle(meta_tensor, output_folder + 'Tensor-Metadata.pkl')

./BALF-COVID19/Tensor-Metadata.pkl  was correctly saved.

In [ ]: