Patterns of cell-cell communication and metabolic activities (scCellFie)¶
Previously we showed how Tensor-cell2cell can be used to extract patterns of multiple modalities of cell-cell communication (CCC). Here we will show how our CTCA can be extended to couple CCC and intracellular metabolic activity. Here we use the tools cell2cell and scCellFie to predict CCC and the activity of metabolic tasks, respectively.
In this case, we use samples from cortical brain organoids across different time points, previously published by Trujillo et al (2019).
use_gpu = False
import tensorly as tl
if use_gpu:
tl.set_backend('pytorch')
Importing packages to use
import cell2cell as c2c
import scanpy as sc
import sccellfie
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
/Users/earmingol/miniforge3/envs/cell2cell/lib/python3.10/site-packages/dask/dataframe/__init__.py:31: FutureWarning: The legacy Dask DataFrame implementation is deprecated and will be removed in a future version. Set the configuration option `dataframe.query-planning` to `True` or None to enable the new Dask Dataframe implementation and silence this warning. warnings.warn( /Users/earmingol/miniforge3/envs/cell2cell/lib/python3.10/site-packages/xarray_schema/__init__.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. from pkg_resources import DistributionNotFound, get_distribution /Users/earmingol/miniforge3/envs/cell2cell/lib/python3.10/site-packages/anndata/utils.py:434: FutureWarning: Importing read_text from `anndata` is deprecated. Import anndata.io.read_text instead. warnings.warn(msg, FutureWarning)
import warnings
warnings.filterwarnings('ignore')
IMPORTANT: In this notebook, the version 0.8.0 of cell2cell is used. Older versions do not implement the Coupled Tensor Component Analysis (CTCA).
1 - Load Data¶
Specify folders where the data is located, and where the outputs will be written:
import os
data_folder = './data/'
directory = os.fsencode(data_folder)
output_folder = './results/scCellFie/'
if not os.path.isdir(output_folder):
os.makedirs(output_folder)
RNA-seq data¶
A preprocessed version of the original paper (Trujillo et al, 2019) can be found here
rnaseq = sc.read_h5ad(data_folder + '/annotated_seurat_norm_harmony_2022.h5ad')
rnaseq = rnaseq.raw.to_adata()
rnaseq.obs.head(2)
orig.ident | nCount_RNA | nFeature_RNA | percent.mt | RNA_snn_res.0.5 | seurat_clusters | celltype | old.ident | |
---|---|---|---|---|---|---|---|---|
Month-01_AAACCTGAGAATTGTG-1 | Month-01 | 6417.0 | 2431 | 2.898551 | 11 | 11 | Other | 11 |
Month-01_AAACCTGAGCAACGGT-1 | Month-01 | 8522.0 | 2464 | 2.663694 | 6 | 6 | Progenitor | 6 |
Ligand-Receptor pairs¶
Different databases of ligand-receptor interactions could be used. We previously created a repository that includes many available DBs (https://github.com/LewisLabUCSD/Ligand-Receptor-Pairs). In this tutorial, we employ the ligand-receptor pairs from CellChat (https://doi.org/10.1038/s41467-021-21246-9), which includes multimeric protein complexes.
lr_pairs = pd.read_csv('https://raw.githubusercontent.com/LewisLabUCSD/Ligand-Receptor-Pairs/master/Human/Human-2020-Jin-LR-pairs.csv')
lr_pairs = lr_pairs.astype(str)
lr_pairs.head(2)
interaction_name | pathway_name | ligand | receptor | agonist | antagonist | co_A_receptor | co_I_receptor | evidence | annotation | interaction_name_2 | ligand_symbol | receptor_symbol | ligand_ensembl | receptor_ensembl | interaction_symbol | interaction_ensembl | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TGFB1_TGFBR1_TGFBR2 | TGFb | TGFB1 | TGFbR1_R2 | TGFb agonist | TGFb antagonist | nan | TGFb inhibition receptor | KEGG: hsa04350 | Secreted Signaling | TGFB1 - (TGFBR1+TGFBR2) | TGFB1 | TGFBR1&TGFBR2 | ENSG00000105329 | ENSG00000106799&ENSG00000163513 | TGFB1^TGFBR1&TGFBR2 | ENSG00000105329^ENSG00000106799&ENSG00000163513 |
1 | TGFB2_TGFBR1_TGFBR2 | TGFb | TGFB2 | TGFbR1_R2 | TGFb agonist | TGFb antagonist | nan | TGFb inhibition receptor | KEGG: hsa04350 | Secreted Signaling | TGFB2 - (TGFBR1+TGFBR2) | TGFB2 | TGFBR1&TGFBR2 | ENSG00000092969 | ENSG00000106799&ENSG00000163513 | TGFB2^TGFBR1&TGFBR2 | ENSG00000092969^ENSG00000106799&ENSG00000163513 |
# interaction columns:
int_columns = ('ligand_symbol', 'receptor_symbol')
2 - Data Preprocessing¶
RNA-seq for CCC¶
Organize data to create tensor
First, define contexts
contexts = ['Month-01', 'Month-03', 'Month-06', 'Month-10']
Generate list of RNA-seq data containing all contexts (time points)
Here, we convert the single-cell expression levels into cell-type expression levels. A basic context-wise preprocessing is performed here, keeping only genes that are expressed at least in 4 single cells, and cell types per context with at least 20 cells.
Important: A list of aggregated gene expression matrices must be used for building the 4D-communication tensor. Each of the gene expression matrices is for one of the contexts, following the same order as the previouslist (contexts
).
The aggregation here corresponds to mean expression value across cells of each cell type. Other aggregation methods could be used by changing the parameter method='average'
in the aggregate_single_cell()
function. Additionally, other gene expression values could be passed, using other preprocessing approaches such as log(x+1) or batch correction methods.
rnaseq_matrices = []
for context in tqdm(contexts):
meta_context = rnaseq.obs.loc[(rnaseq.obs['orig.ident'] == context) \
& (rnaseq.obs['celltype'] != 'Other') & (rnaseq.obs['celltype'] != 'MC')]
# Remove celltype per context that has few single cells
min_sc_number = 20
excluded_sc = []
for idx, row in (meta_context.groupby(['celltype'])[['celltype']].count() >= min_sc_number).iterrows():
if ~row['celltype']:
excluded_sc.append(idx)
meta_context = meta_context.loc[~meta_context['celltype'].isin(excluded_sc)]
cells = list(meta_context.index)
meta_context.index.name = 'barcode'
tmp_data = rnaseq[cells]
# Keep genes in each sample with at least 4 single cells expressing it
sc.pp.filter_genes(tmp_data, min_cells=4)
# Normalize and Log1p
sc.pp.normalize_total(tmp_data, target_sum=1e6)
sc.pp.log1p(tmp_data)
# Aggregate gene expression of single cells into cell types
exp_df = c2c.preprocessing.aggregate_single_cells(rnaseq_data=tmp_data.to_df(),
metadata=meta_context,
barcode_col='barcode',
celltype_col='celltype',
method='average',
)
rnaseq_matrices.append(exp_df)
0%| | 0/4 [00:00<?, ?it/s]
rnaseq_matrices
is the list we will use for building the tensor based on protein-based ligand-receptor pairs.
LR pairs¶
Remove bidirectionality in the list of protein-based ligand-receptor pairs. That is, remove repeated interactions where both interactions are the same but in different order:
From this list:
Ligand | Receptor |
---|---|
Protein A | Protein B |
Protein B | Protein A |
We will have:
Ligand | Receptor |
---|---|
Protein A | Protein B |
lr_pairs = c2c.preprocessing.ppi.remove_ppi_bidirectionality(ppi_data=lr_pairs,
interaction_columns=int_columns
)
Removing bidirectionality of PPI network
lr_pairs.shape
(1988, 17)
Generate a dictionary with function info for each LR pairs.
Keys are LIGAND_NAME^RECEPTOR_NAME (this is the same nomenclature that will be used when building the protein-based 4D-communication tensor later), and values are the function in the annotation column in the dataframe containing ligand-receptor pairs. Other functional annotations of the LR pairs could be used if available.
ppi_functions = dict()
for idx, row in lr_pairs.iterrows():
ppi_functions[row['interaction_symbol']] = row['annotation']
3 - Predict metabolic activities with scCellFie¶
Here, the python tool scCellFie to predict metabolic activities based on the gene expression of enzymes that are part of a specific metabolic task, as indicated in this tutorial.
We only need the raw counts in the single-cell data to make the predictions, as scCellFie performs all necessary processing. However, to perform a smoothing based on neighbors, we will compute the neighbors based on the harmony components:
sc.pp.neighbors(rnaseq, use_rep='X_harmony')
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
rnaseq
AnnData object with n_obs × n_vars = 15973 × 21042 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'RNA_snn_res.0.5', 'seurat_clusters', 'celltype', 'old.ident' var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable' uns: 'neighbors' obsm: 'X_harmony', 'X_pca', 'X_umap' obsp: 'distances', 'connectivities'
results = sccellfie.run_sccellfie_pipeline(rnaseq,
organism='human',
sccellfie_data_folder=None,
n_counts_col='nCount_RNA', # Total counts per cell will be computed if left as None,
process_by_group=False, # Whether to do the processing by cell groups
groupby=None, # Column indicating cell groups if `process_by_group=True`
neighbors_key='neighbors', # Neighbors information if precomputed. Otherwise, it will be computed here
n_neighbors=10, # Number of neighbors to use
batch_key=None, # there is no batch_key in this dataset
threshold_key='sccellfie_threshold', # This is for using the default database. If personalized thresholds are used, specificy column name
smooth_cells=True, # Whether to perform gene expression smoothing before running the tool
alpha=0.33, # Importance of neighbors' expression for the smoothing (0 to 1)
chunk_size=5000, # Number of chunks to run the processing steps (helps with the memory)
disable_pbar=False,
save_folder=None, # In case results will be saved. If so, results will not be returned and should be loaded from the folder (see sccellfie.io.load_data function
save_filename=None # Name for saving the files, otherwise a default name will be used
)
==== scCellFie Pipeline: Initializing ==== Loading scCellFie database for organism: human ==== scCellFie Pipeline: Processing entire dataset ==== ---- scCellFie Step: Preprocessing data ---- ---- scCellFie Step: Preparing inputs ---- Gene names corrected to match database: 22 Shape of new adata object: (15973, 838) Number of GPRs: 764 Shape of tasks by genes: (218, 838) Shape of reactions by genes: (764, 838) Shape of tasks by reactions: (218, 764) ---- scCellFie Step: Smoothing gene expression ----
Smoothing Expression: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.33it/s]
---- scCellFie Step: Computing gene scores ---- ---- scCellFie Step: Computing reaction activity ----
Cell Rxn Activities: 100%|███████████████████████████████████████████████████| 15973/15973 [01:39<00:00, 159.88it/s]
---- scCellFie Step: Computing metabolic task activity ---- Removed 0 metabolic tasks with zeros across all cells. ==== scCellFie Pipeline: Processing completed successfully ====
The results can be found here:
results.keys()
dict_keys(['adata', 'gpr_rules', 'task_by_gene', 'rxn_by_gene', 'task_by_rxn', 'rxn_info', 'task_info', 'thresholds', 'organism'])
To access metabolic activities, we need to inspect results['adata']
:
The processed single-cell data is located in the AnnData object
results['adata']
.The reaction activities for each cell are located in the AnnData object
results['adata'].reactions
.The metabolic task activities for each cell are located in the AnnData object
results['adata'].metabolic_tasks
.
met_activities = results['adata'].metabolic_tasks.copy()
We will scale these activities with respect to the ranges previously computed across the CZI CELLxGENE Human Cell Atlas
minmax = pd.read_csv('https://raw.githubusercontent.com/ventolab/sccellfie-website/refs/heads/main/data/CELLxGENEMetabolicTasksMinMax.csv', index_col=0)
minmax
(R)-3-Hydroxybutanoate synthesis | 3'-Phospho-5'-adenylyl sulfate synthesis | AMP salvage from adenine | ATP generation from glucose (hypoxic conditions) - glycolysis | ATP regeneration from glucose (normoxic conditions) - glycolysis + krebs cycle | Acetoacetate synthesis | Alanine degradation | Alanine synthesis | Arachidonate degradation | Arachidonate synthesis | ... | Valine to succinyl-coA | Vesicle secretion | beta-Alanine degradation | beta-Alanine synthesis | cis-vaccenic acid degradation | cis-vaccenic acid synthesis | gamma-Linolenate degradation | gamma-Linolenate synthesis | glyco-cholate synthesis | tauro-cholate synthesis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
single_cell_min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
single_cell_max | 7.419829 | 11.795217 | 20.651447 | 13.613079 | 4.979552 | 8.038148 | 3.030361 | 7.951063 | 1.752682 | 4.296328 | ... | 6.424888 | 2.780753 | 2.980583 | 6.203485 | 2.161476 | 4.050148 | 2.341566 | 7.578767 | 2.982760 | 2.982760 |
cell_type_min | 0.041299 | 0.000000 | 0.000000 | 0.033310 | 0.024252 | 0.023852 | 0.028060 | 0.025760 | 0.000000 | 0.021624 | ... | 0.000000 | 0.000000 | 0.025981 | 0.029364 | 0.000000 | 0.013848 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
cell_type_max | 4.403184 | 4.105401 | 5.187688 | 7.840590 | 3.159272 | 4.670686 | 1.874831 | 4.480565 | 0.854841 | 2.594763 | ... | 1.313472 | 0.798270 | 1.899791 | 3.463882 | 1.151658 | 2.225383 | 1.263210 | 1.716187 | 0.941299 | 0.941299 |
4 rows × 218 columns
We divide each task value by the max value found across the human cell atlas, so we scale them between 0 and 1:
met_activities.X = np.clip(met_activities.X / minmax.loc['single_cell_max', :].values.reshape(-1), a_min=0., a_max=1.0)
Similar to the CCC part where we prepared the gene expression data, here we convert the single-cell metabolic activities into cell-type activity levels.
A basic context-wise preprocessing is performed here, keeping only cell types per context with at least 20 cells.
met_matrices = []
for context in tqdm(contexts):
meta_context = rnaseq.obs.loc[(rnaseq.obs['orig.ident'] == context) \
& (rnaseq.obs['celltype'] != 'Other') & (rnaseq.obs['celltype'] != 'MC')]
# Remove celltype per context that has few single cells
min_sc_number = 20
excluded_sc = []
for idx, row in (meta_context.groupby(['celltype'])[['celltype']].count() >= min_sc_number).iterrows():
if ~row['celltype']:
excluded_sc.append(idx)
meta_context = meta_context.loc[~meta_context['celltype'].isin(excluded_sc)]
cells = list(meta_context.index)
meta_context.index.name = 'barcode'
tmp_data = met_activities[cells]
# Aggregate gene expression of single cells into cell types, using the trimean
exp_df = sccellfie.expression.aggregation.agg_expression_cells(met_activities,
groupby='celltype',
agg_func='trimean')
met_matrices.append(exp_df)
0%| | 0/4 [00:00<?, ?it/s]
met_matrices[0]
Task | (R)-3-Hydroxybutanoate synthesis | 3'-Phospho-5'-adenylyl sulfate synthesis | AMP salvage from adenine | ATP generation from glucose (hypoxic conditions) - glycolysis | ATP regeneration from glucose (normoxic conditions) - glycolysis + krebs cycle | Acetoacetate synthesis | Alanine degradation | Alanine synthesis | Arachidonate degradation | Arachidonate synthesis | ... | Valine to succinyl-coA | Vesicle secretion | beta-Alanine degradation | beta-Alanine synthesis | cis-vaccenic acid degradation | cis-vaccenic acid synthesis | gamma-Linolenate degradation | gamma-Linolenate synthesis | glyco-cholate synthesis | tauro-cholate synthesis |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GABAergic | 0.257290 | 0.070321 | 0.016719 | 0.248574 | 0.289472 | 0.257290 | 0.316788 | 0.251865 | 0.122373 | 0.317457 | ... | 0.024489 | 0.088211 | 0.349128 | 0.258951 | 0.183256 | 0.262169 | 0.175722 | 0.041018 | 0.017065 | 0.017065 |
Glia | 0.228816 | 0.017635 | 0.033727 | 0.215604 | 0.267410 | 0.227830 | 0.336807 | 0.231558 | 0.161978 | 0.301874 | ... | 0.033127 | 0.068634 | 0.364833 | 0.233392 | 0.210315 | 0.257721 | 0.202597 | 0.049065 | 0.034382 | 0.034382 |
Glutamatergic | 0.185345 | 0.064322 | 0.025364 | 0.173935 | 0.232887 | 0.184854 | 0.271598 | 0.184064 | 0.127847 | 0.251595 | ... | 0.024005 | 0.070838 | 0.304126 | 0.193836 | 0.175435 | 0.211203 | 0.169420 | 0.043661 | 0.019224 | 0.019224 |
IP | 0.222547 | 0.052597 | 0.027519 | 0.213665 | 0.270883 | 0.222173 | 0.310499 | 0.222034 | 0.138435 | 0.299648 | ... | 0.024196 | 0.081357 | 0.348271 | 0.224283 | 0.204476 | 0.246714 | 0.195585 | 0.040729 | 0.022202 | 0.022202 |
MC | 0.176754 | 0.014631 | 0.020480 | 0.168017 | 0.204469 | 0.175424 | 0.274017 | 0.175904 | 0.115840 | 0.236990 | ... | 0.029954 | 0.058975 | 0.300923 | 0.181178 | 0.175457 | 0.199442 | 0.165586 | 0.043496 | 0.018289 | 0.018289 |
Other | 0.136290 | 0.022540 | 0.034058 | 0.128478 | 0.162129 | 0.136080 | 0.209196 | 0.134473 | 0.086839 | 0.178832 | ... | 0.031152 | 0.055257 | 0.232028 | 0.137534 | 0.123759 | 0.149276 | 0.118593 | 0.025574 | 0.017805 | 0.017805 |
Progenitor | 0.142688 | 0.011824 | 0.060387 | 0.132497 | 0.168674 | 0.142371 | 0.215409 | 0.141622 | 0.105195 | 0.191222 | ... | 0.040749 | 0.068754 | 0.249974 | 0.149202 | 0.142666 | 0.151951 | 0.138825 | 0.012511 | 0.032817 | 0.032817 |
7 rows × 218 columns
4 - Build Tensors¶
Build 4D-Communication Tensor¶
Here we use as input the list of gene expression matrices (rnaseq_matrices
) that were aggregated into a cell-type granularity. This list contains the expression matrices of all samples/contexts.
The following functions perform all the steps to build a 4D-communication tensor:
The parameter communication_score
indicates the scoring function to use.
how='outer_cells'
considers only LR pairs that are present in all contexts, while all cell types in the whole dataset are kept, regardless they appear only in one or a few contexts.
complex_sep='&'
is used to specify that the list of ligand-receptor pairs contains protein complexes and that subunits are separated by '&'. If the list does not have complexes, use complex_sep=None
instead.
tensor_lr = c2c.tensor.InteractionTensor(rnaseq_matrices=rnaseq_matrices,
ppi_data=lr_pairs,
context_names=contexts,
how='outer_cells',
complex_sep='&',
interaction_columns=int_columns,
communication_score='expression_gmean',
order_labels=['Contexts', 'Protein LRIs', 'Sender Cells', 'Receiver Cells'],
device='cuda' if use_gpu else 'cpu'
)
Getting expression values for protein complexes Building tensor for the provided context
tensor_lr.tensor.shape
(4, 422, 5, 5)
Build 3D Metabolic Task Activity Tensor¶
To build our 3D tensor with dimensions as contexts, receiver cells, and metabolic tasks, we need to make sure that all cell types in the 4D tensor are present in each matrix (for those missing we will assign a NaN value).
We get the list of cell types:
celltypes = tensor_lr.order_names[-1]
Then we reindex all matrices with metabolic activities (this will generate missing rows (index), and fill them with NaNs):
met_activities_nan = [met.reindex(index=celltypes) for met in met_matrices]
Now we can use these scCellFie outputs directly to generate our 3D metabolic activity tensor. First we generate just the tensor containing the values using tensorly:
tensor_3d = tl.tensor(met_activities_nan)
tensor_3d.shape
(4, 5, 218)
We then obtain the name of metabolic task present in the data:
met_names = met_activities_nan[0].columns.tolist()
Now we take this tensor to generate an InteractionTensor
object using cell2cell:
tensor_met = c2c.tensor.PreBuiltTensor(tensor_3d,
order_names=[contexts, celltypes, met_names],
order_labels=['Contexts', 'Receiver Cells', 'Metabolic Tasks']
)
tensor_met.tensor.shape
(4, 5, 218)
5 - Generate coupled tensors¶
After constructing our 4D communication tensor and 3D metabolic task activity tensor, we can create a CoupledInteractionTensor
object in Tensor-cell2cell v2. This object will later be used to perform CTCA.
The key step here is to define which dimensions should be coupled between the tensors. In this example:
- Context dimensions are coupled directly, since they are identical across both tensors.
- Cell type dimension (in the metabolic task tensor) can be coupled either with the sender or receiver dimension (in the communication tensor):
- Couple to sender: Metabolic task activity is aligned with the cells producing the ligands of each LR pair. This suggests that the metabolic state of sender cells could influence or coordinate with their ability to produce signaling molecules.
- Couple to receiver: Metabolic task activity is aligned with the cells receiving the signal through the receptor. This suggests that metabolic programs may be regulated downstream of the receptors, capturing how cell-cell communication drives metabolic changes in the receiving cells.
In this tutorial, we choose the second option (coupling to the receiver). This means we focus on metabolic task activity that is potentially triggered by LR interactions within the CCC program.
Concretely, the context dimension corresponds to dim 0
in both tensors, while the receiver cell dimension is dim 3
in the 4D communication tensor and dim 2
in the 3D metabolic task tensor.
coupled_tensor = c2c.tensor.CoupledInteractionTensor(tensor1=tensor_lr,
tensor2=tensor_met,
tensor1_name='cell2cell',
tensor2_name='scCellFie',
# Below we specify how dims of tensor1 are coupled with dims of tensor2
mode_mapping={'shared' : [(0,0), (3,1)]}
)
We can observe that this object includes both tensors simultaneously.
coupled_tensor.shape
((4, 422, 5, 5), (4, 5, 218))
Assume that missing values are real zeros
Instead of imputing missing values with our method, we will assume that they are real zeros.
coupled_tensor.mask1 = None
coupled_tensor.mask2 = None
6 - Run Analysis¶
Perform tensor factorization¶
Next, we apply a non-negative canonical coupled component analysis (CTCA). Briefly, this tensor decomposition identifies a low-rank tensor (here, a rank of 10) for each modality's tensor that better approximates each original tensor. These low-rank tensors can be represented as the sum of a set of rank-1 tensors (10 of them in this case). Each rank-1 tensor represents a factor in the decomposition and can be further represented as the outer product of n vectors, where n represents the number of tensor dimensions. Each vector represents one of the n tensor dimensions for that factor and its values, corresponding to individual elements in each dimension, represent the factor loadings.
In the CTCA, we have vectors that are shared between the coupled tensors (here contexts and receiver cells), while we have other vectors that are private for each modality (LR pair and sender cells for the 4D communication tensor; metabolic tasks for the 3D metabolic activity tensor). The loadings shared between both tensors behave exactly the same in both modalities, representing that we are capturing LR pairs and metabolic tasks from each modality that present similar trends in each factor. If each tensor was decomposed separately with the standard TCA in Tensor-cell2cell v1, we would not ensure that the key LR pair and metabolic tasks for each factor would behave the same, as this could lead to loadings for the context and receiver cells with different behaviors between modalities.
To illustrate how performing the CTCA for identifying LR pairs acting coordinated with metabolic activities, we skipped the Elbow Analysis, which can be found in our previous tutorial for protein- and metabolite-based LR pairs. Instead, we run the factorization using a manual number of 10 factors:
coupled_tensor.compute_tensor_factorization(rank=10, # This is just for illustrative purposes - to keep a low number of factors
init='svd',
random_state=888,
runs=1, tol=1e-8, n_iter_max=500 # Robust version of tensor decomposition
)
7 - Results¶
Metadata¶
Generate a list containing metadata for each tensor order/dimension - Later used for coloring factor plots
This is a list containing metadata for each dimension of the tensor (contexts, LR pairs, sender cells, receiver cells). Each metadata corresponds to a dataframe with info of each element in the respective dimension.
We start with the protein-based tensor.
meta_tf = c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor_lr,
metadata_dicts=[None, ppi_functions, None, None],
fill_with_order_elements=True
)
We continue with the metabolic activity tensor. We can find metabolic tasks metadata in results['task_info']
:
results['task_info'].head()
Task | System | Subsystem | |
---|---|---|---|
0 | (R)-3-Hydroxybutanoate synthesis | CARBOHYDRATES METABOLISM | KETOGENESIS |
1 | 3'-Phospho-5'-adenylyl sulfate synthesis | NUCLEOTIDE METABOLISM | COFACTOR |
2 | AMP salvage from adenine | NUCLEOTIDE METABOLISM | SALVAGE |
3 | ATP generation from glucose (hypoxic condition... | ENERGY METABOLISM | ATP GENERATION |
4 | ATP regeneration from glucose (normoxic condit... | ENERGY METABOLISM | ATP GENERATION |
Here we will use categories at the System level:
task_mapper = results['task_info'].set_index('Task')['System'].to_dict()
met_functions = {task: task_mapper[task] for task in tensor_met.order_names[2]}
meta_tf2 = c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor_met,
metadata_dicts=[None, None, met_functions],
fill_with_order_elements=True
)
Generate unified metadata for CTCA results
By using the metadata generated separately for each tensor, we generate a new metadata that incorporates metadata for the shared and private dimensions between both modalities.
new_meta_tf = coupled_tensor.reorder_metadata(meta_tf, meta_tf2)
Plot factor loadings¶
After performing the CTCA, a set of loadings is obtained in a factor-specific way, for each tensor. These loadings provide details of what elements of each dimension are important within each factor. Here we plot first the loadings for the shared dimensions, followed by the private dimensions of the first and second tensor (LR pairs and Metabolic Tasks, respectively).
We generate the order of LR pairs and metabolic tasks given their categories:
lr_order = meta_tf[1].sort_values(['Category', 'Element'])['Element'].values.tolist()
met_order = meta_tf2[2].sort_values(['Category', 'Element'])['Element'].values.tolist()
And color palettes for each of the dimensions:
# Color palettes for each of the coupled tensor dimensions, ordered as order_sorting list below
cmaps = ['viridis', 'tab20', 'tab20', 'Dark2', 'tab20']
factors, axes = c2c.plotting.tensor_factors_plot(interaction_tensor=coupled_tensor,
metadata=new_meta_tf,
sample_col='Element',
group_col='Category',
meta_cmaps=cmaps,
# Order of dimensions
order_sorting=['Contexts', 'Sender Cells', 'Receiver Cells',
'Protein LRIs', 'Metabolic Tasks'
],
# Order of elements
reorder_elements={'Protein LRIs' : lr_order,
'Metabolic Tasks': met_order},
fontsize=14,
filename=output_folder + 'Tensor-Factorization.pdf'
)
Each factor represents a context-dependent communication pattern. In this visualization, each row is a factor and each column is a tensor dimension in that factor (Shared dimensions: contexts and receiver-cells; Private dimensions: LR pairs, sender-cells, and metabolic tasks).
The loadings represent the contribution of each element to that pattern, and their combinations between all dimensions represent the overall pattern. For example, in Factor 9, the identified communication pattern occurs in early stages, decreasing across time, where glial cells and progenitors play a major role as sender and as receivers. Then we can identify the top LR pairs and metabolic tasks contributing from cognate modalities. Thus, we can say that the receptors of these LR pairs act in coordination with the metabolic tasks in the receiver cells.
Top LR pairs and metabolic tasks¶
Top-10 LR pairs
for i in range(coupled_tensor.rank):
print(coupled_tensor.get_top_factor_elements('Protein LRIs', 'Factor {}'.format(i+1), 10))
print('')
MDK^NCL 0.447461 CDH2^CDH2 0.354247 CD99^CD99 0.293988 MDK^SDC2 0.288428 PTN^NCL 0.238056 MDK^LRP1 0.173126 PTN^SDC2 0.152314 MPZL1^MPZL1 0.135136 MDK^PTPRZ1 0.132628 GAS6^AXL 0.112833 Name: Factor 1, dtype: float64 MDK^NCL 0.468916 PTN^NCL 0.411515 MDK^SDC2 0.396334 CD99^CD99 0.338853 PTN^SDC2 0.335042 MDK^PTPRZ1 0.198042 CDH2^CDH2 0.173948 MDK^LRP1 0.169168 PTN^PTPRZ1 0.168926 MDK^SDC4 0.105507 Name: Factor 2, dtype: float64 PTN^PTPRZ1 0.531429 MDK^PTPRZ1 0.461338 PTN^NCL 0.439472 MDK^NCL 0.339584 CD99^CD99 0.184761 CDH2^CDH2 0.158804 PTN^SDC3 0.109892 PTN^SDC1 0.105353 MDK^SDC1 0.098751 MDK^LRP1 0.097240 Name: Factor 3, dtype: float64 NCAM1^NCAM1 0.420415 MDK^NCL 0.390320 PTN^NCL 0.345838 CADM1^CADM1 0.325055 CDH2^CDH2 0.295218 MDK^LRP1 0.169197 EFNB2^EPHA4 0.135251 PTN^SDC3 0.128014 NRXN1^NLGN2 0.121424 MPZL1^MPZL1 0.119317 Name: Factor 4, dtype: float64 PTN^PTPRZ1 0.449854 MDK^PTPRZ1 0.434368 PTN^SDC3 0.295742 CDH2^CDH2 0.226626 PTN^NCL 0.220143 MDK^NCL 0.192212 MDK^LRP1 0.164232 CD99^CD99 0.147850 MDK^ITGA6&ITGB1 0.140440 DLL1^NOTCH2 0.122669 Name: Factor 5, dtype: float64 MDK^PTPRZ1 0.478962 MDK^NCL 0.427506 CD99^CD99 0.425680 MDK^LRP1 0.192323 EFNA3^EPHA5 0.185464 EFNA3^EPHA4 0.174341 CADM1^CADM1 0.163048 EFNA3^EPHA7 0.138361 DLK1^NOTCH1 0.121665 LRRC4B^PTPRF 0.116205 Name: Factor 6, dtype: float64 MDK^NCL 0.348511 MDK^SDC2 0.325627 DLL3^NOTCH2 0.251937 DLK1^NOTCH2 0.247194 NCAM1^FGFR1 0.228189 MDK^ITGA6&ITGB1 0.208784 CDH2^CDH2 0.203821 CD99^CD99 0.179026 PTN^SDC2 0.176697 PTN^NCL 0.169592 Name: Factor 7, dtype: float64 PTN^PTPRZ1 0.335747 NCAM1^NCAM1 0.316447 NCAM1^FGFR1 0.303607 MDK^PTPRZ1 0.234244 NRXN1^NLGN1 0.213643 CADM1^CADM1 0.196564 MDK^ITGA6&ITGB1 0.195380 DLL3^NOTCH2 0.195363 NRXN1^NLGN3 0.192741 NRXN1^NLGN2 0.185744 Name: Factor 8, dtype: float64 MDK^ITGA6&ITGB1 0.359098 DLK1^NOTCH2 0.218617 MDK^SDC2 0.180041 COL4A5^ITGAV&ITGB8 0.178171 CD99^CD99 0.174371 DLK1^NOTCH3 0.163681 COL4A6^ITGAV&ITGB8 0.154054 COL6A1^ITGAV&ITGB8 0.145910 GAS6^AXL 0.142663 LAMB2^ITGA6&ITGB1 0.126305 Name: Factor 9, dtype: float64 NCAM1^NCAM1 0.411756 MDK^NCL 0.245947 CADM1^CADM1 0.227029 CADM3^CADM3 0.199214 L1CAM^L1CAM 0.195029 PTN^SDC3 0.184067 PTN^NCL 0.173581 SEMA6A^PLXNA4 0.163215 MDK^LRP1 0.158855 EFNB3^EPHB6 0.157455 Name: Factor 10, dtype: float64
Top 10 Metabolic Tasks
for i in range(coupled_tensor.rank):
print(coupled_tensor.get_top_factor_elements('Metabolic Tasks', 'Factor {}'.format(i+1), 10))
print('')
Ceramide synthesis 0.174836 Sphingomyelin synthesis 0.159966 beta-Alanine degradation 0.155451 Proline degradation 0.154990 Phosphatidyl-choline synthesis 0.152423 Phosphatidyl-ethanolamine synthesis 0.151645 Phosphatidyl-serine synthesis 0.150507 Ornithine degradation 0.149437 Valine degradation 0.144739 Arginine degradation 0.142601 Name: Factor 1, dtype: float64 Ceramide synthesis 0.200701 Isoleucine degradation 0.165953 Synthesis of galactosyl glucosyl ceramide (link with ganglioside metabolism) 0.163736 Ornithine degradation 0.153292 beta-Alanine degradation 0.151075 Phosphatidyl-ethanolamine synthesis 0.150496 Proline degradation 0.149675 Phosphatidyl-choline synthesis 0.149418 Lysine degradation 0.148358 Threonine degradation 0.147263 Name: Factor 2, dtype: float64 Proline degradation 0.195586 beta-Alanine degradation 0.192549 Ornithine degradation 0.188544 Phosphatidyl-ethanolamine synthesis 0.183235 Phosphatidyl-choline synthesis 0.182657 Phosphatidyl-serine synthesis 0.178943 Threonine degradation 0.178147 Lysine degradation 0.171664 Glutamate degradation 0.168758 Leucine degradation 0.165325 Name: Factor 3, dtype: float64 Ceramide synthesis 0.263489 Sphingomyelin synthesis 0.226176 Synthesis of galactosyl glucosyl ceramide (link with ganglioside metabolism) 0.210807 Isoleucine degradation 0.206439 Valine degradation 0.194876 Alanine degradation 0.178883 (R)-3-Hydroxybutanoate synthesis 0.157006 Glycine synthesis 0.150121 Triacylglycerol synthesis 0.149223 Aspartate synthesis 0.142947 Name: Factor 4, dtype: float64 Sphingomyelin synthesis 0.196100 Valine degradation 0.178646 Ceramide synthesis 0.167458 Alanine degradation 0.165018 Synthesis of galactosyl glucosyl ceramide (link with ganglioside metabolism) 0.139448 Isoleucine degradation 0.138287 Ornithine degradation 0.137874 Phosphatidyl-serine synthesis 0.137472 Arginine degradation 0.135105 Proline degradation 0.134954 Name: Factor 5, dtype: float64 Sphingomyelin synthesis 0.357071 Valine degradation 0.342053 Alanine degradation 0.330187 Aspartate synthesis 0.246301 ATP generation from glucose (hypoxic conditions) - glycolysis 0.235262 Cardiolipin synthesis 0.215084 Synthesis of globoside (link with globoside metabolism) 0.209443 Glycine degradation 0.184424 Linolenate degradation 0.178612 Guanosine triphosphate synthesis (GTP) 0.176691 Name: Factor 6, dtype: float64 Conversion of glutamate to glutamine 0.369020 Synthesis of taurine from cysteine 0.316840 Keratan sulfate degradation 0.271127 Degradation of adenine to urate 0.263173 Fructose to glucose conversion (via fructose-6-phosphate) 0.262595 AMP salvage from adenine 0.259930 Homocysteine synthesis (need methionine) 0.219605 Cysteine synthesis (need serine and methionine) 0.211448 Synthesis of creatine from arginine 0.207733 Biosynthesis of core4 (N-Acetyl-beta-D-glucosaminyl-1,6-(N-acetyl-beta-D-glucosaminyl-1,3)-N-acetyl-D-galactosaminyl-R) 0.206311 Name: Factor 7, dtype: float64 Sphingomyelin synthesis 0.367898 Cysteine degradation 0.340516 Palmitolate degradation 0.306788 Palmitate degradation 0.294419 Elaidate degradation 0.280084 Deoxyguanosine triphosphate synthesis (dGTP) 0.254476 Guanosine triphosphate synthesis (GTP) 0.253583 Valine degradation 0.229901 Cardiolipin synthesis 0.223260 Alanine degradation 0.198617 Name: Factor 8, dtype: float64 Synthesis of taurine from cysteine 0.624341 Keratan sulfate biosynthesis from N-glycan 0.373960 Synthesis of coenzyme A 0.263272 Keratan sulfate biosynthesis from O-glycan (core 4-linked) 0.220560 Keratan sulfate biosynthesis from O-glycan (core 2-linked) 0.220560 UDP-glucose synthesis 0.199930 Heme synthesis 0.197824 Synthesis of aspartate from glutamine 0.188478 O-linked glycosylation 0.177559 Branching (N-acetylglucosaminyltransferases) 0.172314 Name: Factor 9, dtype: float64 Conversion of asparate to asparagine 0.260433 Sphingomyelin synthesis 0.248437 Synthesis of galactosyl glucosyl ceramide (link with ganglioside metabolism) 0.247289 Degradation of cytosine 0.199181 Degradation of uracil 0.199181 3'-Phospho-5'-adenylyl sulfate synthesis 0.190187 Methionine degradation 0.188159 Arachidonate synthesis 0.182839 Triacylglycerol synthesis 0.181313 Ornithine degradation 0.172243 Name: Factor 10, dtype: float64
Export Loadings¶
THESE VALUES CAN BE USED FOR DOWNSTREAM ANALYSES
coupled_tensor.export_factor_loadings(output_folder + 'Coupled-Loadings.xlsx')
Coupled tensor factor loadings saved to ./results/scCellFie/Coupled-Loadings.xlsx
8 - Downstream Analyses¶
Generate factor-specific networks¶
We can generate factor-specific communication networks connecting sender-to-receiver cells.
networks = c2c.analysis.tensor_downstream.get_factor_specific_ccc_networks(coupled_tensor.factors,
sender_label='Sender Cells',
receiver_label='Receiver Cells')
Generate matrix of cell-cell pairs by factors
network_by_factors = c2c.analysis.tensor_downstream.flatten_factor_ccc_networks(networks, orderby='receivers')
Select cell-cell pairs with high potential of interaction
_ = plt.hist(network_by_factors.values.flatten(), bins = 20)
# To consider only cell pairs with high potential to interact in each factor
cci_threshold = 0.2
Visualize networks of factors with significant differences between groups
net_fig, net_ax = c2c.plotting.ccc_networks_plot(coupled_tensor.factors,
included_factors=None,
ccc_threshold=cci_threshold, # Only important communication
nrows=2,
panel_size=(8, 8),
network_layout='circular',
filename=output_folder + 'Networks.pdf'
)
Clustermaps¶
Similarly, we can visualize the contribution of each dimension element across factors by plotting their loadings
factor_sorted = ['Factor {}'.format(f) for f in range(1, coupled_tensor.rank+1)]
Cluster cell-cell pairs by their potential across factors
Here we display the outer product between the sender and receiver cell dimensions, which are the value governing the factor-specific networks above.
cci_cm = c2c.plotting.loading_clustermap(network_by_factors[factor_sorted],
use_zscore=False,
loading_threshold=cci_threshold, # To consider relevant CCIs
figsize=(20, 9),
tick_fontsize=16,
cmap='YlOrBr',
filename=output_folder + 'Clustermap-Factor-specific-CCI.pdf',
row_cluster=False
)
Cluster LR pairs by their importance across factors
We can plot the key LR pairs from the CCC modality and TFs from the TF activity modality. First we identify what could represent a high loading given their separate distributions.
_ = plt.hist(coupled_tensor.factors['Protein LRIs'].values.flatten(), bins = 20)
_ = plt.hist(coupled_tensor.factors['Metabolic Tasks'].values.flatten(), bins = 20)
And set and arbitrary threshold to filter out those that have low loadings
# To consider only most relevant LR pairs
lr_threshold = 0.15
met_threshold = 0.15
Since we have loadings from different modalities, we want to visualize them in a comparable way, so we normalize each of them by their maximum value in each case.
df1 = coupled_tensor.factors['Protein LRIs'].copy()
df1 = df1[(df1.T > lr_threshold).any()]
df1 = (df1 / df1.max().max())
df2 = coupled_tensor.factors['Metabolic Tasks'].copy()
df2 = df2[(df2.T > met_threshold).any()]
df2 = (df2 / df2.max().max())
lr_cm = c2c.plotting.loading_clustermap(df1,
loading_threshold=lr_threshold, # To consider only top LRs
use_zscore=False,
figsize=(10, 5),
cbar_fontsize=20,
tick_fontsize=10,
cmap='YlOrBr',
filename=output_folder + 'Clustermap-Factor-specific-ProtLRs.pdf',
row_cluster=False
)
lr_cm = c2c.plotting.loading_clustermap(df2,
loading_threshold=met_threshold, # To consider only top LRs
use_zscore=False,
figsize=(16, 5),
cbar_fontsize=20,
tick_fontsize=10,
cmap='YlOrBr',
filename=output_folder + 'Clustermap-Factor-specific-MetabolicTasks.pdf',
row_cluster=False
)
Given that we max-scaled them, we can also put them together in an unified plot
df = pd.concat([df1, df2])
lr_cm = c2c.plotting.loading_clustermap(df,
loading_threshold=0.1, # To consider only top LRs and TFs
use_zscore=False,
figsize=(24, 5),
cbar_fontsize=20,
tick_fontsize=10,
cmap='YlOrBr',
filename=output_folder + 'Clustermap-Factor-specific-LR-MetTasks.pdf',
row_cluster=False
)