Building multimodal 4D communication tensors (MEBOCOST)¶
This tutorial is focused on running Tensor-cell2cell v2 using the tools cell2cell and MEBOCOST to predict protein- and metabolite-mediated cell-cell communication (CCC), respectively.
In this case, we use samples from cortical brain organoids across different time points, previously published on https://doi.org/10.1016/j.stem.2019.08.002.
Read this if you plan using a GPU to speed-up the analysis. Before running this notebook, make sure to have a proper NVIDIA GPU driver (https://www.nvidia.com/Download/index.aspx) as well as the CUDA toolkit (https://developer.nvidia.com/cuda-toolkit) installed.
Then, make sure to create an environment with Pytorch >= v1.8.0 following these instructions to enable CUDA.
https://pytorch.org/get-started/locally/
Once you have everything installed, run the next blocks two blocks of code before everything.
If you are using a NVIDIA GPU, with PyTorch and CUDA, set the following variable to be True
use_gpu = True
, otherwise set it use_gpu = False
use_gpu = True
import tensorly as tl
if use_gpu:
tl.set_backend('pytorch')
import cell2cell as c2c
import scanpy as sc
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
1 - Load Data¶
import os
data_folder = './data/'
directory = os.fsencode(data_folder)
output_folder = './data/Tensor-cell2cell/'
if not os.path.isdir(output_folder):
os.mkdir(output_folder)
RNA-seq data¶
A preprocessed version of the original paper (Trujillo et al, 2019) can be found here
rnaseq = sc.read_h5ad(data_folder + '/annotated_seurat_norm_harmony_2022.h5ad')
rnaseq = rnaseq.raw.to_adata()
rnaseq.obs.head(2)
orig.ident | nCount_RNA | nFeature_RNA | percent.mt | RNA_snn_res.0.5 | seurat_clusters | celltype | old.ident | |
---|---|---|---|---|---|---|---|---|
Month-01_AAACCTGAGAATTGTG-1 | Month-01 | 6417.0 | 2431 | 2.898551 | 11 | 11 | Other | 11 |
Month-01_AAACCTGAGCAACGGT-1 | Month-01 | 8522.0 | 2464 | 2.663694 | 6 | 6 | Progenitor | 6 |
Ligand-Receptor pairs¶
Different databases of ligand-receptor interactions could be used. We previously created a repository that includes many available DBs (https://github.com/LewisLabUCSD/Ligand-Receptor-Pairs). In this tutorial, we employ the ligand-receptor pairs from CellChat (https://doi.org/10.1038/s41467-021-21246-9), which includes multimeric protein complexes.
lr_pairs = pd.read_csv('https://raw.githubusercontent.com/LewisLabUCSD/Ligand-Receptor-Pairs/master/Human/Human-2020-Jin-LR-pairs.csv')
lr_pairs = lr_pairs.astype(str)
lr_pairs.head(2)
interaction_name | pathway_name | ligand | receptor | agonist | antagonist | co_A_receptor | co_I_receptor | evidence | annotation | interaction_name_2 | ligand_symbol | receptor_symbol | ligand_ensembl | receptor_ensembl | interaction_symbol | interaction_ensembl | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TGFB1_TGFBR1_TGFBR2 | TGFb | TGFB1 | TGFbR1_R2 | TGFb agonist | TGFb antagonist | nan | TGFb inhibition receptor | KEGG: hsa04350 | Secreted Signaling | TGFB1 - (TGFBR1+TGFBR2) | TGFB1 | TGFBR1&TGFBR2 | ENSG00000105329 | ENSG00000106799&ENSG00000163513 | TGFB1^TGFBR1&TGFBR2 | ENSG00000105329^ENSG00000106799&ENSG00000163513 |
1 | TGFB2_TGFBR1_TGFBR2 | TGFb | TGFB2 | TGFbR1_R2 | TGFb agonist | TGFb antagonist | nan | TGFb inhibition receptor | KEGG: hsa04350 | Secreted Signaling | TGFB2 - (TGFBR1+TGFBR2) | TGFB2 | TGFBR1&TGFBR2 | ENSG00000092969 | ENSG00000106799&ENSG00000163513 | TGFB2^TGFBR1&TGFBR2 | ENSG00000092969^ENSG00000106799&ENSG00000163513 |
# interaction columns:
int_columns = ('ligand_symbol', 'receptor_symbol')
MEBOCOST ouputs¶
In this tutorial we assume that you already ran MEBOCOST and saved its ouputs. You can find an example on how running it and format outputs to be used here:
import os
mebocost_folder = data_folder + '/mebocost-results/'
directory = os.fsencode(mebocost_folder )
data = dict()
for file in sorted(os.listdir(directory)):
filename = os.fsdecode(file)
if filename.endswith(".csv"):
print(filename)
basename = os.path.basename(filename)
sample = basename.split('.')[0]
sample = sample.lstrip('MEBOCOST')[1:]
data[sample] = pd.read_csv(mebocost_folder + filename)
data[sample] = data[sample].loc[data[sample]['Annotation'].str.contains('Receptor')]
else:
continue
MEBOCOST-Month-01.csv MEBOCOST-Month-03.csv MEBOCOST-Month-06.csv MEBOCOST-Month-10.csv
2 - Data Preprocessing¶
RNA-seq¶
Organize data to create tensor
First, define contexts
contexts = ['Month-01', 'Month-03', 'Month-06', 'Month-10']
Generate list of RNA-seq data containing all contexts (time points)
Here, we convert the single-cell expression levels into cell-type expression levels. A basic context-wise preprocessing is performed here, keeping only genes that are expressed at least in 4 single cells, and cell types per context with at least 20 cells.
Important: A list of aggregated gene expression matrices must be used for building the 4D-communication tensor. Each of the gene expression matrices is for one of the contexts, following the same order as the previouslist (contexts
).
The aggregation here corresponds to mean expression value across cells of each cell type. Other aggregation methods could be used by changing the parameter method='average'
in the aggregate_single_cell()
function. Additionally, other gene expression values could be passed, using other preprocessing approaches such as log(x+1) or batch correction methods.
rnaseq_matrices = []
for context in tqdm(contexts):
meta_context = rnaseq.obs.loc[(rnaseq.obs['orig.ident'] == context) \
& (rnaseq.obs['celltype'] != 'Other') & (rnaseq.obs['celltype'] != 'MC')]
# Remove celltype per context that has few single cells
min_sc_number = 20
excluded_sc = []
for idx, row in (meta_context.groupby(['celltype'])[['celltype']].count() >= min_sc_number).iterrows():
if ~row['celltype']:
excluded_sc.append(idx)
meta_context = meta_context.loc[~meta_context['celltype'].isin(excluded_sc)]
cells = list(meta_context.index)
meta_context.index.name = 'barcode'
tmp_data = rnaseq[cells]
# Keep genes in each sample with at least 4 single cells expressing it
sc.pp.filter_genes(tmp_data, min_cells=4)
# Normalize and Log1p
sc.pp.normalize_total(tmp_data, target_sum=1e6)
sc.pp.log1p(tmp_data)
# Aggregate gene expression of single cells into cell types
exp_df = c2c.preprocessing.aggregate_single_cells(rnaseq_data=tmp_data.to_df(),
metadata=meta_context,
barcode_col='barcode',
celltype_col='celltype',
method='average',
)
rnaseq_matrices.append(exp_df)
HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value='')))
rnaseq_matrices
is the list we will use for building the tensor based on protein-based ligand-receptor pairs.
LR pairs¶
Remove bidirectionality in the list of protein-based ligand-receptor pairs. That is, remove repeated interactions where both interactions are the same but in different order:
From this list:
Ligand | Receptor |
---|---|
Protein A | Protein B |
Protein B | Protein A |
We will have:
Ligand | Receptor |
---|---|
Protein A | Protein B |
lr_pairs = c2c.preprocessing.ppi.remove_ppi_bidirectionality(ppi_data=lr_pairs,
interaction_columns=int_columns
)
Removing bidirectionality of PPI network
lr_pairs.shape
(1988, 17)
Generate a dictionary with function info for each LR pairs.
Keys are LIGAND_NAME^RECEPTOR_NAME (this is the same nomenclature that will be used when building the protein-based 4D-communication tensor later), and values are the function in the annotation column in the dataframe containing ligand-receptor pairs. Other functional annotations of the LR pairs could be used if available.
ppi_functions = dict()
for idx, row in lr_pairs.iterrows():
ppi_functions[row['interaction_symbol']] = row['annotation']
Similarly, we can generate the functions for the metabolite-based ligand-receptor pairs. For that, we open the MEBOCOST database:
met_sensor = pd.read_csv('https://raw.githubusercontent.com/kaifuchenlab/MEBOCOST/refs/heads/main/data/mebocost_db/human/human_met_sensor_update_May21_2025.tsv',
sep='\t',
index_col=0
)
met_sensor.head(2)
HMDB_ID | standard_metName | metName | Gene_name | Protein_name | Evidence | Annotation | |
---|---|---|---|---|---|---|---|
ID | |||||||
1 | HMDB0006247 | 25-Hydroxycholesterol | 25-Hydroxycholesterol | ABCA1 | Phospholipid-transporting ATPase ABCA1 | 16611739 | Transporter |
2 | HMDB0000517 | L-Arginine | Arginine; L-Arginine | SLC7A1 | High affinity cationic amino acid transporter 1 | 33209975; 17325243; 33639836; 23831616; 221449... | Transporter |
Then, generate the LIGAND_NAME^RECEPTOR_NAME nomenclature that will be used to build the tensor later.
met_sensor['interaction'] = met_sensor.apply(lambda row: row['standard_metName'] + '^' + row['Gene_name'], axis=1)
We get the annotation of each ligand-receptor interaction. Again, this could be a different information/annotation if available.
met_mapper = met_sensor.set_index('interaction').to_dict()['Annotation']
And add it to our ppi_function
dictionary
for m in tensor_met.order_names[1]:
ppi_functions[m] = 'Metabolite ' + met_mapper[m]
3 - Build Tensors¶
Build 4D-Communication Tensor¶
Here we use as input the list of gene expression matrices that were aggregated into a cell-type granularity. This list contains the expression matrices of all samples/contexts.
The following functions perform all the steps to build a 4D-communication tensor:
The parameter communication_score
indicates the scoring function to use.
how='outer_cells'
considers only LR pairs that are present in all contexts, while all cell types in the whole dataset are kept, regardless they appear only in one or a few contexts.
complex_sep='&'
is used to specify that the list of ligand-receptor pairs contains protein complexes and that subunits are separated by '&'. If the list does not have complexes, use complex_sep=None
instead.
tensor_lr = c2c.tensor.InteractionTensor(rnaseq_matrices=rnaseq_matrices,
ppi_data=lr_pairs,
context_names=contexts,
how='outer_cells',
complex_sep='&',
interaction_columns=int_columns,
communication_score='expression_gmean',
order_labels=['Contexts', 'Protein LRIs', 'Sender Cells', 'Receiver Cells'],
device='cuda' if use_gpu else 'cpu'
)
Getting expression values for protein complexes Building tensor for the provided context
tensor_lr.tensor.shape
torch.Size([4, 422, 5, 5])
By using our MEBOCOST outputs directly, we can generate our metabolite-based 4D communication tensor:
tensor_met = c2c.tensor.dataframes_to_tensor(context_df_dict=data,
sender_col='Sender',
receiver_col='Receiver',
ligand_col='Metabolite_Name',
receptor_col='Sensor',
score_col='Commu_Score',
order_labels=['Contexts', 'Metabolite LRIs', 'Sender Cells', 'Receiver Cells'],
how='outer_cells', # Multiple options implemented now
lr_sep='^',
context_order=sorted(data.keys()),
sort_elements=True,
device='cuda' if use_gpu else 'cpu'
)
100%|██████████| 4/4 [00:00<00:00, 6.45it/s]
tensor_met.tensor.shape
torch.Size([4, 52, 5, 5])
Modify MEBOCOST scores
Here, we modify the original MEBOCOST scores (which are the product between metabolite production and sensor expression). Given the way that MEBOCOST computes the scores, it allows negative values (when the enzymes consuming the metabolite present higher expression than those producing it), we make these values to be zero, as in our case we are only interested when the metabolite is produced.
Additionally, we use the square root of the final score, to represent the geometric mean between the ligand and the receptor, consistent with our calculations for the protein-based LR pairs above.
tensor_met.tensor = tl.sqrt((tensor_met.tensor >= 0) * tensor_met.tensor) # Remove negative cases and convert to gmean
Generate a list containing metadata for each tensor order/dimension - Later used for coloring factor plots
This is a list containing metadata for each dimension of the tensor (contexts, LR pairs, sender cells, receiver cells). Each metadata corresponds to a dataframe with info of each element in the respective dimension.
We start with the protein-based tensor.
meta_tf = c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor_lr,
metadata_dicts=[None, ppi_functions, None, None],
fill_with_order_elements=True
)
We continue with the metabolite-based tensor.
meta_tf2 = c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor_met,
metadata_dicts=[None, ppi_functions, None, None],
fill_with_order_elements=True
)
Re-order ligand-receptor interactions based on their categories¶
Later, we will be interested in plotting our results and group LR pairs by their annotations, so we sort them in a way that those with similar annotations will be plot together.
First, we sort them in the metadata.
meta_tf[1].sort_values(['Category', 'Element'], inplace=True)
meta_tf2[1].sort_values(['Category', 'Element'], inplace=True)
Then, based on this new order, we sort their locations within the tensors.
tensor_lr = c2c.tensor.subset_tensor(tensor_lr, subset_dict={1: meta_tf[1]['Element'].values.tolist()})
tensor_met = c2c.tensor.subset_tensor(tensor_met, subset_dict={1: meta_tf2[1]['Element'].values.tolist()})
4 - Export tensors and metadata¶
Finally, we export both, metadata and tensor for both modalities (protein- and metabolite-based communication).
c2c.io.save_data.export_variable_with_pickle(tensor_lr, output_folder + '/Initial-c2c-Tensor.pkl')
c2c.io.save_data.export_variable_with_pickle(tensor_met, output_folder + '/Initial-MEBOCOST-Tensor.pkl')
/data/Tensor-cell2cell//Initial-c2c-Tensor.pkl was correctly saved. /data/Tensor-cell2cell//Initial-MEBOCOST-Tensor.pkl was correctly saved.
c2c.io.save_data.export_variable_with_pickle(meta_tf, output_folder + '/Meta-Initial-c2c-Tensor.pkl')
c2c.io.save_data.export_variable_with_pickle(meta_tf2, output_folder + '/Meta-Initial-MEBOCOST-Tensor.pkl')
/data/Tensor-cell2cell//Meta-Initial-c2c-Tensor.pkl was correctly saved. /data/Tensor-cell2cell//Meta-Initial-MEBOCOST-Tensor.pkl was correctly saved.