Building multimodal 4D communication tensors (MEBOCOST)¶

This tutorial is focused on running Tensor-cell2cell v2 using the tools cell2cell and MEBOCOST to predict protein- and metabolite-mediated cell-cell communication (CCC), respectively.

In this case, we use samples from cortical brain organoids across different time points, previously published on https://doi.org/10.1016/j.stem.2019.08.002.

Read this if you plan using a GPU to speed-up the analysis. Before running this notebook, make sure to have a proper NVIDIA GPU driver (https://www.nvidia.com/Download/index.aspx) as well as the CUDA toolkit (https://developer.nvidia.com/cuda-toolkit) installed.

Then, make sure to create an environment with Pytorch >= v1.8.0 following these instructions to enable CUDA.

https://pytorch.org/get-started/locally/

Once you have everything installed, run the next blocks two blocks of code before everything.

If you are using a NVIDIA GPU, with PyTorch and CUDA, set the following variable to be True

use_gpu = True, otherwise set it use_gpu = False

In [1]:

Copied!

use_gpu = True

import tensorly as tl
if use_gpu:
    tl.set_backend('pytorch')
use_gpu = True

import tensorly as tl
if use_gpu:
    tl.set_backend('pytorch')

In [2]:

Copied!





import cell2cell as c2c
import scanpy as sc

import numpy as np
import pandas as pd

from tqdm.auto import tqdm

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import cell2cell as c2c
import scanpy as sc

import numpy as np
import pandas as pd

from tqdm.auto import tqdm

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:

Copied!

import warnings
warnings.filterwarnings('ignore')
import warnings
warnings.filterwarnings('ignore')

1 - Load Data¶

In [4]:

Copied!





import os

data_folder = './data/'
directory = os.fsencode(data_folder)

output_folder = './data/Tensor-cell2cell/'
if not os.path.isdir(output_folder):
    os.mkdir(output_folder)
import os

data_folder = './data/'
directory = os.fsencode(data_folder)

output_folder = './data/Tensor-cell2cell/'
if not os.path.isdir(output_folder):
    os.mkdir(output_folder)

RNA-seq data¶

A preprocessed version of the original paper (Trujillo et al, 2019) can be found here

In [5]:

Copied!

rnaseq = sc.read_h5ad(data_folder + '/annotated_seurat_norm_harmony_2022.h5ad')
rnaseq = sc.read_h5ad(data_folder + '/annotated_seurat_norm_harmony_2022.h5ad')

In [6]:

Copied!

rnaseq = rnaseq.raw.to_adata()
rnaseq = rnaseq.raw.to_adata()

In [7]:

Copied!

rnaseq.obs.head(2)
rnaseq.obs.head(2)

Out[7]:

	orig.ident	nCount_RNA	nFeature_RNA	percent.mt	RNA_snn_res.0.5	seurat_clusters	celltype	old.ident
Month-01_AAACCTGAGAATTGTG-1	Month-01	6417.0	2431	2.898551	11	11	Other	11
Month-01_AAACCTGAGCAACGGT-1	Month-01	8522.0	2464	2.663694	6	6	Progenitor	6

Ligand-Receptor pairs¶

Different databases of ligand-receptor interactions could be used. We previously created a repository that includes many available DBs (https://github.com/LewisLabUCSD/Ligand-Receptor-Pairs). In this tutorial, we employ the ligand-receptor pairs from CellChat (https://doi.org/10.1038/s41467-021-21246-9), which includes multimeric protein complexes.

In [8]:

Copied!

lr_pairs = pd.read_csv('https://raw.githubusercontent.com/LewisLabUCSD/Ligand-Receptor-Pairs/master/Human/Human-2020-Jin-LR-pairs.csv')
lr_pairs = lr_pairs.astype(str)
lr_pairs = pd.read_csv('https://raw.githubusercontent.com/LewisLabUCSD/Ligand-Receptor-Pairs/master/Human/Human-2020-Jin-LR-pairs.csv')
lr_pairs = lr_pairs.astype(str)

In [9]:

Copied!

lr_pairs.head(2)
lr_pairs.head(2)

Out[9]:

	interaction_name	pathway_name	ligand	receptor	agonist	antagonist	co_A_receptor	co_I_receptor	evidence	annotation	interaction_name_2	ligand_symbol	receptor_symbol	ligand_ensembl	receptor_ensembl	interaction_symbol	interaction_ensembl
0	TGFB1_TGFBR1_TGFBR2	TGFb	TGFB1	TGFbR1_R2	TGFb agonist	TGFb antagonist	nan	TGFb inhibition receptor	KEGG: hsa04350	Secreted Signaling	TGFB1 - (TGFBR1+TGFBR2)	TGFB1	TGFBR1&TGFBR2	ENSG00000105329	ENSG00000106799&ENSG00000163513	TGFB1^TGFBR1&TGFBR2	ENSG00000105329^ENSG00000106799&ENSG00000163513
1	TGFB2_TGFBR1_TGFBR2	TGFb	TGFB2	TGFbR1_R2	TGFb agonist	TGFb antagonist	nan	TGFb inhibition receptor	KEGG: hsa04350	Secreted Signaling	TGFB2 - (TGFBR1+TGFBR2)	TGFB2	TGFBR1&TGFBR2	ENSG00000092969	ENSG00000106799&ENSG00000163513	TGFB2^TGFBR1&TGFBR2	ENSG00000092969^ENSG00000106799&ENSG00000163513

In [10]:

Copied!

# interaction columns:
int_columns = ('ligand_symbol', 'receptor_symbol')
# interaction columns:
int_columns = ('ligand_symbol', 'receptor_symbol')

MEBOCOST ouputs¶

In this tutorial we assume that you already ran MEBOCOST and saved its ouputs. You can find an example on how running it and format outputs to be used here:

Run MEBOCOST for Tensor-cell2cell v2

In [11]:

Copied!





import os

mebocost_folder = data_folder + '/mebocost-results/'
directory = os.fsencode(mebocost_folder )


data = dict()
for file in sorted(os.listdir(directory)):
    filename = os.fsdecode(file)
    if filename.endswith(".csv"): 
        print(filename)
        basename = os.path.basename(filename)
        sample = basename.split('.')[0]
        sample = sample.lstrip('MEBOCOST')[1:]
        data[sample] = pd.read_csv(mebocost_folder  + filename)
        data[sample] = data[sample].loc[data[sample]['Annotation'].str.contains('Receptor')]
    else:
        continue
import os

mebocost_folder = data_folder + '/mebocost-results/'
directory = os.fsencode(mebocost_folder )


data = dict()
for file in sorted(os.listdir(directory)):
    filename = os.fsdecode(file)
    if filename.endswith(".csv"): 
        print(filename)
        basename = os.path.basename(filename)
        sample = basename.split('.')[0]
        sample = sample.lstrip('MEBOCOST')[1:]
        data[sample] = pd.read_csv(mebocost_folder  + filename)
        data[sample] = data[sample].loc[data[sample]['Annotation'].str.contains('Receptor')]
    else:
        continue

MEBOCOST-Month-01.csv
MEBOCOST-Month-03.csv
MEBOCOST-Month-06.csv
MEBOCOST-Month-10.csv

2 - Data Preprocessing¶

RNA-seq¶

Organize data to create tensor

First, define contexts

In [16]:

Copied!

contexts = ['Month-01', 'Month-03', 'Month-06', 'Month-10']
contexts = ['Month-01', 'Month-03', 'Month-06', 'Month-10']

Generate list of RNA-seq data containing all contexts (time points)

Here, we convert the single-cell expression levels into cell-type expression levels. A basic context-wise preprocessing is performed here, keeping only genes that are expressed at least in 4 single cells, and cell types per context with at least 20 cells.

Important: A list of aggregated gene expression matrices must be used for building the 4D-communication tensor. Each of the gene expression matrices is for one of the contexts, following the same order as the previouslist (contexts).

The aggregation here corresponds to mean expression value across cells of each cell type. Other aggregation methods could be used by changing the parameter method='average' in the aggregate_single_cell() function. Additionally, other gene expression values could be passed, using other preprocessing approaches such as log(x+1) or batch correction methods.

In [17]:

Copied!





rnaseq_matrices = []

for context in tqdm(contexts):
    meta_context = rnaseq.obs.loc[(rnaseq.obs['orig.ident'] == context) \
                                  & (rnaseq.obs['celltype'] != 'Other') & (rnaseq.obs['celltype'] != 'MC')]
    
    # Remove celltype per context that has few single cells
    min_sc_number = 20
    excluded_sc = []
    for idx, row in (meta_context.groupby(['celltype'])[['celltype']].count() >= min_sc_number).iterrows():
        if ~row['celltype']:
            excluded_sc.append(idx)
        
    meta_context = meta_context.loc[~meta_context['celltype'].isin(excluded_sc)]
    cells = list(meta_context.index)

    meta_context.index.name = 'barcode'
    tmp_data = rnaseq[cells]
    # Keep genes in each sample with at least 4 single cells expressing it
    sc.pp.filter_genes(tmp_data, min_cells=4)
    
    # Normalize and Log1p
    sc.pp.normalize_total(tmp_data, target_sum=1e6)
    sc.pp.log1p(tmp_data)

    # Aggregate gene expression of single cells into cell types
    exp_df = c2c.preprocessing.aggregate_single_cells(rnaseq_data=tmp_data.to_df(),
                                                      metadata=meta_context,
                                                      barcode_col='barcode',
                                                      celltype_col='celltype',
                                                      method='average',
                                                     )

    rnaseq_matrices.append(exp_df)
rnaseq_matrices = []

for context in tqdm(contexts):
    meta_context = rnaseq.obs.loc[(rnaseq.obs['orig.ident'] == context) \
                                  & (rnaseq.obs['celltype'] != 'Other') & (rnaseq.obs['celltype'] != 'MC')]
    
    # Remove celltype per context that has few single cells
    min_sc_number = 20
    excluded_sc = []
    for idx, row in (meta_context.groupby(['celltype'])[['celltype']].count() >= min_sc_number).iterrows():
        if ~row['celltype']:
            excluded_sc.append(idx)
        
    meta_context = meta_context.loc[~meta_context['celltype'].isin(excluded_sc)]
    cells = list(meta_context.index)

    meta_context.index.name = 'barcode'
    tmp_data = rnaseq[cells]
    # Keep genes in each sample with at least 4 single cells expressing it
    sc.pp.filter_genes(tmp_data, min_cells=4)
    
    # Normalize and Log1p
    sc.pp.normalize_total(tmp_data, target_sum=1e6)
    sc.pp.log1p(tmp_data)

    # Aggregate gene expression of single cells into cell types
    exp_df = c2c.preprocessing.aggregate_single_cells(rnaseq_data=tmp_data.to_df(),
                                                      metadata=meta_context,
                                                      barcode_col='barcode',
                                                      celltype_col='celltype',
                                                      method='average',
                                                     )

    rnaseq_matrices.append(exp_df)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value='')))

rnaseq_matrices is the list we will use for building the tensor based on protein-based ligand-receptor pairs.

LR pairs¶

Remove bidirectionality in the list of protein-based ligand-receptor pairs. That is, remove repeated interactions where both interactions are the same but in different order:

From this list:

Ligand	Receptor
Protein A	Protein B
Protein B	Protein A

We will have:

Ligand	Receptor
Protein A	Protein B

In [13]:

Copied!

lr_pairs = c2c.preprocessing.ppi.remove_ppi_bidirectionality(ppi_data=lr_pairs, 
                                                             interaction_columns=int_columns
                                                             )
lr_pairs = c2c.preprocessing.ppi.remove_ppi_bidirectionality(ppi_data=lr_pairs, 
                                                             interaction_columns=int_columns
                                                             )

Removing bidirectionality of PPI network

In [14]:

Copied!

lr_pairs.shape
lr_pairs.shape

Out[14]:

(1988, 17)

Generate a dictionary with function info for each LR pairs.

Keys are LIGAND_NAME^RECEPTOR_NAME (this is the same nomenclature that will be used when building the protein-based 4D-communication tensor later), and values are the function in the annotation column in the dataframe containing ligand-receptor pairs. Other functional annotations of the LR pairs could be used if available.

In [15]:

Copied!

ppi_functions = dict()

for idx, row in lr_pairs.iterrows():
    ppi_functions[row['interaction_symbol']] = row['annotation']
ppi_functions = dict()

for idx, row in lr_pairs.iterrows():
    ppi_functions[row['interaction_symbol']] = row['annotation']

Similarly, we can generate the functions for the metabolite-based ligand-receptor pairs. For that, we open the MEBOCOST database:

In [23]:

Copied!





met_sensor = pd.read_csv('https://raw.githubusercontent.com/kaifuchenlab/MEBOCOST/refs/heads/main/data/mebocost_db/human/human_met_sensor_update_May21_2025.tsv', 
                         sep='\t',
                         index_col=0
                        )
met_sensor = pd.read_csv('https://raw.githubusercontent.com/kaifuchenlab/MEBOCOST/refs/heads/main/data/mebocost_db/human/human_met_sensor_update_May21_2025.tsv', 
                         sep='\t',
                         index_col=0
                        )

In [24]:

Copied!

met_sensor.head(2)
met_sensor.head(2)

Out[24]:

	HMDB_ID	standard_metName	metName	Gene_name	Protein_name	Evidence	Annotation
ID
1	HMDB0006247	25-Hydroxycholesterol	25-Hydroxycholesterol	ABCA1	Phospholipid-transporting ATPase ABCA1	16611739	Transporter
2	HMDB0000517	L-Arginine	Arginine; L-Arginine	SLC7A1	High affinity cationic amino acid transporter 1	33209975; 17325243; 33639836; 23831616; 221449...	Transporter

Then, generate the LIGAND_NAME^RECEPTOR_NAME nomenclature that will be used to build the tensor later.

In [25]:

Copied!

met_sensor['interaction'] = met_sensor.apply(lambda row: row['standard_metName'] + '^' + row['Gene_name'], axis=1)
met_sensor['interaction'] = met_sensor.apply(lambda row: row['standard_metName'] + '^' + row['Gene_name'], axis=1)

We get the annotation of each ligand-receptor interaction. Again, this could be a different information/annotation if available.

In [26]:

Copied!

met_mapper = met_sensor.set_index('interaction').to_dict()['Annotation']
met_mapper = met_sensor.set_index('interaction').to_dict()['Annotation']

And add it to our ppi_function dictionary

In [27]:

Copied!

for m in tensor_met.order_names[1]:
    ppi_functions[m] = 'Metabolite ' + met_mapper[m]
for m in tensor_met.order_names[1]:
    ppi_functions[m] = 'Metabolite ' + met_mapper[m]

3 - Build Tensors¶

Build 4D-Communication Tensor¶

Here we use as input the list of gene expression matrices that were aggregated into a cell-type granularity. This list contains the expression matrices of all samples/contexts.

The following functions perform all the steps to build a 4D-communication tensor:

4dtensor

The parameter communication_score indicates the scoring function to use.

how='outer_cells' considers only LR pairs that are present in all contexts, while all cell types in the whole dataset are kept, regardless they appear only in one or a few contexts.

complex_sep='&' is used to specify that the list of ligand-receptor pairs contains protein complexes and that subunits are separated by '&'. If the list does not have complexes, use complex_sep=None instead.

In [18]:

Copied!





tensor_lr = c2c.tensor.InteractionTensor(rnaseq_matrices=rnaseq_matrices,
                                         ppi_data=lr_pairs,
                                         context_names=contexts,
                                         how='outer_cells',
                                         complex_sep='&',
                                         interaction_columns=int_columns,
                                         communication_score='expression_gmean',
                                         order_labels=['Contexts', 'Protein LRIs', 'Sender Cells', 'Receiver Cells'],
                                         device='cuda' if use_gpu else 'cpu'
                                        )
tensor_lr = c2c.tensor.InteractionTensor(rnaseq_matrices=rnaseq_matrices,
                                         ppi_data=lr_pairs,
                                         context_names=contexts,
                                         how='outer_cells',
                                         complex_sep='&',
                                         interaction_columns=int_columns,
                                         communication_score='expression_gmean',
                                         order_labels=['Contexts', 'Protein LRIs', 'Sender Cells', 'Receiver Cells'],
                                         device='cuda' if use_gpu else 'cpu'
                                        )

Getting expression values for protein complexes
Building tensor for the provided context

In [19]:

Copied!

tensor_lr.tensor.shape
tensor_lr.tensor.shape

Out[19]:

torch.Size([4, 422, 5, 5])

By using our MEBOCOST outputs directly, we can generate our metabolite-based 4D communication tensor:

In [20]:

Copied!





tensor_met = c2c.tensor.dataframes_to_tensor(context_df_dict=data,
                                             sender_col='Sender',
                                             receiver_col='Receiver',
                                             ligand_col='Metabolite_Name',
                                             receptor_col='Sensor',
                                             score_col='Commu_Score',
                                             order_labels=['Contexts', 'Metabolite LRIs', 'Sender Cells', 'Receiver Cells'],
                                             how='outer_cells', # Multiple options implemented now
                                             lr_sep='^',
                                             context_order=sorted(data.keys()),
                                             sort_elements=True,
                                             device='cuda' if use_gpu else 'cpu'
                                            )
tensor_met = c2c.tensor.dataframes_to_tensor(context_df_dict=data,
                                             sender_col='Sender',
                                             receiver_col='Receiver',
                                             ligand_col='Metabolite_Name',
                                             receptor_col='Sensor',
                                             score_col='Commu_Score',
                                             order_labels=['Contexts', 'Metabolite LRIs', 'Sender Cells', 'Receiver Cells'],
                                             how='outer_cells', # Multiple options implemented now
                                             lr_sep='^',
                                             context_order=sorted(data.keys()),
                                             sort_elements=True,
                                             device='cuda' if use_gpu else 'cpu'
                                            )

100%|██████████| 4/4 [00:00<00:00,  6.45it/s]

In [21]:

Copied!

tensor_met.tensor.shape
tensor_met.tensor.shape

Out[21]:

torch.Size([4, 52, 5, 5])

Modify MEBOCOST scores

Here, we modify the original MEBOCOST scores (which are the product between metabolite production and sensor expression). Given the way that MEBOCOST computes the scores, it allows negative values (when the enzymes consuming the metabolite present higher expression than those producing it), we make these values to be zero, as in our case we are only interested when the metabolite is produced.

Additionally, we use the square root of the final score, to represent the geometric mean between the ligand and the receptor, consistent with our calculations for the protein-based LR pairs above.

In [22]:

Copied!

tensor_met.tensor = tl.sqrt((tensor_met.tensor >= 0) * tensor_met.tensor) # Remove negative cases and convert to gmean
tensor_met.tensor = tl.sqrt((tensor_met.tensor >= 0) * tensor_met.tensor) # Remove negative cases and convert to gmean

Generate a list containing metadata for each tensor order/dimension - Later used for coloring factor plots

This is a list containing metadata for each dimension of the tensor (contexts, LR pairs, sender cells, receiver cells). Each metadata corresponds to a dataframe with info of each element in the respective dimension.

We start with the protein-based tensor.

In [28]:

Copied!





meta_tf = c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor_lr,
                                              metadata_dicts=[None, ppi_functions, None, None],
                                              fill_with_order_elements=True
                                             )
meta_tf = c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor_lr,
                                              metadata_dicts=[None, ppi_functions, None, None],
                                              fill_with_order_elements=True
                                             )

We continue with the metabolite-based tensor.

In [29]:

Copied!





meta_tf2 = c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor_met,
                                              metadata_dicts=[None, ppi_functions, None, None],
                                              fill_with_order_elements=True
                                             )
meta_tf2 = c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor_met,
                                              metadata_dicts=[None, ppi_functions, None, None],
                                              fill_with_order_elements=True
                                             )

Re-order ligand-receptor interactions based on their categories¶

Later, we will be interested in plotting our results and group LR pairs by their annotations, so we sort them in a way that those with similar annotations will be plot together.

First, we sort them in the metadata.

In [30]:

Copied!

meta_tf[1].sort_values(['Category', 'Element'], inplace=True)
meta_tf2[1].sort_values(['Category', 'Element'], inplace=True)
meta_tf[1].sort_values(['Category', 'Element'], inplace=True)
meta_tf2[1].sort_values(['Category', 'Element'], inplace=True)

Then, based on this new order, we sort their locations within the tensors.

In [31]:

Copied!

tensor_lr = c2c.tensor.subset_tensor(tensor_lr, subset_dict={1: meta_tf[1]['Element'].values.tolist()})
tensor_met = c2c.tensor.subset_tensor(tensor_met, subset_dict={1: meta_tf2[1]['Element'].values.tolist()})
tensor_lr = c2c.tensor.subset_tensor(tensor_lr, subset_dict={1: meta_tf[1]['Element'].values.tolist()})
tensor_met = c2c.tensor.subset_tensor(tensor_met, subset_dict={1: meta_tf2[1]['Element'].values.tolist()})

4 - Export tensors and metadata¶

Before exporting the tensors, we move them to the CPU to avoid incompatibilities with machines without an enabled GPU for analysis:

In [ ]:

Copied!

tensor_lr.to_device('cpu')
tensor_met.to_device('cpu')
tensor_lr.to_device('cpu')
tensor_met.to_device('cpu')

Finally, we export both, metadata and tensor for both modalities (protein- and metabolite-based communication).

In [32]:

Copied!

c2c.io.save_data.export_variable_with_pickle(tensor_lr, output_folder + '/Initial-c2c-Tensor.pkl')
c2c.io.save_data.export_variable_with_pickle(tensor_met, output_folder + '/Initial-MEBOCOST-Tensor.pkl')
c2c.io.save_data.export_variable_with_pickle(tensor_lr, output_folder + '/Initial-c2c-Tensor.pkl')
c2c.io.save_data.export_variable_with_pickle(tensor_met, output_folder + '/Initial-MEBOCOST-Tensor.pkl')

/data/Tensor-cell2cell//Initial-c2c-Tensor.pkl  was correctly saved.
/data/Tensor-cell2cell//Initial-MEBOCOST-Tensor.pkl  was correctly saved.

In [33]:

Copied!

c2c.io.save_data.export_variable_with_pickle(meta_tf, output_folder + '/Meta-Initial-c2c-Tensor.pkl')
c2c.io.save_data.export_variable_with_pickle(meta_tf2, output_folder + '/Meta-Initial-MEBOCOST-Tensor.pkl')
c2c.io.save_data.export_variable_with_pickle(meta_tf, output_folder + '/Meta-Initial-c2c-Tensor.pkl')
c2c.io.save_data.export_variable_with_pickle(meta_tf2, output_folder + '/Meta-Initial-MEBOCOST-Tensor.pkl')

/data/Tensor-cell2cell//Meta-Initial-c2c-Tensor.pkl  was correctly saved.
/data/Tensor-cell2cell//Meta-Initial-MEBOCOST-Tensor.pkl  was correctly saved.