Documentation for cell2cell

This documentation is for our cell2cell suite, which includes the regular cell2cell and Tensor-cell2cell tools. The former is for inferring cell-cell interactions and communication in one sample or context, while the latter is for deconvolving complex patterns of cell-cell communication across multiple samples or contexts simultaneously into interpretable factors representing patterns of communication.

Here, multiple classes and functions are implemented to facilitate the analyses, including a variety of visualizations to simplify the interpretation of results:

  • cell2cell.analysis : Includes simplified pipelines for running the analyses, and functions for downstream analyses of Tensor-cell2cell
  • cell2cell.clustering : Includes multiple scipy-based functions for performing clustering methods.
  • cell2cell.core : Includes the core functions for inferring cell-cell interactions and communication. It includes scoring methods, cell classes, and interaction spaces.
  • cell2cell.datasets : Includes toy datasets and annotations for testing functions in basic scenarios.
  • cell2cell.external : Includes built-in approaches borrowed from other tools to avoid incompatibilities (e.g. UMAP, tensorly, and PCoA).
  • cell2cell.io : Includes functions for opening and saving diverse types of files.
  • cell2cell.plotting : Includes all the visualization options that cell2cell offers.
  • cell2cell.preprocessing : Includes functions for manipulating data and variables (e.g. data preprocessing, integration, permutation, among others).
  • cell2cell.spatial : Includes filtering of cell-cell interactions results given intercellular distance, as well as defining neighborhoods by grids or moving windows.
  • cell2cell.stats : Includes statistical analyses such as enrichment analysis, multiple test correction methods, permutation approaches, and Gini coefficient.
  • cell2cell.tensor : Includes all functions pertinent to the analysis of Tensor-cell2cell
  • cell2cell.utils : Includes general utilities for analyzing networks and performing parallel computing.

Below, all the inputs, parameters (including their different options), and outputs are detailed. Source code of the functions is also included.

analysis special

cell2cell_pipelines

BulkInteractions

Interaction class with all necessary methods to run the cell2cell pipeline on a bulk RNA-seq dataset. Cells here could be represented by tissues, samples or any bulk organization of cells.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for a bulk RNA-seq experiment. Columns are samples and rows are genes.

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

metadata : pandas.Dataframe, default=None Metadata associated with the samples in the RNA-seq dataset.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

communication_score : str, default='expression_thresholding' Type of communication score to infer the potential use of a given ligand- receptor pair by a pair of cells/tissues/samples. Available communication_scores are:

- 'expression_thresholding' : Computes the joint presence of a ligand from a
                             sender cell and of a receptor on a receiver cell
                             from binarizing their gene expression levels.
- 'expression_mean' : Computes the average between the expression of a ligand
                      from a sender cell and the expression of a receptor on a
                      receiver cell.
- 'expression_product' : Computes the product between the expression of a
                        ligand from a sender cell and the expression of a
                        receptor on a receiver cell.
- 'expression_gmean' : Computes the geometric mean between the expression
                      of a ligand from a sender cell and the
                      expression of a receptor on a receiver cell.

cci_score : str, default='bray_curtis' Scoring function to aggregate the communication scores between a pair of cells. It computes an overall potential of cell-cell interactions. Options:

- 'bray_curtis' : Bray-Curtis-like score.
- 'jaccard' : Jaccard-like score.
- 'count' : Number of LR pairs that the pair of cells use.
- 'icellnet' : Sum of the L-R expression product of a pair of cells

cci_type : str, default='undirected' Specifies whether computing the cci_score in a directed or undirected way. For a pair of cells A and B, directed means that the ligands are considered only from cell A and receptors only from cell B or viceversa. While undirected simultaneously considers signaling from cell A to cell B and from cell B to cell A.

sample_col : str, default='sampleID' Column-name for the samples in the metadata.

group_col : str, default='tissue' Column-name for the grouping information associated with the samples in the metadata.

expression_threshold : float, default=10 Threshold value to binarize gene expression when using communication_score='expression_thresholding'. Units have to be the same as the rnaseq_data matrix (e.g., TPMs, counts, etc.).

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

complex_agg_method : str, default='min' Method to aggregate the expression value of multiple genes in a complex.

- 'min' : Minimum expression value among all genes.
- 'mean' : Average expression value among all genes.
- 'gmean' : Geometric mean expression value among all genes.

verbose : boolean, default=False Whether printing or not steps of the analysis.

Attributes

rnaseq_data : pandas.DataFrame Gene expression data for a bulk RNA-seq experiment. Columns are samples and rows are genes.

metadata : pandas.DataFrame Metadata associated with the samples in the RNA-seq dataset.

index_col : str Column-name for the samples in the metadata.

group_col : str Column-name for the grouping information associated with the samples in the metadata.

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

complex_sep : str Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

complex_agg_method : str Method to aggregate the expression value of multiple genes in a complex.

- 'min' : Minimum expression value among all genes.
- 'mean' : Average expression value among all genes.
- 'gmean' : Geometric mean expression value among all genes.

ref_ppi : pandas.DataFrame Reference list of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication. It could be the same as 'ppi_data' if ppi_data is not bidirectional (that is, contains ProtA-ProtB interaction as well as ProtB-ProtA interaction). ref_ppi must be undirected (contains only ProtA-ProtB and not ProtB-ProtA interaction).

interaction_columns : tuple Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

analysis_setup : dict Contains main setup for running the cell-cell interactions and communication analyses. Three main setups are needed (passed as keys):

- 'communication_score' : is the type of communication score used to detect
    active ligand-receptor pairs between each pair of cell.
    It can be:

    - 'expression_thresholding'
    - 'expression_product'
    - 'expression_mean'
    - 'expression_gmean'

- 'cci_score' : is the scoring function to aggregate the communication
    scores.
    It can be:

    - 'bray_curtis'
    - 'jaccard'
    - 'count'
    - 'icellnet'

- 'cci_type' : is the type of interaction between two cells. If it is
    undirected, all ligands and receptors are considered from both cells.
    If it is directed, ligands from one cell and receptors from the other
    are considered separately with respect to ligands from the second
    cell and receptor from the first one.
    So, it can be:

    - 'undirected'
    - 'directed'

cutoff_setup : dict Contains two keys: 'type' and 'parameter'. The first key represent the way to use a cutoff or threshold, while parameter is the value used to binarize the expression values. The key 'type' can be:

    - 'local_percentile' : computes the value of a given percentile, for each
        gene independently. In this case, the parameter corresponds to the
        percentile to compute, as a float value between 0 and 1.
    - 'global_percentile' : computes the value of a given percentile from all
        genes and samples simultaneously. In this case, the parameter
        corresponds to the percentile to compute, as a float value between
        0 and 1. All genes have the same cutoff.
    - 'file' : load a cutoff table from a file. Parameter in this case is the
        path of that file. It must contain the same genes as index and same
        samples as columns.
    - 'multi_col_matrix' : a dataframe must be provided, containing a cutoff
        for each gene in each sample. This allows to use specific cutoffs for
        each sample. The columns here must be the same as the ones in the
        rnaseq_data.
    - 'single_col_matrix' : a dataframe must be provided, containing a cutoff
        for each gene in only one column. These cutoffs will be applied to
        all samples.
    - 'constant_value' : binarizes the expression. Evaluates whether
        expression is greater than the value input in the parameter.

interaction_space : cell2cell.core.interaction_space.InteractionSpace Interaction space that contains all the required elements to perform the cell-cell interaction and communication analysis between every pair of cells. After performing the analyses, the results are stored in this object.

Source code in cell2cell/analysis/cell2cell_pipelines.py
class BulkInteractions:
    '''Interaction class with all necessary methods to run the cell2cell pipeline
    on a bulk RNA-seq dataset. Cells here could be represented by tissues, samples
    or any bulk organization of cells.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a bulk RNA-seq experiment. Columns are samples
        and rows are genes.

    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    metadata : pandas.Dataframe, default=None
        Metadata associated with the samples in the RNA-seq dataset.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    communication_score : str, default='expression_thresholding'
        Type of communication score to infer the potential use of a given ligand-
        receptor pair by a pair of cells/tissues/samples.
        Available communication_scores are:

        - 'expression_thresholding' : Computes the joint presence of a ligand from a
                                     sender cell and of a receptor on a receiver cell
                                     from binarizing their gene expression levels.
        - 'expression_mean' : Computes the average between the expression of a ligand
                              from a sender cell and the expression of a receptor on a
                              receiver cell.
        - 'expression_product' : Computes the product between the expression of a
                                ligand from a sender cell and the expression of a
                                receptor on a receiver cell.
        - 'expression_gmean' : Computes the geometric mean between the expression
                              of a ligand from a sender cell and the
                              expression of a receptor on a receiver cell.

    cci_score : str, default='bray_curtis'
        Scoring function to aggregate the communication scores between a pair of
        cells. It computes an overall potential of cell-cell interactions.
        Options:

        - 'bray_curtis' : Bray-Curtis-like score.
        - 'jaccard' : Jaccard-like score.
        - 'count' : Number of LR pairs that the pair of cells use.
        - 'icellnet' : Sum of the L-R expression product of a pair of cells

    cci_type : str, default='undirected'
        Specifies whether computing the cci_score in a directed or undirected
        way. For a pair of cells A and B, directed means that the ligands are
        considered only from cell A and receptors only from cell B or viceversa.
        While undirected simultaneously considers signaling from cell A to
        cell B and from cell B to cell A.

    sample_col : str, default='sampleID'
        Column-name for the samples in the metadata.

    group_col : str, default='tissue'
        Column-name for the grouping information associated with the samples
        in the metadata.

    expression_threshold : float, default=10
        Threshold value to binarize gene expression when using
        communication_score='expression_thresholding'. Units have to be the
        same as the rnaseq_data matrix (e.g., TPMs, counts, etc.).

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    complex_agg_method : str, default='min'
        Method to aggregate the expression value of multiple genes in a
        complex.

        - 'min' : Minimum expression value among all genes.
        - 'mean' : Average expression value among all genes.
        - 'gmean' : Geometric mean expression value among all genes.

    verbose : boolean, default=False
        Whether printing or not steps of the analysis.

    Attributes
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a bulk RNA-seq experiment. Columns are samples
        and rows are genes.

    metadata : pandas.DataFrame
        Metadata associated with the samples in the RNA-seq dataset.

    index_col : str
        Column-name for the samples in the metadata.

    group_col : str
        Column-name for the grouping information associated with the samples
        in the metadata.

    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used for
        inferring the cell-cell interactions and communication.

    complex_sep : str
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    complex_agg_method : str
        Method to aggregate the expression value of multiple genes in a
        complex.

        - 'min' : Minimum expression value among all genes.
        - 'mean' : Average expression value among all genes.
        - 'gmean' : Geometric mean expression value among all genes.

    ref_ppi : pandas.DataFrame
        Reference list of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication. It could be the
        same as 'ppi_data' if ppi_data is not bidirectional (that is, contains
        ProtA-ProtB interaction as well as ProtB-ProtA interaction). ref_ppi must
        be undirected (contains only ProtA-ProtB and not ProtB-ProtA interaction).

    interaction_columns : tuple
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    analysis_setup : dict
        Contains main setup for running the cell-cell interactions and communication
        analyses.
        Three main setups are needed (passed as keys):

        - 'communication_score' : is the type of communication score used to detect
            active ligand-receptor pairs between each pair of cell.
            It can be:

            - 'expression_thresholding'
            - 'expression_product'
            - 'expression_mean'
            - 'expression_gmean'

        - 'cci_score' : is the scoring function to aggregate the communication
            scores.
            It can be:

            - 'bray_curtis'
            - 'jaccard'
            - 'count'
            - 'icellnet'

        - 'cci_type' : is the type of interaction between two cells. If it is
            undirected, all ligands and receptors are considered from both cells.
            If it is directed, ligands from one cell and receptors from the other
            are considered separately with respect to ligands from the second
            cell and receptor from the first one.
            So, it can be:

            - 'undirected'
            - 'directed'

    cutoff_setup : dict
        Contains two keys: 'type' and 'parameter'. The first key represent the
        way to use a cutoff or threshold, while parameter is the value used
        to binarize the expression values.
        The key 'type' can be:

            - 'local_percentile' : computes the value of a given percentile, for each
                gene independently. In this case, the parameter corresponds to the
                percentile to compute, as a float value between 0 and 1.
            - 'global_percentile' : computes the value of a given percentile from all
                genes and samples simultaneously. In this case, the parameter
                corresponds to the percentile to compute, as a float value between
                0 and 1. All genes have the same cutoff.
            - 'file' : load a cutoff table from a file. Parameter in this case is the
                path of that file. It must contain the same genes as index and same
                samples as columns.
            - 'multi_col_matrix' : a dataframe must be provided, containing a cutoff
                for each gene in each sample. This allows to use specific cutoffs for
                each sample. The columns here must be the same as the ones in the
                rnaseq_data.
            - 'single_col_matrix' : a dataframe must be provided, containing a cutoff
                for each gene in only one column. These cutoffs will be applied to
                all samples.
            - 'constant_value' : binarizes the expression. Evaluates whether
                expression is greater than the value input in the parameter.

    interaction_space : cell2cell.core.interaction_space.InteractionSpace
        Interaction space that contains all the required elements to perform the
        cell-cell interaction and communication analysis between every pair of cells.
        After performing the analyses, the results are stored in this object.
    '''
    def __init__(self, rnaseq_data, ppi_data, metadata=None, interaction_columns=('A', 'B'),
                 communication_score='expression_thresholding', cci_score='bray_curtis', cci_type='undirected',
                 sample_col='sampleID', group_col='tissue', expression_threshold=10, complex_sep=None,
                 complex_agg_method='min', verbose=False):
        # Placeholders
        self.rnaseq_data = rnaseq_data
        self.metadata = metadata
        self.index_col = sample_col
        self.group_col = group_col
        self.analysis_setup = dict()
        self.cutoff_setup = dict()
        self.complex_sep = complex_sep
        self.complex_agg_method = complex_agg_method
        self.interaction_columns = interaction_columns

        # Analysis setup
        self.analysis_setup['communication_score'] = communication_score
        self.analysis_setup['cci_score'] = cci_score
        self.analysis_setup['cci_type'] = cci_type
        self.analysis_setup['ccc_type'] = cci_type

        # Initialize PPI
        genes = list(rnaseq_data.index)
        ppi_data_ = ppi.filter_ppi_by_proteins(ppi_data=ppi_data,
                                               proteins=genes,
                                               complex_sep=complex_sep,
                                               upper_letter_comparison=False,
                                               interaction_columns=self.interaction_columns)

        self.ppi_data = ppi.remove_ppi_bidirectionality(ppi_data=ppi_data_,
                                                        interaction_columns=self.interaction_columns,
                                                        verbose=verbose)
        if self.analysis_setup['cci_type'] == 'undirected':
            self.ref_ppi = self.ppi_data.copy()
            self.ppi_data = ppi.bidirectional_ppi_for_cci(ppi_data=self.ppi_data,
                                                          interaction_columns=self.interaction_columns,
                                                          verbose=verbose)
        else:
            self.ref_ppi = None

        # Thresholding
        self.cutoff_setup['type'] = 'constant_value'
        self.cutoff_setup['parameter'] = expression_threshold

        # Interaction Space
        self.interaction_space = initialize_interaction_space(rnaseq_data=self.rnaseq_data,
                                                              ppi_data=self.ppi_data,
                                                              cutoff_setup=self.cutoff_setup,
                                                              analysis_setup=self.analysis_setup,
                                                              complex_sep=complex_sep,
                                                              complex_agg_method=complex_agg_method,
                                                              interaction_columns=self.interaction_columns,
                                                              verbose=verbose)

    def compute_pairwise_cci_scores(self, cci_score=None, use_ppi_score=False, verbose=True):
        '''Computes overall CCI scores for each pair of cells.

        Parameters
        ----------
        cci_score : str, default=None
            Scoring function to aggregate the communication scores between
            a pair of cells. It computes an overall potential of cell-cell
            interactions. If None, it will use the one stored in the
            attribute analysis_setup of this object.
            Options:

            - 'bray_curtis' : Bray-Curtis-like score.
            - 'jaccard' : Jaccard-like score.
            - 'count' : Number of LR pairs that the pair of cells use.
            - 'icellnet' : Sum of the L-R expression product of a pair of cells

        use_ppi_score : boolean, default=False
            Whether using a weight of LR pairs specified in the ppi_data
            to compute the scores.

        verbose : boolean, default=True
            Whether printing or not steps of the analysis.
        '''
        self.interaction_space.compute_pairwise_cci_scores(cci_score=cci_score,
                                                           use_ppi_score=use_ppi_score,
                                                           verbose=verbose)

    def compute_pairwise_communication_scores(self, communication_score=None, use_ppi_score=False, ref_ppi_data=None,
                                              interaction_columns=None, cells=None, cci_type=None, verbose=True):
        '''Computes the communication scores for each LR pairs in
        a given pair of sender-receiver cell

        Parameters
        ----------
        communication_score : str, default=None
            Type of communication score to infer the potential use of
            a given ligand-receptor pair by a pair of cells/tissues/samples.
            If None, the score stored in the attribute analysis_setup
            will be used.
            Available communication_scores are:

            - 'expresion_thresholding' : Computes the joint presence of a
                                         ligand from a sender cell and of
                                         a receptor on a receiver cell from
                                         binarizing their gene expression levels.
            - 'expression_mean' : Computes the average between the expression
                                  of a ligand from a sender cell and the
                                  expression of a receptor on a receiver cell.
            - 'expression_product' : Computes the product between the expression
                                    of a ligand from a sender cell and the
                                    expression of a receptor on a receiver cell.
            - 'expression_gmean' : Computes the geometric mean between the expression
                                  of a ligand from a sender cell and the
                                  expression of a receptor on a receiver cell.

        use_ppi_score : boolean, default=False
            Whether using a weight of LR pairs specified in the ppi_data
            to compute the scores.

        ref_ppi_data : pandas.DataFrame, default=None
            Reference list of protein-protein interactions (or
            ligand-receptor pairs) used for inferring the cell-cell
            interactions and communication. It could be the same as
            'ppi_data' if ppi_data is not bidirectional (that is,
            contains ProtA-ProtB interaction as well as ProtB-ProtA
            interaction). ref_ppi must be undirected (contains only
            ProtA-ProtB and not ProtB-ProtA interaction). If None
            the one stored in the attribute ref_ppi will be used.

        interaction_columns : tuple, default=None
            Contains the names of the columns where to find the
            partners in a dataframe of protein-protein interactions.
            If the list is for ligand-receptor pairs, the first column
            is for the ligands and the second for the receptors. If
            None, the one stored in the attribute interaction_columns
            will be used.

        cells : list=None
            List of cells to consider.

        cci_type : str, default=None
            Type of interaction between two cells. Used to specify
            if we want to consider a LR pair in both directions.
            It can be:

            - 'undirected'
            - 'directed'

            If None, 'directed' will be used.

        verbose : boolean, default=True
            Whether printing or not steps of the analysis.
        '''
        if interaction_columns is None:
            interaction_columns = self.interaction_columns # Used only for ref_ppi_data

        if ref_ppi_data is None:
            ref_ppi_data = self.ref_ppi

        if cci_type is None:
            cci_type = 'directed'

        self.analysis_setup['ccc_type'] = cci_type

        self.interaction_space.compute_pairwise_communication_scores(communication_score=communication_score,
                                                                     use_ppi_score=use_ppi_score,
                                                                     ref_ppi_data=ref_ppi_data,
                                                                     interaction_columns=interaction_columns,
                                                                     cells=cells,
                                                                     cci_type=cci_type,
                                                                     verbose=verbose)

    @property
    def interaction_elements(self):
        '''Returns the interaction elements within an interaction space.'''
        if hasattr(self.interaction_space, 'interaction_elements'):
            return self.interaction_space.interaction_elements
        else:
            return None
interaction_elements property readonly

Returns the interaction elements within an interaction space.

compute_pairwise_cci_scores(self, cci_score=None, use_ppi_score=False, verbose=True)

Computes overall CCI scores for each pair of cells.

Parameters

cci_score : str, default=None Scoring function to aggregate the communication scores between a pair of cells. It computes an overall potential of cell-cell interactions. If None, it will use the one stored in the attribute analysis_setup of this object. Options:

- 'bray_curtis' : Bray-Curtis-like score.
- 'jaccard' : Jaccard-like score.
- 'count' : Number of LR pairs that the pair of cells use.
- 'icellnet' : Sum of the L-R expression product of a pair of cells

use_ppi_score : boolean, default=False Whether using a weight of LR pairs specified in the ppi_data to compute the scores.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Source code in cell2cell/analysis/cell2cell_pipelines.py
def compute_pairwise_cci_scores(self, cci_score=None, use_ppi_score=False, verbose=True):
    '''Computes overall CCI scores for each pair of cells.

    Parameters
    ----------
    cci_score : str, default=None
        Scoring function to aggregate the communication scores between
        a pair of cells. It computes an overall potential of cell-cell
        interactions. If None, it will use the one stored in the
        attribute analysis_setup of this object.
        Options:

        - 'bray_curtis' : Bray-Curtis-like score.
        - 'jaccard' : Jaccard-like score.
        - 'count' : Number of LR pairs that the pair of cells use.
        - 'icellnet' : Sum of the L-R expression product of a pair of cells

    use_ppi_score : boolean, default=False
        Whether using a weight of LR pairs specified in the ppi_data
        to compute the scores.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.
    '''
    self.interaction_space.compute_pairwise_cci_scores(cci_score=cci_score,
                                                       use_ppi_score=use_ppi_score,
                                                       verbose=verbose)
compute_pairwise_communication_scores(self, communication_score=None, use_ppi_score=False, ref_ppi_data=None, interaction_columns=None, cells=None, cci_type=None, verbose=True)

Computes the communication scores for each LR pairs in a given pair of sender-receiver cell

Parameters

communication_score : str, default=None Type of communication score to infer the potential use of a given ligand-receptor pair by a pair of cells/tissues/samples. If None, the score stored in the attribute analysis_setup will be used. Available communication_scores are:

- 'expresion_thresholding' : Computes the joint presence of a
                             ligand from a sender cell and of
                             a receptor on a receiver cell from
                             binarizing their gene expression levels.
- 'expression_mean' : Computes the average between the expression
                      of a ligand from a sender cell and the
                      expression of a receptor on a receiver cell.
- 'expression_product' : Computes the product between the expression
                        of a ligand from a sender cell and the
                        expression of a receptor on a receiver cell.
- 'expression_gmean' : Computes the geometric mean between the expression
                      of a ligand from a sender cell and the
                      expression of a receptor on a receiver cell.

use_ppi_score : boolean, default=False Whether using a weight of LR pairs specified in the ppi_data to compute the scores.

ref_ppi_data : pandas.DataFrame, default=None Reference list of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication. It could be the same as 'ppi_data' if ppi_data is not bidirectional (that is, contains ProtA-ProtB interaction as well as ProtB-ProtA interaction). ref_ppi must be undirected (contains only ProtA-ProtB and not ProtB-ProtA interaction). If None the one stored in the attribute ref_ppi will be used.

interaction_columns : tuple, default=None Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors. If None, the one stored in the attribute interaction_columns will be used.

cells : list=None List of cells to consider.

cci_type : str, default=None Type of interaction between two cells. Used to specify if we want to consider a LR pair in both directions. It can be:

- 'undirected'
- 'directed'

If None, 'directed' will be used.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Source code in cell2cell/analysis/cell2cell_pipelines.py
def compute_pairwise_communication_scores(self, communication_score=None, use_ppi_score=False, ref_ppi_data=None,
                                          interaction_columns=None, cells=None, cci_type=None, verbose=True):
    '''Computes the communication scores for each LR pairs in
    a given pair of sender-receiver cell

    Parameters
    ----------
    communication_score : str, default=None
        Type of communication score to infer the potential use of
        a given ligand-receptor pair by a pair of cells/tissues/samples.
        If None, the score stored in the attribute analysis_setup
        will be used.
        Available communication_scores are:

        - 'expresion_thresholding' : Computes the joint presence of a
                                     ligand from a sender cell and of
                                     a receptor on a receiver cell from
                                     binarizing their gene expression levels.
        - 'expression_mean' : Computes the average between the expression
                              of a ligand from a sender cell and the
                              expression of a receptor on a receiver cell.
        - 'expression_product' : Computes the product between the expression
                                of a ligand from a sender cell and the
                                expression of a receptor on a receiver cell.
        - 'expression_gmean' : Computes the geometric mean between the expression
                              of a ligand from a sender cell and the
                              expression of a receptor on a receiver cell.

    use_ppi_score : boolean, default=False
        Whether using a weight of LR pairs specified in the ppi_data
        to compute the scores.

    ref_ppi_data : pandas.DataFrame, default=None
        Reference list of protein-protein interactions (or
        ligand-receptor pairs) used for inferring the cell-cell
        interactions and communication. It could be the same as
        'ppi_data' if ppi_data is not bidirectional (that is,
        contains ProtA-ProtB interaction as well as ProtB-ProtA
        interaction). ref_ppi must be undirected (contains only
        ProtA-ProtB and not ProtB-ProtA interaction). If None
        the one stored in the attribute ref_ppi will be used.

    interaction_columns : tuple, default=None
        Contains the names of the columns where to find the
        partners in a dataframe of protein-protein interactions.
        If the list is for ligand-receptor pairs, the first column
        is for the ligands and the second for the receptors. If
        None, the one stored in the attribute interaction_columns
        will be used.

    cells : list=None
        List of cells to consider.

    cci_type : str, default=None
        Type of interaction between two cells. Used to specify
        if we want to consider a LR pair in both directions.
        It can be:

        - 'undirected'
        - 'directed'

        If None, 'directed' will be used.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.
    '''
    if interaction_columns is None:
        interaction_columns = self.interaction_columns # Used only for ref_ppi_data

    if ref_ppi_data is None:
        ref_ppi_data = self.ref_ppi

    if cci_type is None:
        cci_type = 'directed'

    self.analysis_setup['ccc_type'] = cci_type

    self.interaction_space.compute_pairwise_communication_scores(communication_score=communication_score,
                                                                 use_ppi_score=use_ppi_score,
                                                                 ref_ppi_data=ref_ppi_data,
                                                                 interaction_columns=interaction_columns,
                                                                 cells=cells,
                                                                 cci_type=cci_type,
                                                                 verbose=verbose)

SingleCellInteractions

Interaction class with all necessary methods to run the cell2cell pipeline on a single-cell RNA-seq dataset.

Parameters

rnaseq_data : pandas.DataFrame or scanpy.AnnData Gene expression data for a single-cell RNA-seq experiment. If it is a dataframe columns are single cells and rows are genes, while if it is a AnnData object, columns are genes and rows are single cells.

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

metadata : pandas.Dataframe Metadata containing the cell types for each single cells in the RNA-seq dataset.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

communication_score : str, default='expression_thresholding' Type of communication score to infer the potential use of a given ligand- receptor pair by a pair of cells/tissues/samples. Available communication_scores are:

- 'expression_thresholding' : Computes the joint presence of a ligand from a
                             sender cell and of a receptor on a receiver cell
                             from binarizing their gene expression levels.
- 'expression_mean' : Computes the average between the expression of a ligand
                      from a sender cell and the expression of a receptor on a
                      receiver cell.
- 'expression_product' : Computes the product between the expression of a
                        ligand from a sender cell and the expression of a
                        receptor on a receiver cell.
- 'expression_gmean' : Computes the geometric mean between the expression
                      of a ligand from a sender cell and the
                      expression of a receptor on a receiver cell.

cci_score : str, default='bray_curtis' Scoring function to aggregate the communication scores between a pair of cells. It computes an overall potential of cell-cell interactions. Options:

- 'bray_curtis' : Bray-Curtis-like score.
- 'jaccard' : Jaccard-like score.
- 'count' : Number of LR pairs that the pair of cells use.
- 'icellnet' : Sum of the L-R expression product of a pair of cells

cci_type : str, default='undirected' Specifies whether computing the cci_score in a directed or undirected way. For a pair of cells A and B, directed means that the ligands are considered only from cell A and receptors only from cell B or viceversa. While undirected simultaneously considers signaling from cell A to cell B and from cell B to cell A.

expression_threshold : float, default=0.2 Threshold value to binarize gene expression when using communication_score='expression_thresholding'. Units have to be the same as the aggregated gene expression matrix (e.g., counts, fraction of cells with non-zero counts, etc.).

aggregation_method : str, default='nn_cell_fraction' Specifies the method to use to aggregate gene expression of single cells into their respective cell types. Used to perform the CCI analysis since it is on the cell types rather than single cells. Options are:

- 'nn_cell_fraction' : Among the single cells composing a cell type, it
    calculates the fraction of single cells with non-zero count values
    of a given gene.
- 'average' : Computes the average gene expression among the single cells
    composing a cell type for a given gene.

barcode_col : str, default='barcodes' Column-name for the single cells in the metadata.

celltype_col : str, default='celltypes' Column-name in the metadata for the grouping single cells into cell types by the selected aggregation method.

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

complex_agg_method : str, default='min' Method to aggregate the expression value of multiple genes in a complex.

- 'min' : Minimum expression value among all genes.
- 'mean' : Average expression value among all genes.
- 'gmean' : Geometric mean expression value among all genes.

verbose : boolean, default=False Whether printing or not steps of the analysis.

Attributes

rnaseq_data : pandas.DataFrame or scanpy.AnnData Gene expression data for a single-cell RNA-seq experiment. If it is a dataframe columns are single cells and rows are genes, while if it is a AnnData object, columns are genes and rows are single cells.

metadata : pandas.DataFrame Metadata containing the cell types for each single cells in the RNA-seq dataset.

index_col : str Column-name for the single cells in the metadata.

group_col : str Column-name in the metadata for the grouping single cells into cell types by the selected aggregation method.

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

complex_sep : str Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

complex_agg_method : str Method to aggregate the expression value of multiple genes in a complex.

- 'min' : Minimum expression value among all genes.
- 'mean' : Average expression value among all genes.
- 'gmean' : Geometric mean expression value among all genes.

ref_ppi : pandas.DataFrame Reference list of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication. It could be the same as 'ppi_data' if ppi_data is not bidirectional (that is, contains ProtA-ProtB interaction as well as ProtB-ProtA interaction). ref_ppi must be undirected (contains only ProtA-ProtB and not ProtB-ProtA interaction).

interaction_columns : tuple Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

analysis_setup : dict Contains main setup for running the cell-cell interactions and communication analyses. Three main setups are needed (passed as keys):

- 'communication_score' : is the type of communication score used to detect
    active ligand-receptor pairs between each pair of cell.
    It can be:

    - 'expression_thresholding'
    - 'expression_product'
    - 'expression_mean'
    - 'expression_gmean'

- 'cci_score' : is the scoring function to aggregate the communication
    scores.
    It can be:

    - 'bray_curtis'
    - 'jaccard'
    - 'count'
    - 'icellnet'

- 'cci_type' : is the type of interaction between two cells. If it is
    undirected, all ligands and receptors are considered from both cells.
    If it is directed, ligands from one cell and receptors from the other
    are considered separately with respect to ligands from the second
    cell and receptor from the first one.
    So, it can be:

    - 'undirected'
    - 'directed'

cutoff_setup : dict Contains two keys: 'type' and 'parameter'. The first key represent the way to use a cutoff or threshold, while parameter is the value used to binarize the expression values. The key 'type' can be:

- 'local_percentile' : computes the value of a given percentile, for each
    gene independently. In this case, the parameter corresponds to the
    percentile to compute, as a float value between 0 and 1.
- 'global_percentile' : computes the value of a given percentile from all
    genes and samples simultaneously. In this case, the parameter
    corresponds to the percentile to compute, as a float value between
    0 and 1. All genes have the same cutoff.
- 'file' : load a cutoff table from a file. Parameter in this case is the
    path of that file. It must contain the same genes as index and same
    samples as columns.
- 'multi_col_matrix' : a dataframe must be provided, containing a cutoff
    for each gene in each sample. This allows to use specific cutoffs for
    each sample. The columns here must be the same as the ones in the
    rnaseq_data.
- 'single_col_matrix' : a dataframe must be provided, containing a cutoff
    for each gene in only one column. These cutoffs will be applied to
    all samples.
- 'constant_value' : binarizes the expression. Evaluates whether
    expression is greater than the value input in the parameter.

interaction_space : cell2cell.core.interaction_space.InteractionSpace Interaction space that contains all the required elements to perform the cell-cell interaction and communication analysis between every pair of cells. After performing the analyses, the results are stored in this object.

aggregation_method : str Specifies the method to use to aggregate gene expression of single cells into their respective cell types. Used to perform the CCI analysis since it is on the cell types rather than single cells. Options are:

- 'nn_cell_fraction' : Among the single cells composing a cell type, it
    calculates the fraction of single cells with non-zero count values
    of a given gene.
- 'average' : Computes the average gene expression among the single cells
    composing a cell type for a given gene.

ccc_permutation_pvalues : pandas.DataFrame Contains the P-values of the permutation analysis on the communication scores.

cci_permutation_pvalues : pandas.DataFrame Contains the P-values of the permutation analysis on the CCI scores.

__adata : boolean Auxiliary variable used for storing whether rnaseq_data is an AnnData object.

Source code in cell2cell/analysis/cell2cell_pipelines.py
class SingleCellInteractions:
    '''Interaction class with all necessary methods to run the cell2cell pipeline
        on a single-cell RNA-seq dataset.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame or scanpy.AnnData
        Gene expression data for a single-cell RNA-seq experiment. If it is a
        dataframe columns are single cells and rows are genes, while if it is
        a AnnData object, columns are genes and rows are single cells.

    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    metadata : pandas.Dataframe
        Metadata containing the cell types for each single cells in the
        RNA-seq dataset.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    communication_score : str, default='expression_thresholding'
        Type of communication score to infer the potential use of a given ligand-
        receptor pair by a pair of cells/tissues/samples.
        Available communication_scores are:

        - 'expression_thresholding' : Computes the joint presence of a ligand from a
                                     sender cell and of a receptor on a receiver cell
                                     from binarizing their gene expression levels.
        - 'expression_mean' : Computes the average between the expression of a ligand
                              from a sender cell and the expression of a receptor on a
                              receiver cell.
        - 'expression_product' : Computes the product between the expression of a
                                ligand from a sender cell and the expression of a
                                receptor on a receiver cell.
        - 'expression_gmean' : Computes the geometric mean between the expression
                              of a ligand from a sender cell and the
                              expression of a receptor on a receiver cell.

    cci_score : str, default='bray_curtis'
        Scoring function to aggregate the communication scores between a pair of
        cells. It computes an overall potential of cell-cell interactions.
        Options:

        - 'bray_curtis' : Bray-Curtis-like score.
        - 'jaccard' : Jaccard-like score.
        - 'count' : Number of LR pairs that the pair of cells use.
        - 'icellnet' : Sum of the L-R expression product of a pair of cells

    cci_type : str, default='undirected'
        Specifies whether computing the cci_score in a directed or undirected
        way. For a pair of cells A and B, directed means that the ligands are
        considered only from cell A and receptors only from cell B or viceversa.
        While undirected simultaneously considers signaling from cell A to
        cell B and from cell B to cell A.

    expression_threshold : float, default=0.2
        Threshold value to binarize gene expression when using
        communication_score='expression_thresholding'. Units have to be the
        same as the aggregated gene expression matrix (e.g., counts, fraction
        of cells with non-zero counts, etc.).

    aggregation_method : str, default='nn_cell_fraction'
        Specifies the method to use to aggregate gene expression of single
        cells into their respective cell types. Used to perform the CCI
        analysis since it is on the cell types rather than single cells.
        Options are:

        - 'nn_cell_fraction' : Among the single cells composing a cell type, it
            calculates the fraction of single cells with non-zero count values
            of a given gene.
        - 'average' : Computes the average gene expression among the single cells
            composing a cell type for a given gene.

    barcode_col : str, default='barcodes'
        Column-name for the single cells in the metadata.

    celltype_col : str, default='celltypes'
        Column-name in the metadata for the grouping single cells into cell types
        by the selected aggregation method.

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    complex_agg_method : str, default='min'
        Method to aggregate the expression value of multiple genes in a
        complex.

        - 'min' : Minimum expression value among all genes.
        - 'mean' : Average expression value among all genes.
        - 'gmean' : Geometric mean expression value among all genes.

    verbose : boolean, default=False
        Whether printing or not steps of the analysis.

    Attributes
    ----------
    rnaseq_data : pandas.DataFrame or scanpy.AnnData
        Gene expression data for a single-cell RNA-seq experiment. If it is a
        dataframe columns are single cells and rows are genes, while if it is
        a AnnData object, columns are genes and rows are single cells.

    metadata : pandas.DataFrame
        Metadata containing the cell types for each single cells in the
        RNA-seq dataset.

    index_col : str
        Column-name for the single cells in the metadata.

    group_col : str
        Column-name in the metadata for the grouping single cells into cell types
        by the selected aggregation method.

    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used for
        inferring the cell-cell interactions and communication.

    complex_sep : str
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    complex_agg_method : str
        Method to aggregate the expression value of multiple genes in a
        complex.

        - 'min' : Minimum expression value among all genes.
        - 'mean' : Average expression value among all genes.
        - 'gmean' : Geometric mean expression value among all genes.

    ref_ppi : pandas.DataFrame
        Reference list of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication. It could be the
        same as 'ppi_data' if ppi_data is not bidirectional (that is, contains
        ProtA-ProtB interaction as well as ProtB-ProtA interaction). ref_ppi must
        be undirected (contains only ProtA-ProtB and not ProtB-ProtA interaction).

    interaction_columns : tuple
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    analysis_setup : dict
        Contains main setup for running the cell-cell interactions and communication
        analyses.
        Three main setups are needed (passed as keys):

        - 'communication_score' : is the type of communication score used to detect
            active ligand-receptor pairs between each pair of cell.
            It can be:

            - 'expression_thresholding'
            - 'expression_product'
            - 'expression_mean'
            - 'expression_gmean'

        - 'cci_score' : is the scoring function to aggregate the communication
            scores.
            It can be:

            - 'bray_curtis'
            - 'jaccard'
            - 'count'
            - 'icellnet'

        - 'cci_type' : is the type of interaction between two cells. If it is
            undirected, all ligands and receptors are considered from both cells.
            If it is directed, ligands from one cell and receptors from the other
            are considered separately with respect to ligands from the second
            cell and receptor from the first one.
            So, it can be:

            - 'undirected'
            - 'directed'

    cutoff_setup : dict
        Contains two keys: 'type' and 'parameter'. The first key represent the
        way to use a cutoff or threshold, while parameter is the value used
        to binarize the expression values.
        The key 'type' can be:

        - 'local_percentile' : computes the value of a given percentile, for each
            gene independently. In this case, the parameter corresponds to the
            percentile to compute, as a float value between 0 and 1.
        - 'global_percentile' : computes the value of a given percentile from all
            genes and samples simultaneously. In this case, the parameter
            corresponds to the percentile to compute, as a float value between
            0 and 1. All genes have the same cutoff.
        - 'file' : load a cutoff table from a file. Parameter in this case is the
            path of that file. It must contain the same genes as index and same
            samples as columns.
        - 'multi_col_matrix' : a dataframe must be provided, containing a cutoff
            for each gene in each sample. This allows to use specific cutoffs for
            each sample. The columns here must be the same as the ones in the
            rnaseq_data.
        - 'single_col_matrix' : a dataframe must be provided, containing a cutoff
            for each gene in only one column. These cutoffs will be applied to
            all samples.
        - 'constant_value' : binarizes the expression. Evaluates whether
            expression is greater than the value input in the parameter.

    interaction_space : cell2cell.core.interaction_space.InteractionSpace
        Interaction space that contains all the required elements to perform the
        cell-cell interaction and communication analysis between every pair of cells.
        After performing the analyses, the results are stored in this object.

    aggregation_method : str
        Specifies the method to use to aggregate gene expression of single
        cells into their respective cell types. Used to perform the CCI
        analysis since it is on the cell types rather than single cells.
        Options are:

        - 'nn_cell_fraction' : Among the single cells composing a cell type, it
            calculates the fraction of single cells with non-zero count values
            of a given gene.
        - 'average' : Computes the average gene expression among the single cells
            composing a cell type for a given gene.

    ccc_permutation_pvalues : pandas.DataFrame
        Contains the P-values of the permutation analysis on the
        communication scores.

    cci_permutation_pvalues : pandas.DataFrame
        Contains the P-values of the permutation analysis on the
        CCI scores.

    __adata : boolean
        Auxiliary variable used for storing whether rnaseq_data
        is an AnnData object.
    '''
    compute_pairwise_cci_scores = BulkInteractions.compute_pairwise_cci_scores
    compute_pairwise_communication_scores =  BulkInteractions.compute_pairwise_communication_scores
    interaction_elements = BulkInteractions.interaction_elements

    def __init__(self, rnaseq_data, ppi_data, metadata, interaction_columns=('A', 'B'),
                 communication_score='expression_thresholding', cci_score='bray_curtis', cci_type='undirected',
                 expression_threshold=0.20, aggregation_method='nn_cell_fraction', barcode_col='barcodes',
                 celltype_col='cell_types', complex_sep=None, complex_agg_method='min', verbose=False):
        # Placeholders
        self.rnaseq_data = rnaseq_data
        self.metadata = metadata
        self.index_col = barcode_col
        self.group_col = celltype_col
        self.aggregation_method = aggregation_method
        self.analysis_setup = dict()
        self.cutoff_setup = dict()
        self.complex_sep = complex_sep
        self.complex_agg_method = complex_agg_method
        self.interaction_columns = interaction_columns
        self.ccc_permutation_pvalues = None
        self.cci_permutation_pvalues = None

        if isinstance(rnaseq_data, scanpy.AnnData):
            self.__adata = True
            genes = list(rnaseq_data.var.index)
        else:
            self.__adata = False
            genes = list(rnaseq_data.index)

        # Analysis
        self.analysis_setup['communication_score'] = communication_score
        self.analysis_setup['cci_score'] = cci_score
        self.analysis_setup['cci_type'] = cci_type
        self.analysis_setup['ccc_type'] = cci_type

        # Initialize PPI
        ppi_data_ = ppi.filter_ppi_by_proteins(ppi_data=ppi_data,
                                               proteins=genes,
                                               complex_sep=complex_sep,
                                               upper_letter_comparison=False,
                                               interaction_columns=interaction_columns)

        self.ppi_data = ppi.remove_ppi_bidirectionality(ppi_data=ppi_data_,
                                                        interaction_columns=interaction_columns,
                                                        verbose=verbose)

        if self.analysis_setup['cci_type'] == 'undirected':
            self.ref_ppi = self.ppi_data
            self.ppi_data = ppi.bidirectional_ppi_for_cci(ppi_data=self.ppi_data,
                                                          interaction_columns=interaction_columns,
                                                          verbose=verbose)
        else:
            self.ref_ppi = None

        # Thresholding
        self.cutoff_setup['type'] = 'constant_value'
        self.cutoff_setup['parameter'] = expression_threshold


        # Aggregate single-cell RNA-Seq data
        self.aggregated_expression = rnaseq.aggregate_single_cells(rnaseq_data=self.rnaseq_data,
                                                                   metadata=self.metadata,
                                                                   barcode_col=self.index_col,
                                                                   celltype_col=self.group_col,
                                                                   method=self.aggregation_method,
                                                                   transposed=self.__adata)

        # Interaction Space
        self.interaction_space = initialize_interaction_space(rnaseq_data=self.aggregated_expression,
                                                              ppi_data=self.ppi_data,
                                                              cutoff_setup=self.cutoff_setup,
                                                              analysis_setup=self.analysis_setup,
                                                              complex_sep=self.complex_sep,
                                                              complex_agg_method=self.complex_agg_method,
                                                              interaction_columns=self.interaction_columns,
                                                              verbose=verbose)

    def permute_cell_labels(self, permutations=100, evaluation='communication', fdr_correction=True, random_state=None,
                            verbose=False):
        '''Performs permutation analysis of cell-type labels. Detects
        significant CCI or communication scores.

        Parameters
        ----------
        permutations : int, default=100
            Number of permutations where in each of them a random
            shuffle of cell-type labels is performed, followed of
            computing CCI or communication scores to create a null
            distribution.

        evaluation : str, default='communication'
            Whether calculating P-values for CCI or communication scores.

            - 'interactions' : For CCI scores.
            - 'communication' : For communication scores.

        fdr_correction : boolean, default=True
            Whether performing a multiple test correction after
            computing P-values. In this case corresponds to an
            FDR or Benjamini-Hochberg correction, using an alpha
            equal to 0.05.

        random_state : int, default=None
            Seed for randomization.

        verbose : boolean, default=False
            Whether printing or not steps of the analysis.
        '''
        if evaluation == 'communication':
            if 'communication_matrix' not in self.interaction_space.interaction_elements.keys():
                raise ValueError('Run the method compute_pairwise_communication_scores() before permutation analysis.')
            score = self.interaction_space.interaction_elements['communication_matrix'].copy()
        elif evaluation == 'interactions':
            if not hasattr(self.interaction_space, 'distance_matrix'):
                raise ValueError('Run the method compute_pairwise_interactions() before permutation analysis.')
            score = self.interaction_space.interaction_elements['cci_matrix'].copy()
        else:
            raise ValueError('Not a valid evaluation')

        randomized_scores = []

        analysis_setup = self.analysis_setup.copy()
        ppi_data = self.ppi_data
        if (evaluation == 'communication') & (self.analysis_setup['cci_type'] != self.analysis_setup['ccc_type']):
            analysis_setup['cci_type'] = analysis_setup['ccc_type']
            if self.analysis_setup['cci_type'] == 'directed':
                ppi_data = ppi.bidirectional_ppi_for_cci(ppi_data=self.ppi_data,
                                                         interaction_columns=self.interaction_columns,
                                                         verbose=verbose)
            elif self.analysis_setup['cci_type'] == 'undirected':
                ppi_data = self.ref_ppi

        for i in tqdm(range(permutations), disable=not verbose):
            if random_state is not None:
                seed = random_state + i
            else:
                seed = random_state

            randomized_meta = manipulate_dataframes.shuffle_cols_in_df(df=self.metadata.reset_index(),
                                                                       columns=self.group_col,
                                                                       random_state=seed)

            aggregated_expression = rnaseq.aggregate_single_cells(rnaseq_data=self.rnaseq_data,
                                                                  metadata=randomized_meta,
                                                                  barcode_col=self.index_col,
                                                                  celltype_col=self.group_col,
                                                                  method=self.aggregation_method,
                                                                  transposed=self.__adata)

            interaction_space = initialize_interaction_space(rnaseq_data=aggregated_expression,
                                                             ppi_data=ppi_data,
                                                             cutoff_setup=self.cutoff_setup,
                                                             analysis_setup=analysis_setup,
                                                             complex_sep=self.complex_sep,
                                                             complex_agg_method=self.complex_agg_method,
                                                             interaction_columns=self.interaction_columns,
                                                             verbose=False)

            if evaluation == 'communication':
                interaction_space.compute_pairwise_communication_scores(verbose=False)
                randomized_scores.append(interaction_space.interaction_elements['communication_matrix'].values.flatten())
            elif evaluation == 'interactions':
                interaction_space.compute_pairwise_cci_scores(verbose=False)
                randomized_scores.append(interaction_space.interaction_elements['cci_matrix'].values.flatten())

        randomized_scores = np.array(randomized_scores)
        base_scores = score.values.flatten()
        pvals = np.ones(base_scores.shape)
        n_pvals = len(base_scores)
        randomized_scores = randomized_scores.reshape((-1, n_pvals))
        for i in range(n_pvals):
            dist = randomized_scores[:, i]
            dist = np.append(dist, base_scores[i])
            pvals[i] = permutation.compute_pvalue_from_dist(obs_value=base_scores[i],
                                                            dist=dist,
                                                            consider_size=True,
                                                            comparison='different'
                                                            )
        pval_df = pd.DataFrame(pvals.reshape(score.shape), index=score.index, columns=score.columns)

        if fdr_correction:
            symmetric = manipulate_dataframes.check_symmetry(df=pval_df)
            if symmetric:
                pval_df = multitest.compute_fdrcorrection_symmetric_matrix(X=pval_df,
                                                                           alpha=0.05)
            else:
                pval_df = multitest.compute_fdrcorrection_asymmetric_matrix(X=pval_df,
                                                                            alpha=0.05)

        if evaluation == 'communication':
            self.ccc_permutation_pvalues = pval_df
        elif evaluation == 'interactions':
            self.cci_permutation_pvalues = pval_df
        return pval_df
interaction_elements property readonly

Returns the interaction elements within an interaction space.

compute_pairwise_cci_scores(self, cci_score=None, use_ppi_score=False, verbose=True)

Computes overall CCI scores for each pair of cells.

Parameters

cci_score : str, default=None Scoring function to aggregate the communication scores between a pair of cells. It computes an overall potential of cell-cell interactions. If None, it will use the one stored in the attribute analysis_setup of this object. Options:

- 'bray_curtis' : Bray-Curtis-like score.
- 'jaccard' : Jaccard-like score.
- 'count' : Number of LR pairs that the pair of cells use.
- 'icellnet' : Sum of the L-R expression product of a pair of cells

use_ppi_score : boolean, default=False Whether using a weight of LR pairs specified in the ppi_data to compute the scores.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Source code in cell2cell/analysis/cell2cell_pipelines.py
def compute_pairwise_cci_scores(self, cci_score=None, use_ppi_score=False, verbose=True):
    '''Computes overall CCI scores for each pair of cells.

    Parameters
    ----------
    cci_score : str, default=None
        Scoring function to aggregate the communication scores between
        a pair of cells. It computes an overall potential of cell-cell
        interactions. If None, it will use the one stored in the
        attribute analysis_setup of this object.
        Options:

        - 'bray_curtis' : Bray-Curtis-like score.
        - 'jaccard' : Jaccard-like score.
        - 'count' : Number of LR pairs that the pair of cells use.
        - 'icellnet' : Sum of the L-R expression product of a pair of cells

    use_ppi_score : boolean, default=False
        Whether using a weight of LR pairs specified in the ppi_data
        to compute the scores.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.
    '''
    self.interaction_space.compute_pairwise_cci_scores(cci_score=cci_score,
                                                       use_ppi_score=use_ppi_score,
                                                       verbose=verbose)
compute_pairwise_communication_scores(self, communication_score=None, use_ppi_score=False, ref_ppi_data=None, interaction_columns=None, cells=None, cci_type=None, verbose=True)

Computes the communication scores for each LR pairs in a given pair of sender-receiver cell

Parameters

communication_score : str, default=None Type of communication score to infer the potential use of a given ligand-receptor pair by a pair of cells/tissues/samples. If None, the score stored in the attribute analysis_setup will be used. Available communication_scores are:

- 'expresion_thresholding' : Computes the joint presence of a
                             ligand from a sender cell and of
                             a receptor on a receiver cell from
                             binarizing their gene expression levels.
- 'expression_mean' : Computes the average between the expression
                      of a ligand from a sender cell and the
                      expression of a receptor on a receiver cell.
- 'expression_product' : Computes the product between the expression
                        of a ligand from a sender cell and the
                        expression of a receptor on a receiver cell.
- 'expression_gmean' : Computes the geometric mean between the expression
                      of a ligand from a sender cell and the
                      expression of a receptor on a receiver cell.

use_ppi_score : boolean, default=False Whether using a weight of LR pairs specified in the ppi_data to compute the scores.

ref_ppi_data : pandas.DataFrame, default=None Reference list of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication. It could be the same as 'ppi_data' if ppi_data is not bidirectional (that is, contains ProtA-ProtB interaction as well as ProtB-ProtA interaction). ref_ppi must be undirected (contains only ProtA-ProtB and not ProtB-ProtA interaction). If None the one stored in the attribute ref_ppi will be used.

interaction_columns : tuple, default=None Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors. If None, the one stored in the attribute interaction_columns will be used.

cells : list=None List of cells to consider.

cci_type : str, default=None Type of interaction between two cells. Used to specify if we want to consider a LR pair in both directions. It can be:

- 'undirected'
- 'directed'

If None, 'directed' will be used.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Source code in cell2cell/analysis/cell2cell_pipelines.py
def compute_pairwise_communication_scores(self, communication_score=None, use_ppi_score=False, ref_ppi_data=None,
                                          interaction_columns=None, cells=None, cci_type=None, verbose=True):
    '''Computes the communication scores for each LR pairs in
    a given pair of sender-receiver cell

    Parameters
    ----------
    communication_score : str, default=None
        Type of communication score to infer the potential use of
        a given ligand-receptor pair by a pair of cells/tissues/samples.
        If None, the score stored in the attribute analysis_setup
        will be used.
        Available communication_scores are:

        - 'expresion_thresholding' : Computes the joint presence of a
                                     ligand from a sender cell and of
                                     a receptor on a receiver cell from
                                     binarizing their gene expression levels.
        - 'expression_mean' : Computes the average between the expression
                              of a ligand from a sender cell and the
                              expression of a receptor on a receiver cell.
        - 'expression_product' : Computes the product between the expression
                                of a ligand from a sender cell and the
                                expression of a receptor on a receiver cell.
        - 'expression_gmean' : Computes the geometric mean between the expression
                              of a ligand from a sender cell and the
                              expression of a receptor on a receiver cell.

    use_ppi_score : boolean, default=False
        Whether using a weight of LR pairs specified in the ppi_data
        to compute the scores.

    ref_ppi_data : pandas.DataFrame, default=None
        Reference list of protein-protein interactions (or
        ligand-receptor pairs) used for inferring the cell-cell
        interactions and communication. It could be the same as
        'ppi_data' if ppi_data is not bidirectional (that is,
        contains ProtA-ProtB interaction as well as ProtB-ProtA
        interaction). ref_ppi must be undirected (contains only
        ProtA-ProtB and not ProtB-ProtA interaction). If None
        the one stored in the attribute ref_ppi will be used.

    interaction_columns : tuple, default=None
        Contains the names of the columns where to find the
        partners in a dataframe of protein-protein interactions.
        If the list is for ligand-receptor pairs, the first column
        is for the ligands and the second for the receptors. If
        None, the one stored in the attribute interaction_columns
        will be used.

    cells : list=None
        List of cells to consider.

    cci_type : str, default=None
        Type of interaction between two cells. Used to specify
        if we want to consider a LR pair in both directions.
        It can be:

        - 'undirected'
        - 'directed'

        If None, 'directed' will be used.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.
    '''
    if interaction_columns is None:
        interaction_columns = self.interaction_columns # Used only for ref_ppi_data

    if ref_ppi_data is None:
        ref_ppi_data = self.ref_ppi

    if cci_type is None:
        cci_type = 'directed'

    self.analysis_setup['ccc_type'] = cci_type

    self.interaction_space.compute_pairwise_communication_scores(communication_score=communication_score,
                                                                 use_ppi_score=use_ppi_score,
                                                                 ref_ppi_data=ref_ppi_data,
                                                                 interaction_columns=interaction_columns,
                                                                 cells=cells,
                                                                 cci_type=cci_type,
                                                                 verbose=verbose)
permute_cell_labels(self, permutations=100, evaluation='communication', fdr_correction=True, random_state=None, verbose=False)

Performs permutation analysis of cell-type labels. Detects significant CCI or communication scores.

Parameters

permutations : int, default=100 Number of permutations where in each of them a random shuffle of cell-type labels is performed, followed of computing CCI or communication scores to create a null distribution.

evaluation : str, default='communication' Whether calculating P-values for CCI or communication scores.

- 'interactions' : For CCI scores.
- 'communication' : For communication scores.

fdr_correction : boolean, default=True Whether performing a multiple test correction after computing P-values. In this case corresponds to an FDR or Benjamini-Hochberg correction, using an alpha equal to 0.05.

random_state : int, default=None Seed for randomization.

verbose : boolean, default=False Whether printing or not steps of the analysis.

Source code in cell2cell/analysis/cell2cell_pipelines.py
def permute_cell_labels(self, permutations=100, evaluation='communication', fdr_correction=True, random_state=None,
                        verbose=False):
    '''Performs permutation analysis of cell-type labels. Detects
    significant CCI or communication scores.

    Parameters
    ----------
    permutations : int, default=100
        Number of permutations where in each of them a random
        shuffle of cell-type labels is performed, followed of
        computing CCI or communication scores to create a null
        distribution.

    evaluation : str, default='communication'
        Whether calculating P-values for CCI or communication scores.

        - 'interactions' : For CCI scores.
        - 'communication' : For communication scores.

    fdr_correction : boolean, default=True
        Whether performing a multiple test correction after
        computing P-values. In this case corresponds to an
        FDR or Benjamini-Hochberg correction, using an alpha
        equal to 0.05.

    random_state : int, default=None
        Seed for randomization.

    verbose : boolean, default=False
        Whether printing or not steps of the analysis.
    '''
    if evaluation == 'communication':
        if 'communication_matrix' not in self.interaction_space.interaction_elements.keys():
            raise ValueError('Run the method compute_pairwise_communication_scores() before permutation analysis.')
        score = self.interaction_space.interaction_elements['communication_matrix'].copy()
    elif evaluation == 'interactions':
        if not hasattr(self.interaction_space, 'distance_matrix'):
            raise ValueError('Run the method compute_pairwise_interactions() before permutation analysis.')
        score = self.interaction_space.interaction_elements['cci_matrix'].copy()
    else:
        raise ValueError('Not a valid evaluation')

    randomized_scores = []

    analysis_setup = self.analysis_setup.copy()
    ppi_data = self.ppi_data
    if (evaluation == 'communication') & (self.analysis_setup['cci_type'] != self.analysis_setup['ccc_type']):
        analysis_setup['cci_type'] = analysis_setup['ccc_type']
        if self.analysis_setup['cci_type'] == 'directed':
            ppi_data = ppi.bidirectional_ppi_for_cci(ppi_data=self.ppi_data,
                                                     interaction_columns=self.interaction_columns,
                                                     verbose=verbose)
        elif self.analysis_setup['cci_type'] == 'undirected':
            ppi_data = self.ref_ppi

    for i in tqdm(range(permutations), disable=not verbose):
        if random_state is not None:
            seed = random_state + i
        else:
            seed = random_state

        randomized_meta = manipulate_dataframes.shuffle_cols_in_df(df=self.metadata.reset_index(),
                                                                   columns=self.group_col,
                                                                   random_state=seed)

        aggregated_expression = rnaseq.aggregate_single_cells(rnaseq_data=self.rnaseq_data,
                                                              metadata=randomized_meta,
                                                              barcode_col=self.index_col,
                                                              celltype_col=self.group_col,
                                                              method=self.aggregation_method,
                                                              transposed=self.__adata)

        interaction_space = initialize_interaction_space(rnaseq_data=aggregated_expression,
                                                         ppi_data=ppi_data,
                                                         cutoff_setup=self.cutoff_setup,
                                                         analysis_setup=analysis_setup,
                                                         complex_sep=self.complex_sep,
                                                         complex_agg_method=self.complex_agg_method,
                                                         interaction_columns=self.interaction_columns,
                                                         verbose=False)

        if evaluation == 'communication':
            interaction_space.compute_pairwise_communication_scores(verbose=False)
            randomized_scores.append(interaction_space.interaction_elements['communication_matrix'].values.flatten())
        elif evaluation == 'interactions':
            interaction_space.compute_pairwise_cci_scores(verbose=False)
            randomized_scores.append(interaction_space.interaction_elements['cci_matrix'].values.flatten())

    randomized_scores = np.array(randomized_scores)
    base_scores = score.values.flatten()
    pvals = np.ones(base_scores.shape)
    n_pvals = len(base_scores)
    randomized_scores = randomized_scores.reshape((-1, n_pvals))
    for i in range(n_pvals):
        dist = randomized_scores[:, i]
        dist = np.append(dist, base_scores[i])
        pvals[i] = permutation.compute_pvalue_from_dist(obs_value=base_scores[i],
                                                        dist=dist,
                                                        consider_size=True,
                                                        comparison='different'
                                                        )
    pval_df = pd.DataFrame(pvals.reshape(score.shape), index=score.index, columns=score.columns)

    if fdr_correction:
        symmetric = manipulate_dataframes.check_symmetry(df=pval_df)
        if symmetric:
            pval_df = multitest.compute_fdrcorrection_symmetric_matrix(X=pval_df,
                                                                       alpha=0.05)
        else:
            pval_df = multitest.compute_fdrcorrection_asymmetric_matrix(X=pval_df,
                                                                        alpha=0.05)

    if evaluation == 'communication':
        self.ccc_permutation_pvalues = pval_df
    elif evaluation == 'interactions':
        self.cci_permutation_pvalues = pval_df
    return pval_df

initialize_interaction_space(rnaseq_data, ppi_data, cutoff_setup, analysis_setup, excluded_cells=None, complex_sep=None, complex_agg_method='min', interaction_columns=('A', 'B'), verbose=True)

Initializes a InteractionSpace object to perform the analyses

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for a bulk RNA-seq experiment or a single-cell experiment after aggregation into cell types. Columns are samples and rows are genes.

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

cutoff_setup : dict Contains two keys: 'type' and 'parameter'. The first key represent the way to use a cutoff or threshold, while parameter is the value used to binarize the expression values. The key 'type' can be:

- 'local_percentile' : computes the value of a given percentile, for each
    gene independently. In this case, the parameter corresponds to the
    percentile to compute, as a float value between 0 and 1.
- 'global_percentile' : computes the value of a given percentile from all
    genes and samples simultaneously. In this case, the parameter
    corresponds to the percentile to compute, as a float value between
    0 and 1. All genes have the same cutoff.
- 'file' : load a cutoff table from a file. Parameter in this case is the
    path of that file. It must contain the same genes as index and same
    samples as columns.
- 'multi_col_matrix' : a dataframe must be provided, containing a cutoff
    for each gene in each sample. This allows to use specific cutoffs for
    each sample. The columns here must be the same as the ones in the
    rnaseq_data.
- 'single_col_matrix' : a dataframe must be provided, containing a cutoff
    for each gene in only one column. These cutoffs will be applied to
    all samples.
- 'constant_value' : binarizes the expression. Evaluates whether
    expression is greater than the value input in the parameter.

analysis_setup : dict Contains main setup for running the cell-cell interactions and communication analyses. Three main setups are needed (passed as keys):

- 'communication_score' : is the type of communication score used to detect
    active ligand-receptor pairs between each pair of cell.
    It can be:

    - 'expression_thresholding'
    - 'expression_product'
    - 'expression_mean'
    - 'expression_gmean'

- 'cci_score' : is the scoring function to aggregate the communication
    scores.
    It can be:

    - 'bray_curtis'
    - 'jaccard'
    - 'count'
    - 'icellnet'

- 'cci_type' : is the type of interaction between two cells. If it is
    undirected, all ligands and receptors are considered from both cells.
    If it is directed, ligands from one cell and receptors from the other
    are considered separately with respect to ligands from the second
    cell and receptor from the first one.
    So, it can be:

    - 'undirected'
    - 'directed'

excluded_cells : list, default=None List of cells in the rnaseq_data to be excluded. If None, all cells are considered.

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

complex_agg_method : str, default='min' Method to aggregate the expression value of multiple genes in a complex.

- 'min' : Minimum expression value among all genes.
- 'mean' : Average expression value among all genes.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

interaction_space : cell2cell.core.interaction_space.InteractionSpace Interaction space that contains all the required elements to perform the cell-cell interaction and communication analysis between every pair of cells. After performing the analyses, the results are stored in this object.

Source code in cell2cell/analysis/cell2cell_pipelines.py
def initialize_interaction_space(rnaseq_data, ppi_data, cutoff_setup, analysis_setup, excluded_cells=None,
                                 complex_sep=None, complex_agg_method='min', interaction_columns=('A', 'B'),
                                 verbose=True):
    '''Initializes a InteractionSpace object to perform the analyses

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a bulk RNA-seq experiment or a single-cell
        experiment after aggregation into cell types. Columns are samples
        and rows are genes.

    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    cutoff_setup : dict
        Contains two keys: 'type' and 'parameter'. The first key represent the
        way to use a cutoff or threshold, while parameter is the value used
        to binarize the expression values.
        The key 'type' can be:

        - 'local_percentile' : computes the value of a given percentile, for each
            gene independently. In this case, the parameter corresponds to the
            percentile to compute, as a float value between 0 and 1.
        - 'global_percentile' : computes the value of a given percentile from all
            genes and samples simultaneously. In this case, the parameter
            corresponds to the percentile to compute, as a float value between
            0 and 1. All genes have the same cutoff.
        - 'file' : load a cutoff table from a file. Parameter in this case is the
            path of that file. It must contain the same genes as index and same
            samples as columns.
        - 'multi_col_matrix' : a dataframe must be provided, containing a cutoff
            for each gene in each sample. This allows to use specific cutoffs for
            each sample. The columns here must be the same as the ones in the
            rnaseq_data.
        - 'single_col_matrix' : a dataframe must be provided, containing a cutoff
            for each gene in only one column. These cutoffs will be applied to
            all samples.
        - 'constant_value' : binarizes the expression. Evaluates whether
            expression is greater than the value input in the parameter.

    analysis_setup : dict
        Contains main setup for running the cell-cell interactions and communication
        analyses.
        Three main setups are needed (passed as keys):

        - 'communication_score' : is the type of communication score used to detect
            active ligand-receptor pairs between each pair of cell.
            It can be:

            - 'expression_thresholding'
            - 'expression_product'
            - 'expression_mean'
            - 'expression_gmean'

        - 'cci_score' : is the scoring function to aggregate the communication
            scores.
            It can be:

            - 'bray_curtis'
            - 'jaccard'
            - 'count'
            - 'icellnet'

        - 'cci_type' : is the type of interaction between two cells. If it is
            undirected, all ligands and receptors are considered from both cells.
            If it is directed, ligands from one cell and receptors from the other
            are considered separately with respect to ligands from the second
            cell and receptor from the first one.
            So, it can be:

            - 'undirected'
            - 'directed'

    excluded_cells : list, default=None
        List of cells in the rnaseq_data to be excluded. If None, all cells
        are considered.

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    complex_agg_method : str, default='min'
        Method to aggregate the expression value of multiple genes in a
        complex.

        - 'min' : Minimum expression value among all genes.
        - 'mean' : Average expression value among all genes.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    interaction_space : cell2cell.core.interaction_space.InteractionSpace
        Interaction space that contains all the required elements to perform the
        cell-cell interaction and communication analysis between every pair of cells.
        After performing the analyses, the results are stored in this object.
    '''
    if excluded_cells is None:
        excluded_cells = []

    included_cells = sorted(list((set(rnaseq_data.columns) - set(excluded_cells))))

    interaction_space = ispace.InteractionSpace(rnaseq_data=rnaseq_data[included_cells],
                                                ppi_data=ppi_data,
                                                gene_cutoffs=cutoff_setup,
                                                communication_score=analysis_setup['communication_score'],
                                                cci_score=analysis_setup['cci_score'],
                                                cci_type=analysis_setup['cci_type'],
                                                complex_sep=complex_sep,
                                                complex_agg_method=complex_agg_method,
                                                interaction_columns=interaction_columns,
                                                verbose=verbose)
    return interaction_space

tensor_downstream

compute_gini_coefficients(result, sender_label='Sender Cells', receiver_label='Receiver Cells')

Computes Gini coefficient on the distribution of edge weights in each factor-specific cell-cell communication network. Factors obtained from the tensor decomposition with Tensor-cell2cell.

Parameters

result : any Tensor class in cell2cell.tensor.tensor or a dict Either a Tensor type or a dictionary which resulted from the tensor decomposition. If it is a dict, it should be the one in, for example, InteractionTensor.factors

sender_label : str Label for the dimension of sender cells. Usually found in InteractionTensor.order_labels

receiver_label : str Label for the dimension of receiver cells. Usually found in InteractionTensor.order_labels

Returns

gini_df : pandas.DataFrame Dataframe containing the Gini coefficient of each factor from a tensor decomposition. Calculated on the factor-specific cell-cell communication networks.

Source code in cell2cell/analysis/tensor_downstream.py
def compute_gini_coefficients(result, sender_label='Sender Cells', receiver_label='Receiver Cells'):
    '''
    Computes Gini coefficient on the distribution of edge weights
    in each factor-specific cell-cell communication network. Factors
    obtained from the tensor decomposition with Tensor-cell2cell.

    Parameters
    ----------
    result : any Tensor class in cell2cell.tensor.tensor or a dict
        Either a Tensor type or a dictionary which resulted from the tensor
        decomposition. If it is a dict, it should be the one in, for example,
        InteractionTensor.factors

    sender_label : str
        Label for the dimension of sender cells. Usually found in
        InteractionTensor.order_labels

    receiver_label : str
        Label for the dimension of receiver cells. Usually found in
        InteractionTensor.order_labels

    Returns
    -------
    gini_df : pandas.DataFrame
        Dataframe containing the Gini coefficient of each factor from
        a tensor decomposition. Calculated on the factor-specific
        cell-cell communication networks.
    '''
    if hasattr(result, 'factors'):
        result = result.factors
        if result is None:
            raise ValueError('A tensor factorization must be run on the tensor before calling this function.')
    elif isinstance(result, dict):
        pass
    else:
        raise ValueError('result is not of a valid type. It must be an InteractionTensor or a dict.')

    factors = sorted(list(set(result[sender_label].columns) & set(result[receiver_label].columns)))

    ginis = []
    for f in factors:
        factor_net = get_joint_loadings(result=result,
                                        dim1=sender_label,
                                        dim2=receiver_label,
                                        factor=f
                                        )
        gini = gini_coefficient(factor_net.values.flatten())
        ginis.append((f, gini))
    gini_df = pd.DataFrame.from_records(ginis, columns=['Factor', 'Gini'])
    return gini_df

flatten_factor_ccc_networks(networks, orderby='senders')

Flattens all adjacency matrices in the factor-specific cell-cell communication networks. It generates a matrix where rows are factors and columns are cell-cell pairs.

Parameters

networks : dict A dictionary containing a pandas.DataFrame for each of the factors (factor names are the keys of the dict). These dataframes are the adjacency matrices of the CCC networks.

orderby : str Order of the flatten cell-cell pairs. Options are 'senders' and 'receivers'. 'senders' means to flatten the matrices in a way that all cell-cell pairs with a same sender cell are put next to each others. 'receivers' means the same, but by considering the receiver cell instead.

Returns

flatten_networks : pandas.DataFrame A dataframe wherein rows contains a factor-specific network. Columns are the directed cell-cell pairs.

Source code in cell2cell/analysis/tensor_downstream.py
def flatten_factor_ccc_networks(networks, orderby='senders'):
    '''
    Flattens all adjacency matrices in the factor-specific
    cell-cell communication networks. It generates a matrix
    where rows are factors and columns are cell-cell pairs.

    Parameters
    ----------
    networks : dict
        A dictionary containing a pandas.DataFrame for each of the factors
        (factor names are the keys of the dict). These dataframes are the
        adjacency matrices of the CCC networks.

    orderby : str
        Order of the flatten cell-cell pairs. Options are 'senders' and
        'receivers'. 'senders' means to flatten the matrices in a way that
        all cell-cell pairs with a same sender cell are put next to each others.
        'receivers' means the same, but by considering the receiver cell instead.

    Returns
    -------
    flatten_networks : pandas.DataFrame
        A dataframe wherein rows contains a factor-specific network. Columns are
        the directed cell-cell pairs.
    '''
    senders = sorted(set.intersection(*[set(v.index) for v in networks.values()]))
    receivers = sorted(set.intersection(*[set(v.columns) for v in networks.values()]))

    if orderby == 'senders':
        cell_pairs = [s + ' --> ' + r for s in senders for r in receivers]
        flatten_order = 'C'
    elif orderby == 'receivers':
        cell_pairs = [s + ' --> ' + r for r in receivers for s in senders]
        flatten_order = 'F'
    else:
        raise ValueError("`orderby` must be either 'senders' or 'receivers'.")

    data = np.asarray([v.values.flatten(flatten_order) for v in networks.values()]).T
    flatten_networks = pd.DataFrame(data=data,
                                    index=cell_pairs,
                                    columns=list(networks.keys())
                                    )
    return flatten_networks

get_factor_specific_ccc_networks(result, sender_label='Sender Cells', receiver_label='Receiver Cells')

Generates adjacency matrices for each of the factors obtained from a tensor decomposition. These matrices represent a cell-cell communication directed network.

Parameters

result : any Tensor class in cell2cell.tensor.tensor or a dict Either a Tensor type or a dictionary which resulted from the tensor decomposition. If it is a dict, it should be the one in, for example, InteractionTensor.factors

sender_label : str Label for the dimension of sender cells. Usually found in InteractionTensor.order_labels

receiver_label : str Label for the dimension of receiver cells. Usually found in InteractionTensor.order_labels

Returns

networks : dict A dictionary containing a pandas.DataFrame for each of the factors (factor names are the keys of the dict). These dataframes are the adjacency matrices of the CCC networks.

Source code in cell2cell/analysis/tensor_downstream.py
def get_factor_specific_ccc_networks(result, sender_label='Sender Cells', receiver_label='Receiver Cells'):
    '''
    Generates adjacency matrices for each of the factors
    obtained from a tensor decomposition. These matrices represent a
    cell-cell communication directed network.

    Parameters
    ----------
    result : any Tensor class in cell2cell.tensor.tensor or a dict
        Either a Tensor type or a dictionary which resulted from the tensor
        decomposition. If it is a dict, it should be the one in, for example,
        InteractionTensor.factors

    sender_label : str
        Label for the dimension of sender cells. Usually found in
        InteractionTensor.order_labels

    receiver_label : str
        Label for the dimension of receiver cells. Usually found in
        InteractionTensor.order_labels

    Returns
    -------
    networks : dict
        A dictionary containing a pandas.DataFrame for each of the factors
        (factor names are the keys of the dict). These dataframes are the
        adjacency matrices of the CCC networks.
    '''
    if hasattr(result, 'factors'):
        result = result.factors
        if result is None:
            raise ValueError('A tensor factorization must be run on the tensor before calling this function.')
    elif isinstance(result, dict):
        pass
    else:
        raise ValueError('result is not of a valid type. It must be an InteractionTensor or a dict.')

    factors = sorted(list(set(result[sender_label].columns) & set(result[receiver_label].columns)))

    networks = dict()
    for f in factors:
        networks[f] = get_joint_loadings(result=result,
                                         dim1=sender_label,
                                         dim2=receiver_label,
                                         factor=f
                                         )
    return networks

get_joint_loadings(result, dim1, dim2, factor)

Creates the joint loading distribution between two tensor dimensions for a given factor output from decomposition.

Parameters

result : any Tensor class in cell2cell.tensor.tensor or a dict Either a Tensor type or a dictionary which resulted from the tensor decomposition. If it is a dict, it should be the one in, for example, InteractionTensor.factors

dim1 : str One of the tensor dimensions (options are in the keys of the dict, or interaction.factors.keys())

dim2 : str A second tensor dimension (options are in the keys of the dict, or interaction.factors.keys())

str

One of the factors output from the decomposition (e.g. 'Factor 1').

Returns

joint_dist : pandas.DataFrame Joint distribution of factor loadings for the specified dimensions. Rows correspond to elements in dim1 and columns to elements in dim2.

Source code in cell2cell/analysis/tensor_downstream.py
def get_joint_loadings(result, dim1, dim2, factor):
    """
    Creates the joint loading distribution between two tensor dimensions for a
    given factor output from decomposition.

    Parameters
    ----------
    result : any Tensor class in cell2cell.tensor.tensor or a dict
        Either a Tensor type or a dictionary which resulted from the tensor
        decomposition. If it is a dict, it should be the one in, for example,
        InteractionTensor.factors

    dim1 : str
        One of the tensor dimensions (options are in the keys of the dict,
        or interaction.factors.keys())

    dim2 : str
        A second tensor dimension (options are in the keys of the dict,
        or interaction.factors.keys())

    factor: str
        One of the factors output from the decomposition (e.g. 'Factor 1').

    Returns
    -------
    joint_dist : pandas.DataFrame
        Joint distribution of factor loadings for the specified dimensions.
        Rows correspond to elements in dim1 and columns to elements in dim2.
    """
    if hasattr(result, 'factors'):
        result = result.factors
        if result is None:
            raise ValueError('A tensor factorization must be run on the tensor before calling this function.')
    elif isinstance(result, dict):
        pass
    else:
        raise ValueError('result is not of a valid type. It must be an InteractionTensor or a dict.')

    assert dim1 in result.keys(), 'The specified dimension ' + dim1 + ' is not present in the `result` input'
    assert dim2 in result.keys(), 'The specified dimension ' + dim2 + ' is not present in the `result` input'

    vec1 = result[dim1][factor]
    vec2 = result[dim2][factor]

    # Calculate the outer product
    joint_dist = pd.DataFrame(data=np.outer(vec1, vec2),
                              index=vec1.index,
                              columns=vec2.index)

    joint_dist.index.name = dim1
    joint_dist.columns.name = dim2
    return joint_dist

get_lr_by_cell_pairs(result, lr_label, sender_label, receiver_label, order_cells_by='receivers', factor=None, cci_threshold=None, lr_threshold=None)

Returns a dataframe containing the product loadings of a specific combination of ligand-receptor pair and sender-receiver pair.

Parameters

result : any Tensor class in cell2cell.tensor.tensor or a dict Either a Tensor type or a dictionary which resulted from the tensor decomposition. If it is a dict, it should be the one in, for example, InteractionTensor.factors

lr_label : str Label for the dimension of the ligand-receptor pairs. Usually found in InteractionTensor.order_labels

sender_label : str Label for the dimension of sender cells. Usually found in InteractionTensor.order_labels

receiver_label : str Label for the dimension of receiver cells. Usually found in InteractionTensor.order_labels

order_cells_by : str, default='receivers' Order of the returned dataframe. Options are 'senders' and 'receivers'. 'senders' means to order the dataframe in a way that all cell-cell pairs with a same sender cell are put next to each others. 'receivers' means the same, but by considering the receiver cell instead.

factor : str, default=None Name of the factor to be used to compute the product loadings. If None, all factors will be included to compute them.

cci_threshold : float, default=None Threshold to be applied on the product loadings of the sender-cell pairs. If specified, only cell-cell pairs with a product loading above the threshold at least in one of the factors included will be included in the returned dataframe.

lr_threshold : float, default=None Threshold to be applied on the ligand-receptor loadings. If specified, only LR pairs with a loading above the threshold at least in one of the factors included will be included in the returned dataframe.

Returns

cci_lr : pandas.DataFrame Dataframe containing the product loadings of a specific combination of ligand-receptor pair and sender-receiver pair. If the factor is specified, the returned dataframe will contain the product loadings of that factor. If the factor is not specified, the returned dataframe will contain the product loadings across all factors.

Source code in cell2cell/analysis/tensor_downstream.py
def get_lr_by_cell_pairs(result, lr_label, sender_label, receiver_label, order_cells_by='receivers', factor=None,
                         cci_threshold=None, lr_threshold=None):
    '''
    Returns a dataframe containing the product loadings of a specific combination
    of ligand-receptor pair and sender-receiver pair.

    Parameters
    ----------
    result : any Tensor class in cell2cell.tensor.tensor or a dict
        Either a Tensor type or a dictionary which resulted from the tensor
        decomposition. If it is a dict, it should be the one in, for example,
        InteractionTensor.factors

    lr_label : str
        Label for the dimension of the ligand-receptor pairs. Usually found in
        InteractionTensor.order_labels

    sender_label : str
        Label for the dimension of sender cells. Usually found in
        InteractionTensor.order_labels

    receiver_label : str
        Label for the dimension of receiver cells. Usually found in
        InteractionTensor.order_labels

    order_cells_by : str, default='receivers'
        Order of the returned dataframe. Options are 'senders' and
        'receivers'. 'senders' means to order the dataframe in a way that
        all cell-cell pairs with a same sender cell are put next to each others.
        'receivers' means the same, but by considering the receiver cell instead.

    factor : str, default=None
        Name of the factor to be used to compute the product loadings.
        If None, all factors will be included to compute them.

    cci_threshold : float, default=None
        Threshold to be applied on the product loadings of the sender-cell pairs.
        If specified, only cell-cell pairs with a product loading above the
        threshold at least in one of the factors included will be included
        in the returned dataframe.

    lr_threshold : float, default=None
        Threshold to be applied on the ligand-receptor loadings.
        If specified, only LR pairs with a loading above the
        threshold at least in one of the factors included will be included
        in the returned dataframe.

    Returns
    -------
    cci_lr : pandas.DataFrame
        Dataframe containing the product loadings of a specific combination 
        of ligand-receptor pair and sender-receiver pair. If the factor is specified,
        the returned dataframe will contain the product loadings of that factor.
        If the factor is not specified, the returned dataframe will contain the
        product loadings across all factors.
    '''
    if hasattr(result, 'factors'):
        result = result.factors
        if result is None:
            raise ValueError('A tensor factorization must be run on the tensor before calling this function.')
    elif isinstance(result, dict):
        pass
    else:
        raise ValueError('result is not of a valid type. It must be an InteractionTensor or a dict.')

    assert lr_label in result.keys(), 'The specified dimension ' + lr_label + ' is not present in the `result` input'
    assert sender_label in result.keys(), 'The specified dimension ' + sender_label + ' is not present in the `result` input'
    assert receiver_label in result.keys(), 'The specified dimension ' + receiver_label + ' is not present in the `result` input'

    # Sort factors
    sorted_factors = sorted(result[lr_label].columns, key=lambda x: int(x.split(' ')[1]))

    # Get CCI network per factor
    networks = get_factor_specific_ccc_networks(result=result,
                                                sender_label=sender_label,
                                                receiver_label=receiver_label)

    # Flatten networks
    network_by_factors = flatten_factor_ccc_networks(networks=networks, orderby=order_cells_by)

    # Get final dataframe
    df1 = network_by_factors[sorted_factors]
    df2 = result[lr_label][sorted_factors]

    if factor is not None:
        df1 = df1[factor]
        df2 = df2[factor]
        if cci_threshold is not None:
            df1 = df1[(df1 > cci_threshold)]
        if lr_threshold is not None:
            df2 = df2[(df2 > lr_threshold)]
        data = pd.DataFrame(np.outer(df1, df2), index=df1.index, columns=df2.index)
    else:
        if cci_threshold is not None:
            df1 = df1[(df1.T > cci_threshold).any()]  # Top sender-receiver pairs
        if lr_threshold is not None:
            df2 = df2[(df2.T > lr_threshold).any()]  # Top LR Pairs
        data = np.matmul(df1, df2.T)

    cci_lr = pd.DataFrame(data.T.values,
                          columns=df1.index,
                          index=df2.index
                          )

    cci_lr.columns.name = 'Sender-Receiver Pair'
    cci_lr.index.name = 'Ligand-Receptor Pair'
    return cci_lr

tensor_pipelines

run_tensor_cell2cell_pipeline(interaction_tensor, tensor_metadata, copy_tensor=False, rank=None, tf_optimization='regular', random_state=None, backend=None, device=None, elbow_metric='error', smooth_elbow=False, upper_rank=25, tf_init='random', tf_svd='numpy_svd', cmaps=None, sample_col='Element', group_col='Category', fig_fontsize=14, output_folder=None, output_fig=True, fig_format='pdf', **kwargs)

Runs basic pipeline of Tensor-cell2cell (excluding downstream analyses).

Parameters

interaction_tensor : cell2cell.tensor.BaseTensor A communication tensor generated with any of the tensor class in cell2cell.tensor.

tensor_metadata : list List of pandas dataframes with metadata information for elements of each dimension in the tensor. A column called as the variable sample_col contains the name of each element in the tensor while another column called as the variable group_col contains the metadata or grouping information of each element.

copy_tensor : boolean, default=False Whether generating a copy of the original tensor to avoid modifying it.

rank : int, default=None Rank of the Tensor Factorization (number of factors to deconvolve the original tensor). If None, it will automatically inferred from an elbow analysis.

tf_optimization : str, default='regular' It defines whether performing an optimization with higher number of iterations, independent factorization runs, and higher resolution (lower tolerance), or with lower number of iterations, factorization runs, and resolution. Options are:

- 'regular' : It uses 100 max iterations, 1 factorization run, and 10e-7 tolerance.
              Faster to run.
- 'robust' : It uses 500 max iterations, 100 factorization runs, and 10e-8 tolerance.
             Slower to run.

random_state : boolean, default=None Seed for randomization.

backend : str, default=None Backend that TensorLy will use to perform calculations on this tensor. When None, the default backend used is the currently active backend, usually is ('numpy'). Options are:

device : str, default=None Device to use when backend allows multiple devices. Options are:

elbow_metric : str, default='error' Metric to perform the elbow analysis (y-axis).

    - 'error' : Normalized error to compute the elbow.
    - 'similarity' : Similarity based on CorrIndex (1-CorrIndex).

smooth_elbow : boolean, default=False Whether smoothing the elbow-analysis curve with a Savitzky-Golay filter.

upper_rank : int, default=25 Upper bound of ranks to explore with the elbow analysis.

tf_init : str, default='random' Initialization method for computing the Tensor Factorization.

tf_svd : str, default='numpy_svd' Function to compute the SVD for initializing the Tensor Factorization, acceptable values in tensorly.SVD_FUNS

cmaps : list, default=None A list of colormaps used for coloring elements in each dimension. The length of this list is equal to the number of dimensions of the tensor. If None, all dimensions will be colores with the colormap 'gist_rainbow'.

sample_col : str, default='Element' Name of the column containing the element names in the metadata.

group_col : str, default='Category' Name of the column containing the metadata or grouping information for each element in the metadata.

fig_fontsize : int, default=14 Font size of the tick labels. Axis labels will be 1.2 times the fontsize.

output_folder : str, default=None Path to the folder where the figures generated will be saved. If None, figures will not be saved.

output_fig : boolean, default=True Whether generating the figures with matplotlib.

fig_format : str, default='pdf' Format to store figures when an output_folder is specified and output_fig is True. Otherwise, this is not necessary.

**kwargs : dict Extra arguments for the tensor factorization according to inputs in tensorly.

Returns

interaction_tensor : cell2cell.tensor.tensor.BaseTensor Either the original input interaction_tensor or a copy of it. This also stores the results from running the Tensor-cell2cell pipeline in the corresponding attributes.

Source code in cell2cell/analysis/tensor_pipelines.py
def run_tensor_cell2cell_pipeline(interaction_tensor, tensor_metadata, copy_tensor=False, rank=None,
                                  tf_optimization='regular', random_state=None, backend=None, device=None,
                                  elbow_metric='error', smooth_elbow=False, upper_rank=25, tf_init='random',
                                  tf_svd='numpy_svd', cmaps=None, sample_col='Element', group_col='Category',
                                  fig_fontsize=14, output_folder=None, output_fig=True, fig_format='pdf', **kwargs):
    '''
    Runs basic pipeline of Tensor-cell2cell (excluding downstream analyses).

    Parameters
    ----------
    interaction_tensor : cell2cell.tensor.BaseTensor
        A communication tensor generated with any of the tensor class in
        cell2cell.tensor.

    tensor_metadata : list
        List of pandas dataframes with metadata information for elements of each
        dimension in the tensor. A column called as the variable `sample_col` contains
        the name of each element in the tensor while another column called as the
        variable `group_col` contains the metadata or grouping information of each
        element.

    copy_tensor : boolean, default=False
        Whether generating a copy of the original tensor to avoid modifying it.

    rank : int, default=None
        Rank of the Tensor Factorization (number of factors to deconvolve the original
        tensor). If None, it will automatically inferred from an elbow analysis.

    tf_optimization : str, default='regular'
        It defines whether performing an optimization with higher number of iterations,
        independent factorization runs, and higher resolution (lower tolerance),
        or with lower number of iterations, factorization runs, and resolution.
        Options are:

        - 'regular' : It uses 100 max iterations, 1 factorization run, and 10e-7 tolerance.
                      Faster to run.
        - 'robust' : It uses 500 max iterations, 100 factorization runs, and 10e-8 tolerance.
                     Slower to run.

    random_state : boolean, default=None
        Seed for randomization.

    backend : str, default=None
        Backend that TensorLy will use to perform calculations
        on this tensor. When None, the default backend used is
        the currently active backend, usually is ('numpy'). Options are:
        {'cupy', 'jax', 'mxnet', 'numpy', 'pytorch', 'tensorflow'}

    device : str, default=None
        Device to use when backend allows multiple devices. Options are:
         {'cpu', 'cuda:0', None}

    elbow_metric : str, default='error'
        Metric to perform the elbow analysis (y-axis).

            - 'error' : Normalized error to compute the elbow.
            - 'similarity' : Similarity based on CorrIndex (1-CorrIndex).

    smooth_elbow : boolean, default=False
        Whether smoothing the elbow-analysis curve with a Savitzky-Golay filter.

    upper_rank : int, default=25
        Upper bound of ranks to explore with the elbow analysis.

    tf_init : str, default='random'
        Initialization method for computing the Tensor Factorization.
        {‘svd’, ‘random’}

    tf_svd : str, default='numpy_svd'
        Function to compute the SVD for initializing the Tensor Factorization,
        acceptable values in tensorly.SVD_FUNS

    cmaps : list, default=None
        A list of colormaps used for coloring elements in each dimension. The length
        of this list is equal to the number of dimensions of the tensor. If None, all
        dimensions will be colores with the colormap 'gist_rainbow'.

    sample_col : str, default='Element'
        Name of the column containing the element names in the metadata.

    group_col : str, default='Category'
        Name of the column containing the metadata or grouping information for each
        element in the metadata.

    fig_fontsize : int, default=14
        Font size of the tick labels. Axis labels will be 1.2 times the fontsize.

    output_folder : str, default=None
        Path to the folder where the figures generated will be saved.
        If None, figures will not be saved.

    output_fig : boolean, default=True
        Whether generating the figures with matplotlib.

    fig_format : str, default='pdf'
        Format to store figures when an `output_folder` is specified
        and `output_fig` is True. Otherwise, this is not necessary.

    **kwargs : dict
            Extra arguments for the tensor factorization according to inputs in
            tensorly.

    Returns
    -------
    interaction_tensor : cell2cell.tensor.tensor.BaseTensor
        Either the original input `interaction_tensor` or a copy of it.
        This also stores the results from running the Tensor-cell2cell
        pipeline in the corresponding attributes.
    '''
    if copy_tensor:
        interaction_tensor = interaction_tensor.copy()

    dim = len(interaction_tensor.tensor.shape)

    ### OUTPUT FILENAMES ###
    if output_folder is None:
        elbow_filename = None
        tf_filename = None
        loading_filename = None
    else:
        elbow_filename = output_folder + '/Elbow.{}'.format(fig_format)
        tf_filename = output_folder + '/Tensor-Factorization.{}'.format(fig_format)
        loading_filename = output_folder + '/Loadings.xlsx'

    ### PALETTE COLORS FOR ELEMENTS IN TENSOR DIMS ###
    if cmaps is None:
        cmap_5d = ['tab10', 'viridis', 'Dark2_r', 'tab20', 'tab20']
        cmap_4d = ['plasma', 'Dark2_r', 'tab20', 'tab20']

        if dim == 5:
            cmaps = cmap_5d
        elif dim <= 4:
            cmaps = cmap_4d[-dim:]
        else:
            raise ValueError('Tensor of dimension higher to 5 is not supported')

    assert len(cmaps) == dim, "`cmap` must be of the same len of dimensions in the tensor."

    ### FACTORIZATION PARAMETERS ###
    if tf_optimization == 'robust':
        elbow_runs = 20
        tf_runs = 100
        tol = 1e-8
        n_iter_max = 500
    elif tf_optimization == 'regular':
        elbow_runs = 10
        tf_runs = 1
        tol = 1e-7
        n_iter_max = 100
    else:
        raise ValueError("`factorization_type` must be either 'robust' or 'regular'.")

    if backend is not None:
        tl.set_backend(backend)

    if device is not None:
        interaction_tensor.to_device(device=device)

    ### ANALYSIS ###
    # Elbow
    if rank is None:
        print('Running Elbow Analysis')
        fig1, error = interaction_tensor.elbow_rank_selection(upper_rank=upper_rank,
                                                              runs=elbow_runs,
                                                              init=tf_init,
                                                              svd=tf_svd,
                                                              automatic_elbow=True,
                                                              metric=elbow_metric,
                                                              output_fig=output_fig,
                                                              smooth=smooth_elbow,
                                                              random_state=random_state,
                                                              fontsize=fig_fontsize,
                                                              filename=elbow_filename,
                                                              tol=tol, n_iter_max=n_iter_max,
                                                              **kwargs
                                                              )

        rank = interaction_tensor.rank

    # Factorization
    print('Running Tensor Factorization')
    interaction_tensor.compute_tensor_factorization(rank=rank,
                                                    init=tf_init,
                                                    svd=tf_svd,
                                                    random_state=random_state,
                                                    runs=tf_runs,
                                                    normalize_loadings=True,
                                                    tol=tol, n_iter_max=n_iter_max,
                                                    **kwargs
                                                    )

    ### EXPORT RESULTS ###
    if output_folder is not None:
        print('Generating Outputs')
        interaction_tensor.export_factor_loadings(loading_filename)

    if output_fig:
        fig2, axes = tensor_factors_plot(interaction_tensor=interaction_tensor,
                                         metadata=tensor_metadata,
                                         sample_col=sample_col,
                                         group_col=group_col,
                                         meta_cmaps=cmaps,
                                         fontsize=fig_fontsize,
                                         filename=tf_filename
                                         )

    return interaction_tensor

clustering special

cluster_interactions

compute_distance(data_matrix, axis=0, metric='euclidean')

Computes the pairwise distance between elements in a matrix of shape m x n. Uses the function scipy.spatial.distance.pdist

Parameters

data_matrix : pandas.DataFrame or ndarray A m x n matrix used to compute the distances

axis : int, default=0 To decide on which elements to compute the distance. If axis=0, the distances will be between elements in the rows, while axis=1 will lead to distances between elements in the columns.

metric : str, default='euclidean' The distance metric to use. The distance function can be 'braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.

Returns

D : ndarray Returns a condensed distance matrix Y. For each i and j (where i < j < m), where m is the number of original observations. The metric dist(u=X[i], v=X[j]) is computed and stored in entry m * i + j - ((i + 2) * (i + 1)) // 2.

Source code in cell2cell/clustering/cluster_interactions.py
def compute_distance(data_matrix, axis=0, metric='euclidean'):
    '''Computes the pairwise distance between elements in a
    matrix of shape m x n. Uses the function
    scipy.spatial.distance.pdist

    Parameters
    ----------
    data_matrix : pandas.DataFrame or ndarray
        A m x n matrix used to compute the distances

    axis : int, default=0
        To decide on which elements to compute the distance.
        If axis=0, the distances will be between elements in
        the rows, while axis=1 will lead to distances between
        elements in the columns.

    metric : str, default='euclidean'
        The distance metric to use. The distance function can be 'braycurtis',
        'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice',
        'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski',
        'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao',
        'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.

    Returns
    -------
    D : ndarray
        Returns a condensed distance matrix Y. For each i and j (where i < j < m),
        where m is the number of original observations. The metric
        dist(u=X[i], v=X[j]) is computed and stored in entry
        m * i + j - ((i + 2) * (i + 1)) // 2.
    '''
    if (type(data_matrix) is pd.core.frame.DataFrame):
        data = data_matrix.values
    else:
        data = data_matrix
    if axis == 0:
        D = sp.distance.squareform(sp.distance.pdist(data, metric=metric))
    elif axis == 1:
        D = sp.distance.squareform(sp.distance.pdist(data.T, metric=metric))
    else:
        raise ValueError('Not valid axis. Use 0 or 1.')
    return D

compute_linkage(distance_matrix, method='ward', optimal_ordering=True)

Returns a linkage for a given distance matrix using a specific method.

Parameters

distance_matrix : numpy.ndarray A square array containing the distance between a given row and a given column. Diagonal elements must be zero.

method : str, 'ward' by default Method to compute the linkage. It could be:

- 'single'
- 'complete'
- 'average'
- 'weighted'
- 'centroid'
- 'median'
- 'ward'
For more details, go to:
https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.cluster.hierarchy.linkage.html

optimal_ordering : boolean, default=True Whether sorting the leaf of the dendrograms to have a minimal distance between successive leaves. For more information, see scipy.cluster.hierarchy.optimal_leaf_ordering

Returns

Z : numpy.ndarray The hierarchical clustering encoded as a linkage matrix.

Source code in cell2cell/clustering/cluster_interactions.py
def compute_linkage(distance_matrix, method='ward', optimal_ordering=True):
    '''
    Returns a linkage for a given distance matrix using a specific method.

    Parameters
    ----------
    distance_matrix : numpy.ndarray
        A square array containing the distance between a given row and a
        given column. Diagonal elements must be zero.

    method : str, 'ward' by default
        Method to compute the linkage. It could be:

        - 'single'
        - 'complete'
        - 'average'
        - 'weighted'
        - 'centroid'
        - 'median'
        - 'ward'
        For more details, go to:
        https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.cluster.hierarchy.linkage.html

    optimal_ordering : boolean, default=True
        Whether sorting the leaf of the dendrograms to have a minimal distance
        between successive leaves. For more information, see
        scipy.cluster.hierarchy.optimal_leaf_ordering

    Returns
    -------
    Z : numpy.ndarray
        The hierarchical clustering encoded as a linkage matrix.
    '''
    if (type(distance_matrix) is pd.core.frame.DataFrame):
        data = distance_matrix.values
    else:
        data = distance_matrix.copy()
    if ~(data.transpose() == data).all():
        raise ValueError('The matrix is not symmetric')

    np.fill_diagonal(data, 0.0)

    # Compute linkage
    D = sp.distance.squareform(data)
    Z = hc.linkage(D, method=method, optimal_ordering=optimal_ordering)
    return Z

get_clusters_from_linkage(linkage, threshold, criterion='maxclust', labels=None)

Gets clusters from a linkage given a threshold and a criterion.

Parameters

linkage : numpy.ndarray The hierarchical clustering encoded with the matrix returned by the linkage function (Z).

threshold : float The threshold to apply when forming flat clusters.

criterion : str, 'maxclust' by default The criterion to use in forming flat clusters. Depending on the criterion, the threshold has different meanings. More information on: https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.cluster.hierarchy.fcluster.html

labels : array-like, None by default List of labels of the elements contained in the linkage. The order must match the order they were provided when generating the linkage.

Returns

clusters : dict A dictionary containing the clusters obtained. The keys correspond to the cluster numbers and the vaues to a list with element names given the labels, or the element index based on the linkage.

Source code in cell2cell/clustering/cluster_interactions.py
def get_clusters_from_linkage(linkage, threshold, criterion='maxclust', labels=None):
    '''
    Gets clusters from a linkage given a threshold and a criterion.

    Parameters
    ----------
    linkage : numpy.ndarray
        The hierarchical clustering encoded with the matrix returned by
        the linkage function (Z).

    threshold : float
        The threshold to apply when forming flat clusters.

    criterion : str, 'maxclust' by default
        The criterion to use in forming flat clusters. Depending on the
        criterion, the threshold has different meanings. More information on:
        https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.cluster.hierarchy.fcluster.html

    labels : array-like, None by default
        List of labels of the elements contained in the linkage. The order
        must match the order they were provided when generating the linkage.

    Returns
    -------
    clusters : dict
        A dictionary containing the clusters obtained. The keys correspond to
        the cluster numbers and the vaues to a list with element names given the
        labels, or the element index based on the linkage.
    '''

    cluster_ids = hc.fcluster(linkage, threshold, criterion=criterion)
    clusters = dict()
    for c in np.unique(cluster_ids):
        clusters[c] = []

    for i, c in enumerate(cluster_ids):
        if labels is not None:
            clusters[c].append(labels[i])
        else:
            clusters[c].append(i)
    return clusters

core special

cci_scores

compute_braycurtis_like_cci_score(cell1, cell2, ppi_score=None)

Calculates a Bray-Curtis-like score for the interaction between two cells based on their intercellular protein-protein interactions such as ligand-receptor interactions.

Parameters

cell1 : cell2cell.core.cell.Cell First cell-type/tissue/sample to compute interaction between a pair of them. In a directed interaction, this is the sender.

cell2 : cell2cell.core.cell.Cell Second cell-type/tissue/sample to compute interaction between a pair of them. In a directed interaction, this is the receiver.

Returns

cci_score : float Overall score for the interaction between a pair of cell-types/tissues/samples. In this case is a Bray-Curtis-like score.

Source code in cell2cell/core/cci_scores.py
def compute_braycurtis_like_cci_score(cell1, cell2, ppi_score=None):
    '''Calculates a Bray-Curtis-like score for the interaction between
    two cells based on their intercellular protein-protein
    interactions such as ligand-receptor interactions.

    Parameters
    ----------
    cell1 : cell2cell.core.cell.Cell
        First cell-type/tissue/sample to compute interaction
        between a pair of them. In a directed interaction,
        this is the sender.

    cell2 : cell2cell.core.cell.Cell
        Second cell-type/tissue/sample to compute interaction
        between a pair of them. In a directed interaction,
        this is the receiver.

    Returns
    -------
    cci_score : float
        Overall score for the interaction between a pair of
        cell-types/tissues/samples. In this case is a
        Bray-Curtis-like score.
    '''
    c1 = cell1.weighted_ppi['A'].values
    c2 = cell2.weighted_ppi['B'].values

    if (len(c1) == 0) or (len(c2) == 0):
        return 0.0

    if ppi_score is None:
        ppi_score = np.array([1.0] * len(c1))

    # Bray Curtis similarity
    numerator = 2 * np.nansum(c1 * c2 * ppi_score)
    denominator = np.nansum(c1 * c1 * ppi_score) + np.nansum(c2 * c2 * ppi_score)

    if denominator == 0.0:
        return 0.0

    cci_score = numerator / denominator

    if cci_score is np.nan:
        return 0.0
    return cci_score

compute_count_score(cell1, cell2, ppi_score=None)

Calculates the number of active protein-protein interactions for the interaction between two cells, which could be the number of active ligand-receptor interactions.

Parameters

cell1 : cell2cell.core.cell.Cell First cell-type/tissue/sample to compute interaction between a pair of them. In a directed interaction, this is the sender.

cell2 : cell2cell.core.cell.Cell Second cell-type/tissue/sample to compute interaction between a pair of them. In a directed interaction, this is the receiver.

Returns

cci_score : float Overall score for the interaction between a pair of cell-types/tissues/samples.

Source code in cell2cell/core/cci_scores.py
def compute_count_score(cell1, cell2, ppi_score=None):
    '''Calculates the number of active protein-protein interactions
    for the interaction between two cells, which could be the number
    of active ligand-receptor interactions.

    Parameters
    ----------
    cell1 : cell2cell.core.cell.Cell
        First cell-type/tissue/sample to compute interaction
        between a pair of them. In a directed interaction,
        this is the sender.

    cell2 : cell2cell.core.cell.Cell
        Second cell-type/tissue/sample to compute interaction
        between a pair of them. In a directed interaction,
        this is the receiver.

    Returns
    -------
    cci_score : float
        Overall score for the interaction between a pair of
        cell-types/tissues/samples.
    '''
    c1 = cell1.weighted_ppi['A'].values
    c2 = cell2.weighted_ppi['B'].values

    if (len(c1) == 0) or (len(c2) == 0):
        return 0.0

    if ppi_score is None:
        ppi_score = np.array([1.0] * len(c1))

    mult = c1 * c2 * ppi_score
    cci_score = np.nansum(mult != 0) # Count all active pathways (different to zero)

    if cci_score is np.nan:
        return 0.0
    return cci_score

compute_icellnet_score(cell1, cell2, ppi_score=None)

Calculates the sum of communication scores for the interaction between two cells. Based on ICELLNET.

Parameters

cell1 : cell2cell.core.cell.Cell First cell-type/tissue/sample to compute interaction between a pair of them. In a directed interaction, this is the sender.

cell2 : cell2cell.core.cell.Cell Second cell-type/tissue/sample to compute interaction between a pair of them. In a directed interaction, this is the receiver.

Returns

cci_score : float Overall score for the interaction between a pair of cell-types/tissues/samples.

Source code in cell2cell/core/cci_scores.py
def compute_icellnet_score(cell1, cell2, ppi_score=None):
    '''Calculates the sum of communication scores
    for the interaction between two cells. Based on ICELLNET.

    Parameters
    ----------
    cell1 : cell2cell.core.cell.Cell
        First cell-type/tissue/sample to compute interaction
        between a pair of them. In a directed interaction,
        this is the sender.

    cell2 : cell2cell.core.cell.Cell
        Second cell-type/tissue/sample to compute interaction
        between a pair of them. In a directed interaction,
        this is the receiver.

    Returns
    -------
    cci_score : float
        Overall score for the interaction between a pair of
        cell-types/tissues/samples.
    '''
    c1 = cell1.weighted_ppi['A'].values
    c2 = cell2.weighted_ppi['B'].values

    if (len(c1) == 0) or (len(c2) == 0):
        return 0.0

    if ppi_score is None:
        ppi_score = np.array([1.0] * len(c1))

    mult = c1 * c2 * ppi_score
    cci_score = np.nansum(mult)

    if cci_score is np.nan:
        return 0.0
    return cci_score

compute_jaccard_like_cci_score(cell1, cell2, ppi_score=None)

Calculates a Jaccard-like score for the interaction between two cells based on their intercellular protein-protein interactions such as ligand-receptor interactions.

Parameters

cell1 : cell2cell.core.cell.Cell First cell-type/tissue/sample to compute interaction between a pair of them. In a directed interaction, this is the sender.

cell2 : cell2cell.core.cell.Cell Second cell-type/tissue/sample to compute interaction between a pair of them. In a directed interaction, this is the receiver.

Returns

cci_score : float Overall score for the interaction between a pair of cell-types/tissues/samples. In this case it is a Jaccard-like score.

Source code in cell2cell/core/cci_scores.py
def compute_jaccard_like_cci_score(cell1, cell2, ppi_score=None):
    '''Calculates a Jaccard-like score for the interaction between
    two cells based on their intercellular protein-protein
    interactions such as ligand-receptor interactions.

    Parameters
    ----------
    cell1 : cell2cell.core.cell.Cell
        First cell-type/tissue/sample to compute interaction
        between a pair of them. In a directed interaction,
        this is the sender.

    cell2 : cell2cell.core.cell.Cell
        Second cell-type/tissue/sample to compute interaction
        between a pair of them. In a directed interaction,
        this is the receiver.

    Returns
    -------
    cci_score : float
        Overall score for the interaction between a pair of
        cell-types/tissues/samples. In this case it is a
        Jaccard-like score.
    '''
    c1 = cell1.weighted_ppi['A'].values
    c2 = cell2.weighted_ppi['B'].values

    if (len(c1) == 0) or (len(c2) == 0):
        return 0.0

    if ppi_score is None:
        ppi_score = np.array([1.0] * len(c1))

    # Extended Jaccard similarity
    numerator = np.nansum(c1 * c2 * ppi_score)
    denominator = np.nansum(c1 * c1 * ppi_score) + np.nansum(c2 * c2 * ppi_score) - numerator

    if denominator == 0.0:
        return 0.0

    cci_score = numerator / denominator

    if cci_score is np.nan:
        return 0.0
    return cci_score

matmul_bray_curtis_like(A_scores, B_scores, ppi_score=None)

Computes Bray-Curtis-like scores using matrices of proteins by cell-types/tissues/samples.

Parameters

A_scores : array-like Matrix of size NxM, where N are the proteins in the first column of a list of PPIs and M are the cell-types/tissues/samples.

B_scores : array-like Matrix of size NxM, where N are the proteins in the first column of a list of PPIs and M are the cell-types/tissues/samples.

Returns

bray_curtis : numpy.array Matrix MxM, representing the CCI score for all pairs of cell-types/tissues/samples. In directed interactions, the vertical axis (axis 0) represents the senders, while the horizontal axis (axis 1) represents the receivers.

Source code in cell2cell/core/cci_scores.py
def matmul_bray_curtis_like(A_scores, B_scores, ppi_score=None):
    '''Computes Bray-Curtis-like scores using matrices of proteins by
    cell-types/tissues/samples.

    Parameters
    ----------
    A_scores : array-like
        Matrix of size NxM, where N are the proteins in the first
        column of a list of PPIs and M are the
        cell-types/tissues/samples.

    B_scores : array-like
        Matrix of size NxM, where N are the proteins in the first
        column of a list of PPIs and M are the
        cell-types/tissues/samples.

    Returns
    -------
    bray_curtis : numpy.array
        Matrix MxM, representing the CCI score for all pairs of
        cell-types/tissues/samples. In directed interactions,
        the vertical axis (axis 0) represents the senders, while
        the horizontal axis (axis 1) represents the receivers.
    '''
    if ppi_score is None:
        ppi_score = np.array([1.0] * A_scores.shape[0])
    ppi_score = ppi_score.reshape((len(ppi_score), 1))

    numerator = np.matmul(np.multiply(A_scores, ppi_score).transpose(), B_scores)

    A_module = np.sum(np.multiply(np.multiply(A_scores, A_scores), ppi_score), axis=0)
    B_module = np.sum(np.multiply(np.multiply(B_scores, B_scores), ppi_score), axis=0)
    denominator = A_module.reshape((A_module.shape[0], 1)) + B_module

    bray_curtis = np.divide(2.0*numerator, denominator)
    return bray_curtis

matmul_cosine(A_scores, B_scores, ppi_score=None)

Computes cosine-similarity scores using matrices of proteins by cell-types/tissues/samples.

Parameters

A_scores : array-like Matrix of size NxM, where N are the proteins in the first column of a list of PPIs and M are the cell-types/tissues/samples.

B_scores : array-like Matrix of size NxM, where N are the proteins in the first column of a list of PPIs and M are the cell-types/tissues/samples.

Returns

cosine : numpy.array Matrix MxM, representing the CCI score for all pairs of cell-types/tissues/samples. In directed interactions, the vertical axis (axis 0) represents the senders, while the horizontal axis (axis 1) represents the receivers.

Source code in cell2cell/core/cci_scores.py
def matmul_cosine(A_scores, B_scores, ppi_score=None):
    '''Computes cosine-similarity scores using matrices of proteins by
    cell-types/tissues/samples.

    Parameters
    ----------
    A_scores : array-like
        Matrix of size NxM, where N are the proteins in the first
        column of a list of PPIs and M are the
        cell-types/tissues/samples.

    B_scores : array-like
        Matrix of size NxM, where N are the proteins in the first
        column of a list of PPIs and M are the
        cell-types/tissues/samples.

    Returns
    -------
    cosine : numpy.array
        Matrix MxM, representing the CCI score for all pairs of
        cell-types/tissues/samples. In directed interactions,
        the vertical axis (axis 0) represents the senders, while
        the horizontal axis (axis 1) represents the receivers.
    '''
    if ppi_score is None:
        ppi_score = np.array([1.0] * A_scores.shape[0])
    ppi_score = ppi_score.reshape((len(ppi_score), 1))

    numerator = np.matmul(np.multiply(A_scores, ppi_score).transpose(), B_scores)

    A_module = np.sum(np.multiply(np.multiply(A_scores, A_scores), ppi_score), axis=0) ** 0.5
    B_module = np.sum(np.multiply(np.multiply(B_scores, B_scores), ppi_score), axis=0) ** 0.5
    denominator = A_module.reshape((A_module.shape[0], 1)) * B_module

    cosine = np.divide(numerator, denominator)
    return cosine

matmul_count_active(A_scores, B_scores, ppi_score=None)

Computes the count of active protein-protein interactions used for intercellular communication using matrices of proteins by cell-types/tissues/samples.

Parameters

A_scores : array-like Matrix of size NxM, where N are the proteins in the first column of a list of PPIs and M are the cell-types/tissues/samples.

B_scores : array-like Matrix of size NxM, where N are the proteins in the first column of a list of PPIs and M are the cell-types/tissues/samples.

Returns

counts : numpy.array Matrix MxM, representing the CCI score for all pairs of cell-types/tissues/samples. In directed interactions, the vertical axis (axis 0) represents the senders, while the horizontal axis (axis 1) represents the receivers.

Source code in cell2cell/core/cci_scores.py
def matmul_count_active(A_scores, B_scores, ppi_score=None):
    '''Computes the count of active protein-protein interactions
    used for intercellular communication using matrices of proteins by
    cell-types/tissues/samples.

    Parameters
    ----------
    A_scores : array-like
        Matrix of size NxM, where N are the proteins in the first
        column of a list of PPIs and M are the
        cell-types/tissues/samples.

    B_scores : array-like
        Matrix of size NxM, where N are the proteins in the first
        column of a list of PPIs and M are the
        cell-types/tissues/samples.

    Returns
    -------
    counts : numpy.array
        Matrix MxM, representing the CCI score for all pairs of
        cell-types/tissues/samples. In directed interactions,
        the vertical axis (axis 0) represents the senders, while
        the horizontal axis (axis 1) represents the receivers.
    '''
    if ppi_score is None:
        ppi_score = np.array([1.0] * A_scores.shape[0])
    ppi_score = ppi_score.reshape((len(ppi_score), 1))

    counts = np.matmul(np.multiply(A_scores, ppi_score).transpose(), B_scores)
    return counts

matmul_jaccard_like(A_scores, B_scores, ppi_score=None)

Computes Jaccard-like scores using matrices of proteins by cell-types/tissues/samples.

Parameters

A_scores : array-like Matrix of size NxM, where N are the proteins in the first column of a list of PPIs and M are the cell-types/tissues/samples.

B_scores : array-like Matrix of size NxM, where N are the proteins in the first column of a list of PPIs and M are the cell-types/tissues/samples.

Returns

jaccard : numpy.array Matrix MxM, representing the CCI score for all pairs of cell-types/tissues/samples. In directed interactions, the vertical axis (axis 0) represents the senders, while the horizontal axis (axis 1) represents the receivers.

Source code in cell2cell/core/cci_scores.py
def matmul_jaccard_like(A_scores, B_scores, ppi_score=None):
    '''Computes Jaccard-like scores using matrices of proteins by
    cell-types/tissues/samples.

    Parameters
    ----------
    A_scores : array-like
        Matrix of size NxM, where N are the proteins in the first
        column of a list of PPIs and M are the
        cell-types/tissues/samples.

    B_scores : array-like
        Matrix of size NxM, where N are the proteins in the first
        column of a list of PPIs and M are the
        cell-types/tissues/samples.

    Returns
    -------
    jaccard : numpy.array
        Matrix MxM, representing the CCI score for all pairs of
        cell-types/tissues/samples. In directed interactions,
        the vertical axis (axis 0) represents the senders, while
        the horizontal axis (axis 1) represents the receivers.
    '''
    if ppi_score is None:
        ppi_score = np.array([1.0] * A_scores.shape[0])
    ppi_score = ppi_score.reshape((len(ppi_score), 1))

    numerator = np.matmul(np.multiply(A_scores, ppi_score).transpose(), B_scores)

    A_module = np.sum(np.multiply(np.multiply(A_scores, A_scores), ppi_score), axis=0)
    B_module = np.sum(np.multiply(np.multiply(B_scores, B_scores), ppi_score), axis=0)
    denominator = A_module.reshape((A_module.shape[0], 1)) + B_module - numerator

    jaccard = np.divide(numerator, denominator)
    return jaccard

cell

Cell

Specific cell-type/tissue/organ element in a RNAseq dataset.

Parameters

sc_rnaseq_data : pandas.DataFrame A gene expression matrix. Contains only one column that corresponds to cell-type/tissue/sample, while the genes are rows and the specific. Column name will be the label of the instance.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Attributes

id : int ID number of the instance generated.

type : str Name of the respective cell-type/tissue/sample.

rnaseq_data : pandas.DataFrame Copy of sc_rnaseq_data.

weighted_ppi : pandas.DataFrame Dataframe created from a list of protein-protein interactions, here the columns of the interacting proteins are replaced by a score or a preprocessed gene expression of the respective proteins.

Source code in cell2cell/core/cell.py
class Cell:
    '''Specific cell-type/tissue/organ element in a RNAseq dataset.

    Parameters
    ----------
    sc_rnaseq_data : pandas.DataFrame
        A gene expression matrix. Contains only one column that
        corresponds to cell-type/tissue/sample, while the genes
        are rows and the specific. Column name will be the label
        of the instance.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Attributes
    ----------
    id : int
        ID number of the instance generated.

    type : str
        Name of the respective cell-type/tissue/sample.

    rnaseq_data : pandas.DataFrame
        Copy of sc_rnaseq_data.

    weighted_ppi : pandas.DataFrame
        Dataframe created from a list of protein-protein interactions,
        here the columns of the interacting proteins are replaced by
        a score or a preprocessed gene expression of the respective
        proteins.
    '''
    _id_counter = 0  # Number of active instances
    _id = 0 # Unique ID

    def __init__(self, sc_rnaseq_data, verbose=True):
        self.id = Cell._id
        Cell._id_counter += 1
        Cell._id += 1

        self.type = str(sc_rnaseq_data.columns[-1])

        # RNAseq datasets
        self.rnaseq_data = sc_rnaseq_data.copy()
        self.rnaseq_data.columns = ['value']

        # Binary ppi datasets
        self.weighted_ppi = pd.DataFrame(columns=['A', 'B', 'score'])

        # Object created
        if verbose:
            print("New cell instance created for " + self.type)

    def __del__(self):
        Cell._id_counter -= 1

    def __str__(self):
        return str(self.type)

    __repr__ = __str__
__repr__(self) special

Return str(self).

Source code in cell2cell/core/cell.py
def __str__(self):
    return str(self.type)

get_cells_from_rnaseq(rnaseq_data, cell_columns=None, verbose=True)

Creates new instances of Cell based on the RNAseq data of each cell-type/tissue/sample in a gene expression matrix.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for a RNA-seq experiment. Columns are cell-types/tissues/samples and rows are genes.

cell_columns : array-like, default=None List of names of cell-types/tissues/samples in the dataset to be used. If None, all columns will be used.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

cells : dict Dictionary containing all Cell instances generated from a RNAseq dataset. The keys of this dictionary are the names of the corresponding Cell instances.

Source code in cell2cell/core/cell.py
def get_cells_from_rnaseq(rnaseq_data, cell_columns=None, verbose=True):
    '''
    Creates new instances of Cell based on the RNAseq data of each
    cell-type/tissue/sample in a gene expression matrix.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a RNA-seq experiment. Columns are
        cell-types/tissues/samples and rows are genes.

    cell_columns : array-like, default=None
        List of names of cell-types/tissues/samples in the dataset
        to be used. If None, all columns will be used.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    cells : dict
        Dictionary containing all Cell instances generated from a RNAseq dataset.
        The keys of this dictionary are the names of the corresponding Cell instances.
    '''
    if verbose:
        print("Generating objects according to RNAseq datasets provided")
    cells = dict()
    if cell_columns is None:
        cell_columns = rnaseq_data.columns

    for cell in cell_columns:
        cells[cell] = Cell(rnaseq_data[[cell]], verbose=verbose)
    return cells

communication_scores

aggregate_ccc_matrices(ccc_matrices, method='gmean')

Aggregates matrices of communication scores. Each matrix has the communication scores across all pairs of cell-types/tissues/samples for a different pair of interacting proteins.

Parameters

ccc_matrices : list List of matrices of communication scores. Each matrix is for an specific pair of interacting proteins.

method : str, default='gmean'. Method to aggregate the matrices element-wise. Options are:

- 'gmean' : Geometric mean in an element-wise way.
- 'sum' : Sum in an element-wise way.
- 'mean' : Mean in an element-wise way.
Returns

aggregated_ccc_matrix : numpy.array A matrix contiaining aggregated communication scores from multiple PPIs. It's shape is of MxM, where M are all cell-types/tissues/samples. In directed interactions, the vertical axis (axis 0) represents the senders, while the horizontal axis (axis 1) represents the receivers.

Source code in cell2cell/core/communication_scores.py
def aggregate_ccc_matrices(ccc_matrices, method='gmean'):
    '''Aggregates matrices of communication scores. Each
    matrix has the communication scores across all pairs
    of cell-types/tissues/samples for a different
    pair of interacting proteins.

    Parameters
    ----------
    ccc_matrices : list
        List of matrices of communication scores. Each matrix
        is for an specific pair of interacting proteins.

    method : str, default='gmean'.
        Method to aggregate the matrices element-wise.
        Options are:

        - 'gmean' : Geometric mean in an element-wise way.
        - 'sum' : Sum in an element-wise way.
        - 'mean' : Mean in an element-wise way.

    Returns
    -------
    aggregated_ccc_matrix : numpy.array
        A matrix contiaining aggregated communication scores
        from multiple PPIs. It's shape is of MxM, where M are all
        cell-types/tissues/samples. In directed interactions, the
        vertical axis (axis 0) represents the senders, while the
        horizontal axis (axis 1) represents the receivers.
    '''
    if method == 'gmean':
        aggregated_ccc_matrix = gmean(ccc_matrices)
    elif method == 'sum':
        aggregated_ccc_matrix = np.nansum(ccc_matrices, axis=0)
    elif method == 'mean':
        aggregated_ccc_matrix = np.nanmean(ccc_matrices, axis=0)
    else:
        raise ValueError("Not a valid method")

    return aggregated_ccc_matrix

compute_ccc_matrix(prot_a_exp, prot_b_exp, communication_score='expression_product')

Computes communication scores for an specific protein-protein interaction using vectors of gene expression levels for a given interacting protein produced by different cell-types/tissues/samples.

Parameters

prot_a_exp : array-like Vector with gene expression levels for an interacting protein A in a given PPI. Coordinates are different cell-types/tissues/samples.

prot_b_exp : array-like Vector with gene expression levels for an interacting protein B in a given PPI. Coordinates are different cell-types/tissues/samples.

communication_score : str, default='expression_product' Scoring function for computing the communication score. Options are:

- 'expression_product' : Multiplication between the expression
    of the interacting proteins.
- 'expression_mean' : Average between the expression
    of the interacting proteins.
- 'expression_gmean' : Geometric mean between the expression
    of the interacting proteins.
Returns

communication_scores : numpy.array Matrix MxM, representing the CCC scores of an specific PPI across all pairs of cell-types/tissues/samples. M are all cell-types/tissues/samples. In directed interactions, the vertical axis (axis 0) represents the senders, while the horizontal axis (axis 1) represents the receivers.

Source code in cell2cell/core/communication_scores.py
def compute_ccc_matrix(prot_a_exp, prot_b_exp, communication_score='expression_product'):
    '''Computes communication scores for an specific
    protein-protein interaction using vectors of gene expression
    levels for a given interacting protein produced by
    different cell-types/tissues/samples.

    Parameters
    ----------
    prot_a_exp : array-like
        Vector with gene expression levels for an interacting protein A
        in a given PPI. Coordinates are different cell-types/tissues/samples.

    prot_b_exp : array-like
        Vector with gene expression levels for an interacting protein B
        in a given PPI. Coordinates are different cell-types/tissues/samples.

    communication_score : str, default='expression_product'
        Scoring function for computing the communication score.
        Options are:

        - 'expression_product' : Multiplication between the expression
            of the interacting proteins.
        - 'expression_mean' : Average between the expression
            of the interacting proteins.
        - 'expression_gmean' : Geometric mean between the expression
            of the interacting proteins.

    Returns
    -------
    communication_scores : numpy.array
        Matrix MxM, representing the CCC scores of an specific PPI
        across all pairs of cell-types/tissues/samples. M are all
        cell-types/tissues/samples. In directed interactions, the
        vertical axis (axis 0) represents the senders, while the
        horizontal axis (axis 1) represents the receivers.
    '''
    if communication_score == 'expression_product':
        communication_scores = np.outer(prot_a_exp, prot_b_exp)
    elif communication_score == 'expression_mean':
        communication_scores = (np.outer(prot_a_exp, np.ones(prot_b_exp.shape)) + np.outer(np.ones(prot_a_exp.shape), prot_b_exp)) / 2.
    elif communication_score == 'expression_gmean':
        communication_scores = np.sqrt(np.outer(prot_a_exp, prot_b_exp))
    else:
        raise ValueError("Not a valid communication_score")
    return communication_scores

get_binary_scores(cell1, cell2, ppi_score=None)

Computes binary communication scores for all protein-protein interactions between a pair of cell-types/tissues/samples. This corresponds to an AND function between binary values for each interacting protein coming from each cell.

Parameters

cell1 : cell2cell.core.cell.Cell First cell-type/tissue/sample to compute the communication score. In a directed interaction, this is the sender.

cell2 : cell2cell.core.cell.Cell Second cell-type/tissue/sample to compute the communication score. In a directed interaction, this is the receiver.

ppi_score : array-like, default=None An array with a weight for each PPI. The weight multiplies the communication scores.

Returns

communication_scores : numpy.array An array with the communication scores for each intercellular PPI.

Source code in cell2cell/core/communication_scores.py
def get_binary_scores(cell1, cell2, ppi_score=None):
    '''Computes binary communication scores for all
    protein-protein interactions between a pair of
    cell-types/tissues/samples. This corresponds to
    an AND function between binary values for each
    interacting protein coming from each cell.

    Parameters
    ----------
    cell1 : cell2cell.core.cell.Cell
        First cell-type/tissue/sample to compute the communication
        score. In a directed interaction, this is the sender.

    cell2 : cell2cell.core.cell.Cell
        Second cell-type/tissue/sample to compute the communication
        score. In a directed interaction, this is the receiver.

    ppi_score : array-like, default=None
        An array with a weight for each PPI. The weight
        multiplies the communication scores.

    Returns
    -------
    communication_scores : numpy.array
        An array with the communication scores for each intercellular
        PPI.
    '''
    c1 = cell1.weighted_ppi['A'].values
    c2 = cell2.weighted_ppi['B'].values

    if (len(c1) == 0) or (len(c2) == 0):
        return 0.0

    if ppi_score is None:
        ppi_score = np.array([1.0] * len(c1))

    communication_scores = c1 * c2 * ppi_score
    return communication_scores

get_continuous_scores(cell1, cell2, ppi_score=None, method='expression_product')

Computes continuous communication scores for all protein-protein interactions between a pair of cell-types/tissues/samples. This corresponds to a specific scoring function between preprocessed continuous expression values for each interacting protein coming from each cell.

Parameters

cell1 : cell2cell.core.cell.Cell First cell-type/tissue/sample to compute the communication score. In a directed interaction, this is the sender.

cell2 : cell2cell.core.cell.Cell Second cell-type/tissue/sample to compute the communication score. In a directed interaction, this is the receiver.

ppi_score : array-like, default=None An array with a weight for each PPI. The weight multiplies the communication scores.

method : str, default='expression_product' Scoring function for computing the communication score. Options are: - 'expression_product' : Multiplication between the expression of the interacting proteins. One coming from cell1 and the other from cell2. - 'expression_mean' : Average between the expression of the interacting proteins. One coming from cell1 and the other from cell2. - 'expression_gmean' : Geometric mean between the expression of the interacting proteins. One coming from cell1 and the other from cell2.

Returns

communication_scores : numpy.array An array with the communication scores for each intercellular PPI.

Source code in cell2cell/core/communication_scores.py
def get_continuous_scores(cell1, cell2, ppi_score=None, method='expression_product'):
    '''Computes continuous communication scores for all
    protein-protein interactions between a pair of
    cell-types/tissues/samples. This corresponds to
    a specific scoring function between preprocessed continuous
    expression values for each interacting protein coming from
    each cell.

    Parameters
    ----------
    cell1 : cell2cell.core.cell.Cell
        First cell-type/tissue/sample to compute the communication
        score. In a directed interaction, this is the sender.

    cell2 : cell2cell.core.cell.Cell
        Second cell-type/tissue/sample to compute the communication
        score. In a directed interaction, this is the receiver.

    ppi_score : array-like, default=None
        An array with a weight for each PPI. The weight
        multiplies the communication scores.

    method : str, default='expression_product'
        Scoring function for computing the communication score.
        Options are:
            - 'expression_product' : Multiplication between the expression
                of the interacting proteins. One coming from cell1 and the
                other from cell2.
            - 'expression_mean' : Average between the expression
                of the interacting proteins. One coming from cell1 and the
                other from cell2.
            - 'expression_gmean' : Geometric mean between the expression
                of the interacting proteins. One coming from cell1 and the
                other from cell2.

    Returns
    -------
    communication_scores : numpy.array
        An array with the communication scores for each intercellular
        PPI.
    '''
    c1 = cell1.weighted_ppi['A'].values
    c2 = cell2.weighted_ppi['B'].values

    if method == 'expression_product':
        communication_scores = score_expression_product(c1, c2)
    elif method == 'expression_mean':
        communication_scores = score_expression_mean(c1, c2)
    elif method == 'expression_gmean':
        communication_scores = np.sqrt(score_expression_product(c1, c2))
    else:
        raise ValueError('{} is not implemented yet'.format(method))

    if ppi_score is None:
        ppi_score = np.array([1.0] * len(c1))

    communication_scores = communication_scores * ppi_score
    return communication_scores

score_expression_mean(c1, c2)

Computes the expression product score

Parameters

c1 : array-like A 1D-array containing the preprocessed expression values for the interactors in the first column of a list of protein-protein interactions.

c2 : array-like A 1D-array containing the preprocessed expression values for the interactors in the second column of a list of protein-protein interactions.

Returns

(c1 + c2)/2. : array-like Average of vectors.

Source code in cell2cell/core/communication_scores.py
def score_expression_mean(c1, c2):
    '''Computes the expression product score

    Parameters
    ----------
    c1 : array-like
        A 1D-array containing the preprocessed expression values
        for the interactors in the first column of a list of
        protein-protein interactions.

    c2 : array-like
        A 1D-array containing the preprocessed expression values
        for the interactors in the second column of a list of
        protein-protein interactions.

    Returns
    -------
    (c1 + c2)/2. : array-like
        Average of vectors.
    '''
    if (len(c1) == 0) or (len(c2) == 0):
        return 0.0
    return (c1 + c2)/2.

score_expression_product(c1, c2)

Computes the expression product score

Parameters

c1 : array-like A 1D-array containing the preprocessed expression values for the interactors in the first column of a list of protein-protein interactions.

c2 : array-like A 1D-array containing the preprocessed expression values for the interactors in the second column of a list of protein-protein interactions.

Returns

c1 * c2 : array-like Multiplication of vectors.

Source code in cell2cell/core/communication_scores.py
def score_expression_product(c1, c2):
    '''Computes the expression product score

    Parameters
    ----------
    c1 : array-like
        A 1D-array containing the preprocessed expression values
        for the interactors in the first column of a list of
        protein-protein interactions.

    c2 : array-like
        A 1D-array containing the preprocessed expression values
        for the interactors in the second column of a list of
        protein-protein interactions.

    Returns
    -------
    c1 * c2 : array-like
        Multiplication of vectors.
    '''
    if (len(c1) == 0) or (len(c2) == 0):
        return 0.0
    return c1 * c2

interaction_space

InteractionSpace

Interaction space that contains all the required elements to perform the analysis between every pair of cells.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for a bulk RNA-seq experiment or a single-cell experiment after aggregation into cell types. Columns are cell-types/tissues/samples and rows are genes.

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

gene_cutoffs : dict Contains two keys: 'type' and 'parameter'. The first key represent the way to use a cutoff or threshold, while parameter is the value used to binarize the expression values. The key 'type' can be:

- 'local_percentile' : computes the value of a given percentile, for each
    gene independently. In this case, the parameter corresponds to the
    percentile to compute, as a float value between 0 and 1.
- 'global_percentile' : computes the value of a given percentile from all
    genes and samples simultaneously. In this case, the parameter
    corresponds to the percentile to compute, as a float value between
    0 and 1. All genes have the same cutoff.
- 'file' : load a cutoff table from a file. Parameter in this case is the
    path of that file. It must contain the same genes as index and same
    samples as columns.
- 'multi_col_matrix' : a dataframe must be provided, containing a cutoff
    for each gene in each sample. This allows to use specific cutoffs for
    each sample. The columns here must be the same as the ones in the
    rnaseq_data.
- 'single_col_matrix' : a dataframe must be provided, containing a cutoff
    for each gene in only one column. These cutoffs will be applied to
    all samples.
- 'constant_value' : binarizes the expression. Evaluates whether
    expression is greater than the value input in the parameter.

communication_score : str, default='expression_thresholding' Type of communication score used to detect active ligand-receptor pairs between each pair of cell. See cell2cell.core.communication_scores for more details. It can be:

- 'expression_thresholding'
- 'expression_product'
- 'expression_mean'
- 'expression_gmean'

cci_score : str, default='bray_curtis' Scoring function to aggregate the communication scores. See cell2cell.core.cci_scores for more details. It can be:

- 'bray_curtis'
- 'jaccard'
- 'count'
- 'icellnet'

cci_type : str, default='undirected' Type of interaction between two cells. If it is undirected, all ligands and receptors are considered from both cells. If it is directed, ligands from one cell and receptors from the other are considered separately with respect to ligands from the second cell and receptor from the first one. So, it can be:

- 'undirected'
- 'directed'

cci_matrix_template : pandas.DataFrame, default=None A matrix of shape MxM where M are cell-types/tissues/samples. This is used as template for storing CCI scores. It may be useful for specifying which pairs of cells to consider.

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

complex_agg_method : str, default='min' Method to aggregate the expression value of multiple genes in a complex.

- 'min' : Minimum expression value among all genes.
- 'mean' : Average expression value among all genes.
- 'gmean' : Geometric mean expression value among all genes.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Attributes

communication_score : str Type of communication score used to detect active ligand-receptor pairs between each pair of cell. See cell2cell.core.communication_scores for more details. It can be:

- 'expression_thresholding'
- 'expression_product'
- 'expression_mean'
- 'expression_gmean'

cci_score : str Scoring function to aggregate the communication scores. See cell2cell.core.cci_scores for more details. It can be:

- 'bray_curtis'
- 'jaccard'
- 'count'
- 'icellnet'

cci_type : str Type of interaction between two cells. If it is undirected, all ligands and receptors are considered from both cells. If it is directed, ligands from one cell and receptors from the other are considered separately with respect to ligands from the second cell and receptor from the first one. So, it can be:

- 'undirected'
- 'directed'

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

modified_rnaseq_data : pandas.DataFrame Preprocessed gene expression data for a bulk or single-cell RNA-seq experiment. Columns are are cell-types/tissues/samples and rows are genes. The preprocessing may correspond to scoring the gene expression as binary or continuous values depending on the scoring function for cell-cell interactions/communication.

interaction_elements : dict Dictionary containing all the pairs of cells considered (under the key of 'pairs'), Cell instances (under key 'cells') which include all cells/tissues/organs with their associated datasets (rna_seq, weighted_ppi, etc) and a Cell-Cell Interaction Matrix to store CCI scores(under key 'cci_matrix'). A communication matrix is also stored in this object when the communication scores are computed in the InteractionSpace class (under key 'communication_matrix')

distance_matrix : pandas.DataFrame Contains distances for each pair of cells, computed from the CCI scores previously obtained (and stored in interaction_elements['cci_matrix'].

Source code in cell2cell/core/interaction_space.py
class InteractionSpace():
    '''
    Interaction space that contains all the required elements to perform the analysis between every pair of cells.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a bulk RNA-seq experiment or a single-cell
        experiment after aggregation into cell types. Columns are
        cell-types/tissues/samples and rows are genes.

    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    gene_cutoffs : dict
        Contains two keys: 'type' and 'parameter'. The first key represent the
        way to use a cutoff or threshold, while parameter is the value used
        to binarize the expression values.
        The key 'type' can be:

        - 'local_percentile' : computes the value of a given percentile, for each
            gene independently. In this case, the parameter corresponds to the
            percentile to compute, as a float value between 0 and 1.
        - 'global_percentile' : computes the value of a given percentile from all
            genes and samples simultaneously. In this case, the parameter
            corresponds to the percentile to compute, as a float value between
            0 and 1. All genes have the same cutoff.
        - 'file' : load a cutoff table from a file. Parameter in this case is the
            path of that file. It must contain the same genes as index and same
            samples as columns.
        - 'multi_col_matrix' : a dataframe must be provided, containing a cutoff
            for each gene in each sample. This allows to use specific cutoffs for
            each sample. The columns here must be the same as the ones in the
            rnaseq_data.
        - 'single_col_matrix' : a dataframe must be provided, containing a cutoff
            for each gene in only one column. These cutoffs will be applied to
            all samples.
        - 'constant_value' : binarizes the expression. Evaluates whether
            expression is greater than the value input in the parameter.

    communication_score : str, default='expression_thresholding'
        Type of communication score used to detect active ligand-receptor
        pairs between each pair of cell. See
        cell2cell.core.communication_scores for more details.
        It can be:

        - 'expression_thresholding'
        - 'expression_product'
        - 'expression_mean'
        - 'expression_gmean'

    cci_score : str, default='bray_curtis'
        Scoring function to aggregate the communication scores. See
        cell2cell.core.cci_scores for more details.
        It can be:

        - 'bray_curtis'
        - 'jaccard'
        - 'count'
        - 'icellnet'

    cci_type : str, default='undirected'
        Type of interaction between two cells. If it is undirected, all ligands
        and receptors are considered from both cells. If it is directed, ligands
        from one cell and receptors from the other are considered separately with
        respect to ligands from the second cell and receptor from the first one.
        So, it can be:

        - 'undirected'
        - 'directed'

    cci_matrix_template : pandas.DataFrame, default=None
        A matrix of shape MxM where M are cell-types/tissues/samples. This
        is used as template for storing CCI scores. It may be useful
        for specifying which pairs of cells to consider.

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    complex_agg_method : str, default='min'
        Method to aggregate the expression value of multiple genes in a
        complex.

        - 'min' : Minimum expression value among all genes.
        - 'mean' : Average expression value among all genes.
        - 'gmean' : Geometric mean expression value among all genes.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Attributes
    ----------
    communication_score : str
        Type of communication score used to detect active ligand-receptor
        pairs between each pair of cell. See
        cell2cell.core.communication_scores for more details.
        It can be:

        - 'expression_thresholding'
        - 'expression_product'
        - 'expression_mean'
        - 'expression_gmean'

    cci_score : str
        Scoring function to aggregate the communication scores. See
        cell2cell.core.cci_scores for more details.
        It can be:

        - 'bray_curtis'
        - 'jaccard'
        - 'count'
        - 'icellnet'

    cci_type : str
        Type of interaction between two cells. If it is undirected, all ligands
        and receptors are considered from both cells. If it is directed, ligands
        from one cell and receptors from the other are considered separately with
        respect to ligands from the second cell and receptor from the first one.
        So, it can be:

        - 'undirected'
        - 'directed'

    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    modified_rnaseq_data : pandas.DataFrame
        Preprocessed gene expression data for a bulk or single-cell RNA-seq experiment.
        Columns are are cell-types/tissues/samples and rows are genes. The preprocessing
        may correspond to scoring the gene expression as binary or continuous values
        depending on the scoring function for cell-cell interactions/communication.

    interaction_elements : dict
        Dictionary containing all the pairs of cells considered (under
        the key of 'pairs'), Cell instances (under key 'cells')
        which include all cells/tissues/organs with their associated datasets
        (rna_seq, weighted_ppi, etc) and a Cell-Cell Interaction Matrix
        to store CCI scores(under key 'cci_matrix'). A communication matrix
        is also stored in this object when the communication scores are
        computed in the InteractionSpace class (under key
        'communication_matrix')

    distance_matrix : pandas.DataFrame
        Contains distances for each pair of cells, computed from
        the CCI scores previously obtained (and stored in
        interaction_elements['cci_matrix'].
    '''

    def __init__(self, rnaseq_data, ppi_data, gene_cutoffs, communication_score='expression_thresholding',
                 cci_score='bray_curtis', cci_type='undirected', cci_matrix_template=None, complex_sep=None,
                 complex_agg_method='min', interaction_columns=('A', 'B'), verbose=True):

        self.communication_score = communication_score
        self.cci_score = cci_score
        self.cci_type = cci_type

        if self.communication_score == 'expression_thresholding':
            if 'type' in gene_cutoffs.keys():
                cutoff_values = cutoffs.get_cutoffs(rnaseq_data=rnaseq_data,
                                                    parameters=gene_cutoffs,
                                                    verbose=verbose)
            else:
                raise ValueError("If dataframe is not included in gene_cutoffs, please provide the type of method to obtain them.")
        else:
            cutoff_values = None

        prot_a = interaction_columns[0]
        prot_b = interaction_columns[1]
        self.ppi_data = ppi_data.copy()
        if ('A' in self.ppi_data.columns) & (prot_a != 'A'):
            self.ppi_data = self.ppi_data.drop(columns='A')
        if ('B' in self.ppi_data.columns) & (prot_b != 'B'):
            self.ppi_data = self.ppi_data.drop(columns='B')
        self.ppi_data = self.ppi_data.rename(columns={prot_a : 'A', prot_b : 'B'})
        if 'score' not in self.ppi_data.columns:
            self.ppi_data = self.ppi_data.assign(score=1.0)

        self.modified_rnaseq = integrate_data.get_modified_rnaseq(rnaseq_data=rnaseq_data,
                                                                  cutoffs=cutoff_values,
                                                                  communication_score=self.communication_score,
                                                                  )

        self.interaction_elements = generate_interaction_elements(modified_rnaseq=self.modified_rnaseq,
                                                                  ppi_data=self.ppi_data,
                                                                  cci_matrix_template=cci_matrix_template,
                                                                  cci_type=self.cci_type,
                                                                  complex_sep=complex_sep,
                                                                  complex_agg_method=complex_agg_method,
                                                                  verbose=verbose)

        self.interaction_elements['ppi_score'] = self.ppi_data['score'].values

    def pair_cci_score(self, cell1, cell2, cci_score='bray_curtis', use_ppi_score=False, verbose=True):
        '''
        Computes a CCI score for a pair of cells.

        Parameters
        ----------
        cell1 : cell2cell.core.cell.Cell
            First cell-type/tissue/sample to compute the communication
            score. In a directed interaction, this is the sender.

        cell2 : cell2cell.core.cell.Cell
            Second cell-type/tissue/sample to compute the communication
            score. In a directed interaction, this is the receiver.

        cci_score : str, default='bray_curtis'
            Scoring function to aggregate the communication scores between
            a pair of cells. It computes an overall potential of cell-cell
            interactions. If None, it will use the one stored in the
            attribute analysis_setup of this object.
            Options:

            - 'bray_curtis' : Bray-Curtis-like score
            - 'jaccard' : Jaccard-like score
            - 'count' : Number of LR pairs that the pair of cells uses
            - 'icellnet' : Sum of the L-R expression product of a pair of cells

        use_ppi_score : boolean, default=False
            Whether using a weight of LR pairs specified in the ppi_data
            to compute the scores.

        verbose : boolean, default=True
            Whether printing or not steps of the analysis.

        Returns
        -------
        cci_score : float
            Overall score for the interaction between a pair of
            cell-types/tissues/samples. In this case it is a
            Jaccard-like score.
        '''

        if verbose:
            print("Computing interaction score between {} and {}".format(cell1.type, cell2.type))

        if use_ppi_score:
            ppi_score = self.ppi_data['score'].values
        else:
            ppi_score = None
        # Calculate cell-cell interaction score
        if cci_score == 'bray_curtis':
            cci_value = cci_scores.compute_braycurtis_like_cci_score(cell1, cell2, ppi_score=ppi_score)
        elif cci_score == 'jaccard':
            cci_value = cci_scores.compute_jaccard_like_cci_score(cell1, cell2, ppi_score=ppi_score)
        elif cci_score == 'count':
            cci_value = cci_scores.compute_count_score(cell1, cell2, ppi_score=ppi_score)
        elif cci_score == 'icellnet':
            cci_value = cci_scores.compute_icellnet_score(cell1, cell2, ppi_score=ppi_score)
        else:
            raise NotImplementedError("CCI score {} to compute pairwise cell-interactions is not implemented".format(cci_score))
        return cci_value

    def compute_pairwise_cci_scores(self, cci_score=None, use_ppi_score=False, verbose=True):
        '''Computes overall CCI scores for each pair of cells.

        Parameters
        ----------
        cci_score : str, default=None
            Scoring function to aggregate the communication scores between
            a pair of cells. It computes an overall potential of cell-cell
            interactions. If None, it will use the one stored in the
            attribute analysis_setup of this object.
            Options:

            - 'bray_curtis' : Bray-Curtis-like score
            - 'jaccard' : Jaccard-like score
            - 'count' : Number of LR pairs that the pair of cells uses
            - 'icellnet' : Sum of the L-R expression product of a pair of cells

        use_ppi_score : boolean, default=False
            Whether using a weight of LR pairs specified in the ppi_data
            to compute the scores.

        verbose : boolean, default=True
            Whether printing or not steps of the analysis.

        Returns
        -------
        self.interaction_elements['cci_matrix'] : pandas.DataFrame
            Contains CCI scores for each pair of cells
        '''
        if cci_score is None:
            cci_score = self.cci_score
        else:
            assert isinstance(cci_score, str)

        ### Compute pairwise physical interactions
        if verbose:
            print("Computing pairwise interactions")

        # Compute pair by pair
        for pair in self.interaction_elements['pairs']:
            cell1 = self.interaction_elements['cells'][pair[0]]
            cell2 = self.interaction_elements['cells'][pair[1]]
            cci_value = self.pair_cci_score(cell1,
                                            cell2,
                                            cci_score=cci_score,
                                            use_ppi_score=use_ppi_score,
                                            verbose=verbose)
            self.interaction_elements['cci_matrix'].at[pair[0], pair[1]] = cci_value
            if self.cci_type == 'undirected':
                self.interaction_elements['cci_matrix'].at[pair[1], pair[0]] = cci_value

        # Compute using matmul -> Too slow and uses a lot of memory TODO: Try to optimize this
        # if cci_score == 'bray_curtis':
        #     cci_matrix = cci_scores.matmul_bray_curtis_like(self.interaction_elements['A_score'],
        #                                                     self.interaction_elements['B_score'])
        # self.interaction_elements['cci_matrix'] = pd.DataFrame(cci_matrix,
        #                                                        index=self.interaction_elements['cell_names'],
        #                                                        columns=self.interaction_elements['cell_names']
        #                                                        )

        # Generate distance matrix
        if ~(cci_score in ['count', 'icellnet']):
            self.distance_matrix = self.interaction_elements['cci_matrix'].apply(lambda x: 1 - x)
        else:
            #self.distance_matrix = self.interaction_elements['cci_matrix'].div(self.interaction_elements['cci_matrix'].max().max()).apply(lambda x: 1 - x)
            # Regularized distance
            mean = np.nanmean(self.interaction_elements['cci_matrix'])
            self.distance_matrix = self.interaction_elements['cci_matrix'].div(self.interaction_elements['cci_matrix'] + mean).apply(lambda x: 1 - x)
        np.fill_diagonal(self.distance_matrix.values, 0.0)  # Make diagonal zero (delete autocrine-interactions)

    def pair_communication_score(self, cell1, cell2, communication_score='expression_thresholding',
                                 use_ppi_score=False, verbose=True):
        '''Computes a communication score for each protein-protein interaction
        between a pair of cells.

        Parameters
        ----------
        cell1 : cell2cell.core.cell.Cell
            First cell-type/tissue/sample to compute the communication
            score. In a directed interaction, this is the sender.

        cell2 : cell2cell.core.cell.Cell
            Second cell-type/tissue/sample to compute the communication
            score. In a directed interaction, this is the receiver.

        communication_score : str, default=None
            Type of communication score to infer the potential use of
            a given ligand-receptor pair by a pair of cells/tissues/samples.
            If None, the score stored in the attribute analysis_setup
            will be used.
            Available communication_scores are:

            - 'expression_thresholding' : Computes the joint presence of a
                                         ligand from a sender cell and of
                                         a receptor on a receiver cell from
                                         binarizing their gene expression levels.
            - 'expression_mean' : Computes the average between the expression
                                  of a ligand from a sender cell and the
                                  expression of a receptor on a receiver cell.
            - 'expression_product' : Computes the product between the expression
                                    of a ligand from a sender cell and the
                                    expression of a receptor on a receiver cell.
            - 'expression_gmean' : Computes the geometric mean between the expression
                                   of a ligand from a sender cell and the
                                   expression of a receptor on a receiver cell.

        use_ppi_score : boolean, default=False
            Whether using a weight of LR pairs specified in the ppi_data
            to compute the scores.

        verbose : boolean, default=True
            Whether printing or not steps of the analysis.

        Returns
        -------
        communication_scores : numpy.array
            An array with the communication scores for each intercellular
            PPI.
        '''
        # TODO: Implement communication scores
        if verbose:
            print("Computing communication score between {} and {}".format(cell1.type, cell2.type))

        # Check that new score is the same type as score used to build interaction space (binary or continuous)
        if (communication_score in ['expression_product', 'expression_correlation', 'expression_mean', 'expression_gmean']) \
                & (self.communication_score in ['expression_thresholding', 'differential_combinations']):
            raise ValueError('Cannot use {} for this interaction space'.format(communication_score))
        if (communication_score in ['expression_thresholding', 'differential_combinations']) \
                & (self.communication_score in ['expression_product', 'expression_correlation', 'expression_mean', 'expression_gmean']):
            raise ValueError('Cannot use {} for this interaction space'.format(communication_score))

        if use_ppi_score:
            ppi_score = self.ppi_data['score'].values
        else:
            ppi_score = None

        if communication_score in ['expression_thresholding', 'differential_combinations']:
            communication_value = communication_scores.get_binary_scores(cell1=cell1,
                                                                         cell2=cell2,
                                                                         ppi_score=ppi_score)
        elif communication_score in ['expression_product', 'expression_correlation', 'expression_mean', 'expression_gmean']:
              communication_value = communication_scores.get_continuous_scores(cell1=cell1,
                                                                               cell2=cell2,
                                                                               ppi_score=ppi_score,
                                                                               method=communication_score)
        else:
            raise NotImplementedError(
                "Communication score {} to compute pairwise cell-communication is not implemented".format(communication_score))
        return communication_value

    def compute_pairwise_communication_scores(self, communication_score=None, use_ppi_score=False, ref_ppi_data=None,
                                              interaction_columns=('A', 'B'), cells=None, cci_type=None, verbose=True):
        '''Computes the communication scores for each LR pairs in
        a given pair of sender-receiver cell

        Parameters
        ----------
        communication_score : str, default=None
            Type of communication score to infer the potential use of
            a given ligand-receptor pair by a pair of cells/tissues/samples.
            If None, the score stored in the attribute analysis_setup
            will be used.
            Available communication_scores are:

            - 'expression_thresholding' : Computes the joint presence of a
                                         ligand from a sender cell and of
                                         a receptor on a receiver cell from
                                         binarizing their gene expression levels.
            - 'expression_mean' : Computes the average between the expression
                                  of a ligand from a sender cell and the
                                  expression of a receptor on a receiver cell.
            - 'expression_product' : Computes the product between the expression
                                    of a ligand from a sender cell and the
                                    expression of a receptor on a receiver cell.
            - 'expression_gmean' : Computes the geometric mean between the expression
                                    of a ligand from a sender cell and the
                                    expression of a receptor on a receiver cell.

        use_ppi_score : boolean, default=False
            Whether using a weight of LR pairs specified in the ppi_data
            to compute the scores.

        ref_ppi_data : pandas.DataFrame, default=None
            Reference list of protein-protein interactions (or
            ligand-receptor pairs) used for inferring the cell-cell
            interactions and communication. It could be the same as
            'ppi_data' if ppi_data is not bidirectional (that is,
            contains ProtA-ProtB interaction as well as ProtB-ProtA
            interaction). ref_ppi must be undirected (contains only
            ProtA-ProtB and not ProtB-ProtA interaction). If None
            the one stored in the attribute ref_ppi will be used.

        interaction_columns : tuple, default=None
            Contains the names of the columns where to find the
            partners in a dataframe of protein-protein interactions.
            If the list is for ligand-receptor pairs, the first column
            is for the ligands and the second for the receptors. If
            None, the one stored in the attribute interaction_columns
            will be used

        cells : list=None
            List of cells to consider.

        cci_type : str, default=None
            Type of interaction between two cells. Used to specify
            if we want to consider a LR pair in both directions.
            It can be:
                - 'undirected'
                - 'directed
            If None, the one stored in the attribute analysis_setup
            will be used.

        verbose : boolean, default=True
            Whether printing or not steps of the analysis.

        Returns
        -------
        self.interaction_elements['communication_matrix'] : pandas.DataFrame
            Contains communication scores for each LR pair in a
            given pair of sender-receiver cells.
        '''
        if communication_score is None:
            communication_score = self.communication_score
        else:
            assert isinstance(communication_score, str)

        # Cells to consider
        if cells is None:
            cells = self.interaction_elements['cell_names']

        # Labels:
        if cci_type is None:
            cell_pairs = self.interaction_elements['pairs']
        elif cci_type != self.cci_type:
            cell_pairs = generate_pairs(cells, cci_type)
        else:
            #cell_pairs = generate_pairs(cells, self.cci_type) # Think about other scenarios that may need this line
            cell_pairs = self.interaction_elements['pairs']
        col_labels = ['{};{}'.format(pair[0], pair[1]) for pair in cell_pairs]

        # Ref PPI data
        if ref_ppi_data is None:
            ref_index = self.ppi_data.apply(lambda row: (row['A'], row['B']), axis=1)
            keep_index = list(range(self.ppi_data.shape[0]))
        else:
            ref_ppi = ref_ppi_data.copy()
            prot_a = interaction_columns[0]
            prot_b = interaction_columns[1]
            if ('A' in ref_ppi.columns) & (prot_a != 'A'):
                ref_ppi = ref_ppi.drop(columns='A')
            if ('B' in ref_ppi.columns) & (prot_b != 'B'):
                ref_ppi = ref_ppi.drop(columns='B')
            ref_ppi = ref_ppi.rename(columns={prot_a: 'A', prot_b: 'B'})
            ref_index = list(ref_ppi.apply(lambda row: (row['A'], row['B']), axis=1).values)
            keep_index = list(pd.merge(self.ppi_data, ref_ppi, how='inner').index)

        # DataFrame to Store values
        communication_matrix = pd.DataFrame(index=ref_index, columns=col_labels)

        ### Compute pairwise physical interactions
        if verbose:
            print("Computing pairwise communication")

        for i, pair in enumerate(cell_pairs):
            cell1 = self.interaction_elements['cells'][pair[0]]
            cell2 = self.interaction_elements['cells'][pair[1]]

            comm_score = self.pair_communication_score(cell1,
                                                       cell2,
                                                       communication_score=communication_score,
                                                       use_ppi_score=use_ppi_score,
                                                       verbose=verbose)
            kept_values = comm_score.flatten()[keep_index]
            communication_matrix[col_labels[i]] = kept_values

        self.interaction_elements['communication_matrix'] = communication_matrix
compute_pairwise_cci_scores(self, cci_score=None, use_ppi_score=False, verbose=True)

Computes overall CCI scores for each pair of cells.

Parameters

cci_score : str, default=None Scoring function to aggregate the communication scores between a pair of cells. It computes an overall potential of cell-cell interactions. If None, it will use the one stored in the attribute analysis_setup of this object. Options:

- 'bray_curtis' : Bray-Curtis-like score
- 'jaccard' : Jaccard-like score
- 'count' : Number of LR pairs that the pair of cells uses
- 'icellnet' : Sum of the L-R expression product of a pair of cells

use_ppi_score : boolean, default=False Whether using a weight of LR pairs specified in the ppi_data to compute the scores.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

self.interaction_elements['cci_matrix'] : pandas.DataFrame Contains CCI scores for each pair of cells

Source code in cell2cell/core/interaction_space.py
def compute_pairwise_cci_scores(self, cci_score=None, use_ppi_score=False, verbose=True):
    '''Computes overall CCI scores for each pair of cells.

    Parameters
    ----------
    cci_score : str, default=None
        Scoring function to aggregate the communication scores between
        a pair of cells. It computes an overall potential of cell-cell
        interactions. If None, it will use the one stored in the
        attribute analysis_setup of this object.
        Options:

        - 'bray_curtis' : Bray-Curtis-like score
        - 'jaccard' : Jaccard-like score
        - 'count' : Number of LR pairs that the pair of cells uses
        - 'icellnet' : Sum of the L-R expression product of a pair of cells

    use_ppi_score : boolean, default=False
        Whether using a weight of LR pairs specified in the ppi_data
        to compute the scores.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    self.interaction_elements['cci_matrix'] : pandas.DataFrame
        Contains CCI scores for each pair of cells
    '''
    if cci_score is None:
        cci_score = self.cci_score
    else:
        assert isinstance(cci_score, str)

    ### Compute pairwise physical interactions
    if verbose:
        print("Computing pairwise interactions")

    # Compute pair by pair
    for pair in self.interaction_elements['pairs']:
        cell1 = self.interaction_elements['cells'][pair[0]]
        cell2 = self.interaction_elements['cells'][pair[1]]
        cci_value = self.pair_cci_score(cell1,
                                        cell2,
                                        cci_score=cci_score,
                                        use_ppi_score=use_ppi_score,
                                        verbose=verbose)
        self.interaction_elements['cci_matrix'].at[pair[0], pair[1]] = cci_value
        if self.cci_type == 'undirected':
            self.interaction_elements['cci_matrix'].at[pair[1], pair[0]] = cci_value

    # Compute using matmul -> Too slow and uses a lot of memory TODO: Try to optimize this
    # if cci_score == 'bray_curtis':
    #     cci_matrix = cci_scores.matmul_bray_curtis_like(self.interaction_elements['A_score'],
    #                                                     self.interaction_elements['B_score'])
    # self.interaction_elements['cci_matrix'] = pd.DataFrame(cci_matrix,
    #                                                        index=self.interaction_elements['cell_names'],
    #                                                        columns=self.interaction_elements['cell_names']
    #                                                        )

    # Generate distance matrix
    if ~(cci_score in ['count', 'icellnet']):
        self.distance_matrix = self.interaction_elements['cci_matrix'].apply(lambda x: 1 - x)
    else:
        #self.distance_matrix = self.interaction_elements['cci_matrix'].div(self.interaction_elements['cci_matrix'].max().max()).apply(lambda x: 1 - x)
        # Regularized distance
        mean = np.nanmean(self.interaction_elements['cci_matrix'])
        self.distance_matrix = self.interaction_elements['cci_matrix'].div(self.interaction_elements['cci_matrix'] + mean).apply(lambda x: 1 - x)
    np.fill_diagonal(self.distance_matrix.values, 0.0)  # Make diagonal zero (delete autocrine-interactions)
compute_pairwise_communication_scores(self, communication_score=None, use_ppi_score=False, ref_ppi_data=None, interaction_columns=('A', 'B'), cells=None, cci_type=None, verbose=True)

Computes the communication scores for each LR pairs in a given pair of sender-receiver cell

Parameters

communication_score : str, default=None Type of communication score to infer the potential use of a given ligand-receptor pair by a pair of cells/tissues/samples. If None, the score stored in the attribute analysis_setup will be used. Available communication_scores are:

- 'expression_thresholding' : Computes the joint presence of a
                             ligand from a sender cell and of
                             a receptor on a receiver cell from
                             binarizing their gene expression levels.
- 'expression_mean' : Computes the average between the expression
                      of a ligand from a sender cell and the
                      expression of a receptor on a receiver cell.
- 'expression_product' : Computes the product between the expression
                        of a ligand from a sender cell and the
                        expression of a receptor on a receiver cell.
- 'expression_gmean' : Computes the geometric mean between the expression
                        of a ligand from a sender cell and the
                        expression of a receptor on a receiver cell.

use_ppi_score : boolean, default=False Whether using a weight of LR pairs specified in the ppi_data to compute the scores.

ref_ppi_data : pandas.DataFrame, default=None Reference list of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication. It could be the same as 'ppi_data' if ppi_data is not bidirectional (that is, contains ProtA-ProtB interaction as well as ProtB-ProtA interaction). ref_ppi must be undirected (contains only ProtA-ProtB and not ProtB-ProtA interaction). If None the one stored in the attribute ref_ppi will be used.

interaction_columns : tuple, default=None Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors. If None, the one stored in the attribute interaction_columns will be used

cells : list=None List of cells to consider.

cci_type : str, default=None Type of interaction between two cells. Used to specify if we want to consider a LR pair in both directions. It can be: - 'undirected' - 'directed If None, the one stored in the attribute analysis_setup will be used.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

self.interaction_elements['communication_matrix'] : pandas.DataFrame Contains communication scores for each LR pair in a given pair of sender-receiver cells.

Source code in cell2cell/core/interaction_space.py
def compute_pairwise_communication_scores(self, communication_score=None, use_ppi_score=False, ref_ppi_data=None,
                                          interaction_columns=('A', 'B'), cells=None, cci_type=None, verbose=True):
    '''Computes the communication scores for each LR pairs in
    a given pair of sender-receiver cell

    Parameters
    ----------
    communication_score : str, default=None
        Type of communication score to infer the potential use of
        a given ligand-receptor pair by a pair of cells/tissues/samples.
        If None, the score stored in the attribute analysis_setup
        will be used.
        Available communication_scores are:

        - 'expression_thresholding' : Computes the joint presence of a
                                     ligand from a sender cell and of
                                     a receptor on a receiver cell from
                                     binarizing their gene expression levels.
        - 'expression_mean' : Computes the average between the expression
                              of a ligand from a sender cell and the
                              expression of a receptor on a receiver cell.
        - 'expression_product' : Computes the product between the expression
                                of a ligand from a sender cell and the
                                expression of a receptor on a receiver cell.
        - 'expression_gmean' : Computes the geometric mean between the expression
                                of a ligand from a sender cell and the
                                expression of a receptor on a receiver cell.

    use_ppi_score : boolean, default=False
        Whether using a weight of LR pairs specified in the ppi_data
        to compute the scores.

    ref_ppi_data : pandas.DataFrame, default=None
        Reference list of protein-protein interactions (or
        ligand-receptor pairs) used for inferring the cell-cell
        interactions and communication. It could be the same as
        'ppi_data' if ppi_data is not bidirectional (that is,
        contains ProtA-ProtB interaction as well as ProtB-ProtA
        interaction). ref_ppi must be undirected (contains only
        ProtA-ProtB and not ProtB-ProtA interaction). If None
        the one stored in the attribute ref_ppi will be used.

    interaction_columns : tuple, default=None
        Contains the names of the columns where to find the
        partners in a dataframe of protein-protein interactions.
        If the list is for ligand-receptor pairs, the first column
        is for the ligands and the second for the receptors. If
        None, the one stored in the attribute interaction_columns
        will be used

    cells : list=None
        List of cells to consider.

    cci_type : str, default=None
        Type of interaction between two cells. Used to specify
        if we want to consider a LR pair in both directions.
        It can be:
            - 'undirected'
            - 'directed
        If None, the one stored in the attribute analysis_setup
        will be used.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    self.interaction_elements['communication_matrix'] : pandas.DataFrame
        Contains communication scores for each LR pair in a
        given pair of sender-receiver cells.
    '''
    if communication_score is None:
        communication_score = self.communication_score
    else:
        assert isinstance(communication_score, str)

    # Cells to consider
    if cells is None:
        cells = self.interaction_elements['cell_names']

    # Labels:
    if cci_type is None:
        cell_pairs = self.interaction_elements['pairs']
    elif cci_type != self.cci_type:
        cell_pairs = generate_pairs(cells, cci_type)
    else:
        #cell_pairs = generate_pairs(cells, self.cci_type) # Think about other scenarios that may need this line
        cell_pairs = self.interaction_elements['pairs']
    col_labels = ['{};{}'.format(pair[0], pair[1]) for pair in cell_pairs]

    # Ref PPI data
    if ref_ppi_data is None:
        ref_index = self.ppi_data.apply(lambda row: (row['A'], row['B']), axis=1)
        keep_index = list(range(self.ppi_data.shape[0]))
    else:
        ref_ppi = ref_ppi_data.copy()
        prot_a = interaction_columns[0]
        prot_b = interaction_columns[1]
        if ('A' in ref_ppi.columns) & (prot_a != 'A'):
            ref_ppi = ref_ppi.drop(columns='A')
        if ('B' in ref_ppi.columns) & (prot_b != 'B'):
            ref_ppi = ref_ppi.drop(columns='B')
        ref_ppi = ref_ppi.rename(columns={prot_a: 'A', prot_b: 'B'})
        ref_index = list(ref_ppi.apply(lambda row: (row['A'], row['B']), axis=1).values)
        keep_index = list(pd.merge(self.ppi_data, ref_ppi, how='inner').index)

    # DataFrame to Store values
    communication_matrix = pd.DataFrame(index=ref_index, columns=col_labels)

    ### Compute pairwise physical interactions
    if verbose:
        print("Computing pairwise communication")

    for i, pair in enumerate(cell_pairs):
        cell1 = self.interaction_elements['cells'][pair[0]]
        cell2 = self.interaction_elements['cells'][pair[1]]

        comm_score = self.pair_communication_score(cell1,
                                                   cell2,
                                                   communication_score=communication_score,
                                                   use_ppi_score=use_ppi_score,
                                                   verbose=verbose)
        kept_values = comm_score.flatten()[keep_index]
        communication_matrix[col_labels[i]] = kept_values

    self.interaction_elements['communication_matrix'] = communication_matrix
pair_cci_score(self, cell1, cell2, cci_score='bray_curtis', use_ppi_score=False, verbose=True)

Computes a CCI score for a pair of cells.

Parameters

cell1 : cell2cell.core.cell.Cell First cell-type/tissue/sample to compute the communication score. In a directed interaction, this is the sender.

cell2 : cell2cell.core.cell.Cell Second cell-type/tissue/sample to compute the communication score. In a directed interaction, this is the receiver.

cci_score : str, default='bray_curtis' Scoring function to aggregate the communication scores between a pair of cells. It computes an overall potential of cell-cell interactions. If None, it will use the one stored in the attribute analysis_setup of this object. Options:

- 'bray_curtis' : Bray-Curtis-like score
- 'jaccard' : Jaccard-like score
- 'count' : Number of LR pairs that the pair of cells uses
- 'icellnet' : Sum of the L-R expression product of a pair of cells

use_ppi_score : boolean, default=False Whether using a weight of LR pairs specified in the ppi_data to compute the scores.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

cci_score : float Overall score for the interaction between a pair of cell-types/tissues/samples. In this case it is a Jaccard-like score.

Source code in cell2cell/core/interaction_space.py
def pair_cci_score(self, cell1, cell2, cci_score='bray_curtis', use_ppi_score=False, verbose=True):
    '''
    Computes a CCI score for a pair of cells.

    Parameters
    ----------
    cell1 : cell2cell.core.cell.Cell
        First cell-type/tissue/sample to compute the communication
        score. In a directed interaction, this is the sender.

    cell2 : cell2cell.core.cell.Cell
        Second cell-type/tissue/sample to compute the communication
        score. In a directed interaction, this is the receiver.

    cci_score : str, default='bray_curtis'
        Scoring function to aggregate the communication scores between
        a pair of cells. It computes an overall potential of cell-cell
        interactions. If None, it will use the one stored in the
        attribute analysis_setup of this object.
        Options:

        - 'bray_curtis' : Bray-Curtis-like score
        - 'jaccard' : Jaccard-like score
        - 'count' : Number of LR pairs that the pair of cells uses
        - 'icellnet' : Sum of the L-R expression product of a pair of cells

    use_ppi_score : boolean, default=False
        Whether using a weight of LR pairs specified in the ppi_data
        to compute the scores.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    cci_score : float
        Overall score for the interaction between a pair of
        cell-types/tissues/samples. In this case it is a
        Jaccard-like score.
    '''

    if verbose:
        print("Computing interaction score between {} and {}".format(cell1.type, cell2.type))

    if use_ppi_score:
        ppi_score = self.ppi_data['score'].values
    else:
        ppi_score = None
    # Calculate cell-cell interaction score
    if cci_score == 'bray_curtis':
        cci_value = cci_scores.compute_braycurtis_like_cci_score(cell1, cell2, ppi_score=ppi_score)
    elif cci_score == 'jaccard':
        cci_value = cci_scores.compute_jaccard_like_cci_score(cell1, cell2, ppi_score=ppi_score)
    elif cci_score == 'count':
        cci_value = cci_scores.compute_count_score(cell1, cell2, ppi_score=ppi_score)
    elif cci_score == 'icellnet':
        cci_value = cci_scores.compute_icellnet_score(cell1, cell2, ppi_score=ppi_score)
    else:
        raise NotImplementedError("CCI score {} to compute pairwise cell-interactions is not implemented".format(cci_score))
    return cci_value
pair_communication_score(self, cell1, cell2, communication_score='expression_thresholding', use_ppi_score=False, verbose=True)

Computes a communication score for each protein-protein interaction between a pair of cells.

Parameters

cell1 : cell2cell.core.cell.Cell First cell-type/tissue/sample to compute the communication score. In a directed interaction, this is the sender.

cell2 : cell2cell.core.cell.Cell Second cell-type/tissue/sample to compute the communication score. In a directed interaction, this is the receiver.

communication_score : str, default=None Type of communication score to infer the potential use of a given ligand-receptor pair by a pair of cells/tissues/samples. If None, the score stored in the attribute analysis_setup will be used. Available communication_scores are:

- 'expression_thresholding' : Computes the joint presence of a
                             ligand from a sender cell and of
                             a receptor on a receiver cell from
                             binarizing their gene expression levels.
- 'expression_mean' : Computes the average between the expression
                      of a ligand from a sender cell and the
                      expression of a receptor on a receiver cell.
- 'expression_product' : Computes the product between the expression
                        of a ligand from a sender cell and the
                        expression of a receptor on a receiver cell.
- 'expression_gmean' : Computes the geometric mean between the expression
                       of a ligand from a sender cell and the
                       expression of a receptor on a receiver cell.

use_ppi_score : boolean, default=False Whether using a weight of LR pairs specified in the ppi_data to compute the scores.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

communication_scores : numpy.array An array with the communication scores for each intercellular PPI.

Source code in cell2cell/core/interaction_space.py
def pair_communication_score(self, cell1, cell2, communication_score='expression_thresholding',
                             use_ppi_score=False, verbose=True):
    '''Computes a communication score for each protein-protein interaction
    between a pair of cells.

    Parameters
    ----------
    cell1 : cell2cell.core.cell.Cell
        First cell-type/tissue/sample to compute the communication
        score. In a directed interaction, this is the sender.

    cell2 : cell2cell.core.cell.Cell
        Second cell-type/tissue/sample to compute the communication
        score. In a directed interaction, this is the receiver.

    communication_score : str, default=None
        Type of communication score to infer the potential use of
        a given ligand-receptor pair by a pair of cells/tissues/samples.
        If None, the score stored in the attribute analysis_setup
        will be used.
        Available communication_scores are:

        - 'expression_thresholding' : Computes the joint presence of a
                                     ligand from a sender cell and of
                                     a receptor on a receiver cell from
                                     binarizing their gene expression levels.
        - 'expression_mean' : Computes the average between the expression
                              of a ligand from a sender cell and the
                              expression of a receptor on a receiver cell.
        - 'expression_product' : Computes the product between the expression
                                of a ligand from a sender cell and the
                                expression of a receptor on a receiver cell.
        - 'expression_gmean' : Computes the geometric mean between the expression
                               of a ligand from a sender cell and the
                               expression of a receptor on a receiver cell.

    use_ppi_score : boolean, default=False
        Whether using a weight of LR pairs specified in the ppi_data
        to compute the scores.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    communication_scores : numpy.array
        An array with the communication scores for each intercellular
        PPI.
    '''
    # TODO: Implement communication scores
    if verbose:
        print("Computing communication score between {} and {}".format(cell1.type, cell2.type))

    # Check that new score is the same type as score used to build interaction space (binary or continuous)
    if (communication_score in ['expression_product', 'expression_correlation', 'expression_mean', 'expression_gmean']) \
            & (self.communication_score in ['expression_thresholding', 'differential_combinations']):
        raise ValueError('Cannot use {} for this interaction space'.format(communication_score))
    if (communication_score in ['expression_thresholding', 'differential_combinations']) \
            & (self.communication_score in ['expression_product', 'expression_correlation', 'expression_mean', 'expression_gmean']):
        raise ValueError('Cannot use {} for this interaction space'.format(communication_score))

    if use_ppi_score:
        ppi_score = self.ppi_data['score'].values
    else:
        ppi_score = None

    if communication_score in ['expression_thresholding', 'differential_combinations']:
        communication_value = communication_scores.get_binary_scores(cell1=cell1,
                                                                     cell2=cell2,
                                                                     ppi_score=ppi_score)
    elif communication_score in ['expression_product', 'expression_correlation', 'expression_mean', 'expression_gmean']:
          communication_value = communication_scores.get_continuous_scores(cell1=cell1,
                                                                           cell2=cell2,
                                                                           ppi_score=ppi_score,
                                                                           method=communication_score)
    else:
        raise NotImplementedError(
            "Communication score {} to compute pairwise cell-communication is not implemented".format(communication_score))
    return communication_value

generate_interaction_elements(modified_rnaseq, ppi_data, cci_type='undirected', cci_matrix_template=None, complex_sep=None, complex_agg_method='min', interaction_columns=('A', 'B'), verbose=True)

Create all elements needed to perform the analyses of pairwise cell-cell interactions/communication. Corresponds to the interaction elements used by the class InteractionSpace.

Parameters

modified_rnaseq : pandas.DataFrame Preprocessed gene expression data for a bulk or single-cell RNA-seq experiment. Columns are are cell-types/tissues/samples and rows are genes. The preprocessing may correspond to scoring the gene expression as binary or continuous values depending on the scoring function for cell-cell interactions/communication.

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

cci_type : str, default='undirected' Specifies whether computing the cci_score in a directed or undirected way. For a pair of cells A and B, directed means that the ligands are considered only from cell A and receptors only from cell B or viceversa. While undirected simultaneously considers signaling from cell A to cell B and from cell B to cell A.

cci_matrix_template : pandas.DataFrame, default=None A matrix of shape MxM where M are cell-types/tissues/samples. This is used as template for storing CCI scores. It may be useful for specifying which pairs of cells to consider.

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

complex_agg_method : str, default='min' Method to aggregate the expression value of multiple genes in a complex.

- 'min' : Minimum expression value among all genes.
- 'mean' : Average expression value among all genes.
- 'gmean' : Geometric mean expression value among all genes.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

interaction_elements : dict Dictionary containing all the pairs of cells considered (under the key of 'pairs'), Cell instances (under key 'cells') which include all cells/tissues/organs with their associated datasets (rna_seq, weighted_ppi, etc) and a Cell-Cell Interaction Matrix to store CCI scores(under key 'cci_matrix'). A communication matrix is also stored in this object when the communication scores are computed in the InteractionSpace class (under key 'communication_score')

Source code in cell2cell/core/interaction_space.py
def generate_interaction_elements(modified_rnaseq, ppi_data, cci_type='undirected', cci_matrix_template=None,
                                  complex_sep=None, complex_agg_method='min', interaction_columns=('A', 'B'),
                                  verbose=True):
    '''Create all elements needed to perform the analyses of pairwise
    cell-cell interactions/communication. Corresponds to the interaction
    elements used by the class InteractionSpace.

    Parameters
    ----------
    modified_rnaseq : pandas.DataFrame
        Preprocessed gene expression data for a bulk or single-cell RNA-seq experiment.
        Columns are are cell-types/tissues/samples and rows are genes. The preprocessing
        may correspond to scoring the gene expression as binary or continuous values
        depending on the scoring function for cell-cell interactions/communication.

    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used for
        inferring the cell-cell interactions and communication.

    cci_type : str, default='undirected'
        Specifies whether computing the cci_score in a directed or undirected
        way. For a pair of cells A and B, directed means that the ligands are
        considered only from cell A and receptors only from cell B or viceversa.
        While undirected simultaneously considers signaling from cell A to
        cell B and from cell B to cell A.

    cci_matrix_template : pandas.DataFrame, default=None
        A matrix of shape MxM where M are cell-types/tissues/samples. This
        is used as template for storing CCI scores. It may be useful
        for specifying which pairs of cells to consider.

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    complex_agg_method : str, default='min'
        Method to aggregate the expression value of multiple genes in a
        complex.

        - 'min' : Minimum expression value among all genes.
        - 'mean' : Average expression value among all genes.
        - 'gmean' : Geometric mean expression value among all genes.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    interaction_elements : dict
        Dictionary containing all the pairs of cells considered (under
        the key of 'pairs'), Cell instances (under key 'cells')
        which include all cells/tissues/organs with their associated datasets
        (rna_seq, weighted_ppi, etc) and a Cell-Cell Interaction Matrix
        to store CCI scores(under key 'cci_matrix'). A communication matrix
        is also stored in this object when the communication scores are
        computed in the InteractionSpace class (under key
        'communication_score')
    '''

    if verbose:
        print('Creating Interaction Space')

    # Include complex expression
    if complex_sep is not None:
        col_a_genes, complex_a, col_b_genes, complex_b, complexes = get_genes_from_complexes(ppi_data=ppi_data,
                                                                                             complex_sep=complex_sep,
                                                                                             interaction_columns=interaction_columns
                                                                                             )
        modified_rnaseq = add_complexes_to_expression(rnaseq_data=modified_rnaseq,
                                                      complexes=complexes,
                                                      agg_method=complex_agg_method
                                                      )

    # Cells
    cell_instances = list(modified_rnaseq.columns)  # @Erick, check if position 0 of columns contain index header.
    cell_number = len(cell_instances)

    # Generate pairwise interactions
    pairwise_interactions = generate_pairs(cell_instances, cci_type)

    # Interaction elements
    interaction_elements = {}
    interaction_elements['cell_names'] = cell_instances
    interaction_elements['pairs'] = pairwise_interactions
    interaction_elements['cells'] = cell.get_cells_from_rnaseq(modified_rnaseq, verbose=verbose)

    # Cell-specific scores in PPIs

    # For matmul functions
    #interaction_elements['A_score'] = np.array([], dtype=np.int64)#.reshape(ppi_data.shape[0],0)
    #interaction_elements['B_score'] = np.array([], dtype=np.int64)#.reshape(ppi_data.shape[0],0)

    # For 'for' loop
    for cell_instance in interaction_elements['cells'].values():
        cell_instance.weighted_ppi = integrate_data.get_weighted_ppi(ppi_data=ppi_data,
                                                                     modified_rnaseq_data=cell_instance.rnaseq_data,
                                                                     column='value', # value is in each cell
                                                                     )
        #interaction_elements['A_score'] = np.hstack([interaction_elements['A_score'], cell_instance.weighted_ppi['A'].values])
        #interaction_elements['B_score'] = np.hstack([interaction_elements['B_score'], cell_instance.weighted_ppi['B'].values])

    # Cell-cell interaction matrix
    if cci_matrix_template is None:
        interaction_elements['cci_matrix'] = pd.DataFrame(np.zeros((cell_number, cell_number)),
                                                          columns=cell_instances,
                                                          index=cell_instances)
    else:
        interaction_elements['cci_matrix'] = cci_matrix_template
    return interaction_elements

generate_pairs(cells, cci_type, self_interaction=True, remove_duplicates=True)

Generates a list of pairs of interacting cell-types/tissues/samples.

Parameters

cells : list A lyst of cell-type/tissue/sample names.

cci_type : str, Type of interactions. Options are:

- 'directed' : Directed cell-cell interactions, so pair A-B is different
    to pair B-A and both are considered.
- 'undirected' : Undirected cell-cell interactions, so pair A-B is equal
    to pair B-A and just one of them is considered.

self_interaction : boolean, default=True Whether considering autocrine interactions (pair A-A, B-B, etc).

remove_duplicates : boolean, default=True Whether removing duplicates when a list of cells is passed and names are duplicated. If False and a list [A, A, B] is passed, pairs could be [A-A, A-A, A-B, A-A, A-A, A-B, B-A, B-A, B-B] when self_interaction is True and cci_type is 'directed'. In the same scenario but when remove_duplicates is True, the resulting list would be [A-A, A-B, B-A, B-B].

Returns

pairs : list List with pairs of interacting cell-types/tissues/samples.

Source code in cell2cell/core/interaction_space.py
def generate_pairs(cells, cci_type, self_interaction=True, remove_duplicates=True):
    '''Generates a list of pairs of interacting cell-types/tissues/samples.

    Parameters
    ----------
    cells : list
        A lyst of cell-type/tissue/sample names.

    cci_type : str,
        Type of interactions.
        Options are:

        - 'directed' : Directed cell-cell interactions, so pair A-B is different
            to pair B-A and both are considered.
        - 'undirected' : Undirected cell-cell interactions, so pair A-B is equal
            to pair B-A and just one of them is considered.

    self_interaction : boolean, default=True
        Whether considering autocrine interactions (pair A-A, B-B, etc).

    remove_duplicates : boolean, default=True
        Whether removing duplicates when a list of cells is passed and names are
        duplicated. If False and a list [A, A, B] is passed, pairs could be
        [A-A, A-A, A-B, A-A, A-A, A-B, B-A, B-A, B-B] when self_interaction is True
        and cci_type is 'directed'. In the same scenario but when remove_duplicates
        is True, the resulting list would be [A-A, A-B, B-A, B-B].

    Returns
    -------
    pairs : list
        List with pairs of interacting cell-types/tissues/samples.
    '''
    if self_interaction:
        if cci_type == 'directed':
            pairs = list(itertools.product(cells, cells))
            #pairs = list(itertools.combinations(cells + cells, 2)) # Directed
        elif cci_type == 'undirected':
            pairs = list(itertools.combinations(cells, 2)) + [(c, c) for c in cells] # Undirected
        else:
            raise NotImplementedError("CCI type has to be directed or undirected")
    else:
        if cci_type == 'directed':
            pairs_ = list(itertools.product(cells, cells))
            pairs = []
            for p in pairs_:
                if p[0] == p[1]:
                    continue
                else:
                    pairs.append(p)
        elif cci_type == 'undirected':
            pairs = list(itertools.combinations(cells, 2))
        else:
            raise NotImplementedError("CCI type has to be directed or undirected")
    if remove_duplicates:
        pairs = list(set(pairs))  # Remove duplicates
    return pairs

datasets special

anndata

balf_covid(filename='BALF-COVID19-Liao_et_al-NatMed-2020.h5ad')

BALF samples from COVID-19 patients The data consists in 63k immune and epithelial cells in lungs from 3 control, 3 moderate COVID-19, and 6 severe COVID-19 patients.

This dataset was previously published in [1], and this objects contains the raw counts for the annotated cell types available in: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145926

References: [1] Liao, M., Liu, Y., Yuan, J. et al. Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nat Med 26, 842–844 (2020). https://doi.org/10.1038/s41591-020-0901-9

Parameters
filename : str, default='BALF-COVID19-Liao_et_al-NatMed-2020.h5ad'
    Path to the h5ad file in case it was manually downloaded.
Returns
Annotated data matrix.
Source code in cell2cell/datasets/anndata.py
def balf_covid(filename='BALF-COVID19-Liao_et_al-NatMed-2020.h5ad'):
    """BALF samples from COVID-19 patients
    The data consists in 63k immune and epithelial cells in lungs
    from 3 control, 3 moderate COVID-19, and 6 severe COVID-19 patients.

    This dataset was previously published in [1], and this objects contains
    the raw counts for the annotated cell types available in:
    https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145926

    References:
    [1] Liao, M., Liu, Y., Yuan, J. et al.
        Single-cell landscape of bronchoalveolar immune cells in patients
        with COVID-19. Nat Med 26, 842–844 (2020).
        https://doi.org/10.1038/s41591-020-0901-9

    Parameters
    ----------
        filename : str, default='BALF-COVID19-Liao_et_al-NatMed-2020.h5ad'
            Path to the h5ad file in case it was manually downloaded.

    Returns
    -------
        Annotated data matrix.
    """
    url = 'https://zenodo.org/record/7535867/files/BALF-COVID19-Liao_et_al-NatMed-2020.h5ad'
    adata = read(filename, backup_url=url)
    return adata

gsea_data

gsea_msig(organism='human', pathwaydb='GOBP', readable_name=False)

Load a MSigDB from a gmt file

Parameters

organism : str, default='human' Organism for whom the DB will be loaded. Available options are {'human', 'mouse'}.

str, default='GOBP'

Molecular Signature Database to load. Available options are {'GOBP', 'KEGG', 'Reactome'}

readable_name : boolean, default=False If True, the pathway names are transformed to a more readable format. That is, removing underscores and pathway DB name at the beginning.

Returns

pathway_per_gene : defaultdict Dictionary containing all genes in the DB as keys, and their values are lists with their pathway annotations.

Source code in cell2cell/datasets/gsea_data.py
def gsea_msig(organism='human', pathwaydb='GOBP', readable_name=False):
    '''Load a MSigDB from a gmt file

    Parameters
    ----------
    organism : str, default='human'
        Organism for whom the DB will be loaded.
        Available options are {'human', 'mouse'}.

    pathwaydb: str, default='GOBP'
        Molecular Signature Database to load.
        Available options are {'GOBP', 'KEGG', 'Reactome'}

    readable_name : boolean, default=False
        If True, the pathway names are transformed to a more readable format.
        That is, removing underscores and pathway DB name at the beginning.

    Returns
    -------
    pathway_per_gene : defaultdict
        Dictionary containing all genes in the DB as keys, and
        their values are lists with their pathway annotations.
    '''
    _check_pathwaydb(organism, pathwaydb)

    pathway_per_gene = load_gmt(readable_name=readable_name, **PATHWAY_DATA[organism][pathwaydb])
    return pathway_per_gene

heuristic_data

HeuristicGOTerms

GO terms for contact and secreted proteins.

Attributes

contact_go_terms : list List of GO terms associated with proteins that participate in contact interactions (usually on the surface of cells).

mediator_go_terms : list List of GO terms associated with secreted proteins that mediate intercellular interactions or communication.

Source code in cell2cell/datasets/heuristic_data.py
class HeuristicGOTerms:
    '''GO terms for contact and secreted proteins.

    Attributes
    ----------
    contact_go_terms : list
        List of GO terms associated with proteins that
        participate in contact interactions (usually
        on the surface of cells).

    mediator_go_terms : list
        List of GO terms associated with secreted
        proteins that mediate intercellular interactions
        or communication.
    '''
    def __init__(self):
        self.contact_go_terms = ['GO:0007155',  # Cell adhesion
                                 'GO:0022608',  # Multicellular organism adhesion
                                 'GO:0098740',  # Multiorganism cell adhesion
                                 'GO:0098743',  # Cell aggregation
                                 'GO:0030054',  # Cell-junction #
                                 'GO:0009986',  # Cell surface #
                                 'GO:0097610',  # Cell surface forrow
                                 'GO:0007160',  # Cell-matrix adhesion
                                 'GO:0043235',  # Receptor complex,
                                 'GO:0008305',  # Integrin complex,
                                 'GO:0043113',  # Receptor clustering
                                 'GO:0009897',  # External side of plasma membrane #
                                 'GO:0038023',  # Signaling receptor activity #
                                 ]

        self.mediator_go_terms = ['GO:0005615',  # Extracellular space
                                  'GO:0005576',  # Extracellular region
                                  'GO:0031012',  # Extracellular matrix
                                  'GO:0005201',  # Extracellular matrix structural constituent
                                  'GO:1990430',  # Extracellular matrix protein binding
                                  'GO:0048018',  # Receptor ligand activity #
                                  ]

random_data

generate_random_cci_scores(cell_number, labels=None, symmetric=True, random_state=None)

Generates a square cell-cell interaction matrix with random scores.

Parameters

cell_number : int Number of cells.

labels : list, default=None List containing labels for each cells. Length of this list must match the cell_number.

symmetric : boolean, default=True Whether generating a symmetric CCI matrix.

random_state : int, default=None Seed for randomization.

Returns

cci_matrix : pandas.DataFrame Matrix with rows and columns as cells. Values represent a random CCI score between 0 and 1.

Source code in cell2cell/datasets/random_data.py
def generate_random_cci_scores(cell_number, labels=None, symmetric=True, random_state=None):
    '''Generates a square cell-cell interaction
    matrix with random scores.

    Parameters
    ----------
    cell_number : int
        Number of cells.

    labels : list, default=None
        List containing labels for each cells. Length of
        this list must match the cell_number.

    symmetric : boolean, default=True
        Whether generating a symmetric CCI matrix.

    random_state : int, default=None
        Seed for randomization.

    Returns
    -------
    cci_matrix : pandas.DataFrame
        Matrix with rows and columns as cells. Values
        represent a random CCI score between 0 and 1.
    '''
    if labels is not None:
        assert len(labels) == cell_number, "Lenght of labels must match cell_number"
    else:
        labels = ['Cell-{}'.format(n) for n in range(1, cell_number+1)]

    if random_state is not None:
        np.random.seed(random_state)
    cci_scores = np.random.random((cell_number, cell_number))
    if symmetric:
        cci_scores = (cci_scores + cci_scores.T) / 2.
    cci_matrix = pd.DataFrame(cci_scores, index=labels, columns=labels)

    return cci_matrix

generate_random_metadata(cell_labels, group_number)

Randomly assigns groups to cell labels.

Parameters

cell_labels : list A list of cell labels.

group_number : int Number of major groups of cells.

Returns

metadata : pandas.DataFrame DataFrame containing the major groups that each cell received randomly (under column 'Group'). Cells are under the column 'Cell'.

Source code in cell2cell/datasets/random_data.py
def generate_random_metadata(cell_labels, group_number):
    '''Randomly assigns groups to cell labels.

    Parameters
    ----------
    cell_labels : list
        A list of cell labels.

    group_number : int
        Number of major groups of cells.

    Returns
    -------
    metadata : pandas.DataFrame
        DataFrame containing the major groups that each cell
        received randomly (under column 'Group'). Cells are
        under the column 'Cell'.
    '''
    metadata = pd.DataFrame()
    metadata['Cell'] = cell_labels

    groups = list(range(1, group_number+1))
    metadata['Group'] = metadata['Cell'].apply(lambda x: np.random.choice(groups, 1)[0])
    return metadata

generate_random_ppi(max_size, interactors_A, interactors_B=None, random_state=None, verbose=True)

Generates a random list of protein-protein interactions.

Parameters

max_size : int Maximum size of interactions to obtain. Since the PPIs are obtained by independently resampling interactors A and B rather than creating all possible combinations (it may demand too much memory), some PPIs can be duplicated and when dropping them results into a smaller number of PPIs than the max_size.

interactors_A : list A list of protein names to include in the first column of the PPIs.

interactors_B : list, default=None A list of protein names to include in the second columns of the PPIs. If None, interactors_A will be used as interactors_B too.

random_state : int, default=None Seed for randomization.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

ppi_data : pandas.DataFrame DataFrame containing a list of protein-protein interactions. It has three columns: 'A', 'B', and 'score' for interactors A, B and weights of interactions, respectively.

Source code in cell2cell/datasets/random_data.py
def generate_random_ppi(max_size, interactors_A, interactors_B=None, random_state=None, verbose=True):
    '''Generates a random list of protein-protein interactions.

    Parameters
    ----------
    max_size : int
        Maximum size of interactions to obtain. Since the PPIs
        are obtained by independently resampling interactors A and B
        rather than creating all possible combinations (it may demand too much
        memory), some PPIs can be duplicated and when dropping them
        results into a smaller number of PPIs than the max_size.

    interactors_A : list
        A list of protein names to include in the first column of
        the PPIs.

    interactors_B : list, default=None
        A list of protein names to include in the second columns
        of the PPIs. If None, interactors_A will be used as
        interactors_B too.

    random_state : int, default=None
        Seed for randomization.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    ppi_data : pandas.DataFrame
        DataFrame containing a list of protein-protein interactions.
        It has three columns: 'A', 'B', and 'score' for interactors
        A, B and weights of interactions, respectively.
    '''
    if interactors_B is not None:
        assert max_size <= len(interactors_A)*len(interactors_B), "The maximum size can't be greater than all combinations between partners A and B"
    else:
        assert max_size <= len(interactors_A)**2, "The maximum size can't be greater than all combinations of partners A"


    if verbose:
        print('Generating random PPI network.')

    def small_block_ppi(size, interactors_A, interactors_B, random_state):
        if random_state is not None:
            random_state += 1
        if interactors_B is None:
            interactors_B = interactors_A

        col_A = resample(interactors_A, n_samples=size, random_state=random_state)
        col_B = resample(interactors_B, n_samples=size, random_state=random_state)

        ppi_data = pd.DataFrame()
        ppi_data['A'] = col_A
        ppi_data['B'] = col_B
        ppi_data.assign(score=1.0)

        ppi_data = ppi.remove_ppi_bidirectionality(ppi_data, ('A', 'B'), verbose=verbose)
        ppi_data = ppi_data.drop_duplicates()
        ppi_data.reset_index(inplace=True, drop=True)
        return ppi_data

    ppi_data = small_block_ppi(max_size*2, interactors_A, interactors_B, random_state)

    # TODO: This part need to be fixed, it does not converge to the max_size -> len((set(A)) * len(set(B) - set(A)))
    # while ppi_data.shape[0] < size:
    #     if random_state is not None:
    #         random_state += 2
    #     b = small_block_ppi(size, interactors_A, interactors_B, random_state)
    #     print(b)
    #     ppi_data = pd.concat([ppi_data, b])
    #     ppi_data = ppi.remove_ppi_bidirectionality(ppi_data, ('A', 'B'), verbose=verbose)
    #     ppi_data = ppi_data.drop_duplicates()
    #     ppi_data.dropna()
    #     ppi_data.reset_index(inplace=True, drop=True)
    #     print(ppi_data.shape[0])

    if ppi_data.shape[0] > max_size:
        ppi_data = ppi_data.loc[list(range(max_size)), :]
    ppi_data.reset_index(inplace=True, drop=True)
    return ppi_data

generate_random_rnaseq(size, row_names, random_state=None, verbose=True)

Generates a RNA-seq dataset that is normally distributed gene-wise and size normalized (each column sums up to a million).

Parameters

size : int Number of cell-types/tissues/samples (columns).

row_names : array-like List containing the name of genes (rows).

random_state : int, default=None Seed for randomization.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

df : pandas.DataFrame Dataframe containing gene expression given the list of genes for each cell-type/tissue/sample.

Source code in cell2cell/datasets/random_data.py
def generate_random_rnaseq(size, row_names, random_state=None, verbose=True):
    '''
    Generates a RNA-seq dataset that is normally distributed gene-wise and size
    normalized (each column sums up to a million).

    Parameters
    ----------
    size : int
        Number of cell-types/tissues/samples (columns).

    row_names : array-like
        List containing the name of genes (rows).

    random_state : int, default=None
        Seed for randomization.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    df : pandas.DataFrame
        Dataframe containing gene expression given the list
        of genes for each cell-type/tissue/sample.
    '''
    if verbose:
        print('Generating random RNA-seq dataset.')
    columns = ['Cell-{}'.format(c) for c in range(1, size+1)]

    if random_state is not None:
        np.random.seed(random_state)
    data = np.random.randn(len(row_names), len(columns))    # Normal distribution
    min = np.abs(np.amin(data, axis=1))
    min = min.reshape((len(min), 1))

    data = data + min
    df = pd.DataFrame(data, index=row_names, columns=columns)
    if verbose:
        print('Normalizing random RNA-seq dataset (into TPM)')
    df = rnaseq.scale_expression_by_sum(df, axis=0, sum_value=1e6)
    return df

toy_data

generate_toy_distance()

Generates a square matrix with cell-cell distance.

Returns

distance : pandas.DataFrame DataFrame with Euclidean-like distance between each pair of cells in the toy RNA-seq dataset.

Source code in cell2cell/datasets/toy_data.py
def generate_toy_distance():
    '''Generates a square matrix with cell-cell distance.

    Returns
    -------
    distance : pandas.DataFrame
        DataFrame with Euclidean-like distance between each
        pair of cells in the toy RNA-seq dataset.
    '''
    data = np.asarray([[0.0, 10.0, 12.0, 5.0, 3.0],
                       [10.0, 0.0, 15.0, 8.0, 9.0],
                       [12.0, 15.0, 0.0, 4.5, 7.5],
                       [5.0, 8.0, 4.5, 0.0, 6.5],
                       [3.0, 9.0, 7.5, 6.5, 0.0],
                       ])
    distance = pd.DataFrame(data,
                            index=['C1', 'C2', 'C3', 'C4', 'C5'],
                            columns=['C1', 'C2', 'C3', 'C4', 'C5']
                            )
    return distance

generate_toy_metadata()

Generates metadata for cells in the toy RNA-seq dataset.

Returns

metadata : pandas.DataFrame DataFrame with metadata for each cell. Metadata contains the major groups of those cells.

Source code in cell2cell/datasets/toy_data.py
def generate_toy_metadata():
    '''Generates metadata for cells in the toy RNA-seq dataset.

    Returns
    -------
    metadata : pandas.DataFrame
        DataFrame with metadata for each cell. Metadata contains the
        major groups of those cells.
    '''
    data = np.asarray([['C1', 'G1'],
                       ['C2', 'G2'],
                       ['C3', 'G3'],
                       ['C4', 'G3'],
                       ['C5', 'G1']
                       ])

    metadata = pd.DataFrame(data, columns=['#SampleID', 'Groups'])
    return metadata

generate_toy_ppi(prot_complex=False)

Generates a toy list of protein-protein interactions.

Parameters

prot_complex : boolean, default=False Whether including PPIs where interactors could contain multimeric complexes.

Returns

ppi : pandas.DataFrame Dataframe containing PPIs. Columns are 'A' (first interacting partners), 'B' (second interacting partners) and 'score' for weighting each PPI.

Source code in cell2cell/datasets/toy_data.py
def generate_toy_ppi(prot_complex=False):
    '''Generates a toy list of protein-protein interactions.

    Parameters
    ----------
    prot_complex : boolean, default=False
        Whether including PPIs where interactors could contain
        multimeric complexes.

    Returns
    -------
    ppi : pandas.DataFrame
        Dataframe containing PPIs. Columns are 'A' (first interacting
        partners), 'B' (second interacting partners) and 'score'
        for weighting each PPI.
    '''
    if prot_complex:
        data = np.asarray([['Protein-A', 'Protein-B'],
                           ['Protein-B', 'Protein-C'],
                           ['Protein-C', 'Protein-A'],
                           ['Protein-B', 'Protein-B'],
                           ['Protein-B', 'Protein-A'],
                           ['Protein-E', 'Protein-F'],
                           ['Protein-F', 'Protein-F'],
                           ['Protein-C&Protein-E', 'Protein-F'],
                           ['Protein-B', 'Protein-E'],
                           ['Protein-A&Protein-B', 'Protein-F'],
                           ])
    else:
        data = np.asarray([['Protein-A', 'Protein-B'],
                           ['Protein-B', 'Protein-C'],
                           ['Protein-C', 'Protein-A'],
                           ['Protein-B', 'Protein-B'],
                           ['Protein-B', 'Protein-A'],
                           ['Protein-E', 'Protein-F'],
                           ['Protein-F', 'Protein-F'],
                           ['Protein-C', 'Protein-F'],
                           ['Protein-B', 'Protein-E'],
                           ['Protein-A', 'Protein-F'],
                           ])
    ppi = pd.DataFrame(data, columns=['A', 'B'])
    ppi = ppi.assign(score=1.0)
    return ppi

generate_toy_rnaseq()

Generates a toy RNA-seq dataset

Returns

rnaseq : pandas.DataFrame DataFrame contianing the toy RNA-seq dataset. Columns are cells and rows are genes.

Source code in cell2cell/datasets/toy_data.py
def generate_toy_rnaseq():
    '''Generates a toy RNA-seq dataset

    Returns
    -------
    rnaseq : pandas.DataFrame
        DataFrame contianing the toy RNA-seq dataset. Columns
        are cells and rows are genes.
    '''
    data = np.asarray([[5, 10, 8, 15, 2],
                       [15, 5, 20, 1, 30],
                       [18, 12, 5, 40, 20],
                       [9, 30, 22, 5, 2],
                       [2, 1, 1, 27, 15],
                       [30, 11, 16, 5, 12],
                       ])

    rnaseq = pd.DataFrame(data,
                          index=['Protein-A', 'Protein-B', 'Protein-C', 'Protein-D', 'Protein-E', 'Protein-F'],
                          columns=['C1', 'C2', 'C3', 'C4', 'C5']
                          )
    rnaseq.index.name = 'gene_id'
    return rnaseq

external special

goenrich

gene2go(filename, experimental=False, tax_id=9606, **kwds)

read go-annotation file

:param filename: protein or gene identifier column :param experimental: use only experimentally validated annotations :param tax_id: filter according to taxon

Source code in cell2cell/external/goenrich.py
def gene2go(filename, experimental=False, tax_id=9606, **kwds):
    """ read go-annotation file

    :param filename: protein or gene identifier column
    :param experimental: use only experimentally validated annotations
    :param tax_id: filter according to taxon
    """
    defaults = {'comment': '#',
                'names': GENE2GO_COLUMNS}
    defaults.update(kwds)
    result = pd.read_csv(filename, sep='\t', **defaults)

    retain_mask = result.tax_id == tax_id
    result.drop(result.index[~retain_mask], inplace=True)

    if experimental:
        retain_mask = result.Evidence.isin(EXPERIMENTAL_EVIDENCE)
        result.drop(result.index[~retain_mask], inplace=True)

    return result

goa(filename, experimental=True, **kwds)

read go-annotation file

:param filename: protein or gene identifier column :param experimental: use only experimentally validated annotations

Source code in cell2cell/external/goenrich.py
def goa(filename, experimental=True, **kwds):
    """ read go-annotation file

    :param filename: protein or gene identifier column
    :param experimental: use only experimentally validated annotations
    """
    defaults = {'comment': '!',
                'names': GENE_ASSOCIATION_COLUMNS}

    if experimental and 'usecols' in kwds:
        kwds['usecols'] += ('evidence_code',)

    defaults.update(kwds)
    result = pd.read_csv(filename, sep='\t', **defaults)

    if experimental:
        retain_mask = result.evidence_code.isin(EXPERIMENTAL_EVIDENCE)
        result.drop(result.index[~retain_mask], inplace=True)

    return result

ontology(file)

read ontology from file :param file: file path of file handle

Source code in cell2cell/external/goenrich.py
def ontology(file):
    """ read ontology from file
    :param file: file path of file handle
    """
    O = nx.DiGraph()

    if isinstance(file, str):
        f = open(file)
        we_opened_file = True
    else:
        f = file
        we_opened_file = False

    try:
        tokens = _tokenize(f)
        terms = _filter_terms(tokens)
        entries = _parse_terms(terms)
        nodes, edges = zip(*entries)
        O.add_nodes_from(nodes)
        O.add_edges_from(itertools.chain.from_iterable(edges))
        O.graph['roots'] = {data['name'] : n for n, data in O.nodes.items()
                if data['name'] == data['namespace']}
    finally:
        if we_opened_file:
            f.close()

    for root in O.graph['roots'].values():
        for n, depth in nx.shortest_path_length(O, root).items():
            node = O.nodes[n]
            node['depth'] = min(depth, node.get('depth', float('inf')))
    return O.reverse()

sgd(filename, experimental=False, **kwds)

read yeast genome database go-annotation file :param filename: protein or gene identifier column :param experimental: use only experimentally validated annotations

Source code in cell2cell/external/goenrich.py
def sgd(filename, experimental=False, **kwds):
    """ read yeast genome database go-annotation file
    :param filename: protein or gene identifier column
    :param experimental: use only experimentally validated annotations
    """
    return goa(filename, experimental, **kwds)

gseapy

generate_lr_geneset(lr_list, complex_sep=None, lr_sep='^', pathway_per_gene=None, organism='human', pathwaydb='GOBP', min_pathways=15, max_pathways=10000, readable_name=False, output_folder=None)

Generate a gene set from a list of LR pairs.

Parameters

lr_list : list List of LR pairs.

complex_sep : str, default=None Separator of the members of a complex. If None, the ligand and receptor are assumed to be single genes.

lr_sep : str, default='^' Separator of the ligand and receptor in the LR pair.

pathway_per_gene : dict, default=None Dictionary with genes as keys and pathways as values. You can pass this if you are using different annotations than those available resources in cell2cell.datasets.gsea_data.gsea_msig().

organism : str, default='human' Organism for whom the DB will be loaded. Available options are {'human', 'mouse'}.

str, default='GOBP'

Molecular Signature Database to load. Available options are {'GOBP', 'KEGG', 'Reactome'}

min_pathways : int, default=15 Minimum number of pathways that a LR pair can be annotated to.

max_pathways : int, default=10000 Maximum number of pathways that a LR pair can be annotated to.

readable_name : boolean, default=False If True, the pathway names are transformed to a more readable format.

output_folder : str, default=None Path to store the GMT file. If None, it stores the gmt file in the current directory.

Returns

lr_set : dict Dictionary with pathways as keys and LR pairs as values.

Source code in cell2cell/external/gseapy.py
def generate_lr_geneset(lr_list, complex_sep=None, lr_sep='^', pathway_per_gene=None, organism='human', pathwaydb='GOBP',
                        min_pathways=15, max_pathways=10000, readable_name=False, output_folder=None):
    '''Generate a gene set from a list of LR pairs.

    Parameters
    ----------
    lr_list : list
        List of LR pairs.

    complex_sep : str, default=None
        Separator of the members of a complex. If None, the ligand and receptor are assumed to be single genes.

    lr_sep : str, default='^'
        Separator of the ligand and receptor in the LR pair.

    pathway_per_gene : dict, default=None
        Dictionary with genes as keys and pathways as values.
        You can pass this if you are using different annotations than those
        available resources in `cell2cell.datasets.gsea_data.gsea_msig()`.

    organism : str, default='human'
        Organism for whom the DB will be loaded.
        Available options are {'human', 'mouse'}.

    pathwaydb: str, default='GOBP'
        Molecular Signature Database to load.
        Available options are {'GOBP', 'KEGG', 'Reactome'}

    min_pathways : int, default=15
        Minimum number of pathways that a LR pair can be annotated to.

    max_pathways : int, default=10000
        Maximum number of pathways that a LR pair can be annotated to.

    readable_name : boolean, default=False
        If True, the pathway names are transformed to a more readable format.

    output_folder : str, default=None
        Path to store the GMT file. If None, it stores the gmt file in the
        current directory.

    Returns
    -------
    lr_set : dict
        Dictionary with pathways as keys and LR pairs as values.
    '''
    # Check if the LR gene set is already available
    _check_pathwaydb(organism, pathwaydb)

    # Obtain annotations
    gmt_info = PATHWAY_DATA[organism][pathwaydb].copy()
    if output_folder is not None:
        gmt_info['filename'] = os.path.join(output_folder, gmt_info['filename'])
    if pathway_per_gene is None:
        pathway_per_gene = load_gmt(readable_name=readable_name, **gmt_info)

    # Dictionary to save the LR interaction (key) and the annotated pathways (values).
    pathway_sets = defaultdict(set)

    # Iterate through the interactions in the LR DB.
    for lr_label in lr_list:
        lr = lr_label.split(lr_sep)

        # Gene members of the ligand and the receptor in the LR pair
        if complex_sep is None:
            ligands = [lr[0]]
            receptors = [lr[1]]
        else:
            ligands = lr[0].split(complex_sep)
            receptors = lr[1].split(complex_sep)

        # Find pathways associated with all members of the ligand
        for i, ligand in enumerate(ligands):
            if i == 0:
                ligand_pathways = pathway_per_gene[ligand]
            else:
                ligand_pathways = ligand_pathways.intersection(pathway_per_gene[ligand])

        # Find pathways associated with all members of the receptor
        for i, receptor in enumerate(receptors):
            if i == 0:
                receptor_pathways = pathway_per_gene[receptor]
            else:
                receptor_pathways = receptor_pathways.intersection(pathway_per_gene[receptor])

        # Keep only pathways that are in both ligand and receptor.
        lr_pathways = ligand_pathways.intersection(receptor_pathways)
        for p in lr_pathways:
            pathway_sets[p] = pathway_sets[p].union([lr_label])

    lr_set = defaultdict(set)

    for k, v in pathway_sets.items():
        if min_pathways <= len(v) <= max_pathways:
            lr_set[k] = v
    return lr_set

load_gmt(filename, backup_url=None, readable_name=False)

Load a GMT file.

Parameters

filename : str Path to the GMT file.

backup_url : str, default=None URL to download the GMT file from if not present locally.

readable_name : boolean, default=False If True, the pathway names are transformed to a more readable format. That is, removing underscores and pathway DB name at the beginning.

Returns

pathway_per_gene : dict Dictionary with genes as keys and pathways as values.

Source code in cell2cell/external/gseapy.py
def load_gmt(filename, backup_url=None, readable_name=False):
    '''Load a GMT file.

    Parameters
    ----------
    filename : str
        Path to the GMT file.

    backup_url : str, default=None
        URL to download the GMT file from if not present locally.

    readable_name : boolean, default=False
        If True, the pathway names are transformed to a more readable format.
        That is, removing underscores and pathway DB name at the beginning.

    Returns
    -------
    pathway_per_gene : dict
        Dictionary with genes as keys and pathways as values.
    '''
    from pathlib import Path

    path = Path(filename)
    if path.is_file():
        f = open(path, 'rb')
    else:
        if backup_url is not None:
            try:
                _download(backup_url, path)
            except ValueError:  # invalid URL
                print('Invalid filename or URL')
            f = open(path, 'rb')
        else:
            print('Invalid filename')

    pathway_per_gene = defaultdict(set)
    with f:
        for i, line in enumerate(f):
            l = line.decode("utf-8").split('\t')
            l[-1] = l[-1].replace('\n', '')
            l = [pw for pw in l if ('http' not in pw)]  # Remove website info
            pathway_name = l[0]
            if readable_name:
                pathway_name = ' '.join(pathway_name.split('_')[1:])
            for gene in l[1:]:
                pathway_per_gene[gene] = pathway_per_gene[gene].union(set([pathway_name]))
    return pathway_per_gene

run_gsea(loadings, lr_set, output_folder, weight=1, min_size=15, permutations=999, processes=6, random_state=6, significance_threshold=0.05)

Run GSEA using the LR gene set.

Parameters

loadings : pandas.DataFrame Dataframe with the loadings of the LR pairs for each factor.

lr_set : dict Dictionary with pathways as keys and LR pairs as values. LR pairs must match the indexes in the loadings dataframe.

output_folder : str Path to the output folder.

weight : int, default=1 Weight to use for score underlying the GSEA (parameter p).

min_size : int, default=15 Minimum number of LR pairs that a pathway must contain.

permutations : int, default=999 Number of permutations to use for the GSEA. The total permutations will be this number plus 1 (this extra case is the unpermuted one).

processes : int, default=6 Number of processes to use for the GSEA.

random_state : int, default=6 Random seed to use for the GSEA.

significance_threshold : float, default=0.05 Significance threshold to use for the FDR correction.

Returns

pvals : pandas.DataFrame Dataframe containing the P-values for each pathway (rows) in each of the factors (columns).

score : pandas.DataFrame Dataframe containing the Normalized Enrichment Scores (NES) for each pathway (rows) in each of the factors (columns).

gsea_df : pandas.DataFrame Dataframe with the detailed GSEA results.

Source code in cell2cell/external/gseapy.py
def run_gsea(loadings, lr_set, output_folder, weight=1, min_size=15, permutations=999, processes=6,
             random_state=6, significance_threshold=0.05):
    '''Run GSEA using the LR gene set.

    Parameters
    ----------
    loadings : pandas.DataFrame
        Dataframe with the loadings of the LR pairs for each factor.

    lr_set : dict
        Dictionary with pathways as keys and LR pairs as values.
        LR pairs must match the indexes in the loadings dataframe.

    output_folder : str
        Path to the output folder.

    weight : int, default=1
        Weight to use for score underlying the GSEA (parameter p).

    min_size : int, default=15
        Minimum number of LR pairs that a pathway must contain.

    permutations : int, default=999
        Number of permutations to use for the GSEA. The total permutations
        will be this number plus 1 (this extra case is the unpermuted one).

    processes : int, default=6
        Number of processes to use for the GSEA.

    random_state : int, default=6
        Random seed to use for the GSEA.

    significance_threshold : float, default=0.05
        Significance threshold to use for the FDR correction.

    Returns
    -------
    pvals : pandas.DataFrame
        Dataframe containing the P-values for each pathway (rows)
        in each of the factors (columns).

    score : pandas.DataFrame
        Dataframe containing the Normalized Enrichment Scores (NES)
        for each pathway (rows) in each of the factors (columns).

    gsea_df : pandas.DataFrame
        Dataframe with the detailed GSEA results.
    '''
    import numpy as np
    import pandas as pd

    from statsmodels.stats.multitest import fdrcorrection
    from cell2cell.io.directories import create_directory

    create_directory(output_folder)
    gseapy = _check_if_gseapy()

    lr_set_ = lr_set.copy()
    df = loadings.reset_index()
    for factor in tqdm(df.columns[1:]):
        # Rank LR pairs of each factor by their respective loadings
        test = df[['index', factor]]
        test.columns = [0, 1]
        test = test.sort_values(by=1, ascending=False)
        test.reset_index(drop=True, inplace=True)

        # RUN GSEA
        gseapy.prerank(rnk=test,
                       gene_sets=lr_set_,
                       min_size=min_size,
                       weighted_score_type=weight,
                       processes=processes,
                       permutation_num=permutations,  # reduce number to speed up testing
                       outdir=output_folder + '/GSEA/' + factor, format='pdf', seed=random_state)

    # Adjust P-values
    pvals = []
    terms = []
    factors = []
    nes = []
    for factor in df.columns[1:]:
        p_report = pd.read_csv(output_folder + '/GSEA/' + factor + '/gseapy.gene_set.prerank.report.csv')
        pval = p_report['NOM p-val'].values.tolist()
        pvals.extend(pval)
        terms.extend(p_report.Term.values.tolist())
        factors.extend([factor] * len(pval))
        nes.extend(p_report['NES'].values.tolist())
    gsea_df = pd.DataFrame(np.asarray([factors, terms, nes, pvals]).T, columns=['Factor', 'Term', 'NES', 'P-value'])
    gsea_df = gsea_df.loc[gsea_df['P-value'] != 'nan']
    gsea_df['P-value'] = pd.to_numeric(gsea_df['P-value'])
    gsea_df['P-value'] = gsea_df['P-value'].replace(0., 1. / (permutations + 1))
    gsea_df['NES'] = pd.to_numeric(gsea_df['NES'])
    # Corrected P-value
    gsea_df['Adj. P-value'] = fdrcorrection(gsea_df['P-value'].values,
                                            alpha=significance_threshold)[1]
    gsea_df.to_excel(output_folder + '/GSEA/GSEA-Adj-Pvals.xlsx')

    pvals = gsea_df.pivot(index="Term", columns="Factor", values="Adj. P-value").fillna(1.)
    scores = gsea_df.pivot(index="Term", columns="Factor", values="NES").fillna(0)

    # Sort factors
    sorted_columns = [f for f in df.columns[1:] if (f in pvals.columns) & (f in scores.columns)]
    pvals = pvals[sorted_columns]
    scores = scores[sorted_columns]
    return pvals, scores, gsea_df

pcoa

pcoa(distance_matrix, method='eigh', number_of_dimensions=0, inplace=False)

Perform Principal Coordinate Analysis. Principal Coordinate Analysis (PCoA) is a method similar to Principal Components Analysis (PCA) with the difference that PCoA operates on distance matrices, typically with non-euclidian and thus ecologically meaningful distances like UniFrac in microbiome research. In ecology, the euclidean distance preserved by Principal Component Analysis (PCA) is often not a good choice because it deals poorly with double zeros (Species have unimodal distributions along environmental gradients, so if a species is absent from two sites at the same site, it can't be known if an environmental variable is too high in one of them and too low in the other, or too low in both, etc. On the other hand, if an species is present in two sites, that means that the sites are similar.). Note that the returned eigenvectors are not normalized to unit length. Parameters


distance_matrix : pandas.DataFrame A distance matrix. method : str, optional Eigendecomposition method to use in performing PCoA. By default, uses SciPy's eigh, which computes exact eigenvectors and eigenvalues for all dimensions. The alternate method, fsvd, uses faster heuristic eigendecomposition but loses accuracy. The magnitude of accuracy lost is dependent on dataset. number_of_dimensions : int, optional Dimensions to reduce the distance matrix to. This number determines how many eigenvectors and eigenvalues will be returned. By default, equal to the number of dimensions of the distance matrix, as default eigendecomposition using SciPy's eigh method computes all eigenvectors and eigenvalues. If using fast heuristic eigendecomposition through fsvd, a desired number of dimensions should be specified. Note that the default eigendecomposition method eigh does not natively support a specifying number of dimensions to reduce a matrix to, so if this parameter is specified, all eigenvectors and eigenvalues will be simply be computed with no speed gain, and only the number specified by number_of_dimensions will be returned. Specifying a value of 0, the default, will set number_of_dimensions equal to the number of dimensions of the specified distance_matrix. inplace : bool, optional If true, centers a distance matrix in-place in a manner that reduces memory consumption. Returns


OrdinationResults Object that stores the PCoA results, including eigenvalues, the proportion explained by each of them, and transformed sample coordinates. See Also


OrdinationResults Notes


.. note:: If the distance is not euclidean (for example if it is a semimetric and the triangle inequality doesn't hold), negative eigenvalues can appear. There are different ways to deal with that problem (see Legendre & Legendre 1998, \S 9.2.3), but none are currently implemented here. However, a warning is raised whenever negative eigenvalues appear, allowing the user to decide if they can be safely ignored.

Source code in cell2cell/external/pcoa.py
def pcoa(distance_matrix, method="eigh", number_of_dimensions=0,
         inplace=False):
    r"""Perform Principal Coordinate Analysis.
    Principal Coordinate Analysis (PCoA) is a method similar
    to Principal Components Analysis (PCA) with the difference that PCoA
    operates on distance matrices, typically with non-euclidian and thus
    ecologically meaningful distances like UniFrac in microbiome research.
    In ecology, the euclidean distance preserved by Principal
    Component Analysis (PCA) is often not a good choice because it
    deals poorly with double zeros (Species have unimodal
    distributions along environmental gradients, so if a species is
    absent from two sites at the same site, it can't be known if an
    environmental variable is too high in one of them and too low in
    the other, or too low in both, etc. On the other hand, if an
    species is present in two sites, that means that the sites are
    similar.).
    Note that the returned eigenvectors are not normalized to unit length.
    Parameters
    ----------
    distance_matrix : pandas.DataFrame
        A distance matrix.
    method : str, optional
        Eigendecomposition method to use in performing PCoA.
        By default, uses SciPy's `eigh`, which computes exact
        eigenvectors and eigenvalues for all dimensions. The alternate
        method, `fsvd`, uses faster heuristic eigendecomposition but loses
        accuracy. The magnitude of accuracy lost is dependent on dataset.
    number_of_dimensions : int, optional
        Dimensions to reduce the distance matrix to. This number determines
        how many eigenvectors and eigenvalues will be returned.
        By default, equal to the number of dimensions of the distance matrix,
        as default eigendecomposition using SciPy's `eigh` method computes
        all eigenvectors and eigenvalues. If using fast heuristic
        eigendecomposition through `fsvd`, a desired number of dimensions
        should be specified. Note that the default eigendecomposition
        method `eigh` does not natively support a specifying number of
        dimensions to reduce a matrix to, so if this parameter is specified,
        all eigenvectors and eigenvalues will be simply be computed with no
        speed gain, and only the number specified by `number_of_dimensions`
        will be returned. Specifying a value of `0`, the default, will
        set `number_of_dimensions` equal to the number of dimensions of the
        specified `distance_matrix`.
    inplace : bool, optional
        If true, centers a distance matrix in-place in a manner that reduces
        memory consumption.
    Returns
    -------
    OrdinationResults
        Object that stores the PCoA results, including eigenvalues, the
        proportion explained by each of them, and transformed sample
        coordinates.
    See Also
    --------
    OrdinationResults
    Notes
    -----
    .. note:: If the distance is not euclidean (for example if it is a
        semimetric and the triangle inequality doesn't hold),
        negative eigenvalues can appear. There are different ways
        to deal with that problem (see Legendre & Legendre 1998, \S
        9.2.3), but none are currently implemented here.
        However, a warning is raised whenever negative eigenvalues
        appear, allowing the user to decide if they can be safely
        ignored.
    """
    distance_matrix = convert_to_distance_matrix(distance_matrix)

    # Center distance matrix, a requirement for PCoA here
    matrix_data = center_distance_matrix(distance_matrix.values, inplace=inplace)

    # If no dimension specified, by default will compute all eigenvectors
    # and eigenvalues
    if number_of_dimensions == 0:
        if method == "fsvd" and matrix_data.shape[0] > 10:
            warn("FSVD: since no value for number_of_dimensions is specified, "
                 "PCoA for all dimensions will be computed, which may "
                 "result in long computation time if the original "
                 "distance matrix is large.", RuntimeWarning)

        # distance_matrix is guaranteed to be square
        number_of_dimensions = matrix_data.shape[0]
    elif number_of_dimensions < 0:
        raise ValueError('Invalid operation: cannot reduce distance matrix '
                         'to negative dimensions using PCoA. Did you intend '
                         'to specify the default value "0", which sets '
                         'the number_of_dimensions equal to the '
                         'dimensionality of the given distance matrix?')

    # Perform eigendecomposition
    if method == "eigh":
        # eigh does not natively support specifying number_of_dimensions, i.e.
        # there are no speed gains unlike in FSVD. Later, we slice off unwanted
        # dimensions to conform the result of eigh to the specified
        # number_of_dimensions.

        eigvals, eigvecs = eigh(matrix_data)
        long_method_name = "Principal Coordinate Analysis"
    elif method == "fsvd":
        eigvals, eigvecs = _fsvd(matrix_data, number_of_dimensions)
        long_method_name = "Approximate Principal Coordinate Analysis " \
                           "using FSVD"
    else:
        raise ValueError(
            "PCoA eigendecomposition method {} not supported.".format(method))

    # cogent makes eigenvalues positive by taking the
    # abs value, but that doesn't seem to be an approach accepted
    # by L&L to deal with negative eigenvalues. We raise a warning
    # in that case. First, we make values close to 0 equal to 0.
    negative_close_to_zero = np.isclose(eigvals, 0)
    eigvals[negative_close_to_zero] = 0
    if np.any(eigvals < 0):
        warn(
            "The result contains negative eigenvalues."
            " Please compare their magnitude with the magnitude of some"
            " of the largest positive eigenvalues. If the negative ones"
            " are smaller, it's probably safe to ignore them, but if they"
            " are large in magnitude, the results won't be useful. See the"
            " Notes section for more details. The smallest eigenvalue is"
            " {0} and the largest is {1}.".format(eigvals.min(),
                                                  eigvals.max()),
            RuntimeWarning
        )

    # eigvals might not be ordered, so we first sort them, then analogously
    # sort the eigenvectors by the ordering of the eigenvalues too
    idxs_descending = eigvals.argsort()[::-1]
    eigvals = eigvals[idxs_descending]
    eigvecs = eigvecs[:, idxs_descending]

    # If we return only the coordinates that make sense (i.e., that have a
    # corresponding positive eigenvalue), then Jackknifed Beta Diversity
    # won't work as it expects all the OrdinationResults to have the same
    # number of coordinates. In order to solve this issue, we return the
    # coordinates that have a negative eigenvalue as 0
    num_positive = (eigvals >= 0).sum()
    eigvecs[:, num_positive:] = np.zeros(eigvecs[:, num_positive:].shape)
    eigvals[num_positive:] = np.zeros(eigvals[num_positive:].shape)

    if method == "fsvd":
        # Since the dimension parameter, hereafter referred to as 'd',
        # restricts the number of eigenvalues and eigenvectors that FSVD
        # computes, we need to use an alternative method to compute the sum
        # of all eigenvalues, used to compute the array of proportions
        # explained. Otherwise, the proportions calculated will only be
        # relative to d number of dimensions computed; whereas we want
        # it to be relative to the entire dimensionality of the
        # centered distance matrix.

        # An alternative method of calculating th sum of eigenvalues is by
        # computing the trace of the centered distance matrix.
        # See proof outlined here: https://goo.gl/VAYiXx
        sum_eigenvalues = np.trace(matrix_data)
    else:
        # Calculate proportions the usual way
        sum_eigenvalues = np.sum(eigvals)

    proportion_explained = eigvals / sum_eigenvalues

    # In case eigh is used, eigh computes all eigenvectors and -values.
    # So if number_of_dimensions was specified, we manually need to ensure
    # only the requested number of dimensions
    # (number of eigenvectors and eigenvalues, respectively) are returned.
    eigvecs = eigvecs[:, :number_of_dimensions]
    eigvals = eigvals[:number_of_dimensions]
    proportion_explained = proportion_explained[:number_of_dimensions]

    # Scale eigenvalues to have length = sqrt(eigenvalue). This
    # works because np.linalg.eigh returns normalized
    # eigenvectors. Each row contains the coordinates of the
    # objects in the space of principal coordinates. Note that at
    # least one eigenvalue is zero because only n-1 axes are
    # needed to represent n points in a euclidean space.
    coordinates = eigvecs * np.sqrt(eigvals)

    axis_labels = ["PC%d" % i for i in range(1, number_of_dimensions + 1)]

    ordination_dict = {'short_method_name' : "PCoA",
                       'long_method_name' : long_method_name,
                       'eigvals' : pd.Series(eigvals, index=axis_labels),
                       'samples' : pd.DataFrame(coordinates, index=distance_matrix.index, columns=axis_labels),
                       'proportion_explained' :pd.Series(proportion_explained, index=axis_labels)
                       }
    return ordination_dict

pcoa_biplot(ordination, y)

Compute the projection of descriptors into a PCoA matrix This implementation is as described in Chapter 9 of Legendre & Legendre, Numerical Ecology 3rd edition. Parameters


OrdinationResults

The computed principal coordinates analysis of dimensions (n, c) where the matrix y will be projected onto.

DataFrame

Samples by features table of dimensions (n, m). These can be environmental features or abundance counts. This table should be normalized in cases of dimensionally heterogenous physical variables.

Returns

OrdinationResults The modified input object that includes projected features onto the ordination space in the features attribute.

Source code in cell2cell/external/pcoa.py
def pcoa_biplot(ordination, y):
    """Compute the projection of descriptors into a PCoA matrix
    This implementation is as described in Chapter 9 of Legendre & Legendre,
    Numerical Ecology 3rd edition.
    Parameters
    ----------
    ordination: OrdinationResults
        The computed principal coordinates analysis of dimensions (n, c) where
        the matrix ``y`` will be projected onto.
    y: DataFrame
        Samples by features table of dimensions (n, m). These can be
        environmental features or abundance counts. This table should be
        normalized in cases of dimensionally heterogenous physical variables.
    Returns
    -------
    OrdinationResults
        The modified input object that includes projected features onto the
        ordination space in the ``features`` attribute.
    """

    # acknowledge that most saved ordinations lack a name, however if they have
    # a name, it should be PCoA
    if (ordination['short_method_name'] != '' and
            ordination['short_method_name']!= 'PCoA'):
        raise ValueError('This biplot computation can only be performed in a '
                         'PCoA matrix.')

    if set(y.index) != set(ordination['samples'].index):
        raise ValueError('The eigenvectors and the descriptors must describe '
                         'the same samples.')

    eigvals = ordination['eigvals']
    coordinates = ordination['samples']
    N = coordinates.shape[0]

    # align the descriptors and eigenvectors in a sample-wise fashion
    y = y.reindex(coordinates.index)

    # S_pc from equation 9.44
    # Represents the covariance matrix between the features matrix and the
    # column-centered eigenvectors of the pcoa.
    spc = (1 / (N - 1)) * y.values.T.dot(scale(coordinates, ddof=1))

    # U_proj from equation 9.55, is the matrix of descriptors to be projected.
    #
    # Only get the power of non-zero values, otherwise this will raise a
    # divide by zero warning. There shouldn't be negative eigenvalues(?)
    Uproj = np.sqrt(N - 1) * spc.dot(np.diag(np.power(eigvals, -0.5,
                                                      where=eigvals > 0)))

    ordination['features'] = pd.DataFrame(data=Uproj,
                                       index=y.columns.copy(),
                                       columns=coordinates.columns.copy())

    return ordination

pcoa_utils

center_distance_matrix(distance_matrix, inplace=False)

Centers a distance matrix. Note: If the used distance was euclidean, pairwise distances needn't be computed from the data table Y because F_matrix = Y.dot(Y.T) (if Y has been centered). But since we're expecting distance_matrix to be non-euclidian, we do the following computation as per Numerical Ecology (Legendre & Legendre 1998). Parameters


distance_matrix : 2D array_like Distance matrix. inplace : bool, optional Whether or not to center the given distance matrix in-place, which is more efficient in terms of memory and computation.

Source code in cell2cell/external/pcoa_utils.py
def center_distance_matrix(distance_matrix, inplace=False):
    """
    Centers a distance matrix.
    Note: If the used distance was euclidean, pairwise distances
    needn't be computed from the data table Y because F_matrix =
    Y.dot(Y.T) (if Y has been centered).
    But since we're expecting distance_matrix to be non-euclidian,
    we do the following computation as per
    Numerical Ecology (Legendre & Legendre 1998).
    Parameters
    ----------
    distance_matrix : 2D array_like
        Distance matrix.
    inplace : bool, optional
        Whether or not to center the given distance matrix in-place, which
        is more efficient in terms of memory and computation.
    """
    if inplace:
        return _f_matrix_inplace(_e_matrix_inplace(distance_matrix))
    else:
        return f_matrix(e_matrix(distance_matrix))

corr(x, y=None)

Computes correlation between columns of x, or x and y. Correlation is covariance of (columnwise) standardized matrices, so each matrix is first centered and scaled to have variance one, and then their covariance is computed. Parameters


x : 2D array_like Matrix of shape (n, p). Correlation between its columns will be computed. y : 2D array_like, optional Matrix of shape (n, q). If provided, the correlation is computed between the columns of x and the columns of y. Else, it's computed between the columns of x. Returns


correlation Matrix of computed correlations. Has shape (p, p) if y is not provided, else has shape (p, q).

Source code in cell2cell/external/pcoa_utils.py
def corr(x, y=None):
    """Computes correlation between columns of `x`, or `x` and `y`.
    Correlation is covariance of (columnwise) standardized matrices,
    so each matrix is first centered and scaled to have variance one,
    and then their covariance is computed.
    Parameters
    ----------
    x : 2D array_like
        Matrix of shape (n, p). Correlation between its columns will
        be computed.
    y : 2D array_like, optional
        Matrix of shape (n, q). If provided, the correlation is
        computed between the columns of `x` and the columns of
        `y`. Else, it's computed between the columns of `x`.
    Returns
    -------
    correlation
        Matrix of computed correlations. Has shape (p, p) if `y` is
        not provided, else has shape (p, q).
    """
    x = np.asarray(x)
    if y is not None:
        y = np.asarray(y)
        if y.shape[0] != x.shape[0]:
            raise ValueError("Both matrices must have the same number of rows")
        x, y = scale(x), scale(y)
    else:
        x = scale(x)
        y = x
    # Notice that scaling was performed with ddof=0 (dividing by n,
    # the default), so now we need to remove it by also using ddof=0
    # (dividing by n)
    return x.T.dot(y) / x.shape[0]

e_matrix(distance_matrix)

Compute E matrix from a distance matrix. Squares and divides by -2 the input elementwise. Eq. 9.20 in Legendre & Legendre 1998.

Source code in cell2cell/external/pcoa_utils.py
def e_matrix(distance_matrix):
    """Compute E matrix from a distance matrix.
    Squares and divides by -2 the input elementwise. Eq. 9.20 in
    Legendre & Legendre 1998."""
    return distance_matrix * distance_matrix / -2

f_matrix(E_matrix)

Compute F matrix from E matrix. Centring step: for each element, the mean of the corresponding row and column are substracted, and the mean of the whole matrix is added. Eq. 9.21 in Legendre & Legendre 1998.

Source code in cell2cell/external/pcoa_utils.py
def f_matrix(E_matrix):
    """Compute F matrix from E matrix.
    Centring step: for each element, the mean of the corresponding
    row and column are substracted, and the mean of the whole
    matrix is added. Eq. 9.21 in Legendre & Legendre 1998."""
    row_means = E_matrix.mean(axis=1, keepdims=True)
    col_means = E_matrix.mean(axis=0, keepdims=True)
    matrix_mean = E_matrix.mean()
    return E_matrix - row_means - col_means + matrix_mean

mean_and_std(a, axis=None, weights=None, with_mean=True, with_std=True, ddof=0)

Compute the weighted average and standard deviation along the specified axis. Parameters


a : array_like Calculate average and standard deviation of these values. axis : int, optional Axis along which the statistics are computed. The default is to compute them on the flattened array. weights : array_like, optional An array of weights associated with the values in a. Each value in a contributes to the average according to its associated weight. The weights array can either be 1-D (in which case its length must be the size of a along the given axis) or of the same shape as a. If weights=None, then all data in a are assumed to have a weight equal to one. with_mean : bool, optional, defaults to True Compute average if True. with_std : bool, optional, defaults to True Compute standard deviation if True. ddof : int, optional, defaults to 0 It means delta degrees of freedom. Variance is calculated by dividing by n - ddof (where n is the number of elements). By default it computes the maximum likelyhood estimator. Returns


average, std Return the average and standard deviation along the specified axis. If any of them was not required, returns None instead

Source code in cell2cell/external/pcoa_utils.py
def mean_and_std(a, axis=None, weights=None, with_mean=True, with_std=True,
                 ddof=0):
    """Compute the weighted average and standard deviation along the
    specified axis.
    Parameters
    ----------
    a : array_like
        Calculate average and standard deviation of these values.
    axis : int, optional
        Axis along which the statistics are computed. The default is
        to compute them on the flattened array.
    weights : array_like, optional
        An array of weights associated with the values in `a`. Each
        value in `a` contributes to the average according to its
        associated weight. The weights array can either be 1-D (in
        which case its length must be the size of `a` along the given
        axis) or of the same shape as `a`. If `weights=None`, then all
        data in `a` are assumed to have a weight equal to one.
    with_mean : bool, optional, defaults to True
        Compute average if True.
    with_std : bool, optional, defaults to True
        Compute standard deviation if True.
    ddof : int, optional, defaults to 0
        It means delta degrees of freedom. Variance is calculated by
        dividing by `n - ddof` (where `n` is the number of
        elements). By default it computes the maximum likelyhood
        estimator.
    Returns
    -------
    average, std
        Return the average and standard deviation along the specified
        axis. If any of them was not required, returns `None` instead
    """
    if not (with_mean or with_std):
        raise ValueError("Either the mean or standard deviation need to be"
                         " computed.")
    a = np.asarray(a)
    if weights is None:
        avg = a.mean(axis=axis) if with_mean else None
        std = a.std(axis=axis, ddof=ddof) if with_std else None
    else:
        avg = np.average(a, axis=axis, weights=weights)
        if with_std:
            if axis is None:
                variance = np.average((a - avg)**2, weights=weights)
            else:
                # Make sure that the subtraction to compute variance works for
                # multidimensional arrays
                a_rolled = np.rollaxis(a, axis)
                # Numpy doesn't have a weighted std implementation, but this is
                # stable and fast
                variance = np.average((a_rolled - avg)**2, axis=0,
                                      weights=weights)
            if ddof != 0:  # Don't waste time if variance doesn't need scaling
                if axis is None:
                    variance *= a.size / (a.size - ddof)
                else:
                    variance *= a.shape[axis] / (a.shape[axis] - ddof)
            std = np.sqrt(variance)
        else:
            std = None
        avg = avg if with_mean else None
    return avg, std

scale(a, weights=None, with_mean=True, with_std=True, ddof=0, copy=True)

Scale array by columns to have weighted average 0 and standard deviation 1. Parameters


a : array_like 2D array whose columns are standardized according to the weights. weights : array_like, optional Array of weights associated with the columns of a. By default, the scaling is unweighted. with_mean : bool, optional, defaults to True Center columns to have 0 weighted mean. with_std : bool, optional, defaults to True Scale columns to have unit weighted std. ddof : int, optional, defaults to 0 If with_std is True, variance is calculated by dividing by n - ddof (where n is the number of elements). By default it computes the maximum likelyhood stimator. copy : bool, optional, defaults to True Whether to perform the standardization in place, or return a new copy of a. Returns


2D ndarray Scaled array. Notes


Wherever std equals 0, it is replaced by 1 in order to avoid division by zero.

Source code in cell2cell/external/pcoa_utils.py
def scale(a, weights=None, with_mean=True, with_std=True, ddof=0, copy=True):
    """Scale array by columns to have weighted average 0 and standard
    deviation 1.
    Parameters
    ----------
    a : array_like
        2D array whose columns are standardized according to the
        weights.
    weights : array_like, optional
        Array of weights associated with the columns of `a`. By
        default, the scaling is unweighted.
    with_mean : bool, optional, defaults to True
        Center columns to have 0 weighted mean.
    with_std : bool, optional, defaults to True
        Scale columns to have unit weighted std.
    ddof : int, optional, defaults to 0
        If with_std is True, variance is calculated by dividing by `n
        - ddof` (where `n` is the number of elements). By default it
        computes the maximum likelyhood stimator.
    copy : bool, optional, defaults to True
        Whether to perform the standardization in place, or return a
        new copy of `a`.
    Returns
    -------
    2D ndarray
        Scaled array.
    Notes
    -----
    Wherever std equals 0, it is replaced by 1 in order to avoid
    division by zero.
    """
    if copy:
        a = a.copy()
    a = np.asarray(a, dtype=np.float64)
    avg, std = mean_and_std(a, axis=0, weights=weights, with_mean=with_mean,
                            with_std=with_std, ddof=ddof)
    if with_mean:
        a -= avg
    if with_std:
        std[std == 0] = 1.0
        a /= std
    return a

svd_rank(M_shape, S, tol=None)

Matrix rank of M given its singular values S. See np.linalg.matrix_rank for a rationale on the tolerance (we're not using that function because it doesn't let us reuse a precomputed SVD).

Source code in cell2cell/external/pcoa_utils.py
def svd_rank(M_shape, S, tol=None):
    """Matrix rank of `M` given its singular values `S`.
    See `np.linalg.matrix_rank` for a rationale on the tolerance
    (we're not using that function because it doesn't let us reuse a
    precomputed SVD)."""
    if tol is None:
        tol = S.max() * max(M_shape) * np.finfo(S.dtype).eps
    return np.sum(S > tol)

umap

run_umap(rnaseq_data, axis=1, metric='euclidean', min_dist=0.4, n_neighbors=8, random_state=None, **kwargs)

Runs UMAP on a expression matrix. Parameters


rnaseq_data : pandas.DataFrame A dataframe of gene expression values wherein the rows are the genes or embeddings of a dimensionality reduction method and columns the cells, tissues or samples.

axis : int, default=0 An axis of the dataframe (0 across rows, 1 across columns). Across rows means that the UMAP is to compare genes, while across columns is to compare cells, tissues or samples.

metric : str, default='euclidean' The distance metric to use. The distance function can be 'braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.

float, default=0.4

The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.

float, default=8

The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.

random_state : int, default=None Seed for randomization.

**kwargs : dict Extra arguments for UMAP as defined in umap.UMAP.

Returns

umap_df : pandas.DataFrame Dataframe containing the UMAP embeddings for the axis analyzed. Contains columns 'umap1 and 'umap2'.

Source code in cell2cell/external/umap.py
def run_umap(rnaseq_data, axis=1, metric='euclidean', min_dist=0.4, n_neighbors=8, random_state=None, **kwargs):
    '''Runs UMAP on a expression matrix.
    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        A dataframe of gene expression values wherein the rows are the genes or
        embeddings of a dimensionality reduction method and columns the cells,
        tissues or samples.

    axis : int, default=0
        An axis of the dataframe (0 across rows, 1 across columns).
        Across rows means that the UMAP is to compare genes, while
        across columns is to compare cells, tissues or samples.

    metric : str, default='euclidean'
        The distance metric to use. The distance function can be 'braycurtis',
        'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice',
        'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski',
        'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao',
        'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.

    min_dist: float, default=0.4
        The effective minimum distance between embedded points. Smaller values
        will result in a more clustered/clumped embedding where nearby points
        on the manifold are drawn closer together, while larger values will
        result on a more even dispersal of points. The value should be set
        relative to the ``spread`` value, which determines the scale at which
        embedded points will be spread out.

    n_neighbors: float, default=8
        The size of local neighborhood (in terms of number of neighboring
        sample points) used for manifold approximation. Larger values
        result in more global views of the manifold, while smaller
        values result in more local data being preserved. In general
        values should be in the range 2 to 100.

    random_state : int, default=None
        Seed for randomization.

    **kwargs : dict
        Extra arguments for UMAP as defined in umap.UMAP.

    Returns
    -------
    umap_df : pandas.DataFrame
        Dataframe containing the UMAP embeddings for the axis analyzed.
        Contains columns 'umap1 and 'umap2'.
    '''
    # Organize data
    if axis == 0:
        df = rnaseq_data
    elif axis == 1:
        df = rnaseq_data.T
    else:
        raise ValueError("The parameter axis must be either 0 or 1.")

    # Compute distances
    D = sp.distance.pdist(df, metric=metric)
    D_sq = sp.distance.squareform(D)

    # Run UMAP
    model = umap.UMAP(metric="precomputed",
                      min_dist=min_dist,
                      n_neighbors=n_neighbors,
                      random_state=random_state,
                      **kwargs
                      )

    trans_D = model.fit_transform(D_sq)

    # Organize results
    umap_df = pd.DataFrame(trans_D, columns=['umap1', 'umap2'], index=df.index)
    return umap_df

io special

directories

create_directory(pathname)

Creates a directory.

Uses a path to create a directory. It creates all intermediate folders before creating the leaf folder.

Parameters

pathname : str Full path of the folder to create.

Source code in cell2cell/io/directories.py
def create_directory(pathname):
    '''Creates a directory.

    Uses a path to create a directory. It creates
    all intermediate folders before creating the
    leaf folder.

    Parameters
    ----------
    pathname : str
        Full path of the folder to create.
    '''
    if not os.path.isdir(pathname):
        os.makedirs(pathname)
        print("{} was created successfully.".format(pathname))
    else:
        print("{} already exists.".format(pathname))

get_files_from_directory(pathname, dir_in_filepath=False)

Obtains a list of filenames in a folder.

Parameters

pathname : str Full path of the folder to explore.

dir_in_filepath : boolean, default=False Whether adding pathname to the filenames

Returns

filenames : list A list containing the names (strings) of the files in the folder.

Source code in cell2cell/io/directories.py
def get_files_from_directory(pathname, dir_in_filepath=False):
    '''Obtains a list of filenames in a folder.

    Parameters
    ----------
    pathname : str
        Full path of the folder to explore.

    dir_in_filepath : boolean, default=False
        Whether adding `pathname` to the filenames

    Returns
    -------
    filenames : list
        A list containing the names (strings) of the files
        in the folder.
    '''
    directory = os.fsencode(pathname)
    filenames = [pathname + '/' + os.fsdecode(file) if dir_in_filepath else os.fsdecode(file) for file in os.listdir(directory)]
    return filenames

read_data

load_cutoffs(cutoff_file, gene_column=None, drop_nangenes=True, log_transformation=False, verbose=True, **kwargs)

Loads a table of cutoff of thresholding values for each gene.

Parameters

cutoff_file : str Absolute path to a file containing thresholding values for genes. Genes are rows and threshold values are in the only column beyond the one containing the gene names.

gene_column : str, default=None Column name where the gene labels are contained. If None, the first column will be assummed to contain gene names.

drop_nangenes : boolean, default=True Whether dropping empty genes across all columns.

log_transformation : boolean, default=False Whether applying a log10 transformation on the data.

verbose : boolean, default=True Whether printing or not steps of the analysis.

**kwargs : dict Extra arguments for loading files the function cell2cell.io.read_data.load_table

Returns

cutoff_data : pandas.DataFrame Dataframe with the cutoff values for each gene. Rows are genes and just one column is included, which corresponds to 'value', wherein the thresholding or cutoff values are contained.

Source code in cell2cell/io/read_data.py
def load_cutoffs(cutoff_file, gene_column=None, drop_nangenes=True, log_transformation=False, verbose=True, **kwargs):
    '''Loads a table of cutoff of thresholding values for each gene.

    Parameters
    ----------
    cutoff_file : str
        Absolute path to a file containing thresholding values for genes.
        Genes are rows and threshold values are in the only column beyond
        the one containing the gene names.

    gene_column : str, default=None
        Column name where the gene labels are contained. If None, the
        first column will be assummed to contain gene names.

    drop_nangenes : boolean, default=True
        Whether dropping empty genes across all columns.

    log_transformation : boolean, default=False
        Whether applying a log10 transformation on the data.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    **kwargs : dict
        Extra arguments for loading files the function
        cell2cell.io.read_data.load_table

    Returns
    -------
    cutoff_data : pandas.DataFrame
        Dataframe with the cutoff values for each gene. Rows are genes
        and just one column is included, which corresponds to 'value',
        wherein the thresholding or cutoff values are contained.
    '''
    if verbose:
        print("Opening Cutoff datasets from {}".format(cutoff_file))
    cutoff_data = load_table(cutoff_file, verbose=verbose, **kwargs)
    if gene_column is not None:
        cutoff_data = cutoff_data.set_index(gene_column)
    else:
        cutoff_data = cutoff_data.set_index(cutoff_data.columns[0])

    # Keep only numeric datasets
    cutoff_data = cutoff_data.select_dtypes([np.number])

    if drop_nangenes:
        cutoff_data = rnaseq.drop_empty_genes(cutoff_data)

    if log_transformation:
        cutoff_data = rnaseq.log10_transformation(cutoff_data)

    cols = list(cutoff_data.columns)
    cols[0] = 'value'
    cutoff_data.columns = cols
    return cutoff_data

load_go_annotations(goa_file, experimental_evidence=True, verbose=True)

Loads GO annotations for each gene in a given organism.

Parameters

goa_file : str Absolute path to an ga file. It could be an URL as for example: goa_file = 'http://current.geneontology.org/annotations/wb.gaf.gz'

experimental_evidence : boolean, default=True Whether considering only annotations with experimental evidence (at least one article/evidence).

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

goa : pandas.DataFrame Dataframe containing information about GO term annotations of each gene for a given organism according to the ga file.

Source code in cell2cell/io/read_data.py
def load_go_annotations(goa_file, experimental_evidence=True, verbose=True):
    '''Loads GO annotations for each gene in a given organism.

    Parameters
    ----------
    goa_file : str
        Absolute path to an ga file. It could be an URL as for example:
        goa_file = 'http://current.geneontology.org/annotations/wb.gaf.gz'

    experimental_evidence : boolean, default=True
        Whether considering only annotations with experimental evidence
        (at least one article/evidence).

    verbose : boolean, default=True
            Whether printing or not steps of the analysis.

    Returns
    -------
    goa : pandas.DataFrame
        Dataframe containing information about GO term annotations of each
        gene for a given organism according to the ga file.
    '''
    import cell2cell.external.goenrich as goenrich

    if verbose:
        print("Opening GO annotations from {}".format(goa_file))

    goa = goenrich.goa(goa_file, experimental_evidence)
    goa_cols = list(goa.columns)
    goa = goa[goa_cols[:3] + [goa_cols[4]]]
    new_cols = ['db', 'Gene', 'Name', 'GO']
    goa.columns = new_cols
    if verbose:
        print(goa_file + ' was correctly loaded')
    return goa

load_go_terms(go_terms_file, verbose=True)

Loads GO term information from a obo-basic file.

Parameters

go_terms_file : str Absolute path to an obo file. It could be an URL as for example: go_terms_file = 'http://purl.obolibrary.org/obo/go/go-basic.obo'

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

go_terms : networkx.Graph NetworkX Graph containing GO terms datasets from .obo file.

Source code in cell2cell/io/read_data.py
def load_go_terms(go_terms_file, verbose=True):
    '''Loads GO term information from a obo-basic file.

    Parameters
    ----------
    go_terms_file : str
        Absolute path to an obo file. It could be an URL as for example:
        go_terms_file = 'http://purl.obolibrary.org/obo/go/go-basic.obo'

    verbose : boolean, default=True
            Whether printing or not steps of the analysis.

    Returns
    -------
    go_terms : networkx.Graph
        NetworkX Graph containing GO terms datasets from .obo file.
    '''
    import cell2cell.external.goenrich as goenrich

    if verbose:
        print("Opening GO terms from {}".format(go_terms_file))
    go_terms = goenrich.ontology(go_terms_file)
    if verbose:
        print(go_terms_file + ' was correctly loaded')
    return go_terms

load_metadata(metadata_file, cell_labels=None, index_col=None, **kwargs)

Loads a metadata table for a given list of cells.

Parameters

metadata_file : str Absolute path to a file containing a metadata table for cell-types/tissues/samples in a RNA-seq dataset.

cell_labels : list, default=None List of cell-types/tissues/samples to consider. Names must match the labels in the metadata table. These names must be contained in the values of the column indicated by index_col.

index_col : str, default=None Column to be consider the index of the metadata. If None, the index will be the numbers of the rows.

**kwargs : dict Extra arguments for loading files the function cell2cell.io.read_data.load_table

Returns

meta : pandas.DataFrame Metadata for the cell-types/tissues/samples provided.

Source code in cell2cell/io/read_data.py
def load_metadata(metadata_file, cell_labels=None, index_col=None, **kwargs):
    '''Loads a metadata table for a given list of cells.

    Parameters
    ----------
    metadata_file : str
        Absolute path to a file containing a metadata table for
        cell-types/tissues/samples in a RNA-seq dataset.

    cell_labels : list, default=None
        List of cell-types/tissues/samples to consider. Names must
        match the labels in the metadata table. These names must
        be contained in the values of the column indicated
        by index_col.

    index_col : str, default=None
        Column to be consider the index of the metadata.
        If None, the index will be the numbers of the rows.

    **kwargs : dict
        Extra arguments for loading files the function
        cell2cell.io.read_data.load_table

    Returns
    -------
    meta : pandas.DataFrame
        Metadata for the cell-types/tissues/samples provided.
    '''
    meta = load_table(metadata_file, **kwargs)
    if index_col is None:
        index_col = list(meta.columns)[0]
        indexing = False
    else:
        indexing = True
    if cell_labels is not None:
        meta = meta.loc[meta[index_col].isin(cell_labels)]

    if indexing:
        meta.set_index(index_col, inplace=True)
    return meta

load_ppi(ppi_file, interaction_columns, sort_values=None, score=None, rnaseq_genes=None, complex_sep=None, dropna=False, strna='', upper_letter_comparison=False, verbose=True, **kwargs)

Loads a list of protein-protein interactions from a table and returns it in a simplified format.

Parameters

ppi_file : str Absolute path to a file containing a list of protein-protein interactions.

interaction_columns : tuple Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors. Example: ('partner_A', 'partner_B').

sort_values : str, default=None Column name of a column used for sorting the table. If it is not None, that the column, and the whole dataframe, we will be ordered in an ascending manner.

score : str, default=None Column name of a column containing weights to consider in the cell-cell interactions/communication analyses. If None, no weights are used and PPIs are assumed to have an equal contribution to CCI and CCC scores.

rnaseq_genes : list, default=None List of genes in a RNA-seq dataset to filter the list of PPIs. If None, the entire list will be used.

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44". If None, it is assummed that the list does not contains complexes.

dropna : boolean, default=False Whether dropping PPIs with any missing information.

strna : str, default='' If dropna is False, missing values will be filled with strna.

upper_letter_comparison : boolean, default=False Whether making uppercase the gene names in the expression matrices and the protein names in the ppi_data to match their names and integrate their respective expression level. Useful when there are inconsistencies in the names between the expression matrix and the ligand-receptor annotations.

**kwargs : dict Extra arguments for loading files the function cell2cell.io.read_data.load_table

Returns

simplified_ppi : pandas.DataFrame A simplified list of PPIs. In this case, interaction_columns are renamed into 'A' and 'B' for the first and second interacting proteins, respectively. A third column 'score' is included, containing weights of PPIs.

Source code in cell2cell/io/read_data.py
def load_ppi(ppi_file, interaction_columns, sort_values=None, score=None, rnaseq_genes=None, complex_sep=None,
             dropna=False, strna='', upper_letter_comparison=False, verbose=True, **kwargs):
    '''Loads a list of protein-protein interactions from a table and
    returns it in a simplified format.

    Parameters
    ----------
    ppi_file : str
        Absolute path to a file containing a list of protein-protein
        interactions.

    interaction_columns : tuple
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors. Example: ('partner_A', 'partner_B').

    sort_values : str, default=None
        Column name of a column used for sorting the table. If it is not None,
        that the column, and the whole dataframe, we will be ordered in an
        ascending manner.

    score : str, default=None
        Column name of a column containing weights to consider in the cell-cell
        interactions/communication analyses. If None, no weights are used and
        PPIs are assumed to have an equal contribution to CCI and CCC scores.

    rnaseq_genes : list, default=None
        List of genes in a RNA-seq dataset to filter the list of PPIs. If None,
        the entire list will be used.

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44". If None, it is assummed
        that the list does not contains complexes.

    dropna : boolean, default=False
        Whether dropping PPIs with any missing information.

    strna : str, default=''
        If dropna is False, missing values will be filled with strna.

    upper_letter_comparison : boolean, default=False
        Whether making uppercase the gene names in the expression matrices and the
        protein names in the ppi_data to match their names and integrate their
        respective expression level. Useful when there are inconsistencies in the
        names between the expression matrix and the ligand-receptor annotations.

    **kwargs : dict
        Extra arguments for loading files the function
        cell2cell.io.read_data.load_table

    Returns
    -------
    simplified_ppi : pandas.DataFrame
        A simplified list of PPIs. In this case, interaction_columns are renamed
        into 'A' and 'B' for the first and second interacting proteins, respectively.
        A third column 'score' is included, containing weights of PPIs.
    '''
    if verbose:
        print("Opening PPI datasets from {}".format(ppi_file))
    ppi_data = load_table(ppi_file, verbose=verbose,  **kwargs)

    simplified_ppi = ppi.preprocess_ppi_data(ppi_data=ppi_data,
                                             interaction_columns=interaction_columns,
                                             sort_values=sort_values,
                                             score=score,
                                             rnaseq_genes=rnaseq_genes,
                                             complex_sep=complex_sep,
                                             dropna=dropna,
                                             strna=strna,
                                             upper_letter_comparison=upper_letter_comparison,
                                             verbose=verbose)
    return simplified_ppi

load_rnaseq(rnaseq_file, gene_column, drop_nangenes=True, log_transformation=False, verbose=True, **kwargs)

Loads a gene expression matrix for a RNA-seq experiment. Preprocessing steps can be done on-the-fly.

Parameters

rnaseq_file : str Absolute path to a file containing a gene expression matrix. Genes are rows and cell-types/tissues/samples are columns.

gene_column : str Column name where the gene labels are contained.

drop_nangenes : boolean, default=True Whether dropping empty genes across all columns.

log_transformation : boolean, default=False Whether applying a log10 transformation on the data.

verbose : boolean, default=True Whether printing or not steps of the analysis.

**kwargs : dict Extra arguments for loading files the function cell2cell.io.read_data.load_table

Returns

rnaseq_data : pandas.DataFrame Gene expression data for a bulk RNA-seq experiment or a single-cell experiment after aggregation into cell types. Columns are cell-types/tissues/samples and rows are genes.

Source code in cell2cell/io/read_data.py
def load_rnaseq(rnaseq_file, gene_column, drop_nangenes=True, log_transformation=False, verbose=True, **kwargs):
    '''
    Loads a gene expression matrix for a RNA-seq experiment. Preprocessing
    steps can be done on-the-fly.

    Parameters
    ----------
    rnaseq_file : str
        Absolute path to a file containing a gene expression matrix. Genes
        are rows and cell-types/tissues/samples are columns.

    gene_column : str
        Column name where the gene labels are contained.

    drop_nangenes : boolean, default=True
        Whether dropping empty genes across all columns.

    log_transformation : boolean, default=False
        Whether applying a log10 transformation on the data.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    **kwargs : dict
        Extra arguments for loading files the function
        cell2cell.io.read_data.load_table

    Returns
    -------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a bulk RNA-seq experiment or a single-cell
        experiment after aggregation into cell types. Columns are
        cell-types/tissues/samples and rows are genes.
    '''
    if verbose:
        print("Opening RNAseq datasets from {}".format(rnaseq_file))
    rnaseq_data = load_table(rnaseq_file, verbose=verbose, **kwargs)
    if gene_column is not None:
        rnaseq_data = rnaseq_data.set_index(gene_column)
    # Keep only numeric datasets
    rnaseq_data = rnaseq_data.select_dtypes([np.number])

    if drop_nangenes:
        rnaseq_data = rnaseq.drop_empty_genes(rnaseq_data)

    if log_transformation:
        rnaseq_data = rnaseq.log10_transformation(rnaseq_data)

    rnaseq_data = rnaseq_data.drop_duplicates()
    return rnaseq_data

load_table(filename, format='auto', sep='\t', sheet_name=0, compression=None, verbose=True, **kwargs)

Opens a file containing a table into a pandas dataframe.

Parameters

filename : str Absolute path to a file storing a table.

format : str, default='auto' Format of the file. Options are:

- 'auto' : Automatically determines the format given
    the file extension. Files ending with .gz will be
    consider as tsv files.
- 'excel' : An excel file, either .xls or .xlsx
- 'csv' : Comma separated value format
- 'tsv' : Tab separated value format
- 'txt' : Text file

sep : str, default=' ' Separation between columns. Examples are: ' ', ' ', ';', ',', etc.

sheet_name : str, int, list, or None, default=0 Strings are used for sheet names. Integers are used in zero-indexed sheet positions. Lists of strings/integers are used to request multiple sheets. Specify None to get all sheets. Available cases:

- Defaults to 0: 1st sheet as a DataFrame
- 1: 2nd sheet as a DataFrame
- "Sheet1": Load sheet with name “Sheet1”
- [0, 1, "Sheet5"]: Load first, second and sheet named
    “Sheet5” as a dict of DataFrame
- None: All sheets.

compression : str, or None, default=‘infer’ For on-the-fly decompression of on-disk data. If ‘infer’, detects compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression. Options: {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

verbose : boolean, default=True Whether printing or not steps of the analysis.

**kwargs : dict Extra arguments for loading files with the respective pandas function given the format of the file.

Returns

table : pandas.DataFrame Dataframe containing the table stored in a file.

Source code in cell2cell/io/read_data.py
def load_table(filename, format='auto', sep='\t', sheet_name=0, compression=None, verbose=True, **kwargs):
    '''Opens a file containing a table into a pandas dataframe.

    Parameters
    ----------
    filename : str
        Absolute path to a file storing a table.

    format : str, default='auto'
        Format of the file.
        Options are:

        - 'auto' : Automatically determines the format given
            the file extension. Files ending with .gz will be
            consider as tsv files.
        - 'excel' : An excel file, either .xls or .xlsx
        - 'csv' : Comma separated value format
        - 'tsv' : Tab separated value format
        - 'txt' : Text file

    sep : str, default='\t'
        Separation between columns. Examples are: '\t', ' ', ';', ',', etc.

    sheet_name : str, int, list, or None, default=0
        Strings are used for sheet names. Integers are used in zero-indexed
        sheet positions. Lists of strings/integers are used to request
        multiple sheets. Specify None to get all sheets.
        Available cases:

        - Defaults to 0: 1st sheet as a DataFrame
        - 1: 2nd sheet as a DataFrame
        - "Sheet1": Load sheet with name “Sheet1”
        - [0, 1, "Sheet5"]: Load first, second and sheet named
            “Sheet5” as a dict of DataFrame
        - None: All sheets.

    compression : str, or None, default=‘infer’
        For on-the-fly decompression of on-disk data. If ‘infer’, detects
        compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’
        (otherwise no decompression). If using ‘zip’, the ZIP file must contain
        only one data file to be read in. Set to None for no decompression.
        Options: {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    **kwargs : dict
        Extra arguments for loading files with the respective pandas function
        given the format of the file.

    Returns
    -------
    table : pandas.DataFrame
        Dataframe containing the table stored in a file.
    '''
    if filename is None:
        return None

    if format == 'auto':
        if ('.gz' in filename):
            format = 'csv'
            sep = '\t'
            compression='gzip'
        elif ('.xlsx' in filename) or ('.xls' in filename):
            format = 'excel'
        elif ('.csv' in filename):
            format = 'csv'
            sep = ','
            compression=None
        elif ('.tsv' in filename) or ('.txt' in filename):
            format = 'csv'
            sep = '\t'
            compression=None

    if format == 'excel':
        table = pd.read_excel(filename, sheet_name=sheet_name, **kwargs)
    elif (format == 'csv') | (format == 'tsv') | (format == 'txt'):
        table = pd.read_csv(filename, sep=sep, compression=compression, **kwargs)
    else:
        if verbose:
            print("Specify a correct format")
        return None
    if verbose:
        print(filename + ' was correctly loaded')
    return table

load_tables_from_directory(pathname, extension, sep='\t', sheet_name=0, compression=None, verbose=True, **kwargs)

Opens all tables with the same extension in a folder.

Parameters

pathname : str Full path of the folder to explore.

extension : str Extension of the file. Options are:

- 'excel' : An excel file, either .xls or .xlsx
- 'csv' : Comma separated value format
- 'tsv' : Tab separated value format
- 'txt' : Text file

sep : str, default=' ' Separation between columns. Examples are: ' ', ' ', ';', ',', etc.

sheet_name : str, int, list, or None, default=0 Strings are used for sheet names. Integers are used in zero-indexed sheet positions. Lists of strings/integers are used to request multiple sheets. Specify None to get all sheets. Available cases:

- Defaults to 0: 1st sheet as a DataFrame
- 1: 2nd sheet as a DataFrame
- "Sheet1": Load sheet with name “Sheet1”
- [0, 1, "Sheet5"]: Load first, second and sheet named
    “Sheet5” as a dict of DataFrame
- None: All sheets.

compression : str, or None, default=‘infer’ For on-the-fly decompression of on-disk data. If ‘infer’, detects compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression. Options: {‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

verbose : boolean, default=True Whether printing or not steps of the analysis.

**kwargs : dict Extra arguments for loading files with the respective pandas function given the format of the file.

Returns

data : dict Dictionary containing the tables (pandas.DataFrame) loaded from the files. Keys are the filenames without the extension and values are the dataframes.

Source code in cell2cell/io/read_data.py
def load_tables_from_directory(pathname, extension, sep='\t', sheet_name=0, compression=None, verbose=True, **kwargs):
    '''Opens all tables with the same extension in a folder.

    Parameters
    ----------
    pathname : str
        Full path of the folder to explore.

    extension : str
        Extension of the file.
        Options are:

        - 'excel' : An excel file, either .xls or .xlsx
        - 'csv' : Comma separated value format
        - 'tsv' : Tab separated value format
        - 'txt' : Text file

    sep : str, default='\t'
        Separation between columns. Examples are: '\t', ' ', ';', ',', etc.

    sheet_name : str, int, list, or None, default=0
        Strings are used for sheet names. Integers are used in zero-indexed
        sheet positions. Lists of strings/integers are used to request
        multiple sheets. Specify None to get all sheets.
        Available cases:

        - Defaults to 0: 1st sheet as a DataFrame
        - 1: 2nd sheet as a DataFrame
        - "Sheet1": Load sheet with name “Sheet1”
        - [0, 1, "Sheet5"]: Load first, second and sheet named
            “Sheet5” as a dict of DataFrame
        - None: All sheets.

    compression : str, or None, default=‘infer’
        For on-the-fly decompression of on-disk data. If ‘infer’, detects
        compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’
        (otherwise no decompression). If using ‘zip’, the ZIP file must contain
        only one data file to be read in. Set to None for no decompression.
        Options: {‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    **kwargs : dict
        Extra arguments for loading files with the respective pandas function
        given the format of the file.

    Returns
    -------
    data : dict
        Dictionary containing the tables (pandas.DataFrame) loaded from the files.
        Keys are the filenames without the extension and values are the dataframes.
    '''
    assert extension in ['excel', 'csv', 'tsv', 'txt'], "Enter a valid `extension`."

    filenames = get_files_from_directory(pathname=pathname,
                                         dir_in_filepath=True)

    data = dict()
    if compression is None:
        comp = ''
    else:
        assert compression in ['gzip', 'bz2', 'zip', 'xz'], "Enter a valid `compression`."
        comp = '.' + compression
    for filename in filenames:
        if filename.endswith('.' + extension + comp):
            print('Loading {}'.format(filename))
            basename = os.path.basename(filename)
            sample = basename.split('.' + extension)[0]
            data[sample] = load_table(filename=filename,
                                      format=extension,
                                      sep=sep,
                                      sheet_name=sheet_name,
                                      compression=compression,
                                      verbose=verbose, **kwargs)
    return data

load_tensor(filename, backend=None, device=None)

Imports a communication tensor that could be used with Tensor-cell2cell.

Parameters

filename : str Absolute path to a file storing a communication tensor that was previously saved by using pickle.

backend : str, default=None Backend that TensorLy will use to perform calculations on this tensor. When None, the default backend used is the currently active backend, usually is ('numpy'). Options are:

device : str, default=None Device to use when backend allows using multiple devices. Options are:

Returns

interaction_tensor : cell2cell.tensor.BaseTensor A communication tensor generated with any of the tensor class in cell2cell.tensor.

Source code in cell2cell/io/read_data.py
def load_tensor(filename, backend=None, device=None):
    '''Imports a communication tensor that could be used
    with Tensor-cell2cell.

    Parameters
    ----------
    filename : str
        Absolute path to a file storing a communication tensor
        that was previously saved by using pickle.

    backend : str, default=None
        Backend that TensorLy will use to perform calculations
        on this tensor. When None, the default backend used is
        the currently active backend, usually is ('numpy'). Options are:
        {'cupy', 'jax', 'mxnet', 'numpy', 'pytorch', 'tensorflow'}

    device : str, default=None
        Device to use when backend allows using multiple devices. Options are:
         {'cpu', 'cuda:0', None}

    Returns
    -------
    interaction_tensor : cell2cell.tensor.BaseTensor
        A communication tensor generated with any of the tensor class in
        cell2cell.tensor.
    '''
    interaction_tensor = load_variable_with_pickle(filename)

    if 'tl' not in globals():
        import tensorly as tl

    if backend is not None:
        tl.set_backend(backend)

    if device is None:
        interaction_tensor.tensor = tl.tensor(interaction_tensor.tensor)
        interaction_tensor.loc_nans = tl.tensor(interaction_tensor.loc_nans)
        interaction_tensor.loc_zeros = tl.tensor(interaction_tensor.loc_zeros)
        if interaction_tensor.mask is not None:
            interaction_tensor.mask = tl.tensor(interaction_tensor.mask)
    else:
        if tl.get_backend() in ['pytorch', 'tensorflow']:  # Potential TODO: Include other backends that support different devices
            interaction_tensor.tensor = tl.tensor(interaction_tensor.tensor, device=device)
            interaction_tensor.loc_nans = tl.tensor(interaction_tensor.loc_nans, device=device)
            interaction_tensor.loc_zeros = tl.tensor(interaction_tensor.loc_zeros, device=device)
            if interaction_tensor.mask is not None:
                interaction_tensor.mask = tl.tensor(interaction_tensor.mask, device=device)
        else:
            interaction_tensor.tensor = tl.tensor(interaction_tensor.tensor)
            interaction_tensor.loc_nans = tl.tensor(interaction_tensor.loc_nans)
            interaction_tensor.loc_zeros = tl.tensor(interaction_tensor.loc_zeros)
            if interaction_tensor.mask is not None:
                interaction_tensor.mask = tl.tensor(interaction_tensor.mask)
    return interaction_tensor

load_tensor_factors(filename)

Imports factors previously exported from a tensor decomposition done in a cell2cell.tensor.BaseTensor-like object.

Parameters

filename : str Absolute path to a file storing an excel file containing the factors, their loadings, and element names for each of the dimensions of a previously decomposed tensor.

Returns

factors : collections.OrderedDict An ordered dictionary wherein keys are the names of each tensor dimension, and values are the loadings in a pandas.DataFrame. In this dataframe, rows are the elements of the respective dimension and columns are the factors from the tensor factorization. Values are the corresponding loadings.

Source code in cell2cell/io/read_data.py
def load_tensor_factors(filename):
    '''Imports factors previously exported from a tensor
    decomposition done in a cell2cell.tensor.BaseTensor-like object.

    Parameters
    ----------
    filename : str
        Absolute path to a file storing an excel file containing
        the factors, their loadings, and element names for each
        of the dimensions of a previously decomposed tensor.

    Returns
    -------
    factors : collections.OrderedDict
        An ordered dictionary wherein keys are the names of each
        tensor dimension, and values are the loadings in a pandas.DataFrame.
        In this dataframe, rows are the elements of the respective dimension
        and columns are the factors from the tensor factorization. Values
        are the corresponding loadings.
    '''
    from collections import OrderedDict

    xls = pd.ExcelFile(filename)

    factors = OrderedDict()
    for sheet_name in xls.sheet_names:
        factors[sheet_name] = xls.parse(sheet_name, index_col=0)

    return factors

load_variable_with_pickle(filename)

Imports a large size variable stored in a file previously exported with pickle.

Parameters

filename : str Absolute path to a file storing a python variable that was previously created by using pickle.

Returns

variable : a python variable The variable of interest.

Source code in cell2cell/io/read_data.py
def load_variable_with_pickle(filename):
    '''Imports a large size variable stored in a file previously
    exported with pickle.

    Parameters
    ----------
    filename : str
        Absolute path to a file storing a python variable that
        was previously created by using pickle.

    Returns
    -------
    variable : a python variable
        The variable of interest.
    '''

    max_bytes = 2 ** 31 - 1
    bytes_in = bytearray(0)
    input_size = os.path.getsize(filename)
    with open(filename, 'rb') as f_in:
        for _ in range(0, input_size, max_bytes):
            bytes_in += f_in.read(max_bytes)
    variable = pickle.loads(bytes_in)
    return variable

save_data

export_variable_with_pickle(variable, filename)

Exports a large size variable in a python readable way using pickle.

Parameters

variable : a python variable Variable to export

filename : str Complete path to the file wherein the variable will be stored. For example: /home/user/variable.pkl

Source code in cell2cell/io/save_data.py
def export_variable_with_pickle(variable, filename):
    '''Exports a large size variable in a python readable way
    using pickle.

    Parameters
    ----------
    variable : a python variable
        Variable to export

    filename : str
        Complete path to the file wherein the variable will be
        stored. For example:
        /home/user/variable.pkl
    '''

    max_bytes = 2 ** 31 - 1

    bytes_out = pickle.dumps(variable)
    with open(filename, 'wb') as f_out:
        for idx in range(0, len(bytes_out), max_bytes):
            f_out.write(bytes_out[idx:idx + max_bytes])
    print(filename, ' was correctly saved.')

plotting special

aesthetics

generate_legend(color_dict, loc='center left', bbox_to_anchor=(1.01, 0.5), ncol=1, fancybox=True, shadow=True, title='Legend', fontsize=14, sorted_labels=True, ax=None)

Adds a legend to a previous plot or displays an independent legend given specific colors for labels.

Parameters

color_dict : dict Dictionary containing tuples in the RGBA format for indicating colors of major groups of cells. Keys are the labels and values are the RGBA tuples.

loc : str, default='center left' Alignment of the legend given the location specieid in bbox_to_anchor.

bbox_to_anchor : tuple, default=(1.01, 0.5) Location of the legend in a (X, Y) format. For example, if you want your axes legend located at the figure's top right-hand corner instead of the axes' corner, simply specify the corner's location and the coordinate system of that location, which in this case would be (1, 1).

ncol : int, default=1 Number of columns to display the legend.

fancybox : boolean, default=True Whether round edges should be enabled around the FancyBboxPatch which makes up the legend's background.

shadow : boolean, default=True Whether to draw a shadow behind the legend.

title : str, default='Legend' Title of the legend box

fontsize : int, default=14 Size of the text in the legends.

sorted_labels : boolean, default=True Whether alphabetically sorting the labels.

fig : matplotlib.figure.Figure, default=None Figure object to add a legend. If fig=None and ax=None, a new empty figure will be generated.

ax : matplotlib.axes.Axes, default=None Axes instance for a plot.

Returns

legend1 : matplotlib.legend.Legend A legend object in a figure.

Source code in cell2cell/plotting/aesthetics.py
def generate_legend(color_dict, loc='center left', bbox_to_anchor=(1.01, 0.5), ncol=1, fancybox=True, shadow=True,
                    title='Legend', fontsize=14, sorted_labels=True, ax=None):
    '''Adds a legend to a previous plot or displays an independent legend
    given specific colors for labels.

    Parameters
    ----------
    color_dict : dict
        Dictionary containing tuples in the RGBA format for indicating colors
        of major groups of cells. Keys are the labels and values are the RGBA
        tuples.

    loc : str, default='center left'
        Alignment of the legend given the location specieid in bbox_to_anchor.

    bbox_to_anchor : tuple, default=(1.01, 0.5)
        Location of the legend in a (X, Y) format. For example, if you want
        your axes legend located at the figure's top right-hand corner instead
        of the axes' corner, simply specify the corner's location and the
        coordinate system of that location, which in this case would be (1, 1).

    ncol : int, default=1
        Number of columns to display the legend.

    fancybox : boolean, default=True
        Whether round edges should be enabled around the FancyBboxPatch which
        makes up the legend's background.

    shadow : boolean, default=True
        Whether to draw a shadow behind the legend.

    title : str, default='Legend'
        Title of the legend box

    fontsize : int, default=14
        Size of the text in the legends.

    sorted_labels : boolean, default=True
        Whether alphabetically sorting the labels.

    fig : matplotlib.figure.Figure, default=None
        Figure object to add a legend. If fig=None and ax=None, a new empty
        figure will be generated.

    ax : matplotlib.axes.Axes, default=None
        Axes instance for a plot.

    Returns
    -------
    legend1 : matplotlib.legend.Legend
        A legend object in a figure.
    '''
    color_patches = []
    if sorted_labels:
        iteritems = sorted(color_dict.items())
    else:
        iteritems = color_dict.items()
    for k, v in iteritems:
        color_patches.append(patches.Patch(color=v, label=str(k).replace('_', ' ')))

    if ax is None:
        legend1 = plt.legend(handles=color_patches,
                             loc=loc,
                             bbox_to_anchor=bbox_to_anchor,
                             ncol=ncol,
                             fancybox=fancybox,
                             shadow=shadow,
                             title=title,
                             title_fontsize=fontsize,
                             fontsize=fontsize)
    else:
        legend1 = ax.legend(handles=color_patches,
                            loc=loc,
                            bbox_to_anchor=bbox_to_anchor,
                            ncol=ncol,
                            fancybox=fancybox,
                            shadow=shadow,
                            title=title,
                            title_fontsize=fontsize,
                            fontsize=fontsize)
    return legend1

get_colors_from_labels(labels, cmap='gist_rainbow', factor=1)

Generates colors for each label in a list given a colormap

Parameters

labels : list A list of labels to assign a color.

cmap : str, default='gist_rainbow' A matplotlib color palette name.

factor : int, default=1 Factor to amplify the separation of colors.

Returns

colors : dict A dictionary where the keys are the labels and the values correspond to the assigned colors.

Source code in cell2cell/plotting/aesthetics.py
def get_colors_from_labels(labels, cmap='gist_rainbow', factor=1):
    '''Generates colors for each label in a list given a colormap

    Parameters
    ----------
    labels : list
        A list of labels to assign a color.

    cmap : str, default='gist_rainbow'
        A matplotlib color palette name.

    factor : int, default=1
        Factor to amplify the separation of colors.

    Returns
    -------
    colors : dict
        A dictionary where the keys are the labels and the values
        correspond to the assigned colors.
    '''
    assert factor >= 1

    colors = dict.fromkeys(labels, ())

    factor = int(factor)
    cm_ = plt.get_cmap(cmap)

    is_number = all((isinstance(e, float) or isinstance(e, int)) for e in labels)

    if not is_number:
        NUM_COLORS = factor * len(colors)
        for i, label in enumerate(colors.keys()):
            colors[label] = cm_((1 + ((factor-1)/factor)) * i / NUM_COLORS)
    else:
        max_ = np.nanmax(labels)
        min_ = np.nanmin(labels)
        norm = Normalize(vmin=-min_, vmax=max_)

        m = cm.ScalarMappable(norm=norm, cmap=cmap)
        for label in colors.keys():
            colors[label] = m.to_rgba(label)
    return colors

map_colors_to_metadata(metadata, ref_df=None, colors=None, sample_col='#SampleID', group_col='Groups', cmap='gist_rainbow')

Assigns a color to elements in a dataframe containing metadata.

Parameters

metadata : pandas.DataFrame A dataframe with metadata for specific elements.

ref_df : pandas.DataFrame A dataframe whose columns contains a subset of elements in the metadata.

colors : dict, default=None Dictionary containing tuples in the RGBA format for indicating colors of major groups of cells. If colors is specified, cmap will be ignored.

sample_col : str, default='#SampleID' Column in the metadata for elements to color.

group_col : str, default='Groups' Column in the metadata containing the major groups of the elements to color.

cmap : str, default='gist_rainbow' Name of the color palette for coloring the major groups of elements.

Returns

new_colors : pandas.DataFrame A pandas dataframe where the index is the list of elements in the sample_col and the column group_col contains the colors assigned to each element given their groups.

Source code in cell2cell/plotting/aesthetics.py
def map_colors_to_metadata(metadata, ref_df=None, colors=None, sample_col='#SampleID', group_col='Groups',
                           cmap='gist_rainbow'):
    '''Assigns a color to elements in a dataframe containing metadata.

    Parameters
    ----------
    metadata : pandas.DataFrame
        A dataframe with metadata for specific elements.

    ref_df : pandas.DataFrame
        A dataframe whose columns contains a subset of
        elements in the metadata.

    colors : dict, default=None
        Dictionary containing tuples in the RGBA format for indicating colors
        of major groups of cells. If colors is specified, cmap will be
        ignored.

    sample_col : str, default='#SampleID'
        Column in the metadata for elements to color.

    group_col : str, default='Groups'
        Column in the metadata containing the major groups of the elements
        to color.

    cmap : str, default='gist_rainbow'
        Name of the color palette for coloring the major groups of elements.

    Returns
    -------
    new_colors : pandas.DataFrame
        A pandas dataframe where the index is the list of elements in the
        sample_col and the column group_col contains the colors assigned
        to each element given their groups.
    '''
    if ref_df is not None:
        meta_ = metadata.set_index(sample_col).reindex(ref_df.columns)
    else:
        meta_ = metadata.set_index(sample_col)
    labels = meta_[group_col].unique().tolist()
    if colors is None:
        colors = get_colors_from_labels(labels, cmap=cmap)
    else:
        upd_dict = dict([(v, (1., 1., 1., 1.)) for v in labels if v not in colors.keys()])
        colors.update(upd_dict)

    new_colors = meta_[group_col].map(colors)
    new_colors.index = meta_.index
    new_colors.name = group_col.capitalize()

    return new_colors

ccc_plot

clustermap_ccc(interaction_space, metadata=None, sample_col='#SampleID', group_col='Groups', meta_cmap='gist_rainbow', colors=None, cell_labels=('SENDER-CELL', 'RECEIVER-CELL'), metric='jaccard', method='ward', optimal_leaf=True, excluded_cells=None, title='', only_used_lr=True, cbar_title='Presence', cbar_fontsize=12, row_fontsize=8, col_fontsize=8, filename=None, **kwargs)

Generates a clustermap (heatmap + dendrograms from a hierarchical clustering) based on CCC scores for each LR pair in every cell-cell pair.

Parameters

interaction_space : cell2cell.core.interaction_space.InteractionSpace Interaction space that contains all a distance matrix after running the the method compute_pairwise_communication_scores. Alternatively, this object can be a numpy-array or a pandas DataFrame. Also, a SingleCellInteractions or a BulkInteractions object after running the method compute_pairwise_communication_scores.

metadata : pandas.Dataframe, default=None Metadata associated with the cells, cell types or samples in the matrix containing CCC scores. If None, cells will not be colored by major groups.

sample_col : str, default='#SampleID' Column in the metadata for the cells, cell types or samples in the matrix containing CCC scores.

group_col : str, default='Groups' Column in the metadata containing the major groups of cells, cell types or samples in the matrix with CCC scores.

meta_cmap : str, default='gist_rainbow' Name of the color palette for coloring the major groups of cells.

colors : dict, default=None Dictionary containing tuples in the RGBA format for indicating colors of major groups of cells. If colors is specified, meta_cmap will be ignored.

cell_labels : tuple, default=('SENDER-CELL','RECEIVER-CELL') A tuple containing the labels for indicating the group colors of sender and receiver cells if metadata or colors are provided.

metric : str, default='jaccard' The distance metric to use. The distance function can be 'braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.

method : str, default='ward' Clustering method for computing a linkage as in scipy.cluster.hierarchy.linkage

optimal_leaf : boolean, default=True Whether sorting the leaf of the dendrograms to have a minimal distance between successive leaves. For more information, see scipy.cluster.hierarchy.optimal_leaf_ordering

excluded_cells : list, default=None List containing cell names that are present in the interaction_space object but that will be excluded from this plot.

title : str, default='' Title of the clustermap.

only_used_lr : boolean, default=True Whether displaying or not only LR pairs that were used at least by one pair of cells. If True, those LR pairs that were not used will not be displayed.

cbar_title : str, default='CCI score' Title for the colorbar, depending on the score employed.

cbar_fontsize : int, default=12 Font size for the colorbar title as well as labels for axes X and Y.

row_fontsize : int, default=8 Font size for the rows in the clustermap (ligand-receptor pairs).

col_fontsize : int, default=8 Font size for the columns in the clustermap (sender-receiver cell pairs).

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

**kwargs : dict Dictionary containing arguments for the seaborn.clustermap function.

Returns

fig : seaborn.matrix.ClusterGrid A seaborn ClusterGrid instance.

Source code in cell2cell/plotting/ccc_plot.py
def clustermap_ccc(interaction_space, metadata=None, sample_col='#SampleID', group_col='Groups',
                   meta_cmap='gist_rainbow', colors=None, cell_labels=('SENDER-CELL','RECEIVER-CELL'),
                   metric='jaccard', method='ward', optimal_leaf=True, excluded_cells=None, title='',
                   only_used_lr=True, cbar_title='Presence', cbar_fontsize=12, row_fontsize=8, col_fontsize=8,
                   filename=None, **kwargs):
    '''Generates a clustermap (heatmap + dendrograms from a hierarchical
    clustering) based on CCC scores for each LR pair in every cell-cell pair.

    Parameters
    ----------
    interaction_space : cell2cell.core.interaction_space.InteractionSpace
        Interaction space that contains all a distance matrix after running the
        the method compute_pairwise_communication_scores. Alternatively, this
        object can be a numpy-array or a pandas DataFrame. Also, a
        SingleCellInteractions or a BulkInteractions object after running
        the method compute_pairwise_communication_scores.

    metadata : pandas.Dataframe, default=None
        Metadata associated with the cells, cell types or samples in the
        matrix containing CCC scores. If None, cells will not be colored
        by major groups.

    sample_col : str, default='#SampleID'
        Column in the metadata for the cells, cell types or samples
        in the matrix containing CCC scores.

    group_col : str, default='Groups'
        Column in the metadata containing the major groups of cells, cell types
        or samples in the matrix with CCC scores.

    meta_cmap : str, default='gist_rainbow'
        Name of the color palette for coloring the major groups of cells.

    colors : dict, default=None
        Dictionary containing tuples in the RGBA format for indicating colors
        of major groups of cells. If colors is specified, meta_cmap will be
        ignored.

    cell_labels : tuple, default=('SENDER-CELL','RECEIVER-CELL')
        A tuple containing the labels for indicating the group colors of
        sender and receiver cells if metadata or colors are provided.

    metric : str, default='jaccard'
        The distance metric to use. The distance function can be 'braycurtis',
        'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice',
        'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski',
        'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao',
        'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.

    method : str, default='ward'
        Clustering method for computing a linkage as in
        scipy.cluster.hierarchy.linkage

    optimal_leaf : boolean, default=True
        Whether sorting the leaf of the dendrograms to have a minimal distance
        between successive leaves. For more information, see
        scipy.cluster.hierarchy.optimal_leaf_ordering

    excluded_cells : list, default=None
        List containing cell names that are present in the interaction_space
        object but that will be excluded from this plot.

    title : str, default=''
        Title of the clustermap.

    only_used_lr : boolean, default=True
        Whether displaying or not only LR pairs that were used at least by
        one pair of cells. If True, those LR pairs that were not used will
        not be displayed.

    cbar_title : str, default='CCI score'
        Title for the colorbar, depending on the score employed.

    cbar_fontsize : int, default=12
        Font size for the colorbar title as well as labels for axes X and Y.

    row_fontsize : int, default=8
        Font size for the rows in the clustermap (ligand-receptor pairs).

    col_fontsize : int, default=8
        Font size for the columns in the clustermap (sender-receiver cell pairs).

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    **kwargs : dict
        Dictionary containing arguments for the seaborn.clustermap function.

    Returns
    -------
    fig : seaborn.matrix.ClusterGrid
        A seaborn ClusterGrid instance.
    '''
    if hasattr(interaction_space, 'interaction_elements'):
        print('Interaction space detected as an InteractionSpace class')
        if 'communication_matrix' not in interaction_space.interaction_elements.keys():
            raise ValueError('First run the method compute_pairwise_communication_scores() in your interaction' + \
                             ' object to generate a communication matrix.')
        else:
            df_ = interaction_space.interaction_elements['communication_matrix'].copy()
    elif (type(interaction_space) is np.ndarray) or (type(interaction_space) is pd.core.frame.DataFrame):
        print('Interaction space detected as a communication matrix')
        df_ = interaction_space
    elif hasattr(interaction_space, 'interaction_space'):
        print('Interaction space detected as a Interactions class')
        if 'communication_matrix' not in interaction_space.interaction_space.interaction_elements.keys():
            raise ValueError('First run the method compute_pairwise_communication_scores() in your interaction' + \
                             ' object to generate a communication matrix.')
        else:
            df_ = interaction_space.interaction_space.interaction_elements['communication_matrix'].copy()
    else:
        raise ValueError('First run the method compute_pairwise_communication_scores() in your interaction' + \
                         ' object to generate a communication matrix.')

    if excluded_cells is not None:
        included_cells = []
        for cells in df_.columns:
            include = True
            for excluded_cell in excluded_cells:
                if excluded_cell in cells:
                    include = False
            if include:
                included_cells.append(cells)
    else:
        included_cells = list(df_.columns)

    df_ = df_[included_cells]
    df_ = df_.dropna(how='all', axis=0)
    df_ = df_.dropna(how='all', axis=1)
    df_ = df_.fillna(0)
    if only_used_lr:
        df_ = df_[(df_.T != 0).any()]

    # Clustering
    dm_rows = compute_distance(df_, axis=0, metric=metric)
    row_linkage = compute_linkage(dm_rows, method=method, optimal_ordering=optimal_leaf)

    dm_cols = compute_distance(df_, axis=1, metric=metric)
    col_linkage = compute_linkage(dm_cols, method=method, optimal_ordering=optimal_leaf)

    # Colors
    if metadata is not None:
        metadata2 = metadata.reset_index()
        meta_ = metadata2.set_index(sample_col)
        if excluded_cells is not None:
            meta_ = meta_.loc[~meta_.index.isin(excluded_cells)]
        labels = meta_[group_col].values.tolist()

        if colors is None:
            colors = get_colors_from_labels(labels, cmap=meta_cmap)
        else:
            assert all(elem in colors.keys() for elem in set(labels))

        col_colors_L = pd.DataFrame(included_cells)[0].apply(lambda x: colors[metadata2.loc[metadata2[sample_col] == x.split(';')[0],
                                                                                           group_col].values[0]])
        col_colors_L.index = included_cells
        col_colors_L.name = cell_labels[0]

        col_colors_R = pd.DataFrame(included_cells)[0].apply(lambda x: colors[metadata2.loc[metadata2[sample_col] == x.split(';')[1],
                                                                                           group_col].values[0]])
        col_colors_R.index = included_cells
        col_colors_R.name = cell_labels[1]

        # Clustermap
        fig = sns.clustermap(df_,
                             cmap=sns.dark_palette('red'), #plt.get_cmap('YlGnBu_r'),
                             col_linkage=col_linkage,
                             row_linkage=row_linkage,
                             col_colors=[col_colors_L,
                                         col_colors_R],
                             **kwargs
                             )

        fig.ax_heatmap.set_yticklabels(fig.ax_heatmap.yaxis.get_majorticklabels(), rotation=0, ha='left')
        fig.ax_heatmap.set_xticklabels(fig.ax_heatmap.xaxis.get_majorticklabels(), rotation=90)

        fig.ax_col_colors.set_yticks(np.arange(0.5, 2., step=1))
        fig.ax_col_colors.set_yticklabels(list(cell_labels), fontsize=row_fontsize)
        fig.ax_col_colors.yaxis.tick_right()
        plt.setp(fig.ax_col_colors.get_yticklabels(), rotation=0, visible=True)


    else:
        fig = sns.clustermap(df_,
                             cmap=sns.dark_palette('red'),
                             col_linkage=col_linkage,
                             row_linkage=row_linkage,
                             **kwargs
                             )

    # Title
    if len(title) > 0:
        fig.ax_col_dendrogram.set_title(title, fontsize=24)

    # Color bar label
    cbar = fig.ax_heatmap.collections[0].colorbar
    cbar.ax.set_ylabel(cbar_title, fontsize=cbar_fontsize)
    cbar.ax.yaxis.set_label_position("left")

    # Change tick labels
    xlabels = [' --> '.join(i.get_text().split(';')) \
               for i in fig.ax_heatmap.xaxis.get_majorticklabels()]
    ylabels = [' --> '.join(i.get_text().replace('(', '').replace(')', '').replace("'", "").split(', ')) \
               for i in fig.ax_heatmap.yaxis.get_majorticklabels()]

    fig.ax_heatmap.set_xticklabels(xlabels,
                                   fontsize=col_fontsize,
                                   rotation=90,
                                   rotation_mode='anchor',
                                   va='center',
                                   ha='right')
    fig.ax_heatmap.set_yticklabels(ylabels)

    # Save Figure
    if filename is not None:
        plt.savefig(filename,
                    dpi=300,
                    bbox_inches='tight')
    return fig

cci_plot

clustermap_cci(interaction_space, method='ward', optimal_leaf=True, metadata=None, sample_col='#SampleID', group_col='Groups', meta_cmap='gist_rainbow', colors=None, excluded_cells=None, title='', cbar_title='CCI score', cbar_fontsize=18, filename=None, **kwargs)

Generates a clustermap (heatmap + dendrograms from a hierarchical clustering) based on CCI scores of cell-cell pairs.

Parameters

interaction_space : cell2cell.core.interaction_space.InteractionSpace Interaction space that contains all a distance matrix after running the the method compute_pairwise_cci_scores. Alternatively, this object can be a numpy-array or a pandas DataFrame. Also, a SingleCellInteractions or a BulkInteractions object after running the method compute_pairwise_cci_scores.

method : str, default='ward' Clustering method for computing a linkage as in scipy.cluster.hierarchy.linkage

optimal_leaf : boolean, default=True Whether sorting the leaf of the dendrograms to have a minimal distance between successive leaves. For more information, see scipy.cluster.hierarchy.optimal_leaf_ordering

metadata : pandas.Dataframe, default=None Metadata associated with the cells, cell types or samples in the matrix containing CCI scores. If None, cells will not be colored by major groups.

sample_col : str, default='#SampleID' Column in the metadata for the cells, cell types or samples in the matrix containing CCI scores.

group_col : str, default='Groups' Column in the metadata containing the major groups of cells, cell types or samples in the matrix with CCI scores.

meta_cmap : str, default='gist_rainbow' Name of the color palette for coloring the major groups of cells.

colors : dict, default=None Dictionary containing tuples in the RGBA format for indicating colors of major groups of cells. If colors is specified, meta_cmap will be ignored.

excluded_cells : list, default=None List containing cell names that are present in the interaction_space object but that will be excluded from this plot.

title : str, default='' Title of the clustermap.

cbar_title : str, default='CCI score' Title for the colorbar, depending on the score employed.

cbar_fontsize : int, default=18 Font size for the colorbar title as well as labels for axes X and Y.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

**kwargs : dict Dictionary containing arguments for the seaborn.clustermap function.

Returns

hier : seaborn.matrix.ClusterGrid A seaborn ClusterGrid instance.

Source code in cell2cell/plotting/cci_plot.py
def clustermap_cci(interaction_space, method='ward', optimal_leaf=True, metadata=None, sample_col='#SampleID',
                   group_col='Groups', meta_cmap='gist_rainbow', colors=None, excluded_cells=None, title='',
                   cbar_title='CCI score', cbar_fontsize=18, filename=None, **kwargs):
    '''Generates a clustermap (heatmap + dendrograms from a hierarchical
    clustering) based on CCI scores of cell-cell pairs.

    Parameters
    ----------
    interaction_space : cell2cell.core.interaction_space.InteractionSpace
        Interaction space that contains all a distance matrix after running the
        the method compute_pairwise_cci_scores. Alternatively, this object
        can be a numpy-array or a pandas DataFrame. Also, a
        SingleCellInteractions or a BulkInteractions object after running
        the method compute_pairwise_cci_scores.

    method : str, default='ward'
        Clustering method for computing a linkage as in
        scipy.cluster.hierarchy.linkage

    optimal_leaf : boolean, default=True
        Whether sorting the leaf of the dendrograms to have a minimal distance
        between successive leaves. For more information, see
        scipy.cluster.hierarchy.optimal_leaf_ordering

    metadata : pandas.Dataframe, default=None
        Metadata associated with the cells, cell types or samples in the
        matrix containing CCI scores. If None, cells will not be colored
        by major groups.

    sample_col : str, default='#SampleID'
        Column in the metadata for the cells, cell types or samples
        in the matrix containing CCI scores.

    group_col : str, default='Groups'
        Column in the metadata containing the major groups of cells, cell types
        or samples in the matrix with CCI scores.

    meta_cmap : str, default='gist_rainbow'
        Name of the color palette for coloring the major groups of cells.

    colors : dict, default=None
        Dictionary containing tuples in the RGBA format for indicating colors
        of major groups of cells. If colors is specified, meta_cmap will be
        ignored.

    excluded_cells : list, default=None
        List containing cell names that are present in the interaction_space
        object but that will be excluded from this plot.

    title : str, default=''
        Title of the clustermap.

    cbar_title : str, default='CCI score'
        Title for the colorbar, depending on the score employed.

    cbar_fontsize : int, default=18
        Font size for the colorbar title as well as labels for axes X and Y.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    **kwargs : dict
        Dictionary containing arguments for the seaborn.clustermap function.

    Returns
    -------
    hier : seaborn.matrix.ClusterGrid
        A seaborn ClusterGrid instance.
    '''
    if hasattr(interaction_space, 'distance_matrix'):
        print('Interaction space detected as an InteractionSpace class')
        distance_matrix = interaction_space.distance_matrix
        space_type = 'class'
    elif (type(interaction_space) is np.ndarray) or (type(interaction_space) is pd.core.frame.DataFrame):
        print('Interaction space detected as a distance matrix')
        distance_matrix = interaction_space
        space_type = 'matrix'
    elif hasattr(interaction_space, 'interaction_space'):
        print('Interaction space detected as a Interactions class')
        if not hasattr(interaction_space.interaction_space, 'distance_matrix'):
            raise ValueError('First run the method compute_pairwise_interactions() in your interaction' + \
                             ' object to generate a distance matrix.')
        else:
            interaction_space = interaction_space.interaction_space
            distance_matrix = interaction_space.distance_matrix
            space_type = 'class'
    else:
        raise ValueError('First run the method compute_pairwise_interactions() in your interaction' + \
                         ' object to generate a distance matrix.')

    # Drop excluded cells
    if excluded_cells is not None:
        df = distance_matrix.loc[~distance_matrix.index.isin(excluded_cells),
                                 ~distance_matrix.columns.isin(excluded_cells)].copy()
    else:
        df = distance_matrix.copy()

    # Check symmetry to get linkage
    symmetric = check_symmetry(df)
    if (not symmetric) & (type(interaction_space) is pd.core.frame.DataFrame):
        assert set(df.index) == set(df.columns), 'The distance matrix does not have the same elements in rows and columns'

    # Obtain info for generating plot
    linkage = _get_distance_matrix_linkages(df=df,
                                            kwargs=kwargs,
                                            method=method,
                                            optimal_ordering=optimal_leaf,
                                            symmetric=symmetric
                                            )

    kwargs_ = kwargs.copy()


    # PLOT CCI MATRIX
    if space_type == 'class':
        df = interaction_space.interaction_elements['cci_matrix']
    else:
        df = distance_matrix

    if excluded_cells is not None:
        df = df.loc[~df.index.isin(excluded_cells),
                    ~df.columns.isin(excluded_cells)]

    # Colors
    if metadata is not None:
        col_colors = map_colors_to_metadata(metadata=metadata,
                                            ref_df=df,
                                            colors=colors,
                                            sample_col=sample_col,
                                            group_col=group_col,
                                            cmap=meta_cmap)

        if not symmetric:
            row_colors = col_colors
        else:
            row_colors = None
    else:
        col_colors = None
        row_colors = None

    # Plot hierarchical clustering (triangular)
    hier = _plot_triangular_clustermap(df=df,
                                       symmetric=symmetric,
                                       linkage=linkage,
                                       col_colors=col_colors,
                                       row_colors=row_colors,
                                       title=title,
                                       cbar_title=cbar_title,
                                       cbar_fontsize=cbar_fontsize,
                                       **kwargs_)

    if ~symmetric:
        hier.ax_heatmap.set_xlabel('Receiver cells', fontsize=cbar_fontsize)
        hier.ax_heatmap.set_ylabel('Sender cells', fontsize=cbar_fontsize)

    if filename is not None:
        plt.savefig(filename, dpi=300,
                    bbox_inches='tight')
    return hier

circular_plot

circos_plot(interaction_space, sender_cells, receiver_cells, ligands, receptors, excluded_score=0, metadata=None, sample_col='#SampleID', group_col='Groups', meta_cmap='Set2', cells_cmap='Pastel1', colors=None, ax=None, figsize=(10, 10), fontsize=14, legend=True, ligand_label_color='dimgray', receptor_label_color='dimgray', filename=None)

Generates the circos plot in the exact order that sender and receiver cells are provided. Similarly, ligands and receptors are sorted by the order they are input.

Parameters

interaction_space : cell2cell.core.interaction_space.InteractionSpace Interaction space that contains all a distance matrix after running the the method compute_pairwise_communication_scores. Alternatively, this object can a SingleCellInteractions or a BulkInteractions object after running the method compute_pairwise_communication_scores.

sender_cells : list List of cells to be included as senders.

receiver_cells : list List of cells to be included as receivers.

ligands : list List of genes/proteins to be included as ligands produced by the sender cells.

receptors : list List of genes/proteins to be included as receptors produced by the receiver cells.

excluded_score : float, default=0 Rows that have a communication score equal or lower to this will be dropped from the network.

metadata : pandas.Dataframe, default=None Metadata associated with the cells, cell types or samples in the matrix containing CCC scores. If None, cells will be color only by individual cells.

sample_col : str, default='#SampleID' Column in the metadata for the cells, cell types or samples in the matrix containing CCC scores.

group_col : str, default='Groups' Column in the metadata containing the major groups of cells, cell types or samples in the matrix with CCC scores.

meta_cmap : str, default='Set2' Name of the matplotlib color palette for coloring the major groups of cells.

cells_cmap : str, default='Pastel1' Name of the color palette for coloring individual cells.

colors : dict, default=None Dictionary containing tuples in the RGBA format for indicating colors of major groups of cells. If colors is specified, meta_cmap will be ignored.

ax : matplotlib.axes.Axes, default=None Axes instance for a plot.

figsize : tuple, default=(10, 10) Size of the figure (width*height), each in inches.

fontsize : int, default=14 Font size for ligand and receptor labels.

legend : boolean, default=True Whether including legends for cell and cell group colors as well as ligand/receptor colors.

ligand_label_color : str, default='dimgray' Name of the matplotlib color palette for coloring the labels of ligands.

receptor_label_color : str, default='dimgray' Name of the matplotlib color palette for coloring the labels of receptors.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

Returns

ax : matplotlib.axes.Axes Axes instance containing a circos plot.

Source code in cell2cell/plotting/circular_plot.py
def circos_plot(interaction_space, sender_cells, receiver_cells, ligands, receptors, excluded_score=0, metadata=None,
                sample_col='#SampleID', group_col='Groups', meta_cmap='Set2', cells_cmap='Pastel1', colors=None, ax=None,
                figsize=(10,10), fontsize=14, legend=True, ligand_label_color='dimgray', receptor_label_color='dimgray',
                filename=None):
    '''Generates the circos plot in the exact order that sender and
    receiver cells are provided. Similarly, ligands and receptors are
    sorted by the order they are input.

    Parameters
    ----------
    interaction_space : cell2cell.core.interaction_space.InteractionSpace
        Interaction space that contains all a distance matrix after running the
        the method compute_pairwise_communication_scores. Alternatively, this
        object can a SingleCellInteractions or a BulkInteractions object after
        running the method compute_pairwise_communication_scores.

    sender_cells : list
        List of cells to be included as senders.

    receiver_cells : list
        List of cells to be included as receivers.

    ligands : list
        List of genes/proteins to be included as ligands produced by the
        sender cells.

    receptors : list
        List of genes/proteins to be included as receptors produced by the
        receiver cells.

    excluded_score : float, default=0
        Rows that have a communication score equal or lower to this will
        be dropped from the network.

    metadata : pandas.Dataframe, default=None
        Metadata associated with the cells, cell types or samples in the
        matrix containing CCC scores. If None, cells will be color only by
        individual cells.

    sample_col : str, default='#SampleID'
        Column in the metadata for the cells, cell types or samples
        in the matrix containing CCC scores.

    group_col : str, default='Groups'
        Column in the metadata containing the major groups of cells, cell types
        or samples in the matrix with CCC scores.

    meta_cmap : str, default='Set2'
        Name of the matplotlib color palette for coloring the major groups
        of cells.

    cells_cmap : str, default='Pastel1'
        Name of the color palette for coloring individual cells.

    colors : dict, default=None
        Dictionary containing tuples in the RGBA format for indicating colors
        of major groups of cells. If colors is specified, meta_cmap will be
        ignored.

    ax : matplotlib.axes.Axes, default=None
        Axes instance for a plot.

    figsize : tuple, default=(10, 10)
        Size of the figure (width*height), each in inches.

    fontsize : int, default=14
        Font size for ligand and receptor labels.

    legend : boolean, default=True
        Whether including legends for cell and cell group colors as well
        as ligand/receptor colors.

    ligand_label_color : str, default='dimgray'
        Name of the matplotlib color palette for coloring the labels of
        ligands.

    receptor_label_color : str, default='dimgray'
        Name of the matplotlib color palette for coloring the labels of
        receptors.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    Returns
    -------
    ax : matplotlib.axes.Axes
        Axes instance containing a circos plot.
    '''
    if hasattr(interaction_space, 'interaction_elements'):
        if 'communication_matrix' not in interaction_space.interaction_elements.keys():
            raise ValueError('Run the method compute_pairwise_communication_scores() before generating circos plots.')
        else:
            readable_ccc = get_readable_ccc_matrix(interaction_space.interaction_elements['communication_matrix'])
    elif hasattr(interaction_space, 'interaction_space'):
        if 'communication_matrix' not in interaction_space.interaction_space.interaction_elements.keys():
            raise ValueError('Run the method compute_pairwise_communication_scores() before generating circos plots.')
        else:
            readable_ccc = get_readable_ccc_matrix(interaction_space.interaction_space.interaction_elements['communication_matrix'])
    else:
        raise ValueError('Not a valid interaction_space')

    # Figure setups
    if ax is None:
        R = 1.0
        center = (0, 0)

        fig = plt.figure(figsize=figsize, frameon=False)
        ax = fig.add_axes([0., 0., 1., 1.], aspect='equal')
        ax.set_axis_off()
        ax.set_xlim((-R * 1.05 + center[0]), (R * 1.05 + center[0]))
        ax.set_ylim((-R * 1.05 + center[1]), (R * 1.05 + center[1]))
    else:
        xlim = ax.get_xlim()
        ylim = ax.get_ylim()
        x_range = abs(xlim[1] - xlim[0])
        y_range = abs(ylim[1] - ylim[0])

        R = np.nanmin([x_range/2.0, y_range/2.0]) / 1.05
        center = (np.nanmean(xlim), np.nanmean(ylim))
    current_ax = ax
    # Elements to build network
    # TODO: Add option to select sort_by: None (as input), cells, proteins or metadata
    sorted_nodes = sort_nodes(sender_cells=sender_cells,
                              receiver_cells=receiver_cells,
                              ligands=ligands,
                              receptors=receptors
                              )

    # Build network
    G = _build_network(sender_cells=sender_cells,
                       receiver_cells=receiver_cells,
                       ligands=ligands,
                       receptors=receptors,
                       sorted_nodes=sorted_nodes,
                       readable_ccc=readable_ccc,
                       excluded_score=excluded_score
                       )

    # Get coordinates
    nodes_dict = get_arc_angles(G=G,
                                sorting_feature='sorting')

    edges_dict = dict()
    for k, v in nodes_dict.items():
        edges_dict[k] = get_cartesian(theta=np.nanmean(v),
                                      radius=0.95*R/2.,
                                      center=center,
                                      angle='degrees')

    small_R = determine_small_radius(edges_dict)

    # Colors
    cells = list(set(sender_cells+receiver_cells))
    if metadata is not None:
        meta = metadata.set_index(sample_col).reindex(cells)
        meta = meta[[group_col]].fillna('NA')
        labels = meta[group_col].unique().tolist()
        if colors is None:
            colors = get_colors_from_labels(labels, cmap=meta_cmap)
        meta['Color'] = [colors[idx] for idx in meta[group_col]]
    else:
        meta = pd.DataFrame(index=cells)
    # Colors for cells, not major groups
    colors = get_colors_from_labels(cells, cmap=cells_cmap)
    meta['Cells-Color'] = [colors[idx] for idx in meta.index]
    # signal_colors = {'ligand' : 'brown', 'receptor' : 'black'}
    signal_colors = {'ligand' : ligand_label_color, 'receptor' : receptor_label_color}

    # Draw edges
    # TODO: Automatically determine lw given the size of the arcs (each ligand or receptor)
    lw = 10
    for l, r in G.edges:
        path = Path([edges_dict[l], center, edges_dict[r]],
                    [Path.MOVETO, Path.CURVE3, Path.CURVE3])


        patch = patches.FancyArrowPatch(path=path,
                                        arrowstyle="->,head_length={},head_width={}".format(lw/3.0,lw/3.0),
                                        lw=lw/2.,
                                        edgecolor='gray', #meta.at[l.split('^')[0], 'Cells-Color'],
                                        zorder=1,
                                        facecolor='none',
                                        alpha=0.15)
        ax.add_patch(patch)

    # Draw nodes
    # TODO: Automatically determine lw given the size of the figure
    cell_legend = dict()
    if metadata is not None:
        meta_legend = dict()
    else:
        meta_legend = None
    for k, v in nodes_dict.items():
        diff = 0.05 * abs(v[1]-v[0]) # Distance to substract and avoid nodes touching each others

        cell = k.split('^')[0]
        cell_color = meta.at[cell, 'Cells-Color']
        cell_legend[cell] = cell_color
        ax.add_patch(Arc(center, R, R,
                         theta1=diff + v[0],
                         theta2=v[1] - diff,
                         edgecolor=cell_color,
                         lw=lw))
        coeff = 1.0
        if metadata is not None:
            coeff = 1.1

            meta_color = meta.at[cell, 'Color']
            ax.add_patch(Arc(center, coeff * R, coeff * R,
                             theta1=diff + v[0],
                             theta2=v[1] - diff,
                             edgecolor=meta_color,
                             lw=10))

            meta_legend[meta.at[cell, group_col]] = meta_color

        label_coord = get_cartesian(theta=np.nanmean(v),
                                    radius=coeff * R/2. + min([small_R*3, coeff * R*0.05]),
                                    center=center,
                                    angle='degrees')

        # Add Labels
        v2 = label_coord
        x = v2[0]
        y = v2[1]
        rotation = np.nanmean(v)

        if x >= 0:
            ha = 'left'
        else:
            ha = 'right'
        va = 'center'

        if (rotation <= 90):
            rotation = abs(rotation)
        elif (rotation <= 180):
            rotation = -abs(rotation - 180)
        elif (rotation <= 270):
            rotation = abs(rotation - 180)
        else:
            rotation = abs(rotation)

        ax.text(x, y,
                k.split('^')[1],
                rotation=rotation,
                rotation_mode="anchor",
                horizontalalignment=ha,
                verticalalignment=va,
                color=signal_colors[G.nodes[k]['signal']],
                fontsize=fontsize
                )

    # Draw legend
    if legend:
        plt.gcf().canvas.draw()
        lgd = generate_circos_legend(cell_legend=cell_legend,
                                     meta_legend=meta_legend,
                                     signal_legend=signal_colors,
                                     fontsize=fontsize,
                                     ax=ax
                                     )

    if filename is not None:
        plt.savefig(filename, dpi=300,
                    bbox_inches='tight')
    return ax

determine_small_radius(coordinate_dict)

Computes the radius of a circle whose diameter is the distance between the center of two nodes.

Parameters

coordinate_dict : dict A dictionary containing the coordinates to plot each node.

Returns

radius : float The half of the distance between the center of two nodes.

Source code in cell2cell/plotting/circular_plot.py
def determine_small_radius(coordinate_dict):
    '''Computes the radius of a circle whose diameter is the distance
    between the center of two nodes.

    Parameters
    ----------
    coordinate_dict : dict
        A dictionary containing the coordinates to plot each node.

    Returns
    -------
    radius : float
        The half of the distance between the center of two nodes.
    '''
    # Cartesian coordinates
    keys = list(coordinate_dict.keys())
    circle1 = np.asarray(coordinate_dict[keys[0]])
    circle2 = np.asarray(coordinate_dict[keys[1]])
    diff = circle1 - circle2
    distance = np.sqrt(diff[0]**2 + diff[1]**2)
    radius = distance / 2.0
    return radius

generate_circos_legend(cell_legend, signal_legend=None, meta_legend=None, fontsize=14, ax=None)

Adds legends to circos plot.

Parameters

cell_legend : dict Dictionary containing the colors for the cells.

signal_legend : dict, default=None Dictionary containing the colors for the LR pairs in a given pair of cells. Corresponds to the colors of the links.

meta_legend : dict, default=None Dictionary containing the colors for the cells given their major groups.

fontsize : int, default=14 Size of the labels in the legend.

Source code in cell2cell/plotting/circular_plot.py
def generate_circos_legend(cell_legend, signal_legend=None, meta_legend=None, fontsize=14, ax=None):
    '''Adds legends to circos plot.

    Parameters
    ----------
    cell_legend : dict
        Dictionary containing the colors for the cells.

    signal_legend : dict, default=None
        Dictionary containing the colors for the LR pairs in a given pair
        of cells. Corresponds to the colors of the links.

    meta_legend : dict, default=None
        Dictionary containing the colors for the cells given their major
        groups.

    fontsize : int, default=14
        Size of the labels in the legend.
    '''
    legend_fontsize = int(fontsize * 0.9)

    if ax is None:
        ax = plt.gca()

    # Cell legend
    lgd = generate_legend(color_dict=cell_legend,
                          loc='center left',
                          bbox_to_anchor=(1.01, 0.5),
                          ncol=1,
                          fancybox=True,
                          shadow=True,
                          title='Cells',
                          fontsize=legend_fontsize,
                          ax=ax
                          )


    if signal_legend is not None:
        # Signal legend
        # generate_legend(color_dict=signal_legend,
        #                 loc='upper center',
        #                 bbox_to_anchor=(0.5, -0.01),
        #                 ncol=2,
        #                 fancybox=True,
        #                 shadow=True,
        #                 title='Signals',
        #                 fontsize=legend_fontsize
        #                 )
        pass

    if meta_legend is not None:
        #ax.add_artist(lgd)
        fig = plt.gcf()
        ax2 = fig.add_axes([0., 0., 1., 1.], aspect='equal')
        ax2.set_axis_off()
        # Meta legend
        lgd3 = generate_legend(color_dict=meta_legend,
                               loc='center right',
                               bbox_to_anchor=(-0.01, 0.5),
                               ncol=1,
                               fancybox=True,
                               shadow=True,
                               title='Groups',
                               fontsize=legend_fontsize,
                               ax=ax2
                               )
        return lgd3
    return lgd

get_arc_angles(G, sorting_feature=None)

Obtains the angles of polar coordinates to plot nodes as arcs of a circumference.

Parameters

G : networkx.Graph or networkx.DiGraph A networkx graph.

sorting_feature : str, default=None A node attribute present in the dictionary associated with each node. The values associated with this attributed will be used for sorting the nodes.

Returns

angles : dict A dictionary containing the angles for positioning the nodes in polar coordinates. Keys are the node names and values are tuples with angles for the start and end of the arc that represents a node.

Source code in cell2cell/plotting/circular_plot.py
def get_arc_angles(G, sorting_feature=None):
    '''Obtains the angles of polar coordinates to plot nodes as arcs of
     a circumference.

    Parameters
    ----------
    G : networkx.Graph or networkx.DiGraph
        A networkx graph.

    sorting_feature : str, default=None
        A node attribute present in the dictionary associated with
        each node. The values associated with this attributed will
        be used for sorting the nodes.

    Returns
    -------
    angles : dict
        A dictionary containing the angles for positioning the
        nodes in polar coordinates. Keys are the node names and
        values are tuples with angles for the start and end of
        the arc that represents a node.
    '''
    elements = list(set(G.nodes()))
    n_elements = len(elements)

    if sorting_feature is not None:
        sorting_list = [G.nodes[n][sorting_feature] for n in elements]
        elements = [x for _, x in sorted(zip(sorting_list, elements), key=lambda pair: pair[0])]

    angles = dict()

    for i, node in enumerate(elements):
        theta1 = (360 / n_elements) * i
        theta2 = (360 / n_elements) * (i + 1)
        angles[node] = (theta1, theta2)
    return angles

get_cartesian(theta, radius, center=(0, 0), angle='radians')

Performs a polar to cartesian coordinates conversion.

Parameters

theta : float or ndarray An angle for a polar coordinate.

radius : float The radius in a polar coordinate.

center : tuple, default=(0,0) The center of the circle in the cartesian coordinates.

angle : str, default='radians' Type of angle that theta is. Options are: - 'degrees' : from 0 to 360 - 'radians' : from 0 to 2*numpy.pi

Returns

(x, y) : tuple Cartesian coordinates for X and Y axis respective.

Source code in cell2cell/plotting/circular_plot.py
def get_cartesian(theta, radius, center=(0,0), angle='radians'):
    '''Performs a polar to cartesian coordinates conversion.

    Parameters
    ----------
    theta : float or ndarray
        An angle for a polar coordinate.

    radius : float
        The radius in a polar coordinate.

    center : tuple, default=(0,0)
        The center of the circle in the cartesian coordinates.

    angle : str, default='radians'
        Type of angle that theta is. Options are:
         - 'degrees' : from 0 to 360
         - 'radians' : from 0 to 2*numpy.pi

    Returns
    -------
    (x, y) : tuple
        Cartesian coordinates for X and Y axis respective.
    '''
    if angle == 'degrees':
        theta_ = 2.0*np.pi*theta/360.
    elif angle == 'radians':
        theta_ = theta
    else:
        raise ValueError('Not a valid angle. Use randians or degrees.')
    x = radius*np.cos(theta_) + center[0]
    y = radius*np.sin(theta_) + center[1]
    return (x, y)

get_node_colors(G, coloring_feature=None, cmap='viridis')

Generates colors for each node in a network given one of their properties.

Parameters

G : networkx.Graph A graph containing a list of nodes

coloring_feature : str, default=None A node attribute present in the dictionary associated with each node. The values associated with this attributed will be used for coloring the nodes.

cmap : str, default='viridis' Name of a matplotlib color palette for coloring the nodes.

Returns

node_colors : dict A dictionary wherein each key is a node and values are tuples containing colors in the RGBA format.

feature_colores : dict A dictionary wherein each key is a value for the attribute of nodes in the coloring_feature property and values are tuples containing colors in the RGBA format.

Source code in cell2cell/plotting/circular_plot.py
def get_node_colors(G, coloring_feature=None, cmap='viridis'):
    '''Generates colors for each node in a network given one of their
    properties.

    Parameters
    ----------
    G : networkx.Graph
        A graph containing a list of nodes

    coloring_feature : str, default=None
        A node attribute present in the dictionary associated with
        each node. The values associated with this attributed will
        be used for coloring the nodes.

    cmap : str, default='viridis'
        Name of a matplotlib color palette for coloring the nodes.

    Returns
    -------
    node_colors : dict
        A dictionary wherein each key is a node and values are
        tuples containing colors in the RGBA format.

    feature_colores : dict
        A dictionary wherein each key is a value for the attribute
        of nodes in the coloring_feature property and values are
        tuples containing colors in the RGBA format.
    '''
    if coloring_feature is None:
        raise ValueError('Node feature not specified!')

    # Get what features we have in the network G
    features = set()
    for n in G.nodes():
        features.add(G.nodes[n][coloring_feature])

    # Generate colors for each feature
    NUM_COLORS = len(features)
    cm = plt.get_cmap(cmap)
    feature_colors = dict()
    for i, f in enumerate(features):
        feature_colors[f] = cm(1. * i / NUM_COLORS)  # color will now be an RGBA tuple

    # Map feature colors into nodes
    node_colors = dict()
    for n in G.nodes():
        feature = G.nodes[n][coloring_feature]
        node_colors[n] = feature_colors[feature]
    return node_colors, feature_colors

get_readable_ccc_matrix(ccc_matrix)

Transforms a CCC matrix from an InteractionSpace instance into a readable dataframe.

Parameters

ccc_matrix : pandas.DataFrame A dataframe containing the communication scores for a given combination between a pair of sender-receiver cells and a ligand-receptor pair. Columns are pairs of cells and rows LR pairs.

Returns

readable_ccc : pandas.DataFrame A dataframe containing flat information in each row about communication scores for a given pair of cells and a specific LR pair. A row contains the sender and receiver cells as well as the ligand and the receptor participating in an interaction and their respective communication score. Columns are: ['sender', 'receiver', 'ligand', 'receptor', 'communication_score']

Source code in cell2cell/plotting/circular_plot.py
def get_readable_ccc_matrix(ccc_matrix):
    '''Transforms a CCC matrix from an InteractionSpace instance
    into a readable dataframe.

    Parameters
    ----------
    ccc_matrix : pandas.DataFrame
        A dataframe containing the communication scores for a given combination
        between a pair of sender-receiver cells and a ligand-receptor pair.
        Columns are pairs of cells and rows LR pairs.

    Returns
    -------
    readable_ccc : pandas.DataFrame
        A dataframe containing flat information in each row about communication
        scores for a given pair of cells and a specific LR pair. A row contains
        the sender and receiver cells as well as the ligand and the receptor
        participating in an interaction and their respective communication
        score. Columns are: ['sender', 'receiver', 'ligand', 'receptor',
        'communication_score']
    '''
    readable_ccc = ccc_matrix.copy()
    readable_ccc.index.name = 'LR'
    readable_ccc.reset_index(inplace=True)

    readable_ccc = readable_ccc.melt(id_vars=['LR'])
    readable_ccc['sender'] = readable_ccc['variable'].apply(lambda x: x.split(';')[0])
    readable_ccc['receiver'] = readable_ccc['variable'].apply(lambda x: x.split(';')[1])
    readable_ccc['ligand'] = readable_ccc['LR'].apply(lambda x: x[0])
    readable_ccc['receptor'] = readable_ccc['LR'].apply(lambda x: x[1])
    readable_ccc = readable_ccc[['sender', 'receiver', 'ligand', 'receptor', 'value']]
    readable_ccc.columns = ['sender', 'receiver', 'ligand', 'receptor', 'communication_score']
    return readable_ccc

sort_nodes(sender_cells, receiver_cells, ligands, receptors)

Sorts cells by senders first and alphabetically and creates pairs of senders-ligands. If senders and receivers share cells, it creates pairs of senders-receptors for those shared cells. Then sorts receivers cells and creates pairs of receivers-receptors, for those cells that are not shared with senders.

Parameters

sender_cells : list List of sender cells to sort.

receiver_cells : list List of receiver cells to sort.

ligands : list List of ligands to sort.

receptors : list List of receptors to sort.

Returns

sorted_nodes : dict A dictionary where keys are the nodes of cells-proteins and values are the position they obtained (a ranking from 0 to N, where N is the total number of nodes).

Source code in cell2cell/plotting/circular_plot.py
def sort_nodes(sender_cells, receiver_cells, ligands, receptors):
    '''Sorts cells by senders first and alphabetically and creates pairs of
    senders-ligands. If senders and receivers share cells, it creates pairs
    of senders-receptors for those shared cells. Then sorts receivers cells
    and creates pairs of receivers-receptors, for those cells that are not
    shared with senders.

    Parameters
    ----------
    sender_cells : list
        List of sender cells to sort.

    receiver_cells : list
        List of receiver cells to sort.

    ligands : list
        List of ligands to sort.

    receptors : list
        List of receptors to sort.

    Returns
    -------
    sorted_nodes : dict
        A dictionary where keys are the nodes of cells-proteins and values
        are the position they obtained (a ranking from 0 to N, where N is
        the total number of nodes).
    '''
    sorted_nodes = dict()
    count = 0

    both = set(sender_cells) & set(receiver_cells) # Intersection
    for c in sender_cells:
        for p in ligands:
            sorted_nodes[(c + '^' + p)] = count
            count += 1

        if c in both:
            for p in receptors:
                sorted_nodes[(c + '^' + p)] = count
                count += 1

    for c in receiver_cells:
        if c not in both:
            for p in receptors:
                sorted_nodes[(c + '^' + p)] = count
                count += 1
    return sorted_nodes

factor_plot

ccc_networks_plot(factors, included_factors=None, sender_label='Sender Cells', receiver_label='Receiver Cells', ccc_threshold=None, panel_size=(8, 8), nrows=2, network_layout='spring', edge_color='magenta', edge_width=25, edge_arrow_size=20, edge_alpha=0.25, node_color='#210070', node_size=1000, node_alpha=0.9, node_label_size=20, node_label_alpha=0.7, node_label_offset=(0.1, -0.2), factor_title_size=36, filename=None)

Plots factor-specific cell-cell communication networks resulting from decomposition with Tensor-cell2cell.

Parameters

factors : dict Ordered dictionary containing a dataframe with the factor loadings for each dimension/order of the tensor.

included_factors : list, default=None Factors to be included. Factor names must be the same as the key values in the factors dictionary.

sender_label : str Label for the dimension of sender cells. It is one key of the factors dict.

receiver_label : str Label for the dimension of receiver cells. It is one key of the factors dict.

ccc_threshold : float, default=None Threshold to consider only edges with a higher weight than this value.

panel_size : tuple, default=(8, 8) Size of one subplot or network (width*height), each in inches.

nrows : int, default=2 Number of rows in the set of subplots.

network_layout : str, default='spring' Visualization layout of the networks. It uses algorithms implemented in NetworkX, including: -'spring' : Fruchterman-Reingold force-directed algorithm. -'circular' : Position nodes on a circle.

edge_color : str, default='magenta' Color of the edges in the network.

edge_width : int, default=25 Thickness of the edges in the network.

edge_arrow_size : int, default=20 Size of the arrow of an edge pointing towards the receiver cells.

edge_alpha : float, default=0.25 Transparency of the edges. Values must be between 0 and 1. Higher values indicates less transparency.

node_color : str, default="#210070" Color of the nodes in the network.

node_size : int, default=1000 Size of the nodes in the network.

node_alpha : float, default=0.9 Transparency of the nodes. Values must be between 0 and 1. Higher values indicates less transparency.

node_label_size : int, default=20 Size of the labels for the node names.

node_label_alpha : int, default=0.7 Transparency of the node labeks. Values must be between 0 and 1. Higher values indicates less transparency.

node_label_offset : tuple, default=(0.1, -0.2) Offset values to move the node labels away from the center of the nodes.

factor_title_size : int, default=36 Size of the subplot titles. Each network has a title like 'Factor 1', 'Factor 2', ... ,'Factor R'.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

Returns

fig : matplotlib.figure.Figure A matplotlib figure.

axes : matplotlib.axes.Axes or array of Axes Matplotlib axes representing the subplots containing the networks.

Source code in cell2cell/plotting/factor_plot.py
def ccc_networks_plot(factors, included_factors=None, sender_label='Sender Cells', receiver_label='Receiver Cells',
                      ccc_threshold=None, panel_size=(8, 8), nrows=2, network_layout='spring', edge_color='magenta',
                      edge_width=25, edge_arrow_size=20, edge_alpha=0.25, node_color="#210070", node_size=1000,
                      node_alpha=0.9, node_label_size=20, node_label_alpha=0.7, node_label_offset=(0.1, -0.2),
                      factor_title_size=36, filename=None):
    '''Plots factor-specific cell-cell communication networks
    resulting from decomposition with Tensor-cell2cell.

    Parameters
    ----------
    factors : dict
        Ordered dictionary containing a dataframe with the factor loadings for each
        dimension/order of the tensor.

    included_factors : list, default=None
        Factors to be included. Factor names must be the same as the key values in
        the factors dictionary.

    sender_label : str
        Label for the dimension of sender cells. It is one key of the factors dict.

    receiver_label : str
        Label for the dimension of receiver cells. It is one key of the factors dict.

    ccc_threshold : float, default=None
        Threshold to consider only edges with a higher weight than this value.

    panel_size : tuple, default=(8, 8)
        Size of one subplot or network (width*height), each in inches.

    nrows : int, default=2
        Number of rows in the set of subplots.

    network_layout : str, default='spring'
        Visualization layout of the networks. It uses algorithms implemented
        in NetworkX, including:
            -'spring' : Fruchterman-Reingold force-directed algorithm.
            -'circular' : Position nodes on a circle.

    edge_color : str, default='magenta'
        Color of the edges in the network.

    edge_width : int, default=25
        Thickness of the edges in the network.

    edge_arrow_size : int, default=20
        Size of the arrow of an edge pointing towards the receiver cells.

    edge_alpha : float, default=0.25
        Transparency of the edges. Values must be between 0 and 1. Higher
        values indicates less transparency.

    node_color : str, default="#210070"
        Color of the nodes in the network.

    node_size : int, default=1000
        Size of the nodes in the network.

    node_alpha : float, default=0.9
        Transparency of the nodes. Values must be between 0 and 1. Higher
        values indicates less transparency.

    node_label_size : int, default=20
        Size of the labels for the node names.

    node_label_alpha : int, default=0.7
        Transparency of the node labeks. Values must be between 0 and 1.
        Higher values indicates less transparency.

    node_label_offset : tuple, default=(0.1, -0.2)
        Offset values to move the node labels away from the center of the nodes.

    factor_title_size : int, default=36
        Size of the subplot titles. Each network has a title like 'Factor 1',
        'Factor 2', ... ,'Factor R'.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    Returns
    -------
    fig : matplotlib.figure.Figure
        A matplotlib figure.

    axes : matplotlib.axes.Axes or array of Axes
        Matplotlib axes representing the subplots containing the networks.
    '''
    networks = get_factor_specific_ccc_networks(result=factors,
                                                sender_label=sender_label,
                                                receiver_label=receiver_label)

    if included_factors is None:
        factor_labels = [f'Factor {i}' for i in range(1, len(networks) + 1)]
    else:
        factor_labels = included_factors

    nrows = min([len(factor_labels), nrows])
    ncols = int(np.ceil(len(factor_labels) / nrows))
    fig, axes = plt.subplots(nrows, ncols, figsize=(panel_size[0] * ncols, panel_size[1] * nrows))
    if len(factor_labels) == 1:
        axs = np.array([axes])
    else:
        axs = axes.flatten()

    for i, factor in enumerate(factor_labels):
        ax = axs[i]
        if ccc_threshold is not None:
            # Considers edges with weight above ccc_threshold
            df = networks[factor].gt(ccc_threshold).astype(int).multiply(networks[factor])
        else:
            df = networks[factor]

        # Networkx Directed Network
        G = nx.convert_matrix.from_pandas_adjacency(df, create_using=nx.DiGraph())

        # Layout for visualization - Node positions
        if network_layout == 'spring':
            pos = nx.spring_layout(G,
                                   k=1.,
                                   seed=888
                                   )
        elif network_layout == 'circular':
            pos = nx.circular_layout(G)
        else:
            raise ValueError("network_layout should be either 'spring' or 'circular'")

        # Weights for edge thickness
        weights = np.asarray([G.edges[e]['weight'] for e in G.edges()])

        # Visualize network
        nx.draw_networkx_edges(G,
                               pos,
                               alpha=edge_alpha,
                               arrowsize=edge_arrow_size,
                               width=weights * edge_width,
                               edge_color=edge_color,
                               connectionstyle="arc3,rad=-0.3",
                               ax=ax
                               )
        nx.draw_networkx_nodes(G,
                               pos,
                               node_color=node_color,
                               node_size=node_size,
                               alpha=node_alpha,
                               ax=ax
                               )
        label_options = {"ec": "k", "fc": "white", "alpha": node_label_alpha}
        _ = nx.draw_networkx_labels(G,
                                    {k: v + np.array(node_label_offset) for k, v in pos.items()},
                                    font_size=node_label_size,
                                    bbox=label_options,
                                    ax=ax
                                    )

        ax.set_frame_on(False)
        xlim = ax.get_xlim()
        ylim = ax.get_ylim()
        coeff = 1.4
        ax.set_xlim((xlim[0] * coeff, xlim[1] * coeff))
        ax.set_ylim((ylim[0] * coeff, ylim[1] * coeff))
        ax.set_title(factor, fontsize=factor_title_size, fontweight='bold')

    # Remove extra subplots
    for j in range(i+1, axs.shape[0]):
        ax = axs[j]
        ax.axis(False)

    plt.tight_layout()
    if filename is not None:
        plt.savefig(filename, dpi=300, bbox_inches='tight')
    return fig, axes

context_boxplot(context_loadings, metadict, included_factors=None, group_order=None, statistical_test='Mann-Whitney', pval_correction='benjamini-hochberg', text_format='star', nrows=1, figsize=(12, 6), cmap='tab10', title_size=14, axis_label_size=12, group_label_rotation=45, ylabel='Context Loadings', dot_color='lightsalmon', dot_edge_color='brown', filename=None, verbose=False)

Plots a boxplot to compare the loadings of context groups in each of the factors resulting from a tensor decomposition.

Parameters

context_loadings : pandas.DataFrame Dataframe containing the loadings of each of the contexts from a tensor decomposition. Rows are contexts and columns are the factors obtained.

metadict : dict A dictionary containing the groups where each of the contexts belong to. Keys corresponds to the indexes in context_loadings and values are the respective groups. For example: metadict={'Context 1' : 'Group 1', 'Context 2' : 'Group 1', 'Context 3' : 'Group 2', 'Context 4' : 'Group 2'}

included_factors : list, default=None Factors to be included. Factor names must be the same as column elements in the context_loadings.

group_order : list, default=None Order of the groups to plot the boxplots. Considering the example of the metadict, it could be: group_order=['Group 1', 'Group 2'] or group_order=['Group 2', 'Group 1'] If None, the order that groups are found in metadict will be considered.

statistical_test : str, default='Mann-Whitney' The statistical test to compare context groups within each factor. Options include: 't-test_ind', 't-test_welch', 't-test_paired', 'Mann-Whitney', 'Mann-Whitney-gt', 'Mann-Whitney-ls', 'Levene', 'Wilcoxon', 'Kruskal'.

pval_correction : str, default='benjamini-hochberg' Multiple test correction method to reduce false positives. Options include: 'bonferroni', 'bonf', 'Bonferroni', 'holm-bonferroni', 'HB', 'Holm-Bonferroni', 'holm', 'benjamini-hochberg', 'BH', 'fdr_bh', 'Benjamini-Hochberg', 'fdr_by', 'Benjamini-Yekutieli', 'BY', None

text_format : str, default='star' Format to display the results of the statistical test. Options are:

- 'star', to display P- values < 1e-4 as "****"; < 1e-3 as "***";
          < 1e-2 as "**"; < 0.05 as "*", and < 1 as "ns".
- 'simple', to display P-values < 1e-5 as "1e-5"; < 1e-4 as "1e-4";
          < 1e-3 as "0.001"; < 1e-2 as "0.01"; and < 5e-2 as "0.05".

nrows : int, default=1 Number of rows to generate the subplots.

figsize : tuple, default=(12, 6) Size of the figure (width*height), each in inches.

cmap : str, default='tab10' Name of the color palette for coloring the major groups of contexts.

title_size : int, default=14 Font size of the title in each of the factor boxplots.

axis_label_size : int, default=12 Font size of the labels for X and Y axes.

group_label_rotation : int, default=45 Angle of rotation for the tick labels in the X axis.

ylabel : str, default='Context Loadings' Label for the Y axis.

dot_color : str, default='lightsalmon' A matplotlib color for the dots representing individual contexts in the boxplot. For more info see: https://matplotlib.org/stable/gallery/color/named_colors.html

dot_edge_color : str, default='brown' A matplotlib color for the edge of the dots in the boxplot. For more info see: https://matplotlib.org/stable/gallery/color/named_colors.html

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

verbose : boolean, default=None Whether printing out the result of the pairwise statistical tests in each of the factors

Returns

fig : matplotlib.figure.Figure A matplotlib figure.

axes : matplotlib.axes.Axes or array of Axes Matplotlib axes representing the subplots containing the boxplots.

Source code in cell2cell/plotting/factor_plot.py
def context_boxplot(context_loadings, metadict, included_factors=None, group_order=None, statistical_test='Mann-Whitney',
                    pval_correction='benjamini-hochberg', text_format='star', nrows=1, figsize=(12, 6), cmap='tab10',
                    title_size=14, axis_label_size=12, group_label_rotation=45, ylabel='Context Loadings',
                    dot_color='lightsalmon', dot_edge_color='brown', filename=None, verbose=False):
    '''Plots a boxplot to compare the loadings of context groups in each
    of the factors resulting from a tensor decomposition.

    Parameters
    ----------
    context_loadings : pandas.DataFrame
        Dataframe containing the loadings of each of the contexts
        from a tensor decomposition. Rows are contexts and columns
        are the factors obtained.

    metadict : dict
        A dictionary containing the groups where each of the contexts
        belong to. Keys corresponds to the indexes in `context_loadings`
        and values are the respective groups. For example:
        metadict={'Context 1' : 'Group 1', 'Context 2' : 'Group 1',
                  'Context 3' : 'Group 2', 'Context 4' : 'Group 2'}

    included_factors : list, default=None
        Factors to be included. Factor names must be the same as column elements
        in the context_loadings.

    group_order : list, default=None
        Order of the groups to plot the boxplots. Considering the
        example of the metadict, it could be:
        group_order=['Group 1', 'Group 2'] or
        group_order=['Group 2', 'Group 1']
        If None, the order that groups are found in `metadict`
        will be considered.

    statistical_test : str, default='Mann-Whitney'
        The statistical test to compare context groups within each factor.
        Options include:
        't-test_ind', 't-test_welch', 't-test_paired', 'Mann-Whitney',
        'Mann-Whitney-gt', 'Mann-Whitney-ls', 'Levene', 'Wilcoxon', 'Kruskal'.

    pval_correction : str, default='benjamini-hochberg'
        Multiple test correction method to reduce false positives.
        Options include:
        'bonferroni', 'bonf', 'Bonferroni', 'holm-bonferroni', 'HB',
        'Holm-Bonferroni', 'holm', 'benjamini-hochberg', 'BH', 'fdr_bh',
        'Benjamini-Hochberg', 'fdr_by', 'Benjamini-Yekutieli', 'BY', None

    text_format : str, default='star'
        Format to display the results of the statistical test.
        Options are:

        - 'star', to display P- values < 1e-4 as "****"; < 1e-3 as "***";
                  < 1e-2 as "**"; < 0.05 as "*", and < 1 as "ns".
        - 'simple', to display P-values < 1e-5 as "1e-5"; < 1e-4 as "1e-4";
                  < 1e-3 as "0.001"; < 1e-2 as "0.01"; and < 5e-2 as "0.05".

    nrows : int, default=1
        Number of rows to generate the subplots.

    figsize : tuple, default=(12, 6)
        Size of the figure (width*height), each in inches.

    cmap : str, default='tab10'
        Name of the color palette for coloring the major groups of contexts.

    title_size : int, default=14
        Font size of the title in each of the factor boxplots.

    axis_label_size : int, default=12
        Font size of the labels for X and Y axes.

    group_label_rotation : int, default=45
        Angle of rotation for the tick labels in the X axis.

    ylabel : str, default='Context Loadings'
        Label for the Y axis.

    dot_color : str, default='lightsalmon'
        A matplotlib color for the dots representing individual contexts
        in the boxplot. For more info see:
        https://matplotlib.org/stable/gallery/color/named_colors.html

    dot_edge_color : str, default='brown'
        A matplotlib color for the edge of the dots in the boxplot.
        For more info see:
        https://matplotlib.org/stable/gallery/color/named_colors.html

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    verbose : boolean, default=None
        Whether printing out the result of the pairwise statistical tests
        in each of the factors

    Returns
    -------
    fig : matplotlib.figure.Figure
        A matplotlib figure.

    axes : matplotlib.axes.Axes or array of Axes
           Matplotlib axes representing the subplots containing the boxplots.
    '''
    if group_order is not None:
        assert len(set(group_order) & set(metadict.values())) == len(set(metadict.values())), "All groups in `metadict` must be contained in `group_order`"
    else:
        group_order = list(set(metadict.values()))
    df = context_loadings.copy()

    if included_factors is None:
        factor_labels = list(df.columns)
    else:
        factor_labels = included_factors
    rank = len(factor_labels)
    df['Group'] = [metadict[idx] for idx in df.index]

    nrows = min([rank, nrows])
    ncols = int(np.ceil(rank/nrows))
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize, sharey='none')

    if rank == 1:
        axs = np.array([axes])
    else:
        axs = axes.flatten()


    for i, factor in enumerate(factor_labels):
        ax = axs[i]
        x, y = 'Group', factor

        order = group_order

        # Plot the boxes
        ax = sns.boxplot(x=x,
                         y=y,
                         data=df,
                         order=order,
                         whis=[0, 100],
                         width=.6,
                         palette=cmap,
                         boxprops=dict(alpha=.5),
                         ax=ax
                         )

        # Plot the dots
        sns.stripplot(x=x,
                      y=y,
                      data=df,
                      size=6,
                      order=order,
                      color=dot_color,
                      edgecolor=dot_edge_color,
                      linewidth=0.6,
                      jitter=False,
                      ax=ax
                      )

        if statistical_test is not None:
            # Add annotations about statistical test
            from itertools import combinations

            pairs = list(combinations(order, 2))
            annotator = Annotator(ax=ax,
                                  pairs=pairs,
                                  data=df,
                                  x=x,
                                  y=y,
                                  order=order)
            annotator.configure(test=statistical_test,
                                text_format=text_format,
                                loc='inside',
                                comparisons_correction=pval_correction,
                                verbose=verbose
                                )
            annotator.apply_and_annotate()

        ax.set_title(factor, fontsize=title_size)

        ax.set_xlabel('', fontsize=axis_label_size)
        if (i == 0) | (((i) % ncols) == 0):
            ax.set_ylabel(ylabel, fontsize=axis_label_size)
        else:
            ax.set_ylabel(' ', fontsize=axis_label_size)

        ax.set_xticklabels(ax.get_xticklabels(),
                           rotation=group_label_rotation,
                           rotation_mode='anchor',
                           va='bottom',
                           ha='right')

    # Remove extra subplots
    for j in range(i+1, axs.shape[0]):
        ax = axs[j]
        ax.axis(False)

    if axes.shape[0] > 1:
        axes = axes.reshape(axes.shape[0], -1)
        fig.align_ylabels(axes[:, 0])

    plt.tight_layout(rect=[0, 0.03, 1, 0.99])
    if filename is not None:
        plt.savefig(filename, dpi=300, bbox_inches='tight')
    return fig, axes

loading_clustermap(loadings, loading_threshold=0.0, use_zscore=True, metric='euclidean', method='ward', optimal_leaf=True, figsize=(15, 8), heatmap_lw=0.2, cbar_fontsize=12, tick_fontsize=10, cmap=None, cbar_label=None, filename=None, **kwargs)

Plots a clustermap of the tensor-factorization loadings from one tensor dimension or the joint loadings from multiple tensor dimensions.

Parameters


loadings : pandas.DataFrame Loadings for a given tensor dimension after running the tensor decomposition. Rows are the elements in one dimension or joint pairs/n-tuples in multiple dimensions. It is recommended that the loadings resulting from the decomposition should be l2-normalized prior to their use, by considering all dimensions together. For example, take the factors dictionary found in any InteractionTensor or any BaseTensor derived class, and execute cell2cell.tensor.normalize(factors).

loading_threshold : float Threshold to filter out elements in the loadings dataframe. This plot considers elements with loadings greater than this threshold in at least one of the factors.

use_zscore : boolean Whether converting loadings to z-scores across factors.

metric : str, default='euclidean' The distance metric to use. The distance function can be 'braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.

method : str, 'ward' by default Method to compute the linkage. It could be:

   - 'single'
   - 'complete'
   - 'average'
   - 'weighted'
   - 'centroid'
   - 'median'
   - 'ward'
   For more details, go to:
   https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.cluster.hierarchy.linkage.html

optimal_leaf : boolean, default=True Whether sorting the leaf of the dendrograms to have a minimal distance between successive leaves. For more information, see scipy.cluster.hierarchy.optimal_leaf_ordering

figsize : tuple, default=(16, 9) Size of the figure (width*height), each in inches.

heatmap_lw : float, default=0.2 Width of the lines that will divide each cell.

cbar_fontsize : int, default=12 Font size for the colorbar title.

tick_fontsize : int, default=10 Font size for ticks in the x and y axes.

cmap : str, default=None Name of the color palette for coloring the heatmap. If None, cmap='Blues' would be used when use_zscore=False; and cmap='vlag' when use_zscore=True.

cbar_label : str, default=None Label for the color bar. If None, default labels will be 'Z-scores across factors' or 'Loadings', depending on use_zcore is True or False, respectively.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

**kwargs : dict Dictionary containing arguments for the seaborn.clustermap function.

Returns


cm : seaborn.matrix.ClusterGrid A seaborn ClusterGrid instance.

Source code in cell2cell/plotting/factor_plot.py
def loading_clustermap(loadings, loading_threshold=0., use_zscore=True, metric='euclidean', method='ward',
                       optimal_leaf=True, figsize=(15, 8), heatmap_lw=0.2, cbar_fontsize=12, tick_fontsize=10, cmap=None,
                       cbar_label=None, filename=None, **kwargs):
    '''Plots a clustermap of the tensor-factorization loadings from one tensor dimension or
    the joint loadings from multiple tensor dimensions.

    Parameters
    ----------
    loadings : pandas.DataFrame
        Loadings for a given tensor dimension after running the tensor
        decomposition. Rows are the elements in one dimension or joint
        pairs/n-tuples in multiple dimensions. It is recommended that
        the loadings resulting from the decomposition should be
        l2-normalized prior to their use, by considering all dimensions
        together. For example, take the factors dictionary found in any
        InteractionTensor or any BaseTensor derived class, and execute
        cell2cell.tensor.normalize(factors).

    loading_threshold : float
        Threshold to filter out elements in the loadings dataframe.
        This plot considers elements with loadings greater than this
        threshold in at least one of the factors.

    use_zscore : boolean
        Whether converting loadings to z-scores across factors.

    metric : str, default='euclidean'
        The distance metric to use. The distance function can be 'braycurtis',
        'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice',
        'euclidean', 'hamming', 'jaccard', 'jensenshannon', 'kulsinski',
        'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao',
        'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.

    method : str, 'ward' by default
        Method to compute the linkage.
        It could be:

        - 'single'
        - 'complete'
        - 'average'
        - 'weighted'
        - 'centroid'
        - 'median'
        - 'ward'
        For more details, go to:
        https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.cluster.hierarchy.linkage.html

    optimal_leaf : boolean, default=True
        Whether sorting the leaf of the dendrograms to have a minimal distance
        between successive leaves. For more information, see
        scipy.cluster.hierarchy.optimal_leaf_ordering

    figsize : tuple, default=(16, 9)
        Size of the figure (width*height), each in inches.

    heatmap_lw : float, default=0.2
        Width of the lines that will divide each cell.

    cbar_fontsize : int, default=12
        Font size for the colorbar title.

    tick_fontsize : int, default=10
        Font size for ticks in the x and y axes.

    cmap : str, default=None
        Name of the color palette for coloring the heatmap. If None,
        cmap='Blues' would be used when use_zscore=False; and cmap='vlag' when use_zscore=True.

    cbar_label : str, default=None
        Label for the color bar. If None, default labels will be 'Z-scores \n across factors'
        or 'Loadings', depending on `use_zcore` is True or False, respectively.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    **kwargs : dict
        Dictionary containing arguments for the seaborn.clustermap function.

    Returns
    -------
    cm : seaborn.matrix.ClusterGrid
        A seaborn ClusterGrid instance.
    '''
    df = loadings.copy()
    df = df[(df.T > loading_threshold).any()].T
    if use_zscore:
        df = df.apply(zscore)
        if cmap is None:
            cmap = 'vlag'
        val = np.ceil(max([abs(df.min().min()), abs(df.max().max())]))
        vmin, vmax = -1. * val, val
        if cbar_label is None:
            cbar_label = 'Z-scores \n across factors'
    else:
        if cmap is None:
            cmap='Blues'
        vmin, vmax = 0., df.max().max()
        if cbar_label is None:
            cbar_label = 'Loadings'
    # Clustering
    dm_rows = compute_distance(df, axis=0, metric=metric)
    row_linkage = compute_linkage(dm_rows, method=method, optimal_ordering=optimal_leaf)

    dm_cols = compute_distance(df, axis=1, metric=metric)
    col_linkage = compute_linkage(dm_cols, method=method, optimal_ordering=optimal_leaf)

    # Clustermap
    cm = sns.clustermap(df,
                        cmap=cmap,
                        col_linkage=col_linkage,
                        row_linkage=row_linkage,
                        vmin=vmin,
                        vmax=vmax,
                        xticklabels=1,
                        figsize=figsize,
                        linewidths=heatmap_lw,
                        **kwargs
                        )

    # Color bar label
    cbar = cm.ax_heatmap.collections[0].colorbar
    cbar.ax.set_ylabel(cbar_label, fontsize=cbar_fontsize)
    cbar.ax.yaxis.set_label_position("left")

    # Tick labels
    cm.ax_heatmap.set_yticklabels(cm.ax_heatmap.yaxis.get_majorticklabels(), rotation=0, ha='left', fontsize=tick_fontsize)
    plt.setp(cm.ax_heatmap.xaxis.get_majorticklabels(), fontsize=tick_fontsize)

    # Resize clustermap and dendrograms
    hm = cm.ax_heatmap.get_position()
    w_mult = 1.0
    h_mult = 1.0
    cm.ax_heatmap.set_position([hm.x0, hm.y0, hm.width * w_mult, hm.height])
    row = cm.ax_row_dendrogram.get_position()
    row_d_mult = 0.33
    cm.ax_row_dendrogram.set_position(
        [row.x0 + row.width * (1 - row_d_mult), row.y0, row.width * row_d_mult, row.height * h_mult])

    col = cm.ax_col_dendrogram.get_position()
    cm.ax_col_dendrogram.set_position([col.x0, col.y0, col.width * w_mult, col.height * 0.5])

    if filename is not None:
        plt.savefig(filename, dpi=300, bbox_inches='tight')

    cm.ax_heatmap.set_xlabel(cm.ax_heatmap.get_xlabel(), fontsize=int(1.2 * tick_fontsize))
    cm.ax_heatmap.set_ylabel(cm.ax_heatmap.get_ylabel(), fontsize=int(1.2 * tick_fontsize))

    return cm

pcoa_plot

pcoa_3dplot(interaction_space, metadata=None, sample_col='#SampleID', group_col='Groups', pcoa_method='eigh', meta_cmap='gist_rainbow', colors=None, excluded_cells=None, title='', axis_fontsize=14, legend_fontsize=12, figsize=(6, 5), view_angles=(30, 135), filename=None)

Projects the cells into an Euclidean space (PCoA) given their distances based on their CCI scores. Then, plots each cell by their first three coordinates in a 3D scatter plot.

Parameters

interaction_space : cell2cell.core.interaction_space.InteractionSpace Interaction space that contains all a distance matrix after running the the method compute_pairwise_cci_scores. Alternatively, this object can be a numpy-array or a pandas DataFrame. Also, a SingleCellInteractions or a BulkInteractions object after running the method compute_pairwise_cci_scores.

metadata : pandas.Dataframe, default=None Metadata associated with the cells, cell types or samples in the matrix containing CCC scores. If None, cells will not be colored by major groups.

sample_col : str, default='#SampleID' Column in the metadata for the cells, cell types or samples in the matrix containing CCI scores.

group_col : str, default='Groups' Column in the metadata containing the major groups of cells, cell types or samples in the matrix with CCI scores.

pcoa_method : str, default='eigh' Eigendecomposition method to use in performing PCoA. By default, uses SciPy's eigh, which computes exact eigenvectors and eigenvalues for all dimensions. The alternate method, fsvd, uses faster heuristic eigendecomposition but loses accuracy. The magnitude of accuracy lost is dependent on dataset.

meta_cmap : str, default='gist_rainbow' Name of the color palette for coloring the major groups of cells.

colors : dict, default=None Dictionary containing tuples in the RGBA format for indicating colors of major groups of cells. If colors is specified, meta_cmap will be ignored.

excluded_cells : list, default=None List containing cell names that are present in the interaction_space object but that will be excluded from this plot.

title : str, default='' Title of the PCoA 3D plot.

axis_fontsize : int, default=14 Size of the font for the labels of each axis (X, Y and Z).

legend_fontsize : int, default=12 Size of the font for labels in the legend.

figsize : tuple, default=(6, 5) Size of the figure (width*height), each in inches.

view_angles : tuple, default=(30, 135) Rotation angles of the plot. Set the elevation and azimuth of the axes.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

Returns

results : dict Dictionary that contains:

- 'fig' : matplotlib.figure.Figure, containing the whole figure
- 'axes' : matplotlib.axes.Axes, containing the axes of the 3D plot
- 'ordination' : Ordination or projection obtained from the PCoA
- 'distance_matrix' : Distance matrix used to perform the PCoA (usually in
    interaction_space.distance_matrix
Source code in cell2cell/plotting/pcoa_plot.py
def pcoa_3dplot(interaction_space, metadata=None, sample_col='#SampleID', group_col='Groups', pcoa_method='eigh',
                meta_cmap='gist_rainbow', colors=None, excluded_cells=None, title='', axis_fontsize=14, legend_fontsize=12,
                figsize=(6, 5), view_angles=(30, 135), filename=None):
    '''Projects the cells into an Euclidean space (PCoA) given their distances
    based on their CCI scores. Then, plots each cell by their first three
    coordinates in a 3D scatter plot.

    Parameters
    ----------
    interaction_space : cell2cell.core.interaction_space.InteractionSpace
        Interaction space that contains all a distance matrix after running the
        the method compute_pairwise_cci_scores. Alternatively, this object
        can be a numpy-array or a pandas DataFrame. Also, a
        SingleCellInteractions or a BulkInteractions object after running
        the method compute_pairwise_cci_scores.

    metadata : pandas.Dataframe, default=None
        Metadata associated with the cells, cell types or samples in the
        matrix containing CCC scores. If None, cells will not be colored
        by major groups.

    sample_col : str, default='#SampleID'
        Column in the metadata for the cells, cell types or samples
        in the matrix containing CCI scores.

    group_col : str, default='Groups'
        Column in the metadata containing the major groups of cells, cell types
        or samples in the matrix with CCI scores.

    pcoa_method : str, default='eigh'
        Eigendecomposition method to use in performing PCoA.
        By default, uses SciPy's `eigh`, which computes exact
        eigenvectors and eigenvalues for all dimensions. The alternate
        method, `fsvd`, uses faster heuristic eigendecomposition but loses
        accuracy. The magnitude of accuracy lost is dependent on dataset.

    meta_cmap : str, default='gist_rainbow'
        Name of the color palette for coloring the major groups of cells.

    colors : dict, default=None
        Dictionary containing tuples in the RGBA format for indicating colors
        of major groups of cells. If colors is specified, meta_cmap will be
        ignored.

    excluded_cells : list, default=None
        List containing cell names that are present in the interaction_space
        object but that will be excluded from this plot.

    title : str, default=''
        Title of the PCoA 3D plot.

    axis_fontsize : int, default=14
        Size of the font for the labels of each axis (X, Y and Z).

    legend_fontsize : int, default=12
        Size of the font for labels in the legend.

    figsize : tuple, default=(6, 5)
        Size of the figure (width*height), each in inches.

    view_angles : tuple, default=(30, 135)
        Rotation angles of the plot. Set the elevation and
        azimuth of the axes.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    Returns
    -------
    results : dict
        Dictionary that contains:

        - 'fig' : matplotlib.figure.Figure, containing the whole figure
        - 'axes' : matplotlib.axes.Axes, containing the axes of the 3D plot
        - 'ordination' : Ordination or projection obtained from the PCoA
        - 'distance_matrix' : Distance matrix used to perform the PCoA (usually in
            interaction_space.distance_matrix
    '''
    if hasattr(interaction_space, 'distance_matrix'):
        print('Interaction space detected as an InteractionSpace class')
        distance_matrix = interaction_space.distance_matrix
    elif (type(interaction_space) is np.ndarray) or (type(interaction_space) is pd.core.frame.DataFrame):
        print('Interaction space detected as a distance matrix')
        distance_matrix = interaction_space
    elif hasattr(interaction_space, 'interaction_space'):
        print('Interaction space detected as a Interactions class')
        if not hasattr(interaction_space.interaction_space, 'distance_matrix'):
            raise ValueError('First run the method compute_pairwise_interactions() in your interaction' + \
                             ' object to generate a distance matrix.')
        else:
            distance_matrix = interaction_space.interaction_space.distance_matrix
    else:
        raise ValueError('First run the method compute_pairwise_interactions() in your interaction' + \
                         ' object to generate a distance matrix.')

    # Drop excluded cells
    if excluded_cells is not None:
        df = distance_matrix.loc[~distance_matrix.index.isin(excluded_cells),
                                 ~distance_matrix.columns.isin(excluded_cells)]
    else:
        df = distance_matrix

    # PCoA
    ordination = pcoa(df, method=pcoa_method)
    ordination = _check_ordination(ordination)
    ordination['samples'].index = df.index

    # Biplot
    fig = plt.figure(figsize=figsize)
    ax = fig.add_subplot(111, projection='3d')
    #ax = Axes3D(fig) # Not displayed in newer versions

    if metadata is None:
        metadata = pd.DataFrame()
        metadata[sample_col] = list(distance_matrix.columns)
        metadata[group_col] = list(distance_matrix.columns)

    meta_ = metadata.set_index(sample_col)
    if excluded_cells is not None:
        meta_ = meta_.loc[~meta_.index.isin(excluded_cells)]
    labels = meta_[group_col].values.tolist()

    if colors is None:
        colors = get_colors_from_labels(labels, cmap=meta_cmap)
    else:
        assert all(elem in colors.keys() for elem in set(labels))

    # Plot each data point with respective color
    for i, cell_type in enumerate(sorted(meta_[group_col].unique())):
        cells = list(meta_.loc[meta_[group_col] == cell_type].index)
        if colors is not None:
            ax.scatter(ordination['samples'].loc[cells, 'PC1'],
                       ordination['samples'].loc[cells, 'PC2'],
                       ordination['samples'].loc[cells, 'PC3'],
                       color=colors[cell_type],
                       s=50,
                       edgecolors='k',
                       label=cell_type)
        else:
            ax.scatter(ordination['samples'].loc[cells, 'PC1'],
                       ordination['samples'].loc[cells, 'PC2'],
                       ordination['samples'].loc[cells, 'PC3'],
                       s=50,
                       edgecolors='k',
                       label=cell_type)

    # Plot texts
    ax.set_xlabel('PC1 ({}%)'.format(np.round(ordination['proportion_explained']['PC1'] * 100), 2), fontsize=axis_fontsize)
    ax.set_ylabel('PC2 ({}%)'.format(np.round(ordination['proportion_explained']['PC2'] * 100), 2), fontsize=axis_fontsize)
    ax.set_zlabel('PC3 ({}%)'.format(np.round(ordination['proportion_explained']['PC3'] * 100), 2), fontsize=axis_fontsize)

    ax.set_xticklabels([])
    ax.set_yticklabels([])
    ax.set_zticklabels([])

    ax.view_init(view_angles[0], view_angles[1])
    plt.legend(loc='center left', bbox_to_anchor=(1.35, 0.5),
               ncol=2, fancybox=True, shadow=True, fontsize=legend_fontsize)
    plt.title(title, fontsize=16)

    #distskbio = skbio.DistanceMatrix(df, ids=df.index) # Not using skbio for now

    # Save plot
    if filename is not None:
        plt.savefig(filename, dpi=300,
                    bbox_inches='tight')

    results = {'fig' : fig, 'axes' : ax, 'ordination' : ordination, 'distance_matrix' : df} # df used to be distskbio
    return results

pval_plot

dot_plot(sc_interactions, evaluation='communication', significance=0.05, senders=None, receivers=None, figsize=(16, 9), tick_size=8, cmap='PuOr', filename=None)

Generates a dot plot for the CCI or communication scores given their P-values. Size of the dots are given by the -log10(P-value) and colors by the value of the CCI or communication score.

Parameters

sc_interactions : cell2cell.analysis.cell2cell_pipelines.SingleCellInteractions Interaction class with all necessary methods to run the cell2cell pipeline on a single-cell RNA-seq dataset. The method permute_cell_labels() must be run before generating this plot.

evaluation : str, default='communication' P-values of CCI or communication scores used for this plot. - 'interactions' : For CCI scores - 'communication' : For communication scores

significance : float, default=0.05 The significance threshold to be plotted. LR pairs or cell-cell pairs with at least one P-value below this threshold will be considered.

senders : list, default=None Optional filter to plot specific sender cells.

receivers : list, default=None Optional filter to plot specific receiver cells.

figsize : tuple, default=(16, 9) Size of the figure (width*height), each in inches.

tick_size : int, default=8 Specifies the size of ticklabels as well as the maximum size of the dots.

cmap : str, default='PuOr' A matplotlib color palette name.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

Returns

fig : matplotlib.figure.Figure Figure object made with matplotlib

Source code in cell2cell/plotting/pval_plot.py
def dot_plot(sc_interactions, evaluation='communication', significance=0.05, senders=None, receivers=None,
             figsize=(16, 9), tick_size=8, cmap='PuOr', filename=None):
    '''Generates a dot plot for the CCI or communication scores given their
    P-values. Size of the dots are given by the -log10(P-value) and colors
    by the value of the CCI or communication score.

    Parameters
    ----------
    sc_interactions : cell2cell.analysis.cell2cell_pipelines.SingleCellInteractions
        Interaction class with all necessary methods to run the cell2cell
        pipeline on a single-cell RNA-seq dataset. The method
        permute_cell_labels() must be run before generating this plot.

    evaluation : str, default='communication'
        P-values of CCI or communication scores used for this plot.
        - 'interactions' : For CCI scores
        - 'communication' : For communication scores

    significance : float, default=0.05
        The significance threshold to be plotted. LR pairs or cell-cell
        pairs with at least one P-value below this threshold will be
        considered.

    senders : list, default=None
        Optional filter to plot specific sender cells.

    receivers : list, default=None
        Optional filter to plot specific receiver cells.

    figsize : tuple, default=(16, 9)
        Size of the figure (width*height), each in inches.

    tick_size : int, default=8
        Specifies the size of ticklabels as well as the maximum size
        of the dots.

    cmap : str, default='PuOr'
        A matplotlib color palette name.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    Returns
    -------
    fig : matplotlib.figure.Figure
        Figure object made with matplotlib
    '''
    if evaluation == 'communication':
        if not (hasattr(sc_interactions, 'ccc_permutation_pvalues')):
            raise ValueError(
                'Run the method permute_cell_labels() with evaluation="communication" before plotting communication P-values.')
        else:
            pvals = sc_interactions.ccc_permutation_pvalues
            scores = sc_interactions.interaction_space.interaction_elements['communication_matrix']
    elif evaluation == 'interactions':
        if not (hasattr(sc_interactions, 'cci_permutation_pvalues')):
            raise ValueError(
                'Run the method permute_cell_labels() with evaluation="interactions" before plotting interaction P-values.')
        else:
            pvals = sc_interactions.cci_permutation_pvalues
            scores = sc_interactions.interaction_space.interaction_elements['cci_matrix']
    else:
        raise ValueError('evaluation has to be either "communication" or "interactions"')

    pval_df = pvals.copy()
    score_df = scores.copy()

    # Filter cells
    if evaluation == 'communication':
        if (senders is not None) and (receivers is not None):
            new_cols = [s + ';' + r for r in receivers for s in senders]
        elif senders is not None:
            new_cols = [c for s in senders for c in pval_df.columns if (s in c.split(';')[0])]
        elif receivers is not None:
            new_cols = [c for r in receivers for c in pval_df.columns if (r in c.split(';')[1])]
        else:
            new_cols = list(pval_df.columns)
        pval_df = pval_df.reindex(new_cols, fill_value=1.0, axis='columns')
        xlabel = 'Sender-Receiver Pairs'
        ylabel = 'Ligand-Receptor Pairs'
        title = 'Communication Score'
    elif evaluation == 'interactions':
        if senders is not None:
            pval_df = pval_df.reindex(senders, fill_value=0.0, axis='index')
        if receivers is not None:
            pval_df = pval_df.reindex(receivers, fill_value=0.0, axis='columns')
        xlabel = 'Receiver Cells'
        ylabel = 'Sender Cells'
        title = 'CCI Score'

    pval_df.columns = [' --> '.join(str(c).split(';')) for c in pval_df.columns]
    pval_df.index = [' --> '.join(str(r).replace('(', '').replace(')', '').replace("'", "").split(', ')) \
                for r in pval_df.index]


    score_df.columns = [' --> '.join(str(c).split(';')) for c in score_df.columns]
    score_df.index = [' --> '.join(str(r).replace('(', '').replace(')', '').replace("'", "").split(', ')) \
                for r in score_df.index]

    fig = generate_dot_plot(pval_df=pval_df,
                            score_df=score_df,
                            xlabel=xlabel,
                            ylabel=ylabel,
                            cbar_title=title,
                            cmap=cmap,
                            figsize=figsize,
                            significance=significance,
                            label_size=24,
                            title_size=20,
                            tick_size=tick_size,
                            filename=filename
                            )
    return fig

generate_dot_plot(pval_df, score_df, significance=0.05, xlabel='', ylabel='', cbar_title='Score', cmap='PuOr', figsize=(16, 9), label_size=20, title_size=20, tick_size=14, filename=None)

Generates a dot plot for given P-values and respective scores.

Parameters

pval_df : pandas.DataFrame A dataframe containing the P-values, with multiple elements in both rows and columns

score_df : pandas.DataFrame A dataframe containing the scores that were tested. Rows and columns must be the same as in pval_df.

significance : float, default=0.05 The significance threshold to be plotted. LR pairs or cell-cell pairs with at least one P-value below this threshold will be considered.

xlabel : str, default='' Name or label of the X axis.

ylabel : str, default='' Name or label of the Y axis.

cbar_title : str, default='Score' A title for the colorbar associated with the scores in score_df. It is usually the name of the score.

cmap : str, default='PuOr' A matplotlib color palette name.

figsize : tuple, default=(16, 9) Size of the figure (width*height), each in inches.

label_size : int, default=20 Specifies the size of the labels of both X and Y axes.

title_size : int, default=20 Specifies the size of the title of the colorbar and P-val sizes.

tick_size : int, default=14 Specifies the size of ticklabels as well as the maximum size of the dots.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

Returns

fig : matplotlib.figure.Figure Figure object made with matplotlib

Source code in cell2cell/plotting/pval_plot.py
def generate_dot_plot(pval_df, score_df, significance=0.05, xlabel='', ylabel='', cbar_title='Score', cmap='PuOr',
                      figsize=(16, 9), label_size=20, title_size=20, tick_size=14, filename=None):
    '''Generates a dot plot for given P-values and respective scores.

    Parameters
    ----------
    pval_df : pandas.DataFrame
        A dataframe containing the P-values, with multiple elements
        in both rows and columns

    score_df : pandas.DataFrame
        A dataframe containing the scores that were tested. Rows and
        columns must be the same as in `pval_df`.

    significance : float, default=0.05
        The significance threshold to be plotted. LR pairs or cell-cell
        pairs with at least one P-value below this threshold will be
        considered.

    xlabel : str, default=''
        Name or label of the X axis.

    ylabel : str, default=''
        Name or label of the Y axis.

    cbar_title : str, default='Score'
        A title for the colorbar associated with the scores in
        `score_df`. It is usually the name of the score.

    cmap : str, default='PuOr'
        A matplotlib color palette name.

    figsize : tuple, default=(16, 9)
        Size of the figure (width*height), each in inches.

    label_size : int, default=20
        Specifies the size of the labels of both X and Y axes.

    title_size : int, default=20
        Specifies the size of the title of the colorbar and P-val sizes.

    tick_size : int, default=14
        Specifies the size of ticklabels as well as the maximum size
        of the dots.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    Returns
    -------
    fig : matplotlib.figure.Figure
        Figure object made with matplotlib
    '''
    # Preprocessing
    df = pval_df.lt(significance).astype(int)
    # Drop all zeros
    df = df.loc[(df != 0).any(axis=1)]
    df = df.T.loc[(df != 0).any(axis=0)].T
    pval_df = pval_df[df.columns].loc[df.index].applymap(lambda x: -1. * np.log10(x + 1e-9))

    # Set dot sizes and color range
    max_abs = np.max([np.abs(np.min(np.min(score_df))), np.abs(np.max(np.max(score_df)))])
    norm = mpl.colors.Normalize(vmin=-1. * max_abs, vmax=max_abs)
    max_size = mpl.colors.Normalize(vmin=0., vmax=3)

    # Colormap
    cmap = mpl.cm.get_cmap(cmap)

    # Dot plot
    fig, (ax2, ax) = plt.subplots(2, 1, figsize=figsize, gridspec_kw={'height_ratios': [1, 9]})
    for i, idx in enumerate(pval_df.index):
        for j, col in enumerate(pval_df.columns):
            color = np.asarray(cmap(norm(score_df[[col]].loc[[idx]].values.item()))).reshape(1, -1)
            v = pval_df[[col]].loc[[idx]].values.item()
            size = ((max_size(np.min([v, 3])) * tick_size * 2) ** 2)
            ax.scatter(j, i, s=size, c=color)

    # Change tick labels
    xlabels = list(pval_df.columns)
    ylabels = list(pval_df.index)

    ax.set_xticks(ticks=range(0, len(pval_df.columns)))
    ax.set_xticklabels(xlabels,
                       fontsize=tick_size,
                       rotation=90,
                       rotation_mode='anchor',
                       va='center',
                       ha='right')

    ax.set_yticks(ticks=range(0, len(pval_df.index)))
    ax.set_yticklabels(ylabels,
                       fontsize=tick_size,
                       rotation=0, ha='right', va='center'
                       )

    plt.gca().invert_yaxis()

    plt.tick_params(axis='both',
                    which='both',
                    bottom=True,
                    top=False,
                    right=False,
                    left=True,
                    labelleft=True,
                    labelbottom=True)
    ax.set_xlabel(xlabel, fontsize=label_size)
    ax.set_ylabel(ylabel, fontsize=label_size)

    # Colorbar
    # create an axes on the top side of ax. The width of cax will be 3%
    # of ax and the padding between cax and ax will be fixed at 0.21 inch.
    divider = make_axes_locatable(ax)
    cax = divider.append_axes("top", size="3%", pad=0.21)

    cbar = plt.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=cmap),
                        cax=cax,
                        orientation='horizontal'
                        )
    cbar.ax.tick_params(labelsize=tick_size)

    cax.tick_params(axis='x',  # changes apply to the x-axis
                    which='both',  # both major and minor ticks are affected
                    bottom=False,  # ticks along the bottom edge are off
                    top=True,  # ticks along the top edge are off
                    labelbottom=False,  # labels along the bottom edge are off
                    labeltop=True
                    )
    cax.set_title(cbar_title, fontsize=title_size)

    for i, v in enumerate([np.min([np.min(np.min(pval_df)), -1. * np.log10(0.99)]), -1. * np.log10(significance + 1e-9), 3.0]): # old min np.min(np.min(pval_df))
        ax2.scatter(i, 0, s=(max_size(v) * tick_size * 2) ** 2, c='k')
        ax2.scatter(i, 1, s=0, c='k')
        if v == 3.0:
            extra = '>='
        elif i == 1:
            extra = 'Threshold: '
        else:
            extra = ''
        ax2.annotate(extra + str(np.round(abs(v), 4)), (i, 1), fontsize=tick_size, horizontalalignment='center')
    ax2.set_ylim(-0.5, 2)
    ax2.axis('off')
    ax2.set_title('-log10(P-value) sizes', fontsize=title_size)

    if filename is not None:
        plt.savefig(filename, dpi=300, bbox_inches='tight')
    return fig

tensor_plot

generate_plot_df(interaction_tensor)

Generates a melt dataframe with loadings for each element in all dimensions across factors

Parameters

interaction_tensor : cell2cell.tensor.BaseTensor A communication tensor generated with any of the tensor class in cell2cell.tensor

Returns

plot_df : pandas.DataFrame A dataframe containing loadings for every element of all dimensions across factors from the decomposition. Rows are loadings individual elements of each dimension in a given factor, while columns are the following list ['Factor', 'Variable', 'Value', 'Order']

Source code in cell2cell/plotting/tensor_plot.py
def generate_plot_df(interaction_tensor):
    '''Generates a melt dataframe with loadings for each element in all dimensions
    across factors

    Parameters
    ----------
    interaction_tensor : cell2cell.tensor.BaseTensor
        A communication tensor generated with any of the tensor class in
        cell2cell.tensor

    Returns
    -------
    plot_df : pandas.DataFrame
        A dataframe containing loadings for every element of all dimensions across
        factors from the decomposition. Rows are loadings individual elements of each
        dimension in a given factor, while columns are the following list
        ['Factor', 'Variable', 'Value', 'Order']
    '''
    tensor_dim = len(interaction_tensor.tensor.shape)
    if interaction_tensor.order_labels is None:
        if tensor_dim == 4:
            factor_labels = ['Context', 'LRs', 'Sender', 'Receiver']
        elif tensor_dim > 4:
            factor_labels = ['Context-{}'.format(i + 1) for i in range(tensor_dim - 3)] + ['LRs', 'Sender', 'Receiver']
        elif tensor_dim == 3:
            factor_labels = ['LRs', 'Sender', 'Receiver']
        else:
            raise ValueError('Too few dimensions in the tensor')
    else:
        assert len(interaction_tensor.order_labels) == tensor_dim, "The length of order_labels must match the number of orders/dimensions in the tensor"
        factor_labels = interaction_tensor.order_labels
    plot_df = pd.DataFrame()
    for lab, order_factors in enumerate(interaction_tensor.factors.values()):
        sns_df = order_factors.T
        sns_df.index.name = 'Factors'
        melt_df = pd.melt(sns_df.reset_index(), id_vars=['Factors'], value_vars=sns_df.columns)
        melt_df = melt_df.assign(Order=factor_labels[lab])

        plot_df = pd.concat([plot_df, melt_df])
    plot_df.columns = ['Factor', 'Variable', 'Value', 'Order']

    return plot_df

plot_elbow(loss, elbow=None, figsize=(4, 2.25), ylabel='Normalized Error', fontsize=14, filename=None)

Plots the errors of an elbow analysis with just one run of a tensor factorization for each rank.

Parameters

loss : list List of tuples with (x, y) coordinates for the elbow analysis. X values are the different ranks and Y values are the errors of each decomposition.

elbow : int, default=None X coordinate to color the error as red. Usually used to represent the detected elbow.

figsize : tuple, default=(4, 2.25) Figure size, width by height

ylabel : str, default='Normalized Error' Label for the y-axis

fontsize : int, default=14 Fontsize for axis labels.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

Returns

fig : matplotlib.figure.Figure Figure object made with matplotlib

Source code in cell2cell/plotting/tensor_plot.py
def plot_elbow(loss, elbow=None, figsize=(4, 2.25), ylabel='Normalized Error', fontsize=14, filename=None):
    '''Plots the errors of an elbow analysis with just one run of a tensor factorization
    for each rank.

    Parameters
    ----------
    loss : list
        List of  tuples with (x, y) coordinates for the elbow analysis. X values are
        the different ranks and Y values are the errors of each decomposition.

    elbow : int, default=None
        X coordinate to color the error as red. Usually used to represent the detected
        elbow.

    figsize : tuple, default=(4, 2.25)
        Figure size, width by height

    ylabel : str, default='Normalized Error'
        Label for the y-axis

    fontsize : int, default=14
        Fontsize for axis labels.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    Returns
    -------
    fig : matplotlib.figure.Figure
        Figure object made with matplotlib
    '''

    fig = plt.figure(figsize=figsize)

    plt.plot(*zip(*loss))
    plt.tick_params(axis='both', labelsize=fontsize)
    plt.xlabel('Rank', fontsize=int(1.2*fontsize))
    plt.ylabel(ylabel, fontsize=int(1.2 * fontsize))

    if elbow is not None:
        _ = plt.plot(*loss[elbow - 1], 'ro')

    if filename is not None:
        plt.savefig(filename, dpi=300,
                    bbox_inches='tight')
    return fig

plot_multiple_run_elbow(all_loss, elbow=None, ci='95%', figsize=(4, 2.25), ylabel='Normalized Error', fontsize=14, smooth=False, filename=None)

Plots the errors of an elbow analysis with multiple runs of a tensor factorization for each rank.

Parameters

all_loss : ndarray Array containing the errors associated with multiple runs for a given rank. This array is of shape (runs, upper_rank).

elbow : int, default=None X coordinate to color the error as red. Usually used to represent the detected elbow.

ci : str, default='std' Confidence interval for representing the multiple runs in each rank.

figsize : tuple, default=(4, 2.25) Figure size, width by height

ylabel : str, default='Normalized Error' Label for the y-axis

fontsize : int, default=14 Fontsize for axis labels.

smooth : boolean, default=False Whether smoothing the curve with a Savitzky-Golay filter.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

Returns

fig : matplotlib.figure.Figure Figure object made with matplotlib

Source code in cell2cell/plotting/tensor_plot.py
def plot_multiple_run_elbow(all_loss, elbow=None, ci='95%', figsize=(4, 2.25), ylabel='Normalized Error', fontsize=14,
                            smooth=False, filename=None):
    '''Plots the errors of an elbow analysis with multiple runs of a tensor
    factorization for each rank.

    Parameters
    ----------
    all_loss : ndarray
        Array containing the errors associated with multiple runs for a given rank.
        This array is of shape (runs, upper_rank).

    elbow : int, default=None
        X coordinate to color the error as red. Usually used to represent the detected
        elbow.

    ci : str, default='std'
        Confidence interval for representing the multiple runs in each rank.
        {'std', '95%'}

    figsize : tuple, default=(4, 2.25)
        Figure size, width by height

    ylabel : str, default='Normalized Error'
        Label for the y-axis

    fontsize : int, default=14
        Fontsize for axis labels.

    smooth : boolean, default=False
        Whether smoothing the curve with a Savitzky-Golay filter.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    Returns
    -------
    fig : matplotlib.figure.Figure
        Figure object made with matplotlib
    '''
    fig = plt.figure(figsize=figsize)

    x = list(range(1, all_loss.shape[1]+1))
    mean = np.nanmean(all_loss, axis=0)
    std = np.nanstd(all_loss, axis=0)

    if smooth:
        mean = smooth_curve(mean)

    # Plot Mean
    plt.plot(x, mean, 'ob')

    # Plot CI
    if ci == '95%':
        coeff = 1.96
    elif ci == 'std':
        coeff = 1.0
    else:
        raise ValueError("Specify a correct ci. Either '95%' or 'std'")

    plt.fill_between(x, mean-coeff*std, mean+coeff*std, color='steelblue', alpha=.2,
                     label='$\pm$ 1 std')


    plt.tick_params(axis='both', labelsize=fontsize)
    plt.xlabel('Rank', fontsize=int(1.2*fontsize))
    plt.ylabel(ylabel, fontsize=int(1.2 * fontsize))

    if elbow is not None:
        _ = plt.plot(x[elbow - 1], mean[elbow - 1], 'ro')

    if filename is not None:
        plt.savefig(filename, dpi=300,
                    bbox_inches='tight')
    return fig

reorder_dimension_elements(factors, reorder_elements, metadata=None)

Reorders elements in the dataframes including factor loadings.

Parameters

factors : dict Ordered dictionary containing a dataframe with the factor loadings for each dimension/order of the tensor.

reorder_elements : dict, default=None Dictionary for reordering elements in each of the tensor dimension. Keys of this dictionary could be any or all of the keys in interaction_tensor.factors. Values are list with the names or labels of the elements in a tensor dimension. For example, for the context dimension, all elements included in interaction_tensor.factors['Context'].index must be present.

metadata : list, default=None List of pandas dataframes with metadata information for elements of each dimension in the tensor. A column called as the variable sample_col contains the name of each element in the tensor while another column called as the variable group_col contains the metadata or grouping information of each element.

Returns

reordered_factors : dict Ordered dictionary containing a dataframe with the factor loadings for each dimension/order of the tensor. This dictionary includes the new orders.

new_metadata : list, default=None List of pandas dataframes with metadata information for elements of each dimension in the tensor. A column called as the variable sample_col contains the name of each element in the tensor while another column called as the variable group_col contains the metadata or grouping information of each element. In this case, elements are sorted according to reorder_elements.

Source code in cell2cell/plotting/tensor_plot.py
def reorder_dimension_elements(factors, reorder_elements, metadata=None):
    '''Reorders elements in the dataframes including factor loadings.

    Parameters
    ----------
    factors : dict
        Ordered dictionary containing a dataframe with the factor loadings for each
        dimension/order of the tensor.

    reorder_elements : dict, default=None
        Dictionary for reordering elements in each of the tensor dimension.
        Keys of this dictionary could be any or all of the keys in
        interaction_tensor.factors. Values are list with the names or labels of the
        elements in a tensor dimension. For example, for the context dimension,
        all elements included in interaction_tensor.factors['Context'].index must
        be present.

    metadata : list, default=None
        List of pandas dataframes with metadata information for elements of each
        dimension in the tensor. A column called as the variable `sample_col` contains
        the name of each element in the tensor while another column called as the
        variable `group_col` contains the metadata or grouping information of each
        element.

    Returns
    -------
    reordered_factors : dict
        Ordered dictionary containing a dataframe with the factor loadings for each
        dimension/order of the tensor. This dictionary includes the new orders.

    new_metadata : list, default=None
        List of pandas dataframes with metadata information for elements of each
        dimension in the tensor. A column called as the variable `sample_col` contains
        the name of each element in the tensor while another column called as the
        variable `group_col` contains the metadata or grouping information of each
        element. In this case, elements are sorted according to reorder_elements.

    '''
    assert all(k in factors.keys() for k in reorder_elements.keys()), "Keys in 'reorder_elements' must be only keys in 'factors'"
    assert all((len(set(factors[key].index).difference(set(reorder_elements[key]))) == 0) for key in reorder_elements.keys()), "All elements of each dimension included should be present"

    reordered_factors = factors.copy()
    new_metadata = metadata.copy()

    i = 0
    for k, df in reordered_factors.items():
        if k in reorder_elements.keys():
            df = df.loc[reorder_elements[k]]
            reordered_factors[k] = df[~df.index.duplicated(keep='first')]
            if new_metadata is not None:
                meta = new_metadata[i]
                meta['Element'] = pd.Categorical(meta['Element'], ordered=True, categories=list(reordered_factors[k].index))
                new_metadata[i] = meta.sort_values(by='Element').reset_index(drop=True)
        else:
            reordered_factors[k] = df
        i += 1
    return reordered_factors, new_metadata

tensor_factors_plot(interaction_tensor, order_labels=None, reorder_elements=None, metadata=None, sample_col='Element', group_col='Category', meta_cmaps=None, fontsize=20, plot_legend=True, filename=None)

Plots the loadings for each element in each dimension of the tensor, generate by a tensor factorization.

Parameters

interaction_tensor : cell2cell.tensor.BaseTensor A communication tensor generated with any of the tensor class in cell2cell.tensor.

order_labels : list, default=None List with the labels of each dimension to use in the plot. If none, the default names given when factorizing the tensor will be used.

reorder_elements : dict, default=None Dictionary for reordering elements in each of the tensor dimension. Keys of this dictionary could be any or all of the keys in interaction_tensor.factors. Values are list with the names or labels of the elements in a tensor dimension. For example, for the context dimension, all elements included in interaction_tensor.factors['Context'].index must be present.

metadata : list, default=None List of pandas dataframes with metadata information for elements of each dimension in the tensor. A column called as the variable sample_col contains the name of each element in the tensor while another column called as the variable group_col contains the metadata or grouping information of each element.

sample_col : str, default='Element' Name of the column containing the element names in the metadata.

group_col : str, default='Category' Name of the column containing the metadata or grouping information for each element in the metadata.

meta_cmaps : list, default=None A list of colormaps used for coloring elements in each dimension. The length of this list is equal to the number of dimensions of the tensor. If None, all dimensions will be colores with the colormap 'gist_rainbow'.

fontsize : int, default=20 Font size of the tick labels. Axis labels will be 1.2 times the fontsize.

plot_legend : boolean, default=True Whether plotting the legends for the coloring of each element in their respective dimensions.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

Returns

fig : matplotlib.figure.Figure Figure object made with matplotlib

axes : matplotlib.axes.Axes or array of Axes List of Axes for each subplot in the figure.

Source code in cell2cell/plotting/tensor_plot.py
def tensor_factors_plot(interaction_tensor, order_labels=None, reorder_elements=None, metadata=None,
                        sample_col='Element', group_col='Category', meta_cmaps=None, fontsize=20, plot_legend=True,
                        filename=None):
    '''Plots the loadings for each element in each dimension of the tensor, generate by
    a tensor factorization.

    Parameters
    ----------
    interaction_tensor : cell2cell.tensor.BaseTensor
        A communication tensor generated with any of the tensor class in
        cell2cell.tensor.

    order_labels : list, default=None
        List with the labels of each dimension to use in the plot. If none, the
        default names given when factorizing the tensor will be used.

    reorder_elements : dict, default=None
        Dictionary for reordering elements in each of the tensor dimension.
        Keys of this dictionary could be any or all of the keys in
        interaction_tensor.factors. Values are list with the names or labels of the
        elements in a tensor dimension. For example, for the context dimension,
        all elements included in interaction_tensor.factors['Context'].index must
        be present.

    metadata : list, default=None
        List of pandas dataframes with metadata information for elements of each
        dimension in the tensor. A column called as the variable `sample_col` contains
        the name of each element in the tensor while another column called as the
        variable `group_col` contains the metadata or grouping information of each
        element.

    sample_col : str, default='Element'
        Name of the column containing the element names in the metadata.

    group_col : str, default='Category'
        Name of the column containing the metadata or grouping information for each
        element in the metadata.

    meta_cmaps : list, default=None
        A list of colormaps used for coloring elements in each dimension. The length
        of this list is equal to the number of dimensions of the tensor. If None, all
        dimensions will be colores with the colormap 'gist_rainbow'.

    fontsize : int, default=20
        Font size of the tick labels. Axis labels will be 1.2 times the fontsize.

    plot_legend : boolean, default=True
        Whether plotting the legends for the coloring of each element in their
        respective dimensions.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is
        not saved.

    Returns
    -------
    fig : matplotlib.figure.Figure
        Figure object made with matplotlib

    axes : matplotlib.axes.Axes or array of Axes
        List of Axes for each subplot in the figure.
    '''
    # Prepare inputs for matplotlib
    assert interaction_tensor.factors is not None, "First run the method 'compute_tensor_factorization' in your InteractionTensor"
    dim = len(interaction_tensor.factors)

    if order_labels is not None:
        assert dim == len(order_labels), "The lenght of factor_labels must match the order of the tensor (order {})".format(dim)
    else:
        order_labels = list(interaction_tensor.factors.keys())

    rank = interaction_tensor.rank
    fig, axes = tensor_factors_plot_from_loadings(factors=interaction_tensor.factors,
                                                  rank=rank,
                                                  order_labels=order_labels,
                                                  reorder_elements=reorder_elements,
                                                  metadata=metadata,
                                                  sample_col=sample_col,
                                                  group_col=group_col,
                                                  meta_cmaps=meta_cmaps,
                                                  fontsize=fontsize,
                                                  plot_legend=plot_legend,
                                                  filename=filename)
    return fig, axes

tensor_factors_plot_from_loadings(factors, rank=None, order_labels=None, reorder_elements=None, metadata=None, sample_col='Element', group_col='Category', meta_cmaps=None, fontsize=20, plot_legend=True, filename=None)

Plots the loadings for each element in each dimension of the tensor, generate by a tensor factorization.

Parameters

factors : collections.OrderedDict An ordered dictionary wherein keys are the names of each tensor dimension, and values are the loadings in a pandas.DataFrame. In this dataframe, rows are the elements of the respective dimension and columns are the factors from the tensor factorization. Values are the corresponding loadings.

rank : int, default=None Number of factors generated from the decomposition

order_labels : list, default=None List with the labels of each dimension to use in the plot. If none, the default names given when factorizing the tensor will be used.

reorder_elements : dict, default=None Dictionary for reordering elements in each of the tensor dimension. Keys of this dictionary could be any or all of the keys in interaction_tensor.factors. Values are list with the names or labels of the elements in a tensor dimension. For example, for the context dimension, all elements included in interaction_tensor.factors['Context'].index must be present.

metadata : list, default=None List of pandas dataframes with metadata information for elements of each dimension in the tensor. A column called as the variable sample_col contains the name of each element in the tensor while another column called as the variable group_col contains the metadata or grouping information of each element.

sample_col : str, default='Element' Name of the column containing the element names in the metadata.

group_col : str, default='Category' Name of the column containing the metadata or grouping information for each element in the metadata.

meta_cmaps : list, default=None A list of colormaps used for coloring elements in each dimension. The length of this list is equal to the number of dimensions of the tensor. If None, all dimensions will be colores with the colormap 'gist_rainbow'.

fontsize : int, default=20 Font size of the tick labels. Axis labels will be 1.2 times the fontsize.

plot_legend : boolean, default=True Whether plotting the legends for the coloring of each element in their respective dimensions.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

Returns

fig : matplotlib.figure.Figure Figure object made with matplotlib

axes : matplotlib.axes.Axes or array of Axes List of Axes for each subplot in the figure.

Source code in cell2cell/plotting/tensor_plot.py
def tensor_factors_plot_from_loadings(factors, rank=None, order_labels=None, reorder_elements=None, metadata=None,
                                      sample_col='Element', group_col='Category', meta_cmaps=None, fontsize=20, plot_legend=True,
                                      filename=None):
    '''Plots the loadings for each element in each dimension of the tensor, generate by
    a tensor factorization.

    Parameters
    ----------
    factors : collections.OrderedDict
        An ordered dictionary wherein keys are the names of each
        tensor dimension, and values are the loadings in a pandas.DataFrame.
        In this dataframe, rows are the elements of the respective dimension
        and columns are the factors from the tensor factorization. Values
        are the corresponding loadings.

    rank : int, default=None
        Number of factors generated from the decomposition

    order_labels : list, default=None
        List with the labels of each dimension to use in the plot. If none, the
        default names given when factorizing the tensor will be used.

    reorder_elements : dict, default=None
        Dictionary for reordering elements in each of the tensor dimension.
        Keys of this dictionary could be any or all of the keys in
        interaction_tensor.factors. Values are list with the names or labels of the
        elements in a tensor dimension. For example, for the context dimension,
        all elements included in interaction_tensor.factors['Context'].index must
        be present.

    metadata : list, default=None
        List of pandas dataframes with metadata information for elements of each
        dimension in the tensor. A column called as the variable `sample_col` contains
        the name of each element in the tensor while another column called as the
        variable `group_col` contains the metadata or grouping information of each
        element.

    sample_col : str, default='Element'
        Name of the column containing the element names in the metadata.

    group_col : str, default='Category'
        Name of the column containing the metadata or grouping information for each
        element in the metadata.

    meta_cmaps : list, default=None
        A list of colormaps used for coloring elements in each dimension. The length
        of this list is equal to the number of dimensions of the tensor. If None, all
        dimensions will be colores with the colormap 'gist_rainbow'.

    fontsize : int, default=20
        Font size of the tick labels. Axis labels will be 1.2 times the fontsize.

    plot_legend : boolean, default=True
        Whether plotting the legends for the coloring of each element in their
        respective dimensions.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is
        not saved.

    Returns
    -------
    fig : matplotlib.figure.Figure
        Figure object made with matplotlib

    axes : matplotlib.axes.Axes or array of Axes
        List of Axes for each subplot in the figure.
    '''
    # Prepare inputs for matplotlib
    if rank is not None:
        assert list(factors.values())[0].shape[1] == rank, "Rank must match the number of columns in dataframes in `factors`"
    else:
        rank = list(factors.values())[0].shape[1]

    dim = len(factors)

    if order_labels is not None:
        assert dim == len(order_labels), "The length of factor_labels must match the order of the tensor (order {})".format(dim)
    else:
        order_labels = list(factors.keys())

    if metadata is not None:
        meta_og = metadata.copy()
    if reorder_elements is not None:
        factors, metadata = reorder_dimension_elements(factors=factors,
                                                       reorder_elements=reorder_elements,
                                                       metadata=metadata)

    if metadata is None:
        metadata = [None] * dim
        meta_colors = [None] * dim
        element_colors = [None] * dim
    else:
        if meta_cmaps is None:
            meta_cmaps = ['gist_rainbow']*len(metadata)
        assert len(metadata) == len(meta_cmaps), "Provide a cmap for each order"
        assert len(metadata) == len(factors), "Provide a metadata for each order. If there is no metadata for any, replace with None"
        meta_colors = [get_colors_from_labels(m[group_col], cmap=cmap) if ((m is not None) & (cmap is not None)) else None for m, cmap in zip(meta_og, meta_cmaps)]
        element_colors = [map_colors_to_metadata(metadata=m,
                                                 colors=mc,
                                                 sample_col=sample_col,
                                                 group_col=group_col,
                                                 cmap=cmap).to_dict() if ((m is not None) & (cmap is not None)) else None for m, cmap, mc in zip(metadata, meta_cmaps, meta_colors)]

    # Make the plot
    fig, axes = plt.subplots(nrows=rank,
                             ncols=dim,
                             figsize=(10, int(rank * 1.2 + 1)),
                             sharex='col',
                             #sharey='col'
                             )

    axes = axes.reshape((rank, dim))

    # Factor by factor
    if rank > 1:
        # Iterates horizontally (dimension by dimension)
        for ind, (order_factors, axs) in enumerate(zip(factors.values(), axes.T)):
            if isinstance(order_factors, pd.Series):
                order_factors = order_factors.to_frame().T
            # Iterates vertically (factor by factor)
            for i, (df_row, ax) in enumerate(zip(order_factors.T.iterrows(), axs)):
                factor_name = df_row[0]
                factor = df_row[1]
                sns.despine(top=True, ax=ax)
                if (metadata[ind] is not None) & (meta_colors[ind] is not None):
                    plot_colors = [element_colors[ind][idx] for idx in order_factors.index]
                    ax.bar(range(len(factor)), factor.values.tolist(), color=plot_colors)
                else:
                    ax.bar(range(len(factor)), factor.values.tolist())
                axes[i, 0].set_ylabel(factor_name, fontsize=int(1.2*fontsize))
                if i < len(axs):
                    ax.tick_params(axis='x', which='both', length=0)
                    ax.tick_params(axis='both', labelsize=fontsize)
                    plt.setp(ax.get_xticklabels(), visible=False)
            axs[-1].set_xlabel(order_labels[ind], fontsize=int(1.2*fontsize), labelpad=fontsize)
    else:
        for ind, order_factors in enumerate(factors.values()):
            if isinstance(order_factors, pd.Series):
                order_factors = order_factors.to_frame().T
            ax = axes[ind]
            ax.set_xlabel(order_labels[ind], fontsize=int(1.2*fontsize), labelpad=fontsize)
            for i, df_row in enumerate(order_factors.T.iterrows()):
                factor_name = df_row[0]
                factor = df_row[1]
                sns.despine(top=True, ax=ax)
                if (metadata[ind] is not None) & (meta_colors[ind] is not None):
                    plot_colors = [element_colors[ind][idx] for idx in order_factors.index]
                    ax.bar(range(len(factor)), factor.values.tolist(), color=plot_colors)
                else:
                    ax.bar(range(len(factor)), factor.values.tolist())
                ax.set_ylabel(factor_name, fontsize=int(1.2*fontsize))

    fig.align_ylabels(axes[:,0])
    plt.tight_layout()

    # Include legends of coloring the elements in each dimension.
    if plot_legend:
        # Set current axis:
        ax = axes[0, -1]
        plt.sca(ax)

        # Legends
        fig.canvas.draw()
        renderer = fig.canvas.get_renderer()
        bbox_cords =  (1.05, 1.2)

        N=len(order_labels) - 1
        for ind, order in enumerate(order_labels):
            if (metadata[ind] is not None) & (meta_colors[ind] is not None):
                lgd = generate_legend(color_dict=meta_colors[ind],
                                      bbox_to_anchor=bbox_cords,
                                      loc='upper left',
                                      title=order_labels[ind],
                                      fontsize=fontsize,
                                      sorted_labels=False,
                                      ax=ax
                                      )
                cords = lgd.get_window_extent(renderer).transformed(ax.transAxes.inverted())
                xrange = abs(cords.p0[0] - cords.p1[0])
                bbox_cords = (bbox_cords[0] + xrange + 0.05, bbox_cords[1])
                if ind != N:
                    ax.add_artist(lgd)

    if filename is not None:
        plt.savefig(filename, dpi=300,
                    bbox_inches='tight')
    return fig, axes

umap_plot

umap_biplot(umap_df, figsize=(8, 8), ax=None, show_axes=True, show_legend=True, hue=None, cmap='tab10', fontsize=20, filename=None)

Plots a UMAP biplot for the UMAP embeddings.

Parameters

umap_df : pandas.DataFrame Dataframe containing the UMAP embeddings for the axis analyzed. It must contain columns 'umap1 and 'umap2'. If a hue column is provided in the parameter 'hue', that column must be provided in this dataframe.

figsize : tuple, default=(8, 8) Size of the figure (width*height), each in inches.

ax : matplotlib.axes.Axes, default=None The matplotlib axes containing a plot.

show_axes : boolean, default=True Whether showing lines, ticks and ticklabels of both axes.

show_legend : boolean, default=True Whether including the legend when a hue is provided.

hue : vector or key in 'umap_df' Grouping variable that will produce points with different colors. Can be either categorical or numeric, although color mapping will behave differently in latter case.

cmap : str, default='tab10' Name of the color palette for coloring elements with UMAP embeddings.

fontsize : int, default=20 Fontsize of the axis labels (UMAP1 and UMAP2).

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

Returns
fig : matplotlib.figure.Figure
    A matplotlib Figure instance.

ax : matplotlib.axes.Axes
    The matplotlib axes containing the plot.
Source code in cell2cell/plotting/umap_plot.py
def umap_biplot(umap_df, figsize=(8 ,8), ax=None, show_axes=True, show_legend=True, hue=None,
                cmap='tab10', fontsize=20, filename=None):
    '''Plots a UMAP biplot for the UMAP embeddings.

    Parameters
    ----------
    umap_df : pandas.DataFrame
        Dataframe containing the UMAP embeddings for the axis analyzed.
        It must contain columns 'umap1 and 'umap2'. If a hue column is
        provided in the parameter 'hue', that column must be provided
        in this dataframe.

    figsize : tuple, default=(8, 8)
        Size of the figure (width*height), each in inches.

    ax : matplotlib.axes.Axes, default=None
        The matplotlib axes containing a plot.

    show_axes : boolean, default=True
        Whether showing lines, ticks and ticklabels of both axes.

    show_legend : boolean, default=True
        Whether including the legend when a hue is provided.

    hue : vector or key in 'umap_df'
        Grouping variable that will produce points with different colors.
        Can be either categorical or numeric, although color mapping will
        behave differently in latter case.

    cmap : str, default='tab10'
        Name of the color palette for coloring elements with UMAP embeddings.

    fontsize : int, default=20
        Fontsize of the axis labels (UMAP1 and UMAP2).

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    Returns
    -------
        fig : matplotlib.figure.Figure
            A matplotlib Figure instance.

        ax : matplotlib.axes.Axes
            The matplotlib axes containing the plot.
    '''

    if ax is None:
        fig = plt.figure(figsize=figsize)

    ax = sns.scatterplot(x='umap1',
                         y='umap2',
                         data=umap_df,
                         hue=hue,
                         palette=cmap,
                         ax=ax
                         )

    if show_axes:
        sns.despine(ax=ax,
                    offset=15
                    )

        ax.tick_params(axis='both',
                       which='both',
                       colors='black',
                       width=2,
                       length=5
                       )
    else:
        ax.set_xticks([])
        ax.set_yticks([])
        for key, spine in ax.spines.items():
            spine.set_visible(False)


    for tick in ax.get_xticklabels():
        tick.set_fontproperties('arial')
        tick.set_weight("bold")
        tick.set_color("black")
        tick.set_fontsize(int(0.7*fontsize))
    for tick in ax.get_yticklabels():
        tick.set_fontproperties('arial')
        tick.set_weight("bold")
        tick.set_color("black")
        tick.set_fontsize(int(0.7*fontsize))

    ax.set_xlabel('UMAP 1', fontsize=fontsize)
    ax.set_ylabel('UMAP 2', fontsize=fontsize)

    if (show_legend) & (hue is not None):
        # Put the legend out of the figure
        legend = ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
        legend.set_title(hue)
        legend.get_title().set_fontsize(int(0.7*fontsize))

        for text in legend.get_texts():
            text.set_fontsize(int(0.7*fontsize))

    if filename is not None:
        plt.savefig(filename, dpi=300, bbox_inches='tight')

    if ax is None:
        return fig, ax
    else:
        return ax

preprocessing special

cutoffs

get_constant_cutoff(rnaseq_data, constant_cutoff=10)

Generates a cutoff/threshold dataframe for all genes in rnaseq_data assigning a constant value as the cutoff.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for a bulk RNA-seq experiment or a single-cell experiment after aggregation into cell types. Columns are cell-types/tissues/samples and rows are genes.

constant_cutoff : float, default=10 Cutoff or threshold assigned to each gene.

Returns

cutoffs : pandas.DataFrame A dataframe containing the value corresponding to cutoff or threshold assigned to each gene. Rows are genes and the column corresponds to 'value'. All values are the same and corresponds to the constant_cutoff.

Source code in cell2cell/preprocessing/cutoffs.py
def get_constant_cutoff(rnaseq_data, constant_cutoff=10):
    '''
    Generates a cutoff/threshold dataframe for all genes
    in rnaseq_data assigning a constant value as the cutoff.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a bulk RNA-seq experiment or a single-cell
        experiment after aggregation into cell types. Columns are
        cell-types/tissues/samples and rows are genes.

    constant_cutoff : float, default=10
        Cutoff or threshold assigned to each gene.

    Returns
    -------
    cutoffs : pandas.DataFrame
        A dataframe containing the value corresponding to cutoff or threshold
        assigned to each gene. Rows are genes and the column corresponds to
        'value'. All values are the same and corresponds to the
        constant_cutoff.
    '''
    cutoffs = pd.DataFrame(index=rnaseq_data.index)
    cutoffs['value'] = constant_cutoff
    return cutoffs

get_cutoffs(rnaseq_data, parameters, verbose=True)

This function creates cutoff/threshold values for genes in rnaseq_data and the respective cells/tissues/samples by a given method or parameter.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for a bulk RNA-seq experiment or a single-cell experiment after aggregation into cell types. Columns are cell-types/tissues/samples and rows are genes.

parameters : dict This dictionary must contain a 'parameter' key and a 'type' key. The first one is the respective parameter to compute the threshold or cutoff values. The type corresponds to the approach to compute the values according to the parameter employed. Options of 'type' that can be used:

- 'local_percentile' : computes the value of a given percentile,
                       for each gene independently. In this case,
                       the parameter corresponds to the percentile
                       to compute, as a float value between 0 and 1.
- 'global_percentile' : computes the value of a given percentile
                        from all genes and samples simultaneously.
                        In this case, the parameter corresponds to
                        the percentile to compute, as a float value
                        between 0 and 1. All genes have the same cutoff.
- 'file' : load a cutoff table from a file. Parameter in this case is
           the path of that file. It must contain the same genes as
           index and same samples as columns.
- 'multi_col_matrix' : a dataframe must be provided, containing a
                       cutoff for each gene in each sample. This allows
                       to use specific cutoffs for each sample. The
                       columns here must be the same as the ones in the
                       rnaseq_data.
- 'single_col_matrix' : a dataframe must be provided, containing a
                        cutoff for each gene in only one column. These
                        cutoffs will be applied to all samples.
- 'constant_value' : binarizes the expression. Evaluates whether
                     expression is greater than the value input in
                     the 'parameter'.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

cutoffs : pandas.DataFrame Dataframe wherein rows are genes in rnaseq_data. Depending on the type in the parameters dictionary, it may have only one column ('value') or the same columns that rnaseq_data has, generating specfic cutoffs for each cell/tissue/sample.

Source code in cell2cell/preprocessing/cutoffs.py
def get_cutoffs(rnaseq_data, parameters, verbose=True):
    '''
    This function creates cutoff/threshold values for genes
    in rnaseq_data and the respective cells/tissues/samples
    by a given method or parameter.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a bulk RNA-seq experiment or a single-cell
        experiment after aggregation into cell types. Columns are
        cell-types/tissues/samples and rows are genes.

    parameters : dict
        This dictionary must contain a 'parameter' key and a 'type' key.
        The first one is the respective parameter to compute the threshold
        or cutoff values. The type corresponds to the approach to
        compute the values according to the parameter employed.
        Options of 'type' that can be used:

        - 'local_percentile' : computes the value of a given percentile,
                               for each gene independently. In this case,
                               the parameter corresponds to the percentile
                               to compute, as a float value between 0 and 1.
        - 'global_percentile' : computes the value of a given percentile
                                from all genes and samples simultaneously.
                                In this case, the parameter corresponds to
                                the percentile to compute, as a float value
                                between 0 and 1. All genes have the same cutoff.
        - 'file' : load a cutoff table from a file. Parameter in this case is
                   the path of that file. It must contain the same genes as
                   index and same samples as columns.
        - 'multi_col_matrix' : a dataframe must be provided, containing a
                               cutoff for each gene in each sample. This allows
                               to use specific cutoffs for each sample. The
                               columns here must be the same as the ones in the
                               rnaseq_data.
        - 'single_col_matrix' : a dataframe must be provided, containing a
                                cutoff for each gene in only one column. These
                                cutoffs will be applied to all samples.
        - 'constant_value' : binarizes the expression. Evaluates whether
                             expression is greater than the value input in
                             the 'parameter'.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    cutoffs : pandas.DataFrame
        Dataframe wherein rows are genes in rnaseq_data. Depending on the type in
        the parameters dictionary, it may have only one column ('value') or the
        same columns that rnaseq_data has, generating specfic cutoffs for each
        cell/tissue/sample.
    '''
    parameter = parameters['parameter']
    type = parameters['type']
    if verbose:
        print("Calculating cutoffs for gene abundances")
    if type == 'local_percentile':
        cutoffs = get_local_percentile_cutoffs(rnaseq_data, parameter)
        cutoffs.columns = ['value']
    elif type == 'global_percentile':
        cutoffs = get_global_percentile_cutoffs(rnaseq_data, parameter)
        cutoffs.columns = ['value']
    elif type == 'constant_value':
        cutoffs = get_constant_cutoff(rnaseq_data, parameter)
        cutoffs.columns = ['value']
    elif type == 'file':
        cutoffs = read_data.load_cutoffs(parameter,
                                         format='auto')
        cutoffs = cutoffs.loc[rnaseq_data.index]
    elif type == 'multi_col_matrix':
        cutoffs = parameter
        cutoffs = cutoffs.loc[rnaseq_data.index]
        cutoffs = cutoffs[rnaseq_data.columns]
    elif type == 'single_col_matrix':
        cutoffs = parameter
        cutoffs.columns = ['value']
        cutoffs = cutoffs.loc[rnaseq_data.index]
    else:
        raise ValueError(type + ' is not a valid cutoff')
    return cutoffs

get_global_percentile_cutoffs(rnaseq_data, percentile=0.75)

Obtains a global value associated with a given percentile across cells/tissues/samples and genes in a rnaseq_data.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for a bulk RNA-seq experiment or a single-cell experiment after aggregation into cell types. Columns are cell-types/tissues/samples and rows are genes.

percentile : float, default=0.75 This is the percentile to be computed.

Returns

cutoffs : pandas.DataFrame A dataframe containing the value corresponding to the percentile across the dataset. Rows are genes and the column corresponds to 'value'. All values here are the same global percentile.

Source code in cell2cell/preprocessing/cutoffs.py
def get_global_percentile_cutoffs(rnaseq_data, percentile=0.75):
    '''
    Obtains a global value associated with a given percentile across
    cells/tissues/samples and genes in a rnaseq_data.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a bulk RNA-seq experiment or a single-cell
        experiment after aggregation into cell types. Columns are
        cell-types/tissues/samples and rows are genes.

    percentile : float, default=0.75
        This is the percentile to be computed.

    Returns
    -------
    cutoffs : pandas.DataFrame
        A dataframe containing the value corresponding to the percentile
        across the dataset. Rows are genes and the column corresponds to
        'value'. All values here are the same global percentile.
    '''
    cutoffs = pd.DataFrame(index=rnaseq_data.index, columns=['value'])
    cutoffs['value'] = np.quantile(rnaseq_data.values, percentile)
    return cutoffs

get_local_percentile_cutoffs(rnaseq_data, percentile=0.75)

Obtains a local value associated with a given percentile across cells/tissues/samples for each gene in a rnaseq_data.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for a bulk RNA-seq experiment or a single-cell experiment after aggregation into cell types. Columns are cell-types/tissues/samples and rows are genes.

percentile : float, default=0.75 This is the percentile to be computed.

Returns

cutoffs : pandas.DataFrame A dataframe containing the value corresponding to the percentile across the genes. Rows are genes and the column corresponds to 'value'.

Source code in cell2cell/preprocessing/cutoffs.py
def get_local_percentile_cutoffs(rnaseq_data, percentile=0.75):
    '''
    Obtains a local value associated with a given percentile across
    cells/tissues/samples for each gene in a rnaseq_data.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a bulk RNA-seq experiment or a single-cell
        experiment after aggregation into cell types. Columns are
        cell-types/tissues/samples and rows are genes.

    percentile : float, default=0.75
        This is the percentile to be computed.

    Returns
    -------
    cutoffs : pandas.DataFrame
        A dataframe containing the value corresponding to the percentile
        across the genes. Rows are genes and the column corresponds to
        'value'.
    '''
    cutoffs = rnaseq_data.quantile(percentile, axis=1).to_frame()
    cutoffs.columns = ['value']
    return cutoffs

find_elements

find_duplicates(element_list)

Function based on: https://stackoverflow.com/a/5419576/12032899 Finds duplicate items and list their index location.

Parameters

element_list : list List of elements

Returns

duplicate_dict : dict Dictionary with duplicate items. Keys are the items, and values are lists with the respective indexes where they are.

Source code in cell2cell/preprocessing/find_elements.py
def find_duplicates(element_list):
    '''Function based on: https://stackoverflow.com/a/5419576/12032899
    Finds duplicate items and list their index location.

    Parameters
    ----------
    element_list : list
        List of elements

    Returns
    -------
    duplicate_dict : dict
        Dictionary with duplicate items. Keys are the items, and values
        are lists with the respective indexes where they are.
    '''
    tally = defaultdict(list)
    for i,item in enumerate(element_list):
        tally[item].append(i)

    duplicate_dict = {key : locs for key,locs in tally.items()
                            if len(locs)>1}
    return duplicate_dict

get_element_abundances(element_lists)

Computes the fraction of occurrence of each element in a list of lists.

Parameters

element_lists : list List of lists of elements. Elements will be counted only once in each of the lists.

Returns

abundance_dict : dict Dictionary containing the number of times that an element was present, divided by the total number of lists in element_lists.

Source code in cell2cell/preprocessing/find_elements.py
def get_element_abundances(element_lists):
    '''Computes the fraction of occurrence of each element
    in a list of lists.

    Parameters
    ----------
    element_lists : list
        List of lists of elements. Elements will be
        counted only once in each of the lists.

    Returns
    -------
    abundance_dict : dict
        Dictionary containing the number of times that an
        element was present, divided by the total number of
        lists in `element_lists`.
    '''
    abundance_dict = Counter(itertools.chain(*map(set, element_lists)))
    total = len(element_lists)
    abundance_dict = {k : v/total for k, v in abundance_dict.items()}
    return abundance_dict

get_elements_over_fraction(abundance_dict, fraction)

Obtains a list of elements with the fraction of occurrence at least the threshold.

Parameters

abundance_dict : dict Dictionary containing the number of times that an element was present, divided by the total number of possible occurrences.

fraction : float Threshold to filter the elements. Elements with at least this threshold will be included.

Returns

elements : list List of elements that met the fraction criteria.

Source code in cell2cell/preprocessing/find_elements.py
def get_elements_over_fraction(abundance_dict, fraction):
    '''Obtains a list of elements with the
    fraction of occurrence at least the threshold.

    Parameters
    ----------
    abundance_dict : dict
        Dictionary containing the number of times that an
        element was present, divided by the total number of
        possible occurrences.

    fraction : float
        Threshold to filter the elements. Elements with at least
        this threshold will be included.

    Returns
    -------
    elements : list
        List of elements that met the fraction criteria.
    '''
    elements = [k for k, v in abundance_dict.items() if v >= fraction]
    return elements

gene_ontology

find_all_children_of_go_term(go_terms, go_term_name, output_list, verbose=True)

Finds all children GO terms (below in hierarchy) of a given GO term.

Parameters

go_terms : networkx.Graph NetworkX Graph containing GO terms datasets from .obo file. It could be loaded using cell2cell.io.read_data.load_go_terms(filename).

go_term_name : str Specific GO term to find their children. For example: 'GO:0007155'.

output_list : list List used to perform a Depth First Search and find the children in a recursive way. Here the children will be automatically written.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Source code in cell2cell/preprocessing/gene_ontology.py
def find_all_children_of_go_term(go_terms, go_term_name, output_list, verbose=True):
    '''
    Finds all children GO terms (below in hierarchy) of
    a given GO term.

    Parameters
    ----------
    go_terms : networkx.Graph
        NetworkX Graph containing GO terms datasets from .obo file.
        It could be loaded using
        cell2cell.io.read_data.load_go_terms(filename).

    go_term_name : str
        Specific GO term to find their children. For example:
        'GO:0007155'.

    output_list : list
        List used to perform a Depth First Search and find the
        children in a recursive way. Here the children will be
        automatically written.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.
    '''
    for child in networkx.ancestors(go_terms, go_term_name):
        if child not in output_list:
            if verbose:
                print('Retrieving children for ' + go_term_name)
            output_list.append(child)
        find_all_children_of_go_term(go_terms, child, output_list, verbose)

find_go_terms_from_keyword(go_terms, keyword, verbose=False)

Uses a keyword to find related GO terms.

Parameters

go_terms : networkx.Graph NetworkX Graph containing GO terms datasets from .obo file. It could be loaded using cell2cell.io.read_data.load_go_terms(filename).

keyword : str Keyword to be included in the names of retrieved GO terms.

verbose : boolean, default=False Whether printing or not steps of the analysis.

Returns

go_filter : list List containing all GO terms related to a keyword.

Source code in cell2cell/preprocessing/gene_ontology.py
def find_go_terms_from_keyword(go_terms, keyword, verbose=False):
    '''
    Uses a keyword to find related GO terms.

    Parameters
    ----------
    go_terms : networkx.Graph
        NetworkX Graph containing GO terms datasets from .obo file.
        It could be loaded using
        cell2cell.io.read_data.load_go_terms(filename).

    keyword : str
        Keyword to be included in the names of retrieved GO terms.

    verbose : boolean, default=False
        Whether printing or not steps of the analysis.

    Returns
    -------
    go_filter : list
        List containing all GO terms related to a keyword.
    '''
    go_filter = []
    for go, node in go_terms.nodes.items():
        if keyword in node['name']:
            go_filter.append(go)
            if verbose:
                print(go, node['name'])
    return go_filter

get_genes_from_go_hierarchy(go_annotations, go_terms, go_filter, go_header='GO', gene_header='Gene', verbose=False)

Obtains genes associated with specific GO terms and their children GO terms (below in the hierarchy).

Parameters

go_annotations : pandas.DataFrame Dataframe containing information about GO term annotations of each gene for a given organism according to the ga file. Can be loading with the function cell2cell.io.read_data.load_go_annotations().

go_terms : networkx.Graph NetworkX Graph containing GO terms datasets from .obo file. It could be loaded using cell2cell.io.read_data.load_go_terms(filename).

go_filter : list List containing one or more GO-terms to find associated genes.

go_header : str, default='GO' Column name wherein GO terms are located in the dataframe.

gene_header : str, default='Gene' Column name wherein genes are located in the dataframe.

verbose : boolean, default=False Whether printing or not steps of the analysis.

Returns

genes : list List of genes that are associated with GO-terms contained in go_filter, and related to the children GO terms of those terms.

Source code in cell2cell/preprocessing/gene_ontology.py
def get_genes_from_go_hierarchy(go_annotations, go_terms, go_filter, go_header='GO', gene_header='Gene', verbose=False):
    '''
    Obtains genes associated with specific GO terms and their
    children GO terms (below in the hierarchy).

    Parameters
    ----------
    go_annotations : pandas.DataFrame
        Dataframe containing information about GO term annotations of each
        gene for a given organism according to the ga file. Can be loading
        with the function cell2cell.io.read_data.load_go_annotations().

    go_terms : networkx.Graph
        NetworkX Graph containing GO terms datasets from .obo file.
        It could be loaded using
        cell2cell.io.read_data.load_go_terms(filename).

    go_filter : list
        List containing one or more GO-terms to find associated genes.

    go_header : str, default='GO'
        Column name wherein GO terms are located in the dataframe.

    gene_header : str, default='Gene'
        Column name wherein genes are located in the dataframe.

    verbose : boolean, default=False
        Whether printing or not steps of the analysis.

    Returns
    -------
    genes : list
        List of genes that are associated with GO-terms contained in
        go_filter, and related to the children GO terms of those terms.
    '''
    go_hierarchy = go_filter.copy()
    iter = len(go_hierarchy)
    for i in range(iter):
        find_all_children_of_go_term(go_terms, go_hierarchy[i], go_hierarchy, verbose=verbose)
    go_hierarchy = list(set(go_hierarchy))
    genes = get_genes_from_go_terms(go_annotations=go_annotations,
                                    go_filter=go_hierarchy,
                                    go_header=go_header,
                                    gene_header=gene_header,
                                    verbose=verbose)
    return genes

get_genes_from_go_terms(go_annotations, go_filter, go_header='GO', gene_header='Gene', verbose=True)

Finds genes associated with specific GO-terms.

Parameters

go_annotations : pandas.DataFrame Dataframe containing information about GO term annotations of each gene for a given organism according to the ga file. Can be loading with the function cell2cell.io.read_data.load_go_annotations().

go_filter : list List containing one or more GO-terms to find associated genes.

go_header : str, default='GO' Column name wherein GO terms are located in the dataframe.

gene_header : str, default='Gene' Column name wherein genes are located in the dataframe.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

genes : list List of genes that are associated with GO-terms contained in go_filter.

Source code in cell2cell/preprocessing/gene_ontology.py
def get_genes_from_go_terms(go_annotations, go_filter, go_header='GO', gene_header='Gene', verbose=True):
    '''
    Finds genes associated with specific GO-terms.

    Parameters
    ----------
    go_annotations : pandas.DataFrame
        Dataframe containing information about GO term annotations of each
        gene for a given organism according to the ga file. Can be loading
        with the function cell2cell.io.read_data.load_go_annotations().

    go_filter : list
        List containing one or more GO-terms to find associated genes.

    go_header : str, default='GO'
        Column name wherein GO terms are located in the dataframe.

    gene_header : str, default='Gene'
        Column name wherein genes are located in the dataframe.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    genes : list
        List of genes that are associated with GO-terms contained in
        go_filter.
    '''
    if verbose:
        print('Filtering genes by using GO terms')
    genes = list(go_annotations.loc[go_annotations[go_header].isin(go_filter)][gene_header].unique())
    return genes

integrate_data

get_modified_rnaseq(rnaseq_data, cutoffs=None, communication_score='expression_thresholding')

Preprocess gene expression into values used by a communication scoring function (either continuous or binary).

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for RNA-seq experiment. Columns are cell-types/tissues/samples and rows are genes.

cutoffs : pandas.DataFrame A dataframe containing the value corresponding to cutoff or threshold assigned to each gene. Rows are genes and columns could be either 'value' for a single threshold for all cell-types/tissues/samples or the names of cell-types/tissues/samples for thresholding in a specific way. They could be obtained through the function cell2cell.preprocessing.cutoffs.get_cutoffs()

communication_score : str, default='expression_thresholding' Type of communication score used to detect active ligand-receptor pairs between each pair of cell. See cell2cell.core.communication_scores for more details. It can be:

- 'expression_thresholding'
- 'expression_product'
- 'expression_mean'
- 'expression_gmean'
Returns

modified_rnaseq : pandas.DataFrame Preprocessed gene expression given a communication scoring function to use. Rows are genes and columns are cell-types/tissues/samples.

Source code in cell2cell/preprocessing/integrate_data.py
def get_modified_rnaseq(rnaseq_data, cutoffs=None, communication_score='expression_thresholding'):
    '''
    Preprocess gene expression into values used by a communication
    scoring function (either continuous or binary).

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for RNA-seq experiment. Columns are
        cell-types/tissues/samples and rows are genes.

    cutoffs : pandas.DataFrame
        A dataframe containing the value corresponding to cutoff or threshold
        assigned to each gene. Rows are genes and columns could be either
        'value' for a single threshold for all cell-types/tissues/samples
        or the names of cell-types/tissues/samples for thresholding in a
        specific way.
        They could be obtained through the function
        cell2cell.preprocessing.cutoffs.get_cutoffs()

    communication_score : str, default='expression_thresholding'
        Type of communication score used to detect active ligand-receptor
        pairs between each pair of cell. See
        cell2cell.core.communication_scores for more details.
        It can be:

        - 'expression_thresholding'
        - 'expression_product'
        - 'expression_mean'
        - 'expression_gmean'

    Returns
    -------
    modified_rnaseq : pandas.DataFrame
        Preprocessed gene expression given a communication scoring
        function to use. Rows are genes and columns are
        cell-types/tissues/samples.
    '''
    if communication_score == 'expression_thresholding':
        modified_rnaseq = get_thresholded_rnaseq(rnaseq_data, cutoffs)
    elif communication_score in ['expression_product', 'expression_mean', 'expression_gmean']:
        modified_rnaseq = rnaseq_data.copy()
    else:
        # As other score types are implemented, other elif condition will be included here.
        raise NotImplementedError("Score type {} to compute pairwise cell-interactions is not implemented".format(communication_score))
    return modified_rnaseq

get_ppi_dict_from_go_terms(ppi_data, go_annotations, go_terms, contact_go_terms, mediator_go_terms=None, use_children=True, go_header='GO', gene_header='Gene', interaction_columns=('A', 'B'), verbose=True)

Filters a complete list of protein-protein interactions into sublists containing proteins involved in different kinds of intercellular interactions, by provided lists of GO terms.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

go_annotations : pandas.DataFrame Dataframe containing information about GO term annotations of each gene for a given organism according to the ga file. Can be loading with the function cell2cell.io.read_data.load_go_annotations().

go_terms : networkx.Graph NetworkX Graph containing GO terms datasets from .obo file. It could be loaded using cell2cell.io.read_data.load_go_terms(filename).

contact_go_terms : list GO terms for selecting proteins participating in cell contact interactions (e.g. surface proteins, receptors).

mediator_go_terms : list, default=None GO terms for selecting proteins participating in mediated or secreted signaling (e.g. extracellular proteins, ligands). If None, only interactions involved in cell contacts will be returned.

use_children : boolean, default=True Whether considering children GO terms (below in hierarchy) to the ones passed as inputs (contact_go_terms and mediator_go_terms).

go_header : str, default='GO' Column name wherein GO terms are located in the dataframe.

gene_header : str, default='Gene' Column name wherein genes are located in the dataframe.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

ppi_dict : dict Dictionary containing lists of PPIs involving proteins that participate in diffferent kinds of intercellular interactions. Options are under the keys:

- 'contacts' : Contains proteins participating in cell contact
        interactions (e.g. surface proteins, receptors)
- 'mediated' : Contains proteins participating in mediated or
        secreted signaling (e.g. ligand-receptor interactions)
- 'combined' : Contains both 'contacts' and 'mediated' PPIs.
- 'complete' : Contains all combinations of interactions between
        ligands, receptors, surface proteins, etc).
If mediator_go_terms input is None, this dictionary will contain
PPIs only for 'contacts'.
Source code in cell2cell/preprocessing/integrate_data.py
def get_ppi_dict_from_go_terms(ppi_data, go_annotations, go_terms, contact_go_terms, mediator_go_terms=None, use_children=True,
                               go_header='GO', gene_header='Gene', interaction_columns=('A', 'B'), verbose=True):
    '''
    Filters a complete list of protein-protein interactions into
    sublists containing proteins involved in different kinds of
    intercellular interactions, by provided lists of GO terms.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    go_annotations : pandas.DataFrame
        Dataframe containing information about GO term annotations of each
        gene for a given organism according to the ga file. Can be loading
        with the function cell2cell.io.read_data.load_go_annotations().

    go_terms : networkx.Graph
        NetworkX Graph containing GO terms datasets from .obo file.
        It could be loaded using
        cell2cell.io.read_data.load_go_terms(filename).

    contact_go_terms : list
        GO terms for selecting proteins participating in cell contact
        interactions (e.g. surface proteins, receptors).

    mediator_go_terms : list, default=None
        GO terms for selecting proteins participating in mediated or
        secreted signaling (e.g. extracellular proteins, ligands).
        If None, only interactions involved in cell contacts
        will be returned.

    use_children : boolean, default=True
        Whether considering children GO terms (below in hierarchy) to the
        ones passed as inputs (contact_go_terms and mediator_go_terms).

    go_header : str, default='GO'
        Column name wherein GO terms are located in the dataframe.

    gene_header : str, default='Gene'
        Column name wherein genes are located in the dataframe.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    ppi_dict : dict
        Dictionary containing lists of PPIs involving proteins that
        participate in diffferent kinds of intercellular interactions.
        Options are under the keys:

        - 'contacts' : Contains proteins participating in cell contact
                interactions (e.g. surface proteins, receptors)
        - 'mediated' : Contains proteins participating in mediated or
                secreted signaling (e.g. ligand-receptor interactions)
        - 'combined' : Contains both 'contacts' and 'mediated' PPIs.
        - 'complete' : Contains all combinations of interactions between
                ligands, receptors, surface proteins, etc).
        If mediator_go_terms input is None, this dictionary will contain
        PPIs only for 'contacts'.
    '''
    if use_children == True:
        contact_proteins = gene_ontology.get_genes_from_go_hierarchy(go_annotations=go_annotations,
                                                                     go_terms=go_terms,
                                                                     go_filter=contact_go_terms,
                                                                     go_header=go_header,
                                                                     gene_header=gene_header,
                                                                     verbose=verbose)

        mediator_proteins = gene_ontology.get_genes_from_go_hierarchy(go_annotations=go_annotations,
                                                                      go_terms=go_terms,
                                                                      go_filter=mediator_go_terms,
                                                                      go_header=go_header,
                                                                      gene_header=gene_header,
                                                                      verbose=verbose)
    else:
        contact_proteins = gene_ontology.get_genes_from_go_terms(go_annotations=go_annotations,
                                                                 go_filter=contact_go_terms,
                                                                 go_header=go_header,
                                                                 gene_header=gene_header,
                                                                 verbose=verbose)

        mediator_proteins = gene_ontology.get_genes_from_go_terms(go_annotations=go_annotations,
                                                                  go_filter=mediator_go_terms,
                                                                  go_header=go_header,
                                                                  gene_header=gene_header,
                                                                  verbose=verbose)

    # Avoid same genes in list
    #contact_proteins = list(set(contact_proteins) - set(mediator_proteins))

    ppi_dict = get_ppi_dict_from_proteins(ppi_data=ppi_data,
                                          contact_proteins=contact_proteins,
                                          mediator_proteins=mediator_proteins,
                                          interaction_columns=interaction_columns,
                                          verbose=verbose)

    return ppi_dict

get_ppi_dict_from_proteins(ppi_data, contact_proteins, mediator_proteins=None, interaction_columns=('A', 'B'), bidirectional=True, verbose=True)

Filters a complete list of protein-protein interactions into sublists containing proteins involved in different kinds of intercellular interactions, by provided lists of proteins.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

contact_proteins : list Protein names of proteins participating in cell contact interactions (e.g. surface proteins, receptors).

mediator_proteins : list, default=None Protein names of proteins participating in mediated or secreted signaling (e.g. extracellular proteins, ligands). If None, only interactions involved in cell contacts will be returned.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

bidirectional : boolean, default=True Whether duplicating PPIs in both direction of interactions. That is, if the list considers ProtA-ProtB interaction, the interaction ProtB-ProtA will be also included.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

ppi_dict : dict Dictionary containing lists of PPIs involving proteins that participate in diffferent kinds of intercellular interactions. Options are under the keys:

- 'contacts' : Contains proteins participating in cell contact
        interactions (e.g. surface proteins, receptors)
- 'mediated' : Contains proteins participating in mediated or
        secreted signaling (e.g. ligand-receptor interactions)
- 'combined' : Contains both 'contacts' and 'mediated' PPIs.
- 'complete' : Contains all combinations of interactions between
        ligands, receptors, surface proteins, etc).
If mediator_proteins input is None, this dictionary will contain
PPIs only for 'contacts'.
Source code in cell2cell/preprocessing/integrate_data.py
def get_ppi_dict_from_proteins(ppi_data, contact_proteins, mediator_proteins=None, interaction_columns=('A', 'B'),
                               bidirectional=True, verbose=True):
    '''
    Filters a complete list of protein-protein interactions into
    sublists containing proteins involved in different kinds of
    intercellular interactions, by provided lists of proteins.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    contact_proteins : list
        Protein names of proteins participating in cell contact
        interactions (e.g. surface proteins, receptors).

    mediator_proteins : list, default=None
        Protein names of proteins participating in mediated or
        secreted signaling (e.g. extracellular proteins, ligands).
        If None, only interactions involved in cell contacts
        will be returned.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    bidirectional : boolean, default=True
        Whether duplicating PPIs in both direction of interactions.
        That is, if the list considers ProtA-ProtB interaction,
        the interaction ProtB-ProtA will be also included.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    ppi_dict : dict
        Dictionary containing lists of PPIs involving proteins that
        participate in diffferent kinds of intercellular interactions.
        Options are under the keys:

        - 'contacts' : Contains proteins participating in cell contact
                interactions (e.g. surface proteins, receptors)
        - 'mediated' : Contains proteins participating in mediated or
                secreted signaling (e.g. ligand-receptor interactions)
        - 'combined' : Contains both 'contacts' and 'mediated' PPIs.
        - 'complete' : Contains all combinations of interactions between
                ligands, receptors, surface proteins, etc).
        If mediator_proteins input is None, this dictionary will contain
        PPIs only for 'contacts'.
    '''


    ppi_dict = dict()
    ppi_dict['contacts'] = ppi.filter_ppi_network(ppi_data=ppi_data,
                                                  contact_proteins=contact_proteins,
                                                  mediator_proteins=mediator_proteins,
                                                  interaction_type='contacts',
                                                  interaction_columns=interaction_columns,
                                                  bidirectional=bidirectional,
                                                  verbose=verbose)
    if mediator_proteins is not None:
        for interaction_type in ['mediated', 'combined', 'complete']:
            ppi_dict[interaction_type] = ppi.filter_ppi_network(ppi_data=ppi_data,
                                                          contact_proteins=contact_proteins,
                                                          mediator_proteins=mediator_proteins,
                                                          interaction_type=interaction_type,
                                                          interaction_columns=interaction_columns,
                                                          bidirectional=bidirectional,
                                                          verbose=verbose)

    return ppi_dict

get_thresholded_rnaseq(rnaseq_data, cutoffs)

Binzarizes a RNA-seq dataset given cutoff or threshold values.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for RNA-seq experiment. Columns are cell-types/tissues/samples and rows are genes.

cutoffs : pandas.DataFrame A dataframe containing the value corresponding to cutoff or threshold assigned to each gene. Rows are genes and columns could be either 'value' for a single threshold for all cell-types/tissues/samples or the names of cell-types/tissues/samples for thresholding in a specific way. They could be obtained through the function cell2cell.preprocessing.cutoffs.get_cutoffs()

Returns

binary_rnaseq_data : pandas.DataFrame Preprocessed gene expression into binary values given cutoffs or thresholds either general or specific for all cell-types/ tissues/samples.

Source code in cell2cell/preprocessing/integrate_data.py
def get_thresholded_rnaseq(rnaseq_data, cutoffs):
    '''Binzarizes a RNA-seq dataset given cutoff or threshold
    values.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for RNA-seq experiment. Columns are
        cell-types/tissues/samples and rows are genes.

    cutoffs : pandas.DataFrame
        A dataframe containing the value corresponding to cutoff or threshold
        assigned to each gene. Rows are genes and columns could be either
        'value' for a single threshold for all cell-types/tissues/samples
        or the names of cell-types/tissues/samples for thresholding in a
        specific way.
        They could be obtained through the function
        cell2cell.preprocessing.cutoffs.get_cutoffs()

    Returns
    -------
    binary_rnaseq_data : pandas.DataFrame
        Preprocessed gene expression into binary values given cutoffs
        or thresholds either general or specific for all cell-types/
        tissues/samples.
    '''
    binary_rnaseq_data = rnaseq_data.copy()
    columns = list(cutoffs.columns)
    if (len(columns) == 1) and ('value' in columns):
        binary_rnaseq_data = binary_rnaseq_data.gt(list(cutoffs.value.values), axis=0)
    elif sorted(columns) == sorted(list(rnaseq_data.columns)):  # Check there is a column for each cell type
        for col in columns:
            binary_rnaseq_data[col] = binary_rnaseq_data[col].gt(list(cutoffs[col].values), axis=0) # ge
    else:
        raise KeyError("The cutoff data provided does not have a 'value' column or does not match the columns in rnaseq_data.")
    binary_rnaseq_data = binary_rnaseq_data.astype(float)
    return binary_rnaseq_data

get_weighted_ppi(ppi_data, modified_rnaseq_data, column='value', interaction_columns=('A', 'B'))

Assigns preprocessed gene expression values to proteins in a list of protein-protein interactions.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

modified_rnaseq_data : pandas.DataFrame Preprocessed gene expression given a communication scoring function to use. Rows are genes and columns are cell-types/tissues/samples.

column : str, default='value' Column name to consider the gene expression values.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

Returns

weighted_ppi : pandas.DataFrame List of protein-protein interactions that contains gene expression values instead of the names of interacting proteins. Gene expression values are preprocessed given a communication scoring function to use.

Source code in cell2cell/preprocessing/integrate_data.py
def get_weighted_ppi(ppi_data, modified_rnaseq_data, column='value', interaction_columns=('A', 'B')):
    '''
    Assigns preprocessed gene expression values to
    proteins in a list of protein-protein interactions.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    modified_rnaseq_data : pandas.DataFrame
        Preprocessed gene expression given a communication scoring
        function to use. Rows are genes and columns are
        cell-types/tissues/samples.

    column : str, default='value'
        Column name to consider the gene expression values.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    Returns
    -------
    weighted_ppi : pandas.DataFrame
        List of protein-protein interactions that contains gene expression
        values instead of the names of interacting proteins. Gene expression
        values are preprocessed given a communication scoring function to
        use.
    '''
    prot_a = interaction_columns[0]
    prot_b = interaction_columns[1]
    weighted_ppi = ppi_data.copy()
    weighted_ppi[prot_a] = weighted_ppi[prot_a].apply(func=lambda row: modified_rnaseq_data.at[row, column]) # Replaced .loc by .at
    weighted_ppi[prot_b] = weighted_ppi[prot_b].apply(func=lambda row: modified_rnaseq_data.at[row, column])
    weighted_ppi = weighted_ppi[[prot_a, prot_b, 'score']].reset_index(drop=True).fillna(0.0)
    return weighted_ppi

manipulate_dataframes

check_presence_in_dataframe(df, elements, columns=None)

Searches for elements in a dataframe and returns those that are present in the dataframe.

Parameters

df : pandas.DataFrame A dataframe

elements : list List of elements to find in the dataframe. They must be a data type contained in the dataframe.

columns : list, default=None Names of columns to consider in the search. If None, all columns are used.

Returns

found_elements : list List of elements in the input list that were found in the dataframe.

Source code in cell2cell/preprocessing/manipulate_dataframes.py
def check_presence_in_dataframe(df, elements, columns=None):
    '''
    Searches for elements in a dataframe and returns those
    that are present in the dataframe.

    Parameters
    ----------
    df : pandas.DataFrame
        A dataframe

    elements : list
        List of elements to find in the dataframe. They
        must be a data type contained in the dataframe.

    columns : list, default=None
        Names of columns to consider in the search. If
        None, all columns are used.

    Returns
    -------
    found_elements : list
        List of elements in the input list that were found
        in the dataframe.
    '''
    if columns is None:
        columns = list(df.columns)
    df_elements = pd.Series(np.unique(df[columns].values.flatten()))
    df_elements = df_elements.loc[df_elements.isin(elements)].values
    found_elements = list(df_elements)
    return found_elements

check_symmetry(df)

Checks whether a dataframe is symmetric.

Parameters

df : pandas.DataFrame A dataframe.

Returns

symmetric : boolean Whether a dataframe is symmetric.

Source code in cell2cell/preprocessing/manipulate_dataframes.py
def check_symmetry(df):
    '''
    Checks whether a dataframe is symmetric.

    Parameters
    ----------
    df : pandas.DataFrame
        A dataframe.

    Returns
    -------
    symmetric : boolean
        Whether a dataframe is symmetric.
    '''
    shape = df.shape
    if shape[0] == shape[1]:
        symmetric = (df.values.transpose() == df.values).all()
    else:
        symmetric = False
    return symmetric

convert_to_distance_matrix(df)

Converts a symmetric dataframe into a distance dataframe. That is, diagonal elements are all zero.

Parameters

df : pandas.DataFrame A dataframe.

Returns

df_ : pandas.DataFrame A copy of df, but with all diagonal elements with a value of zero.

Source code in cell2cell/preprocessing/manipulate_dataframes.py
def convert_to_distance_matrix(df):
    '''
    Converts a symmetric dataframe into a distance dataframe.
    That is, diagonal elements are all zero.

    Parameters
    ----------
    df : pandas.DataFrame
        A dataframe.

    Returns
    -------
    df_ : pandas.DataFrame
        A copy of df, but with all diagonal elements with a
        value of zero.
    '''
    if check_symmetry(df):
        df_ = df.copy()
        if np.trace(df_.values,) != 0.0:
            raise Warning("Diagonal elements are not zero. Automatically replaced by zeros")
        np.fill_diagonal(df_.values, 0.0)
    else:
        raise ValueError('The DataFrame is not symmetric')
    return df_

shuffle_cols_in_df(df, columns, shuffling_number=1, random_state=None)

Randomly shuffles specific columns in a dataframe.

Parameters

df : pandas.DataFrame A dataframe.

columns : list Names of columns to shuffle.

shuffling_number : int, default=1 Number of shuffles per column.

random_state : int, default=None Seed for randomization.

Returns

df_ : pandas.DataFrame A shuffled dataframe.

Source code in cell2cell/preprocessing/manipulate_dataframes.py
def shuffle_cols_in_df(df, columns, shuffling_number=1, random_state=None):
    '''
    Randomly shuffles specific columns in a dataframe.

    Parameters
    ----------
    df : pandas.DataFrame
        A dataframe.

    columns : list
        Names of columns to shuffle.

    shuffling_number : int, default=1
        Number of shuffles per column.

    random_state : int, default=None
        Seed for randomization.

    Returns
    -------
    df_ : pandas.DataFrame
        A shuffled dataframe.
    '''
    df_ = df.copy()
    if isinstance(columns, str):
        columns = [columns]

    for col in columns:
        for i in range(shuffling_number):
            if random_state is not None:
                np.random.seed(random_state + i)
            df_[col] = np.random.permutation(df_[col].values)
    return df_

shuffle_dataframe(df, shuffling_number=1, axis=0, random_state=None)

Randomly shuffles a whole dataframe across a given axis.

Parameters

df : pandas.DataFrame A dataframe.

shuffling_number : int, default=1 Number of shuffles per column.

axis : int, default=0 An axis of the dataframe (0 across rows, 1 across columns). Across rows means that shuffles each column independently, and across columns shuffles each row independently.

random_state : int, default=None Seed for randomization.

Returns

df_ : pandas.DataFrame A shuffled dataframe.

Source code in cell2cell/preprocessing/manipulate_dataframes.py
def shuffle_dataframe(df, shuffling_number=1, axis=0, random_state=None):
    '''
    Randomly shuffles a whole dataframe across a given axis.

    Parameters
    ----------
    df : pandas.DataFrame
        A dataframe.

    shuffling_number : int, default=1
        Number of shuffles per column.

    axis : int, default=0
        An axis of the dataframe (0 across rows, 1 across columns).
        Across rows means that shuffles each column independently,
        and across columns shuffles each row independently.

    random_state : int, default=None
        Seed for randomization.

    Returns
    -------
    df_ : pandas.DataFrame
        A shuffled dataframe.
    '''
    df_ = df.copy()
    axis = int(not axis)  # pandas.DataFrame is always 2D
    to_shuffle = np.rollaxis(df_.values, axis)
    for _ in range(shuffling_number):
        for i, view in enumerate(to_shuffle):
            if random_state is not None:
                np.random.seed(random_state + i)
            np.random.shuffle(view)
    df_ = pd.DataFrame(np.rollaxis(to_shuffle, axis=axis), index=df_.index, columns=df_.columns)
    return df_

shuffle_rows_in_df(df, rows, shuffling_number=1, random_state=None)

Randomly shuffles specific rows in a dataframe.

Parameters

df : pandas.DataFrame A dataframe.

rows : list Names of rows (or indexes) to shuffle.

shuffling_number : int, default=1 Number of shuffles per row.

random_state : int, default=None Seed for randomization.

Returns

df_.T : pandas.DataFrame A shuffled dataframe.

Source code in cell2cell/preprocessing/manipulate_dataframes.py
def shuffle_rows_in_df(df, rows, shuffling_number=1, random_state=None):
    '''
    Randomly shuffles specific rows in a dataframe.

    Parameters
    ----------
    df : pandas.DataFrame
        A dataframe.

    rows : list
        Names of rows (or indexes) to shuffle.

    shuffling_number : int, default=1
        Number of shuffles per row.

    random_state : int, default=None
        Seed for randomization.

    Returns
    -------
    df_.T : pandas.DataFrame
        A shuffled dataframe.
    '''
    df_ = df.copy().T
    if isinstance(rows, str):
        rows = [rows]

    for row in rows:
        for i in range(shuffling_number):
            if random_state is not None:
                np.random.seed(random_state + i)
            df_[row] = np.random.permutation(df_[row].values)
    return df_.T

subsample_dataframe(df, n_samples, random_state=None)

Randomly subsamples rows of a dataframe.

Parameters

df : pandas.DataFrame A dataframe.

n_samples : int Number of samples, rows in this case. If n_samples is larger than the number of rows, the entire dataframe will be returned, but shuffled.

random_state : int, default=None Seed for randomization.

Returns

subsampled_df : pandas.DataFrame A subsampled and shuffled dataframe.

Source code in cell2cell/preprocessing/manipulate_dataframes.py
def subsample_dataframe(df, n_samples, random_state=None):
    '''
    Randomly subsamples rows of a dataframe.

    Parameters
    ----------
    df : pandas.DataFrame
        A dataframe.

    n_samples : int
        Number of samples, rows in this case. If
        n_samples is larger than the number of rows,
        the entire dataframe will be returned, but
        shuffled.

    random_state : int, default=None
        Seed for randomization.

    Returns
    -------
    subsampled_df : pandas.DataFrame
        A subsampled and shuffled dataframe.
    '''
    items = list(df.index)
    if n_samples > len(items):
        n_samples = len(items)
    if isinstance(random_state, int):
        random.seed(random_state)
    random.shuffle(items)

    subsampled_df = df.loc[items[:n_samples],:]
    return subsampled_df

ppi

bidirectional_ppi_for_cci(ppi_data, interaction_columns=('A', 'B'), verbose=True)

Makes a list of protein-protein interactions to be bidirectional. That is, repeating a PPI like ProtA-ProtB but in the other direction (ProtB-ProtA) if not present.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

interaction_columns : tuple Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

bi_ppi_data : pandas.DataFrame Bidirectional ppi_data. Contains duplicated PPIs in both directions. That is, it contains both ProtA-ProtB and ProtB-ProtA interactions.

Source code in cell2cell/preprocessing/ppi.py
def bidirectional_ppi_for_cci(ppi_data, interaction_columns=('A', 'B'), verbose=True):
    '''
    Makes a list of protein-protein interactions to be bidirectional.
    That is, repeating a PPI like ProtA-ProtB but in the other direction
    (ProtB-ProtA) if not present.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    interaction_columns : tuple
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    bi_ppi_data : pandas.DataFrame
        Bidirectional ppi_data. Contains duplicated PPIs in both directions.
        That is, it contains both ProtA-ProtB and ProtB-ProtA interactions.
    '''
    if verbose:
        print("Making bidirectional PPI for CCI.")
    if ppi_data.shape[0] == 0:
        return ppi_data.copy()

    ppi_A = ppi_data.copy()
    col1 = ppi_A[[interaction_columns[0]]]
    col2 = ppi_A[[interaction_columns[1]]]
    ppi_B = ppi_data.copy()
    ppi_B[interaction_columns[0]] = col2
    ppi_B[interaction_columns[1]] = col1

    if verbose:
        print("Removing duplicates in bidirectional PPI network.")
    bi_ppi_data = pd.concat([ppi_A, ppi_B], join="inner")
    bi_ppi_data = bi_ppi_data.drop_duplicates()
    bi_ppi_data.reset_index(inplace=True, drop=True)
    return bi_ppi_data

filter_complex_ppi_by_proteins(ppi_data, proteins, complex_sep='&', upper_letter_comparison=True, interaction_columns=('A', 'B'))

Filters a list of protein-protein interactions that for sure contains protein complexes to contain only interacting proteins or subunites in a list of specific protein or gene names.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

proteins : list A list of protein names to filter PPIs.

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

upper_letter_comparison : boolean, default=True Whether making uppercase the protein names in the list of proteins and the names in the ppi_data to match their names and integrate their Useful when there are inconsistencies in the names that comes from a expression matrix and from ligand-receptor annotations.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

Returns

integrated_ppi : pandas.DataFrame A filtered list of PPIs, containing protein complexes in some cases, by a given list of proteins or gene names.

Source code in cell2cell/preprocessing/ppi.py
def filter_complex_ppi_by_proteins(ppi_data, proteins, complex_sep='&', upper_letter_comparison=True,
                                   interaction_columns=('A', 'B')):
    '''
    Filters a list of protein-protein interactions that for sure contains
    protein complexes to contain only interacting proteins or subunites
    in a list of specific protein or gene names.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    proteins : list
        A list of protein names to filter PPIs.

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    upper_letter_comparison : boolean, default=True
        Whether making uppercase the protein names in the list of proteins and
        the names in the ppi_data to match their names and integrate their
        Useful when there are inconsistencies in the names that comes from a
        expression matrix and from ligand-receptor annotations.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    Returns
    -------
    integrated_ppi : pandas.DataFrame
        A filtered list of PPIs, containing protein complexes in some cases,
        by a given list of proteins or gene names.
    '''
    col_a = interaction_columns[0]
    col_b = interaction_columns[1]

    integrated_ppi = ppi_data.copy()

    if upper_letter_comparison:
        integrated_ppi[col_a] = integrated_ppi[col_a].apply(lambda x: str(x).upper())
        integrated_ppi[col_b] = integrated_ppi[col_b].apply(lambda x: str(x).upper())
        prots = set([str(p).upper() for p in proteins])
    else:
        prots = set(proteins)

    col_a_genes, complex_a, col_b_genes, complex_b, complexes = get_genes_from_complexes(ppi_data=integrated_ppi,
                                                                                         complex_sep=complex_sep,
                                                                                         interaction_columns=interaction_columns
                                                                                         )

    shared_a_genes = set(col_a_genes & prots)
    shared_b_genes = set(col_b_genes & prots)

    shared_a_complexes = set(complex_a & prots)
    shared_b_complexes = set(complex_b & prots)

    integrated_a_complexes = set()
    integrated_b_complexes = set()
    for k, v in complexes.items():
        if all(p in shared_a_complexes for p in v):
            integrated_a_complexes.add(k)
        elif all(p in shared_b_complexes for p in v):
            integrated_b_complexes.add(k)

    integrated_a = shared_a_genes.union(integrated_a_complexes)
    integrated_b = shared_b_genes.union(integrated_b_complexes)

    filter = (integrated_ppi[col_a].isin(integrated_a)) & (integrated_ppi[col_b].isin(integrated_b))
    integrated_ppi = ppi_data.loc[filter].reset_index(drop=True)

    return integrated_ppi

filter_ppi_by_proteins(ppi_data, proteins, complex_sep=None, upper_letter_comparison=True, interaction_columns=('A', 'B'))

Filters a list of protein-protein interactions to contain only interacting proteins in a list of specific protein or gene names.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

proteins : list A list of protein names to filter PPIs.

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

upper_letter_comparison : boolean, default=True Whether making uppercase the protein names in the list of proteins and the names in the ppi_data to match their names and integrate their Useful when there are inconsistencies in the names that comes from a expression matrix and from ligand-receptor annotations.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

Returns

integrated_ppi : pandas.DataFrame A filtered list of PPIs by a given list of proteins or gene names.

Source code in cell2cell/preprocessing/ppi.py
def filter_ppi_by_proteins(ppi_data, proteins, complex_sep=None, upper_letter_comparison=True, interaction_columns=('A', 'B')):
    '''
    Filters a list of protein-protein interactions to contain
    only interacting proteins in a list of specific protein or gene names.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    proteins : list
        A list of protein names to filter PPIs.

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    upper_letter_comparison : boolean, default=True
        Whether making uppercase the protein names in the list of proteins and
        the names in the ppi_data to match their names and integrate their
        Useful when there are inconsistencies in the names that comes from a
        expression matrix and from ligand-receptor annotations.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    Returns
    -------
    integrated_ppi : pandas.DataFrame
        A filtered list of PPIs by a given list of proteins or gene names.
    '''

    col_a = interaction_columns[0]
    col_b = interaction_columns[1]

    integrated_ppi = ppi_data.copy()

    if upper_letter_comparison:
        integrated_ppi[col_a] = integrated_ppi[col_a].apply(lambda x: str(x).upper())
        integrated_ppi[col_b] = integrated_ppi[col_b].apply(lambda x: str(x).upper())
        prots = set([str(p).upper() for p in proteins])
    else:
        prots = set(proteins)

    if complex_sep is not None:
        integrated_ppi = filter_complex_ppi_by_proteins(ppi_data=integrated_ppi,
                                                        proteins=prots,
                                                        complex_sep=complex_sep,
                                                        upper_letter_comparison=False, # Because it was ran above
                                                        interaction_columns=interaction_columns,
                                                        )
    else:
        integrated_ppi = integrated_ppi[(integrated_ppi[col_a].isin(prots)) & (integrated_ppi[col_b].isin(prots))]
    integrated_ppi = integrated_ppi.reset_index(drop=True)
    return integrated_ppi

filter_ppi_network(ppi_data, contact_proteins, mediator_proteins=None, reference_list=None, bidirectional=True, interaction_type='contacts', interaction_columns=('A', 'B'), verbose=True)

Filters a list of protein-protein interactions to contain interacting proteins involved in different kinds of cell-cell interactions/communication.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

contact_proteins : list Protein names of proteins participating in cell contact interactions (e.g. surface proteins, receptors).

mediator_proteins : list, default=None Protein names of proteins participating in mediated or secreted signaling (e.g. extracellular proteins, ligands). If None, only interactions involved in cell contacts will be returned.

reference_list : list, default=None Reference list of protein names. Filtered PPIs from contact_proteins and mediator proteins will be keep only if those proteins are also present in this list when is not None.

bidirectional : boolean, default=True Whether duplicating PPIs in both direction of interactions. That is, if the list considers ProtA-ProtB interaction, the interaction ProtB-ProtA will be also included.

interaction_type : str, default='contacts' Type of intercellular interactions/communication where the proteins have to be involved in. Available types are:

- 'contacts' : Contains proteins participating in cell contact
        interactions (e.g. surface proteins, receptors)
- 'mediated' : Contains proteins participating in mediated or
        secreted signaling (e.g. ligand-receptor interactions)
- 'combined' : Contains both 'contacts' and 'mediated' PPIs.
- 'complete' : Contains all combinations of interactions between
        ligands, receptors, surface proteins, etc).

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

new_ppi_data : pandas.DataFrame A filtered list of PPIs by a given list of proteins or gene names depending on the type of intercellular communication.

Source code in cell2cell/preprocessing/ppi.py
def filter_ppi_network(ppi_data, contact_proteins, mediator_proteins=None, reference_list=None, bidirectional=True,
                       interaction_type='contacts', interaction_columns=('A', 'B'), verbose=True):
    '''
    Filters a list of protein-protein interactions to contain interacting
    proteins involved in different kinds of cell-cell interactions/communication.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    contact_proteins : list
        Protein names of proteins participating in cell contact
        interactions (e.g. surface proteins, receptors).

    mediator_proteins : list, default=None
        Protein names of proteins participating in mediated or
        secreted signaling (e.g. extracellular proteins, ligands).
        If None, only interactions involved in cell contacts
        will be returned.

    reference_list : list, default=None
        Reference list of protein names. Filtered PPIs from contact_proteins
        and mediator proteins will be keep only if those proteins are also
        present in this list when is not None.

    bidirectional : boolean, default=True
        Whether duplicating PPIs in both direction of interactions.
        That is, if the list considers ProtA-ProtB interaction,
        the interaction ProtB-ProtA will be also included.

    interaction_type : str, default='contacts'
        Type of intercellular interactions/communication where the proteins
        have to be involved in.
        Available types are:

        - 'contacts' : Contains proteins participating in cell contact
                interactions (e.g. surface proteins, receptors)
        - 'mediated' : Contains proteins participating in mediated or
                secreted signaling (e.g. ligand-receptor interactions)
        - 'combined' : Contains both 'contacts' and 'mediated' PPIs.
        - 'complete' : Contains all combinations of interactions between
                ligands, receptors, surface proteins, etc).

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    new_ppi_data : pandas.DataFrame
        A filtered list of PPIs by a given list of proteins or gene names
        depending on the type of intercellular communication.
    '''

    new_ppi_data = get_filtered_ppi_network(ppi_data=ppi_data,
                                            contact_proteins=contact_proteins,
                                            mediator_proteins=mediator_proteins,
                                            reference_list=reference_list,
                                            interaction_type=interaction_type,
                                            interaction_columns=interaction_columns,
                                            verbose=verbose)

    if bidirectional:
        new_ppi_data = bidirectional_ppi_for_cci(ppi_data=new_ppi_data,
                                                 interaction_columns=interaction_columns,
                                                 verbose=verbose)
    return new_ppi_data

get_all_to_all_ppi(ppi_data, proteins, interaction_columns=('A', 'B'))

Filters a list of protein-protein interactions to contain only proteins in a given list in both columns of interacting partners.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

proteins : list A list of protein names to filter PPIs.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

Returns

new_ppi_data : pandas.DataFrame A filtered list of PPIs by a given list of proteins or gene names.

Source code in cell2cell/preprocessing/ppi.py
def get_all_to_all_ppi(ppi_data, proteins, interaction_columns=('A', 'B')):
    '''
    Filters a list of protein-protein interactions to
    contain only proteins in a given list in both
    columns of interacting partners.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    proteins : list
        A list of protein names to filter PPIs.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    Returns
    -------
    new_ppi_data : pandas.DataFrame
        A filtered list of PPIs by a given list of proteins or gene names.
    '''
    header_interactorA = interaction_columns[0]
    header_interactorB = interaction_columns[1]
    new_ppi_data = ppi_data.loc[ppi_data[header_interactorA].isin(proteins) & ppi_data[header_interactorB].isin(proteins)]
    new_ppi_data = new_ppi_data.drop_duplicates()
    return new_ppi_data

get_filtered_ppi_network(ppi_data, contact_proteins, mediator_proteins=None, reference_list=None, interaction_type='contacts', interaction_columns=('A', 'B'), verbose=True)

Filters a list of protein-protein interactions to contain interacting proteins involved in different kinds of cell-cell interactions/communication.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

contact_proteins : list Protein names of proteins participating in cell contact interactions (e.g. surface proteins, receptors).

mediator_proteins : list, default=None Protein names of proteins participating in mediated or secreted signaling (e.g. extracellular proteins, ligands). If None, only interactions involved in cell contacts will be returned.

reference_list : list, default=None Reference list of protein names. Filtered PPIs from contact_proteins and mediator proteins will be keep only if those proteins are also present in this list when is not None.

interaction_type : str, default='contacts' Type of intercellular interactions/communication where the proteins have to be involved in. Available types are:

- 'contacts' : Contains proteins participating in cell contact
        interactions (e.g. surface proteins, receptors)
- 'mediated' : Contains proteins participating in mediated or
        secreted signaling (e.g. ligand-receptor interactions)
- 'combined' : Contains both 'contacts' and 'mediated' PPIs.
- 'complete' : Contains all combinations of interactions between
        ligands, receptors, surface proteins, etc).

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

new_ppi_data : pandas.DataFrame A filtered list of PPIs by a given list of proteins or gene names depending on the type of intercellular communication.

Source code in cell2cell/preprocessing/ppi.py
def get_filtered_ppi_network(ppi_data, contact_proteins, mediator_proteins=None, reference_list=None,
                             interaction_type='contacts', interaction_columns=('A', 'B'), verbose=True):
    '''
    Filters a list of protein-protein interactions to contain interacting
    proteins involved in different kinds of cell-cell interactions/communication.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    contact_proteins : list
        Protein names of proteins participating in cell contact
        interactions (e.g. surface proteins, receptors).

    mediator_proteins : list, default=None
        Protein names of proteins participating in mediated or
        secreted signaling (e.g. extracellular proteins, ligands).
        If None, only interactions involved in cell contacts
        will be returned.

    reference_list : list, default=None
        Reference list of protein names. Filtered PPIs from contact_proteins
        and mediator proteins will be keep only if those proteins are also
        present in this list when is not None.

    interaction_type : str, default='contacts'
        Type of intercellular interactions/communication where the proteins
        have to be involved in.
        Available types are:

        - 'contacts' : Contains proteins participating in cell contact
                interactions (e.g. surface proteins, receptors)
        - 'mediated' : Contains proteins participating in mediated or
                secreted signaling (e.g. ligand-receptor interactions)
        - 'combined' : Contains both 'contacts' and 'mediated' PPIs.
        - 'complete' : Contains all combinations of interactions between
                ligands, receptors, surface proteins, etc).

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    new_ppi_data : pandas.DataFrame
        A filtered list of PPIs by a given list of proteins or gene names
        depending on the type of intercellular communication.
    '''
    if (mediator_proteins is None) and (interaction_type != 'contacts'):
        raise ValueError("mediator_proteins cannot be None when interaction_type is not contacts")

    # Consider only genes that are in the reference_list:
    if reference_list is not None:
        contact_proteins = list(filter(lambda x: x in reference_list , contact_proteins))
        if mediator_proteins is not None:
            mediator_proteins = list(filter(lambda x: x in reference_list, mediator_proteins))

    if verbose:
        print('Filtering PPI interactions by using a list of genes for {} interactions'.format(interaction_type))
    if interaction_type == 'contacts':
        new_ppi_data = get_all_to_all_ppi(ppi_data=ppi_data,
                                          proteins=contact_proteins,
                                          interaction_columns=interaction_columns)

    elif interaction_type == 'complete':
        total_proteins = list(set(contact_proteins + mediator_proteins))

        new_ppi_data = get_all_to_all_ppi(ppi_data=ppi_data,
                                          proteins=total_proteins,
                                          interaction_columns=interaction_columns)
    else:
        # All the following interactions incorporate contacts-mediator interactions
        mediated = get_one_group_to_other_ppi(ppi_data=ppi_data,
                                              proteins_a=contact_proteins,
                                              proteins_b=mediator_proteins,
                                              interaction_columns=interaction_columns)
        if interaction_type == 'mediated':
            new_ppi_data = mediated
        elif interaction_type == 'combined':
            contacts = get_all_to_all_ppi(ppi_data=ppi_data,
                                          proteins=contact_proteins,
                                          interaction_columns=interaction_columns)
            new_ppi_data = pd.concat([contacts, mediated], ignore_index = True).drop_duplicates()
        else:
            raise NameError('Not valid interaction type to filter the PPI network')
    new_ppi_data.reset_index(inplace=True, drop=True)
    return new_ppi_data

get_genes_from_complexes(ppi_data, complex_sep='&', interaction_columns=('A', 'B'))

Gets protein/gene names for individual proteins (subunits when in complex) in a list of PPIs. If protein is a complex, for example ProtA&ProtB, it will return ProtA and ProtB separately.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

Returns

col_a_genes : list List of protein/gene names for proteins and subunits in the first column of interacting partners.

complex_a : list List of list of subunits of each complex that were present in the first column of interacting partners and that were returned as subunits in the previous list.

col_b_genes : list List of protein/gene names for proteins and subunits in the second column of interacting partners.

complex_b : list List of list of subunits of each complex that were present in the second column of interacting partners and that were returned as subunits in the previous list.

complexes : dict Dictionary where keys are the complex names in the list of PPIs, while values are list of subunits for the respective complex names.

Source code in cell2cell/preprocessing/ppi.py
def get_genes_from_complexes(ppi_data, complex_sep='&', interaction_columns=('A', 'B')):
    '''
    Gets protein/gene names for individual proteins (subunits when in complex)
    in a list of PPIs. If protein is a complex, for example ProtA&ProtB, it will
    return ProtA and ProtB separately.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    Returns
    -------
    col_a_genes : list
        List of protein/gene names for proteins and subunits in the first column
        of interacting partners.

    complex_a : list
        List of list of subunits of each complex that were present in the first
        column of interacting partners and that were returned as subunits in the
        previous list.

    col_b_genes : list
        List of protein/gene names for proteins and subunits in the second column
        of interacting partners.

    complex_b : list
        List of list of subunits of each complex that were present in the second
        column of interacting partners and that were returned as subunits in the
        previous list.

    complexes : dict
        Dictionary where keys are the complex names in the list of PPIs, while
        values are list of subunits for the respective complex names.
    '''
    col_a = interaction_columns[0]
    col_b = interaction_columns[1]

    col_a_genes = set()
    col_b_genes = set()

    complexes = dict()
    complex_a = set()
    complex_b = set()
    for idx, row in ppi_data.iterrows():
        prot_a = row[col_a]
        prot_b = row[col_b]

        if complex_sep in prot_a:
            comp = set([l for l in prot_a.split(complex_sep)])
            complexes[prot_a] = comp
            complex_a = complex_a.union(comp)
        else:
            col_a_genes.add(prot_a)

        if complex_sep in prot_b:
            comp = set([r for r in prot_b.split(complex_sep)])
            complexes[prot_b] = comp
            complex_b = complex_b.union(comp)
        else:
            col_b_genes.add(prot_b)

    return col_a_genes, complex_a, col_b_genes, complex_b, complexes

get_one_group_to_other_ppi(ppi_data, proteins_a, proteins_b, interaction_columns=('A', 'B'))

Filters a list of protein-protein interactions to contain specific proteins in the first column of interacting partners an other specific proteins in the second column.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

proteins_a : list A list of protein names to filter the first column of interacting proteins in a list of PPIs.

proteins_b : list A list of protein names to filter the second column of interacting proteins in a list of PPIs.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

Returns

new_ppi_data : pandas.DataFrame A filtered list of PPIs by a given lists of proteins or gene names.

Source code in cell2cell/preprocessing/ppi.py
def get_one_group_to_other_ppi(ppi_data, proteins_a, proteins_b, interaction_columns=('A', 'B')):
    '''Filters a list of protein-protein interactions to
    contain specific proteins in the first column of
    interacting partners an other specific proteins in
    the second column.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    proteins_a : list
        A list of protein names to filter the first column of interacting
        proteins in a list of PPIs.

    proteins_b : list
        A list of protein names to filter the second column of interacting
        proteins in a list of PPIs.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    Returns
    -------
    new_ppi_data : pandas.DataFrame
        A filtered list of PPIs by a given lists of proteins or gene names.
    '''
    header_interactorA = interaction_columns[0]
    header_interactorB = interaction_columns[1]
    direction1 = ppi_data.loc[ppi_data[header_interactorA].isin(proteins_a) & ppi_data[header_interactorB].isin(proteins_b)]
    direction2 = ppi_data.loc[ppi_data[header_interactorA].isin(proteins_b) & ppi_data[header_interactorB].isin(proteins_a)]
    direction2.columns = [header_interactorB, header_interactorA, 'score']
    direction2 = direction2[[header_interactorA, header_interactorB, 'score']]
    new_ppi_data = pd.concat([direction1, direction2], ignore_index=True).drop_duplicates()
    return new_ppi_data

preprocess_ppi_data(ppi_data, interaction_columns, sort_values=None, score=None, rnaseq_genes=None, complex_sep=None, dropna=False, strna='', upper_letter_comparison=True, verbose=True)

Preprocess a list of protein-protein interactions by removed bidirectionality and keeping the minimum number of columns.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

interaction_columns : tuple Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

sort_values : str, default=None Column name to sort PPIs in an ascending manner. If None, sorting is not done.

rnaseq_genes : list, default=None List of protein or gene names to filter the PPIs.

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

dropna : boolean, default=False Whether dropping incomplete PPIs (with NaN values).

strna : str, default='' String to replace empty or NaN values with.

upper_letter_comparison : boolean, default=True Whether making uppercase the gene names in the expression matrices and the protein names in the ppi_data to match their names and integrate their respective expression level. Useful when there are inconsistencies in the names between the expression matrix and the ligand-receptor annotations.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

simplified_ppi : pandas.DataFrame A simplified list of protein-protein interactions. It does not contains duplicated interactions in both directions (if ProtA-ProtB and ProtB-ProtA interactions are present, only the one that appears first is kept) either extra columns beyond interacting ones. It contains only three columns: 'A', 'B', 'score', wherein 'A' and 'B' are the interacting partners in the PPI and 'score' represents a weight of the interaction for computing cell-cell interactions/communication.

Source code in cell2cell/preprocessing/ppi.py
def preprocess_ppi_data(ppi_data, interaction_columns, sort_values=None, score=None, rnaseq_genes=None, complex_sep=None,
             dropna=False, strna='', upper_letter_comparison=True, verbose=True):
    '''
    Preprocess a list of protein-protein interactions by
    removed bidirectionality and keeping the minimum number of columns.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    interaction_columns : tuple
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    sort_values : str, default=None
        Column name to sort PPIs in an ascending manner. If None,
        sorting is not done.

    rnaseq_genes : list, default=None
        List of protein or gene names to filter the PPIs.

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    dropna : boolean, default=False
        Whether dropping incomplete PPIs (with NaN values).

    strna : str, default=''
        String to replace empty or NaN values with.

    upper_letter_comparison : boolean, default=True
        Whether making uppercase the gene names in the expression matrices and the
        protein names in the ppi_data to match their names and integrate their
        respective expression level. Useful when there are inconsistencies in the
        names between the expression matrix and the ligand-receptor annotations.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    simplified_ppi : pandas.DataFrame
        A simplified list of protein-protein interactions. It does not contains
        duplicated interactions in both directions (if ProtA-ProtB and
        ProtB-ProtA interactions are present, only the one that appears first
        is kept) either extra columns beyond interacting ones. It contains only
        three columns: 'A', 'B', 'score', wherein 'A' and 'B' are the interacting
        partners in the PPI and 'score' represents a weight of the interaction
        for computing cell-cell interactions/communication.
    '''
    if sort_values is not None:
        ppi_data = ppi_data.sort_values(by=sort_values)
    if dropna:
        ppi_data = ppi_data.loc[ppi_data[interaction_columns].dropna().index,:]
    if strna is not None:
        assert(isinstance(strna, str)), "strna has to be an string."
        ppi_data = ppi_data.fillna(strna)
    unidirectional_ppi = remove_ppi_bidirectionality(ppi_data, interaction_columns, verbose=verbose)
    simplified_ppi = simplify_ppi(unidirectional_ppi, interaction_columns, score, verbose=verbose)
    if rnaseq_genes is not None:
        if complex_sep is None:
            simplified_ppi = filter_ppi_by_proteins(ppi_data=simplified_ppi,
                                                    proteins=rnaseq_genes,
                                                    upper_letter_comparison=upper_letter_comparison,
                                                    )
        else:
            simplified_ppi = filter_complex_ppi_by_proteins(ppi_data=simplified_ppi,
                                                            proteins=rnaseq_genes,
                                                            complex_sep=complex_sep,
                                                            interaction_columns=('A', 'B'),
                                                            upper_letter_comparison=upper_letter_comparison
                                                            )
    simplified_ppi = simplified_ppi.drop_duplicates().reset_index(drop=True)
    return simplified_ppi

remove_ppi_bidirectionality(ppi_data, interaction_columns, verbose=True)

Removes duplicate interactions. For example, when ProtA-ProtB and ProtB-ProtA interactions are present in the dataset, only one of them will be kept.

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

interaction_columns : tuple Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns

unidirectional_ppi : pandas.DataFrame List of protein-protein interactions without duplicated interactions in both directions (if ProtA-ProtB and ProtB-ProtA interactions are present, only the one that appears first is kept).

Source code in cell2cell/preprocessing/ppi.py
def remove_ppi_bidirectionality(ppi_data, interaction_columns, verbose=True):
    '''
    Removes duplicate interactions. For example, when ProtA-ProtB and
    ProtB-ProtA interactions are present in the dataset, only one of
    them will be kept.

    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    interaction_columns : tuple
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
    unidirectional_ppi : pandas.DataFrame
        List of protein-protein interactions without duplicated interactions
        in both directions (if ProtA-ProtB and ProtB-ProtA interactions are
        present, only the one that appears first is kept).
    '''
    if verbose:
        print('Removing bidirectionality of PPI network')
    header_interactorA = interaction_columns[0]
    header_interactorB = interaction_columns[1]
    IA = ppi_data[[header_interactorA, header_interactorB]]
    IB = ppi_data[[header_interactorB, header_interactorA]]
    IB.columns = [header_interactorA, header_interactorB]
    repeated_interactions = pd.merge(IA, IB, on=[header_interactorA, header_interactorB])
    repeated = list(np.unique(repeated_interactions.values.flatten()))
    df =  pd.DataFrame(combinations(sorted(repeated), 2), columns=[header_interactorA, header_interactorB])
    df = df[[header_interactorB, header_interactorA]]   # To keep lexicographically sorted interactions
    df.columns = [header_interactorA, header_interactorB]
    unidirectional_ppi = pd.merge(ppi_data, df, indicator=True, how='outer').query('_merge=="left_only"').drop('_merge', axis=1)
    unidirectional_ppi.reset_index(drop=True, inplace=True)
    return unidirectional_ppi

simplify_ppi(ppi_data, interaction_columns, score=None, verbose=True)

Reduces a dataframe of protein-protein interactions into a simpler version with only three columns (A, B and score).

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

interaction_columns : tuple Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

score : str, default=None Column name where weights for the PPIs are specified. If None, a default score of one is automatically assigned to each PPI.

verbose : boolean, default=True Whether printing or not steps of the analysis.

Returns
A simplified dataframe of protein-protein interactions with only
three columns: 'A', 'B', 'score', wherein 'A' and 'B' are the
interacting partners in the PPI and 'score' represents a weight
of the interaction for computing cell-cell interactions/communication.
Source code in cell2cell/preprocessing/ppi.py
def simplify_ppi(ppi_data, interaction_columns, score=None, verbose=True):
    '''
    Reduces a dataframe of protein-protein interactions into
    a simpler version with only three columns (A, B and score).

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    interaction_columns : tuple
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    score : str, default=None
        Column name where weights for the PPIs are specified. If
        None, a default score of one is automatically assigned
        to each PPI.

    verbose : boolean, default=True
        Whether printing or not steps of the analysis.

    Returns
    -------
        A simplified dataframe of protein-protein interactions with only
        three columns: 'A', 'B', 'score', wherein 'A' and 'B' are the
        interacting partners in the PPI and 'score' represents a weight
        of the interaction for computing cell-cell interactions/communication.
    '''
    if verbose:
        print('Simplying PPI network')
    header_interactorA = interaction_columns[0]
    header_interactorB = interaction_columns[1]
    if score is None:
        simple_ppi = ppi_data[[header_interactorA, header_interactorB]]
        simple_ppi = simple_ppi.assign(score = 1.0)
    else:
        simple_ppi = ppi_data[[header_interactorA, header_interactorB, score]]
        simple_ppi[score] = simple_ppi[score].fillna(np.nanmin(simple_ppi[score].values))
    cols = ['A', 'B', 'score']
    simple_ppi.columns = cols
    return simple_ppi

rnaseq

add_complexes_to_expression(rnaseq_data, complexes, agg_method='min')

Adds multimeric complexes into the gene expression matrix. Their gene expressions are the minimum expression value among the respective subunits composing them.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for RNA-seq experiment. Columns are cell-types/tissues/samples and rows are genes.

complexes : dict Dictionary where keys are the complex names in the list of PPIs, while values are list of subunits for the respective complex names.

agg_method : str, default='min' Method to aggregate the expression value of multiple genes in a complex.

- 'min' : Minimum expression value among all genes.
- 'mean' : Average expression value among all genes.
- 'gmean' : Geometric mean expression value among all genes.
Returns

tmp_rna : pandas.DataFrame Gene expression data for RNA-seq experiment containing multimeric complex names. Their gene expressions are the minimum expression value among the respective subunits composing them. Columns are cell-types/tissues/samples and rows are genes.

Source code in cell2cell/preprocessing/rnaseq.py
def add_complexes_to_expression(rnaseq_data, complexes, agg_method='min'):
    '''
    Adds multimeric complexes into the gene expression matrix.
    Their gene expressions are the minimum expression value
    among the respective subunits composing them.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for RNA-seq experiment. Columns are
        cell-types/tissues/samples and rows are genes.

    complexes : dict
        Dictionary where keys are the complex names in the list of PPIs, while
        values are list of subunits for the respective complex names.

    agg_method : str, default='min'
        Method to aggregate the expression value of multiple genes in a
        complex.

        - 'min' : Minimum expression value among all genes.
        - 'mean' : Average expression value among all genes.
        - 'gmean' : Geometric mean expression value among all genes.

    Returns
    -------
    tmp_rna : pandas.DataFrame
        Gene expression data for RNA-seq experiment containing multimeric
        complex names. Their gene expressions are the minimum expression value
        among the respective subunits composing them. Columns are
        cell-types/tissues/samples and rows are genes.
    '''
    tmp_rna = rnaseq_data.copy()
    for k, v in complexes.items():
        if all(g in tmp_rna.index for g in v):
            df = tmp_rna.loc[v, :]
            if agg_method == 'min':
                tmp_rna.loc[k] = df.min().values.tolist()
            elif agg_method == 'mean':
                tmp_rna.loc[k] = df.mean().values.tolist()
            elif agg_method == 'gmean':
                tmp_rna.loc[k] = df.apply(lambda x: np.exp(np.mean(np.log(x)))).values.tolist()
            else:
                ValueError("{} is not a valid agg_method".format(agg_method))
        else:
            tmp_rna.loc[k] = [0] * tmp_rna.shape[1]
    return tmp_rna

aggregate_single_cells(rnaseq_data, metadata, barcode_col='barcodes', celltype_col='cell_types', method='average', transposed=True)

Aggregates gene expression of single cells into cell types for each gene.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for a single-cell RNA-seq experiment. Columns are single cells and rows are genes. If columns are genes and rows are single cells, specify transposed=True.

metadata : pandas.Dataframe Metadata containing the cell types for each single cells in the RNA-seq dataset.

barcode_col : str, default='barcodes' Column-name for the single cells in the metadata.

celltype_col : str, default='cell_types' Column-name in the metadata for the grouping single cells into cell types by the selected aggregation method.

method : str, default='average Specifies the method to use to aggregate gene expression of single cells into their respective cell types. Used to perform the CCI analysis since it is on the cell types rather than single cells. Options are:

- 'nn_cell_fraction' : Among the single cells composing a cell type, it
    calculates the fraction of single cells with non-zero count values
    of a given gene.
- 'average' : Computes the average gene expression among the single cells
    composing a cell type for a given gene.

transposed : boolean, default=True Whether the rnaseq_data is organized with columns as genes and rows as single cells.

Returns

agg_df : pandas.DataFrame Dataframe containing the gene expression values that were aggregated by cell types. Columns are cell types and rows are genes.

Source code in cell2cell/preprocessing/rnaseq.py
def aggregate_single_cells(rnaseq_data, metadata, barcode_col='barcodes', celltype_col='cell_types', method='average',
                           transposed=True):
    '''Aggregates gene expression of single cells into cell types for each gene.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a single-cell RNA-seq experiment. Columns are
        single cells and rows are genes. If columns are genes and rows are
        single cells, specify transposed=True.

    metadata : pandas.Dataframe
        Metadata containing the cell types for each single cells in the
        RNA-seq dataset.

    barcode_col : str, default='barcodes'
        Column-name for the single cells in the metadata.

    celltype_col : str, default='cell_types'
        Column-name in the metadata for the grouping single cells into cell types
        by the selected aggregation method.

    method : str, default='average
        Specifies the method to use to aggregate gene expression of single
        cells into their respective cell types. Used to perform the CCI
        analysis since it is on the cell types rather than single cells.
        Options are:

        - 'nn_cell_fraction' : Among the single cells composing a cell type, it
            calculates the fraction of single cells with non-zero count values
            of a given gene.
        - 'average' : Computes the average gene expression among the single cells
            composing a cell type for a given gene.

    transposed : boolean, default=True
        Whether the rnaseq_data is organized with columns as
        genes and rows as single cells.

    Returns
    -------
    agg_df : pandas.DataFrame
        Dataframe containing the gene expression values that were aggregated
        by cell types. Columns are cell types and rows are genes.
    '''
    assert metadata is not None, "Please provide metadata containing the barcodes and cell-type annotation."
    assert method in ['average', 'nn_cell_fraction'], "{} is not a valid option for method".format(method)

    meta = metadata.reset_index()
    meta = meta[[barcode_col, celltype_col]].set_index(barcode_col)
    mapper = meta[celltype_col].to_dict()

    if transposed:
        df = rnaseq_data
    else:
        df = rnaseq_data.T
    df.index = [mapper[c] for c in df.index]
    df.index.name = 'celltype'
    df.reset_index(inplace=True)

    agg_df = pd.DataFrame(index=df.columns).drop('celltype')

    for celltype, ct_df in df.groupby('celltype'):
        ct_df = ct_df.drop('celltype', axis=1)
        if method == 'average':
            agg = ct_df.mean()
        elif method == 'nn_cell_fraction':
            agg = ((ct_df > 0).sum() / ct_df.shape[0])
        agg_df[celltype] = agg
    return agg_df

divide_expression_by_max(rnaseq_data, axis=1)

Normalizes each gene value given the max value across an axis.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for RNA-seq experiment. Columns are cell-types/tissues/samples and rows are genes.

axis : int, default=0 Axis to perform the max-value normalization. Options are {0 for normalizing across rows (column-wise) or 1 for normalizing across columns (row-wise)}.

Returns

new_data : pandas.DataFrame A gene expression data for RNA-seq experiment with normalized values. Columns are cell-types/tissues/samples and rows are genes.

Source code in cell2cell/preprocessing/rnaseq.py
def divide_expression_by_max(rnaseq_data, axis=1):
    '''
    Normalizes each gene value given the max value across an axis.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for RNA-seq experiment. Columns are
        cell-types/tissues/samples and rows are genes.

    axis : int, default=0
        Axis to perform the max-value normalization. Options
        are {0 for normalizing across rows (column-wise) or 1 for
        normalizing across columns (row-wise)}.

    Returns
    -------
    new_data : pandas.DataFrame
        A gene expression data for RNA-seq experiment with
        normalized values. Columns are
        cell-types/tissues/samples and rows are genes.
    '''
    new_data = rnaseq_data.div(rnaseq_data.max(axis=axis), axis=int(not axis))
    new_data = new_data.fillna(0.0).replace(np.inf, 0.0)
    return new_data

divide_expression_by_mean(rnaseq_data, axis=1)

Normalizes each gene value given the mean value across an axis.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for RNA-seq experiment. Columns are cell-types/tissues/samples and rows are genes.

axis : int, default=0 Axis to perform the mean-value normalization. Options are {0 for normalizing across rows (column-wise) or 1 for normalizing across columns (row-wise)}.

Returns

new_data : pandas.DataFrame A gene expression data for RNA-seq experiment with normalized values. Columns are cell-types/tissues/samples and rows are genes.

Source code in cell2cell/preprocessing/rnaseq.py
def divide_expression_by_mean(rnaseq_data, axis=1):
    '''
    Normalizes each gene value given the mean value across an axis.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for RNA-seq experiment. Columns are
        cell-types/tissues/samples and rows are genes.

    axis : int, default=0
        Axis to perform the mean-value normalization. Options
        are {0 for normalizing across rows (column-wise) or 1 for
        normalizing across columns (row-wise)}.

    Returns
    -------
    new_data : pandas.DataFrame
        A gene expression data for RNA-seq experiment with
        normalized values. Columns are
        cell-types/tissues/samples and rows are genes.
    '''
    new_data = rnaseq_data.div(rnaseq_data.mean(axis=axis), axis=int(not axis))
    new_data = new_data.fillna(0.0).replace(np.inf, 0.0)
    return new_data

drop_empty_genes(rnaseq_data)

Drops genes that are all zeroes and/or without expression values for all cell-types/tissues/samples.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for RNA-seq experiment. Columns are cell-types/tissues/samples and rows are genes.

Returns

data : pandas.DataFrame A gene expression data for RNA-seq experiment without empty genes. Columns are cell-types/tissues/samples and rows are genes.

Source code in cell2cell/preprocessing/rnaseq.py
def drop_empty_genes(rnaseq_data):
    '''Drops genes that are all zeroes and/or without
    expression values for all cell-types/tissues/samples.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for RNA-seq experiment. Columns are
        cell-types/tissues/samples and rows are genes.

    Returns
    -------
    data : pandas.DataFrame
        A gene expression data for RNA-seq experiment without
        empty genes. Columns are
        cell-types/tissues/samples and rows are genes.
    '''
    data = rnaseq_data.copy()
    data = data.dropna(how='all')
    data = data.fillna(0)  # Put zeros to all missing values
    data = data.loc[data.sum(axis=1) != 0]  # Drop rows will 0 among all cell/tissues
    return data

log10_transformation(rnaseq_data, addition=1e-06)

Log-transforms gene expression values in a gene expression matrix for a RNA-seq experiment.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for RNA-seq experiment. Columns are cell-types/tissues/samples and rows are genes.

Returns

data : pandas.DataFrame A gene expression data for RNA-seq experiment with log-transformed values. Values are log10(expression + addition). Columns are cell-types/tissues/samples and rows are genes.

Source code in cell2cell/preprocessing/rnaseq.py
def log10_transformation(rnaseq_data, addition = 1e-6):
    '''Log-transforms gene expression values in a
    gene expression matrix for a RNA-seq experiment.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for RNA-seq experiment. Columns are
        cell-types/tissues/samples and rows are genes.

    Returns
    -------
    data : pandas.DataFrame
        A gene expression data for RNA-seq experiment with
        log-transformed values. Values are log10(expression + addition).
        Columns are cell-types/tissues/samples and rows are genes.
    '''
    ### Apply this only after applying "drop_empty_genes" function
    data = rnaseq_data.copy()
    data = data.apply(lambda x: np.log10(x + addition))
    data = data.replace([np.inf, -np.inf], np.nan)
    return data

scale_expression_by_sum(rnaseq_data, axis=0, sum_value=1000000.0)

Normalizes all samples to sum up the same scale factor.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for RNA-seq experiment. Columns are cell-types/tissues/samples and rows are genes.

axis : int, default=0 Axis to perform the global-scaling normalization. Options are {0 for normalizing across rows (column-wise) or 1 for normalizing across columns (row-wise)}.

sum_value : float, default=1e6 Scaling factor. Normalized axis will sum up this value.

Returns

scaled_data : pandas.DataFrame A gene expression data for RNA-seq experiment with scaled values. All rows or columns, depending on the specified axis sum up to the same value. Columns are cell-types/tissues/samples and rows are genes.

Source code in cell2cell/preprocessing/rnaseq.py
def scale_expression_by_sum(rnaseq_data, axis=0, sum_value=1e6):
    '''
    Normalizes all samples to sum up the same scale factor.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for RNA-seq experiment. Columns are
        cell-types/tissues/samples and rows are genes.

    axis : int, default=0
        Axis to perform the global-scaling normalization. Options
        are {0 for normalizing across rows (column-wise) or 1 for
        normalizing across columns (row-wise)}.

    sum_value : float, default=1e6
        Scaling factor. Normalized axis will sum up this value.

    Returns
    -------
    scaled_data : pandas.DataFrame
        A gene expression data for RNA-seq experiment with
        scaled values. All rows or columns, depending on the specified
        axis sum up to the same value. Columns are
        cell-types/tissues/samples and rows are genes.
    '''
    data = rnaseq_data.values
    data = sum_value * np.divide(data, np.nansum(data, axis=axis))
    scaled_data = pd.DataFrame(data, index=rnaseq_data.index, columns=rnaseq_data.columns)
    return scaled_data

signal

smooth_curve(values, window_length=None, polyorder=3, **kwargs)

Apply a Savitzky-Golay filter to an array to smooth the curve.

Parameters

values : array-like An array or list of values.

window_length : int, default=None Size of the window of values to use too smooth the curve.

polyorder : int, default=3 The order of the polynomial used to fit the samples.

**kwargs : dict Extra arguments for the scipy.signal.savgol_filter function.

Returns

smooth_values : array-like An array or list of values representing the smooth curvee.

Source code in cell2cell/preprocessing/signal.py
def smooth_curve(values, window_length=None, polyorder=3, **kwargs):
    '''Apply a Savitzky-Golay filter to an array to smooth the curve.

    Parameters
    ----------
    values : array-like
        An array or list of values.

    window_length : int, default=None
        Size of the window of values to use too smooth the curve.

    polyorder : int, default=3
        The order of the polynomial used to fit the samples.

    **kwargs : dict
        Extra arguments for the scipy.signal.savgol_filter function.

    Returns
    -------
    smooth_values : array-like
        An array or list of values representing the smooth curvee.
    '''
    size = len(values)
    if window_length is None:
        window_length = int(size / min([2, size]))
    if window_length % 2 == 0:
        window_length += 1
    assert(polyorder < window_length), "polyorder must be less than window_length."
    smooth_values = savgol_filter(values, window_length, polyorder, **kwargs)
    return smooth_values

spatial special

distances

celltype_pair_distance(df1, df2, method='min', distance='euclidean')

Calculates the distance between two sets of data points (single cell coordinates) represented by df1 and df2. It supports two distance metrics: Euclidean and Manhattan distances. The method parameter allows you to specify how the distances between the two sets are aggregated.

Parameters

df1 : pandas.DataFrame The first set of single cell coordinates.

df1 : pandas.DataFrame The second set of single cell coordinates.

method : str, default='min' The aggregation method for the calculated distances. It can be one of 'min', 'max', or 'mean'.

distance : str, default='euclidean' The distance metric to use. It can be 'euclidean' or 'manhattan'.

Returns

agg_dist : numpy.float The aggregated distance between the two sets of data points based on the specified method and distance metric.

Source code in cell2cell/spatial/distances.py
def celltype_pair_distance(df1, df2, method='min', distance='euclidean'):
    '''
    Calculates the distance between two sets of data points (single cell coordinates)
    represented by df1 and df2. It supports two distance metrics: Euclidean and Manhattan
    distances. The method parameter allows you to specify how the distances between the
    two sets are aggregated.

    Parameters
    ----------
    df1 : pandas.DataFrame
        The first set of single cell coordinates.

    df1 : pandas.DataFrame
        The second set of single cell coordinates.

    method : str, default='min'
        The aggregation method for the calculated distances. It can be one of 'min',
        'max', or 'mean'.

    distance : str, default='euclidean'
        The distance metric to use. It can be 'euclidean' or 'manhattan'.

    Returns
    -------
    agg_dist : numpy.float
        The aggregated distance between the two sets of data points based on the specified
        method and distance metric.
    '''
    if distance == 'euclidean':
        distances = euclidean_distances(df1, df2)
    elif distance == 'manhattan':
        distances = manhattan_distances(df1, df2)
    else:
        raise NotImplementedError("{} distance is not implemented.".format(distance.capitalize()))

    if method == 'min':
        agg_dist = np.nanmin(distances)
    elif method == 'max':
        agg_dist = np.nanmax(distances)
    elif method == 'mean':
        agg_dist = np.nanmean(distances)
    else:
        raise NotImplementedError('Method {} is not implemented.'.format(method))
    return agg_dist

pairwise_celltype_distances(df, group_col, coord_cols=['X', 'Y'], method='min', distance='euclidean', pairs=None)

Calculates pairwise distances between groups of single cells. It computes an aggregate distance between all possible combinations of groups.

Parameters

df : pandas.DataFrame A dataframe where each row is a single cell, and there are columns containing spatial coordinates and cell group.

group_col : str The name of the column that defines the groups for which distances are calculated.

coord_cols : list, default=None The list of column names that represent the coordinates of the single cells.

pairs : list A list of specific group pairs for which distances should be calculated. If not provided, all possible combinations of group pairs will be considered.

Returns

distances : pandas.DataFrame The pairwise distances between groups based on the specified group column. In this dataframe rows and columns are the cell groups used to compute distances.

Source code in cell2cell/spatial/distances.py
def pairwise_celltype_distances(df, group_col, coord_cols=['X', 'Y'],
                                method='min', distance='euclidean', pairs=None):
    '''
    Calculates pairwise distances between groups of single cells. It computes an
    aggregate distance between all possible combinations of groups.

    Parameters
    ----------
    df : pandas.DataFrame
        A dataframe where each row is a single cell, and there are columns containing
        spatial coordinates and cell group.

    group_col : str
        The name of the column that defines the groups for which distances are calculated.

    coord_cols : list, default=None
        The list of column names that represent the coordinates of the single cells.

    pairs : list
        A list of specific group pairs for which distances should be calculated.
        If not provided, all possible combinations of group pairs will be considered.

    Returns
    -------
    distances : pandas.DataFrame
        The pairwise distances between groups based on the specified group column.
        In this dataframe rows and columns are the cell groups used to compute distances.
    '''
    # TODO: Adapt code below to receive AnnData or MuData objects
    # df_ = pd.DataFrame(adata.obsm['spatial'], index=adata.obs_names, columns=['X', 'Y'])
    # df = adata.obs[[group_col]]
    df_ = df[coord_cols]
    groups = df[group_col].unique()
    distances = pd.DataFrame(np.zeros((len(groups), len(groups))),
                             index=groups,
                             columns=groups)

    if pairs is None:
        pairs = list(itertools.combinations(groups, 2))

    for pair in pairs:
        dist = celltype_pair_distance(df_.loc[df[group_col] == pair[0]], df_.loc[df[group_col] == pair[1]],
                                      method=method,
                                      distance=distance
                                      )
        distances.loc[pair[0], pair[1]] = dist
        distances.loc[pair[1], pair[0]] = dist
    return distances

filtering

dist_filter_liana(liana_outputs, distances, max_dist, min_dist=0, source_col='source', target_col='target', keep_dist=False)

Filters a dataframe with outputs from LIANA based on a distance threshold defined applied to another dataframe containing distances between cell groups.

Parameters

liana_outputs : pandas.DataFrame Dataframe containing the results from LIANA, where rows are pairs of ligand-receptor interactions by pair of source-target cell groups.

distances : pandas.DataFrame Square dataframe containing distances between pairs of cell groups.

max_dist : float The distance threshold used to filter the pairs from the liana_outputs dataframe.

min_dist : float, default=0 The minimum distance between cell pairs to consider them in the interaction tensor.

source_col : str, default='source' Column name in both dataframes that represents the source cell groups.

target_col : str, default='target' Column name in both dataframes that represents the target cell groups.

keep_dist : bool, default=False To determine whether to keep the 'distance' column in the filtered output. If set to True, the 'distance' column will be retained; otherwise, it will be dropped and the LIANA dataframe will contain the original columns.

Returns

filtered_liana_outputs : pandas.DataFrame It containing pairs from the liana_outputs dataframe that meet the distance threshold criteria.

Source code in cell2cell/spatial/filtering.py
def dist_filter_liana(liana_outputs, distances, max_dist, min_dist=0, source_col='source', target_col='target',
                      keep_dist=False):
    '''
    Filters a dataframe with outputs from LIANA based on a distance threshold
    defined applied to another dataframe containing distances between cell groups.

    Parameters
    ----------
    liana_outputs : pandas.DataFrame
        Dataframe containing the results from LIANA, where rows are pairs of
        ligand-receptor interactions by pair of source-target cell groups.

    distances : pandas.DataFrame
        Square dataframe containing distances between pairs of cell groups.

    max_dist : float
        The distance threshold used to filter the pairs from the liana_outputs dataframe.

    min_dist : float, default=0
        The minimum distance between cell pairs to consider them in the interaction tensor.

    source_col : str, default='source'
        Column name in both dataframes that represents the source cell groups.

    target_col : str, default='target'
         Column name in both dataframes that represents the target cell groups.

    keep_dist : bool, default=False
        To determine whether to keep the 'distance' column in the filtered output.
        If set to True, the 'distance' column will be retained; otherwise, it will be dropped
        and the LIANA dataframe will contain the original columns.

    Returns
    -------
    filtered_liana_outputs : pandas.DataFrame
        It containing pairs from the liana_outputs dataframe that meet the distance
        threshold criteria.
    '''
    # Convert distances to a long-form dataframe
    distances = distances.stack().reset_index()
    distances.columns = [source_col, target_col, 'distance']

    # Merge the long-form distances DataFrame with pairs_df
    merged_df = liana_outputs.merge(distances, on=[source_col, target_col], how='left')

    # Filter based on the distance threshold
    filtered_liana_outputs = merged_df[(min_dist <= merged_df['distance']) & (merged_df['distance'] <= max_dist)]

    if keep_dist == False:
        filtered_liana_outputs = filtered_liana_outputs.drop(['distance'], axis=1)

    return filtered_liana_outputs

dist_filter_tensor(interaction_tensor, distances, max_dist, min_dist=0, source_axis=2, target_axis=3)

Filters an Interaction Tensor based on intercellular distances between cell types.

Parameters

interaction_tensor : cell2cell.tensor.BaseTensor A communication tensor generated with any of the tensor class in cell2cell.tensor

distances : pandas.DataFrame Square dataframe containing distances between pairs of cell groups. It must contain all cell groups that act as sender and receiver cells in the tensor.

max_dist : float The maximum distance between cell pairs to consider them in the interaction tensor.

min_dist : float, default=0 The minimum distance between cell pairs to consider them in the interaction tensor.

source_axis : int, default=2 The index indicating the axis in the tensor corresponding to sender cells.

target_axis : int, default=3 The index indicating the axis in the tensor corresponding to receiver cells.

Returns

new_interaction_tensor : cell2cell.tensor.BaseTensor A tensor with communication scores made zero for cell type pairs with intercellular distance over the distance threshold.

Source code in cell2cell/spatial/filtering.py
def dist_filter_tensor(interaction_tensor, distances, max_dist, min_dist=0, source_axis=2, target_axis=3):
    '''
    Filters an Interaction Tensor based on intercellular distances between cell types.

    Parameters
    ----------
    interaction_tensor : cell2cell.tensor.BaseTensor
        A communication tensor generated with any of the tensor class in
        cell2cell.tensor

    distances : pandas.DataFrame
        Square dataframe containing distances between pairs of cell groups. It must contain
        all cell groups that act as sender and receiver cells in the tensor.

    max_dist : float
        The maximum distance between cell pairs to consider them in the interaction tensor.

    min_dist : float, default=0
        The minimum distance between cell pairs to consider them in the interaction tensor.

    source_axis : int, default=2
        The index indicating the axis in the tensor corresponding to sender cells.

    target_axis : int, default=3
        The index indicating the axis in the tensor corresponding to receiver cells.

    Returns
    -------
    new_interaction_tensor : cell2cell.tensor.BaseTensor
        A tensor with communication scores made zero for cell type pairs with intercellular
        distance over the distance threshold.
    '''
    # Evaluate whether we provide distances for all cell types in the tensor
    assert all([cell in distances.index for cell in
                interaction_tensor.order_names[source_axis]]), "Distances not provided for all sender cells"
    assert all([cell in distances.columns for cell in
                interaction_tensor.order_names[target_axis]]), "Distances not provided for all receiver cells"

    source_cell_groups = interaction_tensor.order_names[source_axis]
    target_cell_groups = interaction_tensor.order_names[target_axis]

    # Use only cell types in the tensor
    dist_df = distances.loc[source_cell_groups, target_cell_groups]

    # Filter cell types by intercellular distances
    dist = ((min_dist <= dist_df) & (dist_df <= max_dist)).astype(int).values

    # Mapping what re-arrange should be done to keep the original tensor shape
    tensor_shape = list(interaction_tensor.tensor.shape)
    original_order = list(range(len(tensor_shape)))
    new_order = []

    # Generate template tensor with cells to keep
    template_tensor = dist
    for i, size in enumerate(tensor_shape):
        if (i != source_axis) and (i != target_axis):
            template_tensor = [template_tensor] * size
            new_order.insert(0, i)
    template_tensor = np.array(template_tensor)

    new_order += [source_axis, target_axis]
    changes_needed = [new_order.index(i) for i in original_order]

    # Re-arrange axes by the order
    template_tensor = template_tensor.transpose(changes_needed)

    # Create tensorly object
    template_tensor = tl.tensor(template_tensor, **tl.context(interaction_tensor.tensor))

    assert template_tensor.shape == interaction_tensor.tensor.shape, "Filtering of cells was not properly done. Revise code of this function (template tensor)"

    # tensor = tl.zeros_like(interaction_tensor.tensor, **tl.context(tensor))
    new_interaction_tensor = interaction_tensor.copy()
    new_interaction_tensor.tensor = new_interaction_tensor.tensor * template_tensor
    # Make masked cells by distance to be real zeros
    new_interaction_tensor.loc_zeros = (new_interaction_tensor.tensor == 0).astype(int) - new_interaction_tensor.loc_nans
    return new_interaction_tensor

neighborhoods

add_sliding_window_info_to_adata(adata, window_mapping)

Adds window information to the AnnData object's .obs DataFrame. Each window is represented as a column, and cells/spots belonging to a window are marked with a 1.0, while others are marked with a 0.0. It modifies the adata object in place.

Parameters

adata : AnnData The AnnData object to which the window information will be added.

window_mapping : dict A dictionary mapping each window to a set of cell/spot indeces or barcodes. This is the output from the create_moving_windows function.

Source code in cell2cell/spatial/neighborhoods.py
def add_sliding_window_info_to_adata(adata, window_mapping):
    """
    Adds window information to the AnnData object's .obs DataFrame. Each window is represented
    as a column, and cells/spots belonging to a window are marked with a 1.0, while others are marked
    with a 0.0. It modifies the `adata` object in place.

    Parameters
    ----------
    adata : AnnData
        The AnnData object to which the window information will be added.

    window_mapping : dict
        A dictionary mapping each window to a set of cell/spot indeces or barcodes.
        This is the output from the `create_moving_windows` function.
    """

    # Initialize all window columns to 0.0
    for window in sorted(window_mapping.keys()):
        adata.obs[window] = 0.0

    # Mark cells that belong to each window
    for window, barcode_indeces in window_mapping.items():
        adata.obs.loc[barcode_indeces, window] = 1.0

calculate_window_size(adata, num_windows)

Calculates the window size required to fit a specified number of windows across the width of the coordinate space in spatial transcriptomics data.

Parameters

adata : AnnData The AnnData object containing spatial transcriptomics data. The spatial coordinates must be stored in adata.obsm['spatial'].

num_windows : int The desired number of windows to fit across the width of the coordinate space.

Returns

window_size : float The calculated size of each window to fit the specified number of windows across the width of the coordinate space.

Source code in cell2cell/spatial/neighborhoods.py
def calculate_window_size(adata, num_windows):
    """
    Calculates the window size required to fit a specified number of windows
    across the width of the coordinate space in spatial transcriptomics data.

    Parameters
    ----------
    adata : AnnData
        The AnnData object containing spatial transcriptomics data. The spatial coordinates
        must be stored in `adata.obsm['spatial']`.

    num_windows : int
        The desired number of windows to fit across the width of the coordinate space.

    Returns
    -------
    window_size : float
        The calculated size of each window to fit the specified number of windows
        across the width of the coordinate space.
    """

    # Extract X coordinates
    x_coords = adata.obsm['spatial'][:, 0]

    # Determine the range of X coordinates
    x_min, x_max = np.min(x_coords), np.max(x_coords)

    # Calculate the window size
    window_size = (x_max - x_min) / num_windows

    return window_size

create_sliding_windows(adata, window_size, stride)

Maps windows to the cells they contain based on spatial transcriptomics data. Returns a dictionary where keys are window identifiers and values are sets of cell indices.

Parameters

adata : AnnData The AnnData object containing spatial transcriptomics data. The spatial coordinates must be stored in adata.obsm['spatial'].

window_size : float The size of each square window along each dimension.

stride : float The stride with which the window moves along each dimension.

Returns

window_mapping : dict A dictionary mapping each window to a set of cell indices that fall within that window.

Source code in cell2cell/spatial/neighborhoods.py
def create_sliding_windows(adata, window_size, stride):
    """
    Maps windows to the cells they contain based on spatial transcriptomics data.
    Returns a dictionary where keys are window identifiers and values are sets of cell indices.

    Parameters
    ----------
    adata : AnnData
        The AnnData object containing spatial transcriptomics data. The spatial coordinates
        must be stored in `adata.obsm['spatial']`.

    window_size : float
        The size of each square window along each dimension.

    stride : float
        The stride with which the window moves along each dimension.

    Returns
    -------
    window_mapping : dict
        A dictionary mapping each window to a set of cell indices that fall within that window.
    """

    # Get the spatial coordinates
    coords = pd.DataFrame(adata.obsm['spatial'], index=adata.obs_names, columns=['X', 'Y'])

    # Define the range of the sliding windows
    x_min, y_min = coords.min()
    x_max, y_max = coords.max()
    x_windows = np.arange(x_min, x_max - window_size + stride, stride)
    y_windows = np.arange(y_min, y_max - window_size + stride, stride)

    # Function to find all windows a point belongs to
    def find_windows(coord, window_edges):
        return [i for i, edge in enumerate(window_edges) if edge <= coord < edge + window_size]

    # Initialize the window mapping
    window_mapping = {}

    # Assign cells to all overlapping windows
    for cell_idx, (x, y) in enumerate(zip(coords['X'], coords['Y'])):
        cell_windows = ["window_{}_{}".format(wx, wy)
                        for wx in find_windows(x, x_windows)
                        for wy in find_windows(y, y_windows)]

        for win in cell_windows:
            if win not in window_mapping:
                window_mapping[win] = set()
            window_mapping[win].add(coords.index[cell_idx])  # This stores the cell/spot barcodes
            # For memory efficiency, it could be `window_mapping[win].add(cell_idx)` instead

    return window_mapping

create_spatial_grid(adata, num_bins, copy=False)

Segments spatial transcriptomics data into a square grid based on spatial coordinates and annotates each cell or spot with its corresponding grid position.

Parameters

adata : AnnData The AnnData object containing spatial transcriptomics data. The spatial coordinates must be stored in adata.obsm['spatial']. This object is either modified in place or a copy is returned based on the copy parameter.

num_bins : int The number of bins (squares) along each dimension of the grid. The grid is square, so this number applies to both the horizontal and vertical divisions.

copy : bool, default=False If True, the function operates on and returns a copy of the input AnnData object. If False, the function modifies the input AnnData object in place.

Returns

adata_ : AnnData or None If copy=True, a new AnnData object with added grid annotations is returned.

Source code in cell2cell/spatial/neighborhoods.py
def create_spatial_grid(adata, num_bins, copy=False):
    """
    Segments spatial transcriptomics data into a square grid based on spatial coordinates
    and annotates each cell or spot with its corresponding grid position.

    Parameters
    ----------
    adata : AnnData
        The AnnData object containing spatial transcriptomics data. The spatial coordinates
        must be stored in `adata.obsm['spatial']`. This object is either modified in place
        or a copy is returned based on the `copy` parameter.

    num_bins : int
        The number of bins (squares) along each dimension of the grid. The grid is square,
        so this number applies to both the horizontal and vertical divisions.

    copy : bool, default=False
        If True, the function operates on and returns a copy of the input AnnData object.
        If False, the function modifies the input AnnData object in place.

    Returns
    -------
    adata_ : AnnData or None
        If `copy=True`, a new AnnData object with added grid annotations is returned.
    """

    if copy:
        adata_ = adata.copy()
    else:
        adata_ = adata

    # Get the spatial coordinates
    coords = pd.DataFrame(adata.obsm['spatial'], index=adata.obs_names, columns=['X', 'Y'])

    # Define the bins for each dimension
    x_min, y_min = coords.min()
    x_max, y_max = coords.max()
    x_bins = np.linspace(x_min, x_max, num_bins + 1)
    y_bins = np.linspace(y_min, y_max, num_bins + 1)

    # Digitize the coordinates into bins
    adata_.obs['grid_x'] = np.digitize(coords['X'], x_bins, right=False) - 1
    adata_.obs['grid_y'] = np.digitize(coords['Y'], y_bins, right=False) - 1

    # Adjust indices to start from 0 and end at num_bins - 1
    adata_.obs['grid_x'] = np.clip(adata_.obs['grid_x'], 0, num_bins - 1)
    adata_.obs['grid_y'] = np.clip(adata_.obs['grid_y'], 0, num_bins - 1)

    # Combine grid indices to form a grid cell identifier
    adata_.obs['grid_cell'] = adata_.obs['grid_x'].astype(str) + "_" + adata_.obs['grid_y'].astype(str)

    if copy:
        return adata_

stats special

enrichment

fisher_representation(sample_size, class_in_sample, population_size, class_in_population)

Performs an analysis of enrichment/depletion based on observation in a sample. It computes a p-value given a fisher exact test.

Parameters

sample_size : int Size of the sample obtained or number of elements obtained from the analysis.

class_in_sample : int Number of elements of a given class that are contained in the sample. This is the class to be tested.

population_size : int Size of the sampling space. That is, the total number of possible elements to be chosen when sampling.

class_in_population : int Number of elements of a given class that are contained in the population. This is the class to be tested.

Returns

results : dict A dictionary containing the odd ratios and p-values for depletion and enrichment analysis.

Source code in cell2cell/stats/enrichment.py
def fisher_representation(sample_size, class_in_sample, population_size, class_in_population):
    '''
    Performs an analysis of enrichment/depletion based on observation
    in a sample. It computes a p-value given a fisher exact test.

    Parameters
    ----------
    sample_size : int
        Size of the sample obtained or number of elements
        obtained from the analysis.

    class_in_sample : int
        Number of elements of a given class that are
        contained in the sample. This is the class to be tested.

    population_size : int
        Size of the sampling space. That is, the total number
        of possible elements to be chosen when sampling.

    class_in_population : int
        Number of elements of a given class that are contained
        in the population. This is the class to be tested.

    Returns
    -------
    results : dict
        A dictionary containing the odd ratios and p-values for
        depletion and enrichment analysis.
    '''
    # Computing the number of elements that are not in the same class
    nonclass_in_sample = sample_size - class_in_sample
    nonclass_in_population = population_size - class_in_population

    # Remaining elements in population after sampling
    rem_class = class_in_population - class_in_sample
    rem_nonclass = nonclass_in_population - nonclass_in_sample

    # Depletion Analysis
    depletion_odds, depletion_fisher_p_val = st.fisher_exact([[class_in_sample, rem_class],
                                                              [nonclass_in_sample, rem_nonclass]],
                                                             alternative='less')

    # Enrichment Analysis
    enrichment_odds, enrichment_fisher_p_val = st.fisher_exact([[class_in_sample, rem_class],
                                                                [nonclass_in_sample, rem_nonclass]],
                                                               alternative='greater')

    p_vals = (depletion_fisher_p_val, enrichment_fisher_p_val)
    odds = (depletion_odds, enrichment_odds)
    results = {'pval' : p_vals,
               'odds' : odds,
              }
    return results

hypergeom_representation(sample_size, class_in_sample, population_size, class_in_population)

Performs an analysis of enrichment/depletion based on observation in a sample. It computes a p-value given a hypergeometric distribution.

Parameters

sample_size : int Size of the sample obtained or number of elements obtained from the analysis.

class_in_sample : int Number of elements of a given class that are contained in the sample. This is the class to be tested.

population_size : int Size of the sampling space. That is, the total number of possible elements to be chosen when sampling.

class_in_population : int Number of elements of a given class that are contained in the population. This is the class to be tested.

Returns

p_vals : tuple A tuple containing the p-values for depletion and enrichment analysis, respectively.

Source code in cell2cell/stats/enrichment.py
def hypergeom_representation(sample_size, class_in_sample, population_size, class_in_population):
    '''
    Performs an analysis of enrichment/depletion based on observation
    in a sample. It computes a p-value given a hypergeometric
    distribution.

    Parameters
    ----------
    sample_size : int
        Size of the sample obtained or number of elements
        obtained from the analysis.

    class_in_sample : int
        Number of elements of a given class that are
        contained in the sample. This is the class to be tested.

    population_size : int
        Size of the sampling space. That is, the total number
        of possible elements to be chosen when sampling.

    class_in_population : int
        Number of elements of a given class that are contained
        in the population. This is the class to be tested.

    Returns
    -------
    p_vals : tuple
        A tuple containing the p-values for depletion and
        enrichment analysis, respectively.
    '''
    # Computing the number of elements that are not in the same class
    nonclass_in_sample = sample_size - class_in_sample
    nonclass_in_population = population_size - class_in_population

    # Remaining elements in population after sampling
    rem_class = class_in_population - class_in_sample
    rem_nonclass = nonclass_in_population - nonclass_in_sample

    # Depletion Analysis
    depletion_hyp_p_val = st.hypergeom.cdf(class_in_sample, population_size, class_in_population, sample_size)

    # Enrichment Analysis
    enrichment_hyp_p_val = 1.0 - st.hypergeom.cdf(class_in_sample - 1.0, population_size, class_in_population,
                                                  sample_size)

    p_vals = (depletion_hyp_p_val, enrichment_hyp_p_val)
    return p_vals

gini

gini_coefficient(distribution)

Computes the Gini coefficient of an array of values. Code borrowed from: https://stackoverflow.com/questions/39512260/calculating-gini-coefficient-in-python-numpy

Parameters

distribution : array-like An array of values representing the distribution to be evaluated.

Returns

gini : float Gini coefficient for the evaluated distribution.

Source code in cell2cell/stats/gini.py
def gini_coefficient(distribution):
    """Computes the Gini coefficient of an array of values.
    Code borrowed from:
    https://stackoverflow.com/questions/39512260/calculating-gini-coefficient-in-python-numpy

    Parameters
    ----------
    distribution : array-like
        An array of values representing the distribution
        to be evaluated.

    Returns
    -------
    gini : float
        Gini coefficient for the evaluated distribution.
    """
    diffsum = 0
    for i, xi in enumerate(distribution[:-1], 1):
        diffsum += np.sum(np.abs(xi - distribution[i:]))
    gini = diffsum / (len(distribution)**2 * np.mean(distribution))
    return gini

multitest

compute_fdrcorrection_asymmetric_matrix(X, alpha=0.1)

Computes and FDR correction or Benjamini-Hochberg procedure on a asymmetric matrix of p-values. Here, the correction is performed for every value in X.

Parameters

X : pandas.DataFrame An asymmetric dataframe of P-values.

alpha : float, default=0.1 Error rate of the FDR correction. Must be 0 < alpha < 1.

Returns

adj_X : pandas.DataFrame An asymmetric dataframe with adjusted P-values of X.

Source code in cell2cell/stats/multitest.py
def compute_fdrcorrection_asymmetric_matrix(X, alpha=0.1):
    '''
    Computes and FDR correction or Benjamini-Hochberg procedure
    on a asymmetric matrix of p-values. Here, the correction
    is performed for every value in X.

    Parameters
    ----------
    X : pandas.DataFrame
        An asymmetric dataframe of P-values.

    alpha : float, default=0.1
        Error rate of the FDR correction. Must be 0 < alpha < 1.

    Returns
    -------
    adj_X : pandas.DataFrame
        An asymmetric dataframe with adjusted P-values of X.
    '''
    pandas = False
    a = X.copy()

    if isinstance(X, pd.DataFrame):
        pandas = True
        a = X.values
        index = X.index
        columns = X.columns

    # Original data
    pvals = a.flatten()

    # New data
    rej, adj_pvals = fdrcorrection(pvals, alpha=alpha)

    # Reorder_data
    #adj_X = adj_pvals.reshape(-1, a.shape[1])
    adj_X = adj_pvals.reshape(a.shape) # Allows using tensors

    if pandas:
        adj_X = pd.DataFrame(adj_X, index=index, columns=columns)
    return adj_X

compute_fdrcorrection_symmetric_matrix(X, alpha=0.1)

Computes and FDR correction or Benjamini-Hochberg procedure on a symmetric matrix of p-values. Here, only the diagonal and values on the upper triangle are considered to avoid repetition with the lower triangle.

Parameters

X : pandas.DataFrame A symmetric dataframe of P-values.

alpha : float, default=0.1 Error rate of the FDR correction. Must be 0 < alpha < 1.

Returns

adj_X : pandas.DataFrame A symmetric dataframe with adjusted P-values of X.

Source code in cell2cell/stats/multitest.py
def compute_fdrcorrection_symmetric_matrix(X, alpha=0.1):
    '''
    Computes and FDR correction or Benjamini-Hochberg procedure
    on a symmetric matrix of p-values. Here, only the diagonal
    and values on the upper triangle are considered to avoid
    repetition with the lower triangle.

    Parameters
    ----------
    X : pandas.DataFrame
        A symmetric dataframe of P-values.

    alpha : float, default=0.1
        Error rate of the FDR correction. Must be 0 < alpha < 1.

    Returns
    -------
    adj_X : pandas.DataFrame
        A symmetric dataframe with adjusted P-values of X.
    '''
    pandas = False
    a = X.copy()

    if isinstance(X, pd.DataFrame):
        pandas = True
        a = X.values
        index = X.index
        columns = X.columns

    # Original data
    upper_idx = np.triu_indices_from(a)
    pvals = a[upper_idx]

    # New data
    adj_X = np.zeros(a.shape)
    rej, adj_pvals = fdrcorrection(pvals.flatten(), alpha=alpha)

    # Reorder_data
    adj_X[upper_idx] = adj_pvals
    adj_X = adj_X + np.triu(adj_X, 1).T

    if pandas:
        adj_X = pd.DataFrame(adj_X, index=index, columns=columns)
    return adj_X

permutation

compute_pvalue_from_dist(obs_value, dist, consider_size=False, comparison='upper')

Computes the probability of observing a value in a given distribution.

Parameters

obs_value : float An observed value used to get a p-value from a distribution.

dist : array-like A simulated oe empirical distribution of values used to compare the observed value and get a p-value.

consider_size : boolean, default=False Whether considering the size of the distribution for limiting small probabilities to be as minimal as the reciprocal of the size.

comparison : str, default='upper' Type of hypothesis testing:

- 'lower' : Lower-tailed, whether the value is smaller than most
    of the values in the distribution.
- 'upper' : Upper-tailed, whether the value is greater than most
    of the values in the distribution.
- 'different' : Two-tailed, whether the value is different than
    most of the values in the distribution.
Returns

pval : float P-value obtained from comparing the observed value and values in the distribution.

Source code in cell2cell/stats/permutation.py
def compute_pvalue_from_dist(obs_value, dist, consider_size=False, comparison='upper'):
    '''
    Computes the probability of observing a value in a given distribution.

    Parameters
    ----------
    obs_value : float
        An observed value used to get a p-value from a distribution.

    dist : array-like
        A simulated oe empirical distribution of values used to compare
        the observed value and get a p-value.

    consider_size : boolean, default=False
        Whether considering the size of the distribution for limiting
        small probabilities to be as minimal as the reciprocal of the size.

    comparison : str, default='upper'
        Type of hypothesis testing:

        - 'lower' : Lower-tailed, whether the value is smaller than most
            of the values in the distribution.
        - 'upper' : Upper-tailed, whether the value is greater than most
            of the values in the distribution.
        - 'different' : Two-tailed, whether the value is different than
            most of the values in the distribution.

    Returns
    -------
    pval : float
        P-value obtained from comparing the observed value and values in the
        distribution.
    '''
    # Omit nan values
    dist_ = [x for x in dist if ~np.isnan(x)]

    # All values in dist are NaNs or obs_value is NaN
    if (len(dist_) == 0) | np.isnan(obs_value):
        return 1.0

    # No NaN values
    if comparison == 'lower':
        pval = scipy.stats.percentileofscore(dist_, obs_value) / 100.0
    elif comparison == 'upper':
        pval = 1.0 - scipy.stats.percentileofscore(dist_, obs_value) / 100.0
    elif comparison == 'different':
        percentile = scipy.stats.percentileofscore(dist_, obs_value) / 100.0
        if percentile <= 0.5:
            pval = 2.0 * percentile
        else:
            pval = 2.0 * (1.0 - percentile)
    else:
        raise NotImplementedError('Comparison {} is not implemented'.format(comparison))

    if (consider_size) & (pval == 0.):
        pval = 1./(len(dist_) + 1e-6)
    elif pval < 0.:
        pval = 1. / (len(dist_) + 1e-6)
    return pval

pvalue_from_dist(obs_value, dist, label='', consider_size=False, comparison='upper')

Computes a p-value for an observed value given a simulated or empirical distribution. It plots the distribution and prints the p-value.

Parameters

obs_value : float An observed value used to get a p-value from a distribution.

dist : array-like A simulated oe empirical distribution of values used to compare the observed value and get a p-value.

label : str, default='' Label used for the histogram plot. Useful for identifying it across multiple plots.

consider_size : boolean, default=False Whether considering the size of the distribution for limiting small probabilities to be as minimal as the reciprocal of the size.

comparison : str, default='upper' Type of hypothesis testing:

- 'lower' : Lower-tailed, whether the value is smaller than most
    of the values in the distribution.
- 'upper' : Upper-tailed, whether the value is greater than most
    of the values in the distribution.
- 'different' : Two-tailed, whether the value is different than
    most of the values in the distribution.
Returns

fig : matplotlib.figure.Figure Figure that shows the histogram for dist.

pval : float P-value obtained from comparing the observed value and values in the distribution.

Source code in cell2cell/stats/permutation.py
def pvalue_from_dist(obs_value, dist, label='', consider_size=False, comparison='upper'):
    '''
    Computes a p-value for an observed value given a simulated or
    empirical distribution. It plots the distribution and prints
    the p-value.

    Parameters
    ----------
    obs_value : float
        An observed value used to get a p-value from a distribution.

    dist : array-like
        A simulated oe empirical distribution of values used to compare
        the observed value and get a p-value.

    label : str, default=''
        Label used for the histogram plot. Useful for identifying it
        across multiple plots.

    consider_size : boolean, default=False
        Whether considering the size of the distribution for limiting
        small probabilities to be as minimal as the reciprocal of the size.

    comparison : str, default='upper'
        Type of hypothesis testing:

        - 'lower' : Lower-tailed, whether the value is smaller than most
            of the values in the distribution.
        - 'upper' : Upper-tailed, whether the value is greater than most
            of the values in the distribution.
        - 'different' : Two-tailed, whether the value is different than
            most of the values in the distribution.

    Returns
    -------
    fig : matplotlib.figure.Figure
        Figure that shows the histogram for dist.

    pval : float
        P-value obtained from comparing the observed value and values in the
        distribution.
    '''
    pval = compute_pvalue_from_dist(obs_value=obs_value,
                                    dist=dist,
                                    consider_size=consider_size,
                                    comparison=comparison
                                    )

    print('P-value is: {}'.format(pval))

    with sns.axes_style("darkgrid"):
        if label == '':
            if pval > 1. - 1. / len(dist):
                label_ = 'p-val: >{:g}'.format(float('{:.1g}'.format(1. - 1. / len(dist))))
            elif pval < 1. / len(dist):
                label_ = 'p-val: <{:g}'.format(float('{:.1g}'.format(1. / len(dist))))
            else:
                label_ = 'p-val: {0:.2E}'.format(pval)
        else:
            if pval > 1. - 1. / len(dist):
                label_ = label + ' - p-val: >{:g}'.format(float('{:.1g}'.format(1. - 1. / len(dist))))
            elif pval < 1. / len(dist):
                label_ = label + ' - p-val: <{:g}'.format(float('{:.1g}'.format(1. / len(dist))))
            else:
                label_ = label + ' - p-val: {0:.2E}'.format(pval)
        fig = sns.distplot(dist, hist=True, kde=True, norm_hist=False, rug=False, label=label_)
        fig.axvline(x=obs_value, color=fig.get_lines()[-1].get_c(), ls='--')

        fig.tick_params(axis='both', which='major', labelsize=16)
    lgd = plt.legend(loc='center left', bbox_to_anchor=(1.01, 0.5),
                     ncol=1, fancybox=True, shadow=True, fontsize=14)
    return fig, pval

random_switching_ppi_labels(ppi_data, genes=None, random_state=None, interaction_columns=('A', 'B'), permuted_column='both')

Randomly permutes the labels of interacting proteins in a list of protein-protein interactions.

Parameters

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

genes : list, default=None List of genes, with names matching proteins in the PPIs, to exclusively consider in the analysis.

random_state : int, default=None Seed for randomization.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

permuted_column : str, default='both' Column among the interacting_columns to permute. Options are:

- 'first' : To permute labels considering only proteins in the first
    column.
- 'second' : To permute labels considering only proteins in the second
    column.
- ' both' : To permute labels considering all the proteins in the list.
Returns

ppi_data_ : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) with randomly permuted labels of proteins.

Source code in cell2cell/stats/permutation.py
def random_switching_ppi_labels(ppi_data, genes=None, random_state=None, interaction_columns=('A', 'B'), permuted_column='both'):
    '''
    Randomly permutes the labels of interacting proteins in
    a list of protein-protein interactions.

    Parameters
    ----------
    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    genes : list, default=None
        List of genes, with names matching proteins in the PPIs, to exclusively
        consider in the analysis.

    random_state : int, default=None
        Seed for randomization.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a
        dataframe of protein-protein interactions. If the list is for
        ligand-receptor pairs, the first column is for the ligands and the second
        for the receptors.

    permuted_column : str, default='both'
        Column among the interacting_columns to permute.
        Options are:

        - 'first' : To permute labels considering only proteins in the first
            column.
        - 'second' : To permute labels considering only proteins in the second
            column.
        - ' both' : To permute labels considering all the proteins in the list.

    Returns
    -------
    ppi_data_ : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) with randomly
        permuted labels of proteins.
    '''
    ppi_data_ = ppi_data.copy()
    prot_a = interaction_columns[0]
    prot_b = interaction_columns[1]
    if permuted_column == 'both':
        if genes is None:
            genes = list(np.unique(ppi_data_[interaction_columns].values.flatten()))
        else:
            genes = list(set(genes))
        mapper = dict(zip(genes, shuffle(genes, random_state=random_state)))
        ppi_data_[prot_a] = ppi_data_[prot_a].apply(lambda x: mapper[x])
        ppi_data_[prot_b] = ppi_data_[prot_b].apply(lambda x: mapper[x])
    elif permuted_column == 'first':
        if genes is None:
            genes = list(np.unique(ppi_data_[prot_a].values.flatten()))
        else:
            genes = list(set(genes))
        mapper = dict(zip(genes, shuffle(genes, random_state=random_state)))
        ppi_data_[prot_a] = ppi_data_[prot_a].apply(lambda x: mapper[x])
    elif permuted_column == 'second':
        if genes is None:
            genes = list(np.unique(ppi_data_[prot_b].values.flatten()))
        else:
            genes = list(set(genes))
        mapper = dict(zip(genes, shuffle(genes, random_state=random_state)))
        ppi_data_[prot_b] = ppi_data_[prot_b].apply(lambda x: mapper[x])
    else: raise ValueError('Not valid option')
    return ppi_data_

run_label_permutation(rnaseq_data, ppi_data, genes, analysis_setup, cutoff_setup, permutations=10000, permuted_label='gene_labels', excluded_cells=None, consider_size=True, verbose=False)

Permutes a label before computing cell-cell interaction scores.

Parameters

rnaseq_data : pandas.DataFrame Gene expression data for a bulk RNA-seq experiment or a single-cell experiment after aggregation into cell types. Columns are cell-types/tissues/samples and rows are genes.

ppi_data : pandas.DataFrame List of protein-protein interactions (or ligand-receptor pairs) used for inferring the cell-cell interactions and communication.

genes : list List of genes in rnaseq_data to exclusively consider in the analysis.

analysis_setup : dict Contains main setup for running the cell-cell interactions and communication analyses. Three main setups are needed (passed as keys):

- 'communication_score' : is the type of communication score used to detect
    active ligand-receptor pairs between each pair of cell.
    It can be:

    - 'expression_thresholding'
    - 'expression_product'
    - 'expression_mean'
- 'cci_score' : is the scoring function to aggregate the communication
    scores.
    It can be:

    - 'bray_curtis'
    - 'jaccard'
    - 'count'
- 'cci_type' : is the type of interaction between two cells. If it is
    undirected, all ligands and receptors are considered from both cells.
    If it is directed, ligands from one cell and receptors from the other
     are considered separately with respect to ligands from the second
     cell and receptor from the first one.
     So, it can be:

     - 'undirected'
     - 'directed

cutoff_setup : dict Contains two keys: 'type' and 'parameter'. The first key represent the way to use a cutoff or threshold, while parameter is the value used to binarize the expression values. The key 'type' can be:

- 'local_percentile' : computes the value of a given percentile, for each
    gene independently. In this case, the parameter corresponds to the
    percentile to compute, as a float value between 0 and 1.
- 'global_percentile' : computes the value of a given percentile from all
    genes and samples simultaneously. In this case, the parameter
    corresponds to the percentile to compute, as a float value between
    0 and 1. All genes have the same cutoff.
- 'file' : load a cutoff table from a file. Parameter in this case is the
    path of that file. It must contain the same genes as index and same
    samples as columns.
- 'multi_col_matrix' : a dataframe must be provided, containing a cutoff
    for each gene in each sample. This allows to use specific cutoffs for
    each sample. The columns here must be the same as the ones in the
    rnaseq_data.
- 'single_col_matrix' : a dataframe must be provided, containing a cutoff
    for each gene in only one column. These cutoffs will be applied to
    all samples.
- 'constant_value' : binarizes the expression. Evaluates whether
    expression is greater than the value input in the parameter.

permutations : int, default=100 Number of permutations where in each of them a random shuffle of labels is performed, followed of computing CCI scores to create a null distribution.

permuted_label : str, default='gene_labels' Label to be permuted. Types are:

    - 'genes' : Permutes cell-labels in a gene-specific way.
    - 'gene_labels' : Permutes the labels of genes in the RNA-seq dataset.
    - 'cell_labels' : Permutes the labels of cell-types/tissues/samples
        in the RNA-seq dataset.

excluded_cells : list, default=None List of cells to exclude from the analysis.

consider_size : boolean, default=True Whether considering the size of the distribution for limiting small probabilities to be as minimal as the reciprocal of the size.

verbose : boolean, default=False Whether printing or not steps of the analysis.

Returns

cci_pvals : pandas.DataFrame Matrix where rows and columns are cell-types/tissues/samples and each value is a P-value for the corresponding CCI score.

Source code in cell2cell/stats/permutation.py
def run_label_permutation(rnaseq_data, ppi_data, genes, analysis_setup, cutoff_setup, permutations=10000,
                          permuted_label='gene_labels', excluded_cells=None, consider_size=True, verbose=False):
    '''Permutes a label before computing cell-cell interaction scores.

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression data for a bulk RNA-seq experiment or a single-cell
        experiment after aggregation into cell types. Columns are
        cell-types/tissues/samples and rows are genes.

    ppi_data : pandas.DataFrame
        List of protein-protein interactions (or ligand-receptor pairs) used
        for inferring the cell-cell interactions and communication.

    genes : list
        List of genes in rnaseq_data to exclusively consider in the analysis.

    analysis_setup : dict
        Contains main setup for running the cell-cell interactions and communication
        analyses. Three main setups are needed (passed as keys):

        - 'communication_score' : is the type of communication score used to detect
            active ligand-receptor pairs between each pair of cell.
            It can be:

            - 'expression_thresholding'
            - 'expression_product'
            - 'expression_mean'
        - 'cci_score' : is the scoring function to aggregate the communication
            scores.
            It can be:

            - 'bray_curtis'
            - 'jaccard'
            - 'count'
        - 'cci_type' : is the type of interaction between two cells. If it is
            undirected, all ligands and receptors are considered from both cells.
            If it is directed, ligands from one cell and receptors from the other
             are considered separately with respect to ligands from the second
             cell and receptor from the first one.
             So, it can be:

             - 'undirected'
             - 'directed

    cutoff_setup : dict
        Contains two keys: 'type' and 'parameter'. The first key represent the
        way to use a cutoff or threshold, while parameter is the value used
        to binarize the expression values.
        The key 'type' can be:

        - 'local_percentile' : computes the value of a given percentile, for each
            gene independently. In this case, the parameter corresponds to the
            percentile to compute, as a float value between 0 and 1.
        - 'global_percentile' : computes the value of a given percentile from all
            genes and samples simultaneously. In this case, the parameter
            corresponds to the percentile to compute, as a float value between
            0 and 1. All genes have the same cutoff.
        - 'file' : load a cutoff table from a file. Parameter in this case is the
            path of that file. It must contain the same genes as index and same
            samples as columns.
        - 'multi_col_matrix' : a dataframe must be provided, containing a cutoff
            for each gene in each sample. This allows to use specific cutoffs for
            each sample. The columns here must be the same as the ones in the
            rnaseq_data.
        - 'single_col_matrix' : a dataframe must be provided, containing a cutoff
            for each gene in only one column. These cutoffs will be applied to
            all samples.
        - 'constant_value' : binarizes the expression. Evaluates whether
            expression is greater than the value input in the parameter.

    permutations : int, default=100
            Number of permutations where in each of them a random
            shuffle of labels is performed, followed of computing
            CCI scores to create a null distribution.

    permuted_label : str, default='gene_labels'
        Label to be permuted.
        Types are:

            - 'genes' : Permutes cell-labels in a gene-specific way.
            - 'gene_labels' : Permutes the labels of genes in the RNA-seq dataset.
            - 'cell_labels' : Permutes the labels of cell-types/tissues/samples
                in the RNA-seq dataset.

    excluded_cells : list, default=None
        List of cells to exclude from the analysis.

    consider_size : boolean, default=True
        Whether considering the size of the distribution for limiting
        small probabilities to be as minimal as the reciprocal of the size.

    verbose : boolean, default=False
        Whether printing or not steps of the analysis.

    Returns
    -------
    cci_pvals : pandas.DataFrame
        Matrix where rows and columns are cell-types/tissues/samples and
        each value is a P-value for the corresponding CCI score.
    '''
    # Placeholders
    scores = np.array([])
    diag = np.array([])

    # Info to use
    if genes is None:
        genes = list(rnaseq_data.index)

    if excluded_cells is not None:
        included_cells = sorted(list(set(rnaseq_data.columns) - set(excluded_cells)))
    else:
        included_cells = sorted(list(set(rnaseq_data.columns)))

    rnaseq_data_ = rnaseq_data.loc[genes, included_cells]

    # Permutations
    for i in tqdm(range(permutations)):

        if permuted_label == 'genes':
            # Shuffle genes
            shuffled_rnaseq_data = shuffle_rows_in_df(df=rnaseq_data_,
                                                      rows=genes)
        elif permuted_label == 'cell_labels':
            # Shuffle cell labels
            shuffled_rnaseq_data = rnaseq_data_.copy()
            shuffled_cells = shuffle(list(shuffled_rnaseq_data.columns))
            shuffled_rnaseq_data.columns = shuffled_cells

        elif permuted_label == 'gene_labels':
            # Shuffle gene labels
            shuffled_rnaseq_data = rnaseq_data.copy()
            shuffled_genes = shuffle(list(shuffled_rnaseq_data.index))
            shuffled_rnaseq_data.index = shuffled_genes
        else:
            raise ValueError('Not a valid shuffle_type.')


        interaction_space = ispace.InteractionSpace(rnaseq_data=shuffled_rnaseq_data.loc[genes, included_cells],
                                                    ppi_data=ppi_data,
                                                    gene_cutoffs=cutoff_setup,
                                                    communication_score=analysis_setup['communication_score'],
                                                    cci_score=analysis_setup['cci_score'],
                                                    cci_type=analysis_setup['cci_type'],
                                                    verbose=verbose)

        # Keep scores
        cci = interaction_space.interaction_elements['cci_matrix'].loc[included_cells, included_cells]
        cci_diag = np.diag(cci).copy()
        np.fill_diagonal(cci.values, 0.0)

        iter_scores = scipy.spatial.distance.squareform(cci)
        iter_scores = np.reshape(iter_scores, (len(iter_scores), 1)).T

        iter_diag = np.reshape(cci_diag, (len(cci_diag), 1)).T

        if scores.shape == (0,):
            scores = iter_scores
        else:
            scores = np.concatenate([scores, iter_scores], axis=0)

        if diag.shape == (0,):
            diag = iter_diag
        else:
            diag = np.concatenate([diag, iter_diag], axis=0)

    # Base CCI scores
    base_interaction_space = ispace.InteractionSpace(rnaseq_data=rnaseq_data_[included_cells],
                                                     ppi_data=ppi_data,
                                                     gene_cutoffs=cutoff_setup,
                                                     communication_score=analysis_setup['communication_score'],
                                                     cci_score=analysis_setup['cci_score'],
                                                     cci_type=analysis_setup['cci_type'],
                                                     verbose=verbose)

    # Keep scores
    base_cci = base_interaction_space.interaction_elements['cci_matrix'].loc[included_cells, included_cells]
    base_cci_diag = np.diag(base_cci).copy()
    np.fill_diagonal(base_cci.values, 0.0)

    base_scores = scipy.spatial.distance.squareform(base_cci)

    # P-values
    pvals = np.zeros((scores.shape[1], 1))
    diag_pvals = np.zeros((diag.shape[1], 1))

    for i in range(scores.shape[1]):
        pvals[i] = compute_pvalue_from_dist(base_scores[i],
                                            scores[:, i],
                                            consider_size=consider_size,
                                            comparison='different'
                                            )

    for i in range(diag.shape[1]):
        diag_pvals[i] = compute_pvalue_from_dist(base_cci_diag[i],
                                                 diag[:, i],
                                                 consider_size=consider_size,
                                                 comparison='different'
                                                 )

    # DataFrame
    cci_pvals = scipy.spatial.distance.squareform(pvals.reshape((len(pvals),)))
    for i, v in enumerate(diag_pvals.flatten()):
        cci_pvals[i, i] = v
    cci_pvals = pd.DataFrame(cci_pvals, index=base_cci.index, columns=base_cci.columns)

    return cci_pvals

tensor special

external_scores

dataframes_to_tensor(context_df_dict, sender_col, receiver_col, ligand_col, receptor_col, score_col, how='inner', outer_fraction=0.0, lr_fill=nan, cell_fill=nan, lr_sep='^', dup_aggregation='max', context_order=None, order_labels=None, sort_elements=True, device=None)

Generates an InteractionTensor from a dictionary containing dataframes for all contexts.

Parameters

context_df_dict : dict Dictionary containing a dataframe for each context. The dataframe must contain columns containing sender cells, receiver cells, ligands, receptors, and communication scores, separately. Keys are context names and values are dataframes.

sender_col : str Name of the column containing the sender cells in all context dataframes.

receiver_col : str Name of the column containing the receiver cells in all context dataframes.

ligand_col : str Name of the column containing the ligands in all context dataframes.

receptor_col : str Name of the column containing the receptors in all context dataframes.

score_col : str Name of the column containing the communication scores in all context dataframes.

how : str, default='inner' Approach to consider cell types and genes present across multiple contexts.

- 'inner' : Considers only cell types and LR pairs that are present in all
            contexts (intersection).
- 'outer' : Considers all cell types and LR pairs that are present
            across contexts (union).
- 'outer_lrs' : Considers only cell types that are present in all
                contexts (intersection), while all LR pairs that are
                present across contexts (union).
- 'outer_cells' : Considers only LR pairs that are present in all
                  contexts (intersection), while all cell types that are
                  present across contexts (union).

outer_fraction : float, default=0.0 Threshold to filter the elements when how includes any outer option. Elements with a fraction abundance across contexts (in context_df_dict) at least this threshold will be included. When this value is 0, considers all elements across the samples. When this value is 1, it acts as using how='inner'.

lr_fill : float, default=numpy.nan Value to fill communication scores when a ligand-receptor pair is not present across all contexts.

cell_fill : float, default=numpy.nan Value to fill communication scores when a cell is not present across all ligand-receptor pairs or all contexts.

lr_sep : str, default='^' Separation character to join ligands and receptors into a LR pair name.

dup_aggregation : str, default='max' Approach to aggregate communication score if there are multiple instances of an LR pair for a specific sender-receiver pair in one of the dataframes.

- 'max' : Maximum of the multiple instances
- 'min' : Minimum of the multiple instances
- 'mean' : Average of the multiple instances
- 'median' : Median of the multiple instances

context_order : list, default=None List used to sort the contexts when building the tensor. Elements must be all elements in context_df_dict.keys().

order_labels : list, default=None List containing the labels for each order or dimension of the tensor. For example: ['Contexts', 'Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']

sort_elements : boolean, default=True Whether alphabetically sorting elements in the InteractionTensor. The Context Dimension is not sorted if a 'context_order' list is provided.

device : str, default=None Device to use when backend is pytorch. Options are:

Returns

interaction_tensor : cell2cell.tensor.PreBuiltTensor A communication tensor generated for the Tensor-cell2cell pipeline.

Source code in cell2cell/tensor/external_scores.py
def dataframes_to_tensor(context_df_dict, sender_col, receiver_col, ligand_col, receptor_col, score_col, how='inner',
                         outer_fraction=0.0, lr_fill=np.nan, cell_fill=np.nan, lr_sep='^', dup_aggregation='max',
                         context_order=None, order_labels=None, sort_elements=True, device=None):
    '''Generates an InteractionTensor from a dictionary
    containing dataframes for all contexts.

    Parameters
    ----------
    context_df_dict : dict
        Dictionary containing a dataframe for each context. The dataframe
        must contain columns containing sender cells, receiver cells,
        ligands, receptors, and communication scores, separately.
        Keys are context names and values are dataframes.

    sender_col : str
        Name of the column containing the sender cells in all context
        dataframes.

    receiver_col : str
        Name of the column containing the receiver cells in all context
        dataframes.

    ligand_col : str
        Name of the column containing the ligands in all context
        dataframes.

    receptor_col : str
        Name of the column containing the receptors in all context
        dataframes.

    score_col : str
        Name of the column containing the communication scores in all context
        dataframes.

    how : str, default='inner'
        Approach to consider cell types and genes present across multiple contexts.

        - 'inner' : Considers only cell types and LR pairs that are present in all
                    contexts (intersection).
        - 'outer' : Considers all cell types and LR pairs that are present
                    across contexts (union).
        - 'outer_lrs' : Considers only cell types that are present in all
                        contexts (intersection), while all LR pairs that are
                        present across contexts (union).
        - 'outer_cells' : Considers only LR pairs that are present in all
                          contexts (intersection), while all cell types that are
                          present across contexts (union).

    outer_fraction : float, default=0.0
        Threshold to filter the elements when `how` includes any outer option.
        Elements with a fraction abundance across contexts (in `context_df_dict`)
        at least this threshold will be included. When this value is 0, considers
        all elements across the samples. When this value is 1, it acts as using
        `how='inner'`.

    lr_fill : float, default=numpy.nan
        Value to fill communication scores when a ligand-receptor pair is not
        present across all contexts.

    cell_fill : float, default=numpy.nan
        Value to fill communication scores when a cell is not
        present across all ligand-receptor pairs or all contexts.

    lr_sep : str, default='^'
        Separation character to join ligands and receptors into a LR pair name.

    dup_aggregation : str, default='max'
        Approach to aggregate communication score if there are multiple instances
        of an LR pair for a specific sender-receiver pair in one of the dataframes.

        - 'max' : Maximum of the multiple instances
        - 'min' : Minimum of the multiple instances
        - 'mean' : Average of the multiple instances
        - 'median' : Median of the multiple instances

    context_order : list, default=None
        List used to sort the contexts when building the tensor. Elements must
        be all elements in context_df_dict.keys().

    order_labels : list, default=None
        List containing the labels for each order or dimension of the tensor. For
        example: ['Contexts', 'Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']

    sort_elements : boolean, default=True
        Whether alphabetically sorting elements in the InteractionTensor.
        The Context Dimension is not sorted if a 'context_order' list is provided.

    device : str, default=None
            Device to use when backend is pytorch. Options are:
             {'cpu', 'cuda:0', None}

    Returns
    -------
    interaction_tensor : cell2cell.tensor.PreBuiltTensor
        A communication tensor generated for the Tensor-cell2cell pipeline.
    '''
    # Make sure that all contexts contain needed info
    if context_order is None:
        sort_context = sort_elements
        context_order = list(context_df_dict.keys())
    else:
        assert all([c in context_df_dict.keys() for c in context_order]), "The list 'context_order' must contain all context names contained in the keys of 'context_dict'"
        assert len(context_order) == len(context_df_dict.keys()), "Each context name must be contained only once in the list 'context_order'"
        sort_context = False
    cols = [sender_col, receiver_col, ligand_col, receptor_col, score_col]
    assert all([c in df.columns for c in cols for df in context_df_dict.values()]), "All input columns must be contained in all dataframes included in 'context_dict'"

    # Copy context dict to make modifications
    cont_dict = {k : v.copy()[cols] for k, v in context_df_dict.items()}

    # Labels for each dimension
    if order_labels is None:
        order_labels = ['Contexts', 'Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']

    # Find all existing LR pairs, sender and receiver cells across contexts
    lr_dict = defaultdict(set)
    sender_dict = defaultdict(set)
    receiver_dict = defaultdict(set)

    for k, df in cont_dict.items():
        df['LRs'] = df.apply(lambda row: row[ligand_col] + lr_sep + row[receptor_col], axis=1)
        # This is to consider only LR pairs in each context that are present in all cell pairs. Disabled for now.
        # if how in ['inner', 'outer_cells']:
        #     ccc_df = df.pivot(index='LRs', columns=[sender_col, receiver_col], values=score_col)
        #     ccc_df = ccc_df.dropna(how='any')
        #     lr_dict[k].update(list(ccc_df.index))
        # else:
        lr_dict[k].update(df['LRs'].unique().tolist())
        sender_dict[k].update(df[sender_col].unique().tolist())
        receiver_dict[k].update(df[receiver_col].unique().tolist())

    # Subset LR pairs, sender and receiver cells given parameter 'how'
    df_lrs = [list(lr_dict[k]) for k in context_order]
    df_senders = [list(sender_dict[k]) for k in context_order]
    df_receivers  = [list(receiver_dict[k]) for k in context_order]

    if how == 'inner':
        lr_pairs = list(set.intersection(*map(set, df_lrs)))
        sender_cells = list(set.intersection(*map(set, df_senders)))
        receiver_cells = list(set.intersection(*map(set, df_receivers)))
    elif how == 'outer':
        lr_pairs = get_elements_over_fraction(abundance_dict=get_element_abundances(element_lists=df_lrs),
                                              fraction=outer_fraction)
        sender_cells = get_elements_over_fraction(abundance_dict=get_element_abundances(element_lists=df_senders),
                                              fraction=outer_fraction)
        receiver_cells = get_elements_over_fraction(abundance_dict=get_element_abundances(element_lists=df_receivers),
                                              fraction=outer_fraction)
    elif how == 'outer_lrs':
        lr_pairs = get_elements_over_fraction(abundance_dict=get_element_abundances(element_lists=df_lrs),
                                              fraction=outer_fraction)
        sender_cells = list(set.intersection(*map(set, df_senders)))
        receiver_cells = list(set.intersection(*map(set, df_receivers)))
    elif how == 'outer_cells':
        lr_pairs = list(set.intersection(*map(set, df_lrs)))
        sender_cells = get_elements_over_fraction(abundance_dict=get_element_abundances(element_lists=df_senders),
                                                  fraction=outer_fraction)
        receiver_cells = get_elements_over_fraction(abundance_dict=get_element_abundances(element_lists=df_receivers),
                                                    fraction=outer_fraction)
    else:
        raise ValueError("Not a valid input for parameter 'how'")

    if sort_elements:
        if sort_context:
            context_order = sorted(context_order)
        lr_pairs = sorted(lr_pairs)
        sender_cells = sorted(sender_cells)
        receiver_cells = sorted(receiver_cells)

    # Build temporal tensor to pass to PreBuiltTensor
    tmp_tensor = []
    for k in tqdm(context_order):
        v = cont_dict[k]
        # 3D tensor for the context
        tmp_3d_tensor = []
        for lr in lr_pairs:
            df = v.loc[v['LRs'] == lr]
            if df.shape[0] == 0:  # TODO: Check behavior when df is empty
                df = pd.DataFrame(lr_fill, index=sender_cells, columns=receiver_cells)
            else:
                if df[cols[:-1]].duplicated().any():
                    assert dup_aggregation in ['max', 'min', 'mean', 'median'], "Please use a valid option for `dup_aggregation`."
                    df = getattr(df.groupby(cols[:-1]), dup_aggregation)().reset_index()
                df = df.pivot(index=sender_col, columns=receiver_col, values=score_col)
                df = df.reindex(sender_cells, fill_value=cell_fill).reindex(receiver_cells, fill_value=cell_fill, axis='columns')

            tmp_3d_tensor.append(df.values)
        tmp_tensor.append(tmp_3d_tensor)

    # Create InteractionTensor using PreBuiltTensor
    tensor = np.asarray(tmp_tensor)
    if how != 'inner':
        mask = (~np.isnan(tensor)).astype(int)
        loc_nans = (np.isnan(tensor)).astype(int)
    else:
        mask = None
        loc_nans = np.zeros(tensor.shape, dtype=int)

    interaction_tensor = PreBuiltTensor(tensor=tensor,
                                        order_names=[context_order, lr_pairs, sender_cells, receiver_cells],
                                        order_labels=order_labels,
                                        mask=mask,
                                        loc_nans=loc_nans,
                                        device=device)
    return interaction_tensor

factor_manipulation

normalize_factors(factors)

L2-normalizes the factors considering all tensor dimensions from a tensor decomposition result

Parameters

factors : dict Ordered dictionary containing a dataframe with the factor loadings for each dimension/order of the tensor. This is the result from a tensor decomposition, it can be found as the attribute factors in any tensor class derived from the class BaseTensor (e.g. BaseTensor.factors).

Returns

norm_factors : dict The normalized factors.

Source code in cell2cell/tensor/factor_manipulation.py
def normalize_factors(factors):
    '''
    L2-normalizes the factors considering all tensor dimensions
    from a tensor decomposition result

    Parameters
    ----------
    factors : dict
        Ordered dictionary containing a dataframe with the factor loadings for each
        dimension/order of the tensor. This is the result from a tensor decomposition,
        it can be found as the attribute `factors` in any tensor class derived from the
        class BaseTensor (e.g. BaseTensor.factors).

    Returns
    -------
    norm_factors : dict
        The normalized factors.
    '''
    norm_factors = dict()
    for k, v in factors.items():
        norm_factors[k] = v / np.linalg.norm(v, axis=0)
    return norm_factors

shuffle_factors(factors, axis=0)

Randomly shuffles the values of the factors in the tensor decomposition.

Source code in cell2cell/tensor/factor_manipulation.py
def shuffle_factors(factors, axis=0):
    '''
    Randomly shuffles the values of the factors in the tensor decomposition.
    '''
    raise NotImplementedError

factorization

normalized_error(reference_tensor, reconstructed_tensor)

Computes a normalized error between two tensors

Parameters

reference_tensor : ndarray list A tensor that could be a list of lists, a multidimensional numpy array or a tensorly.tensor. This tensor is the input of a tensor decomposition and used as reference in the normalized error for a new tensor reconstructed from the factors of the tensor decomposition.

reconstructed_tensor : ndarray list A tensor that could be a list of lists, a multidimensional numpy array or a tensorly.tensor. This tensor is an approximation of the reference_tensor by using the resulting factors of a tensor decomposition to compute it.

Returns

norm_error : float The normalized error between a reference tensor and a reconstructed tensor. The error is normalized by dividing by the Frobinius norm of the reference tensor.

Source code in cell2cell/tensor/factorization.py
def normalized_error(reference_tensor, reconstructed_tensor):
    '''Computes a normalized error between two tensors

    Parameters
    ----------
    reference_tensor : ndarray list
        A tensor that could be a list of lists, a multidimensional numpy array or
        a tensorly.tensor. This tensor is the input of a tensor decomposition and
        used as reference in the normalized error for a new tensor reconstructed
        from the factors of the tensor decomposition.

    reconstructed_tensor : ndarray list
        A tensor that could be a list of lists, a multidimensional numpy array or
        a tensorly.tensor. This tensor is an approximation of the reference_tensor
        by using the resulting factors of a tensor decomposition to compute it.

    Returns
    -------
    norm_error : float
        The normalized error between a reference tensor and a reconstructed tensor.
        The error is normalized by dividing by the Frobinius norm of the reference
        tensor.
    '''
    norm_error = tl.norm(reference_tensor - reconstructed_tensor) / tl.norm(reference_tensor)
    return tl.to_numpy(norm_error)

metrics

correlation_index(factors_1, factors_2, tol=5e-16, method='stacked')

CorrIndex implementation to assess tensor decomposition outputs. From [1] Sobhani et al 2022 (https://doi.org/10.1016/j.sigpro.2022.108457). Metric is scaling and column-permutation invariant, wherein each column is a factor.

Parameters

factors_1 : dict Ordered dictionary containing a dataframe with the factor loadings for each dimension/order of the tensor. This is the result from a tensor decomposition, it can be found as the attribute factors in any tensor class derived from the class BaseTensor (e.g. BaseTensor.factors).

factors_2 : dict Similar to factors_1 but coming from another tensor decomposition of a tensor with equal shape.

tol : float, default=5e-16 Precision threshold below which to call the CorrIndex score 0.

method : str, default='stacked' Method to obtain the CorrIndex by comparing the A matrices from two decompositions. Possible options are:

- 'stacked' : The original method implemented in [1]. Here all A matrices from the same decomposition are
              vertically concatenated, building a big A matrix for each decomposition.
- 'max_score' : This computes the CorrIndex for each pair of A matrices (i.e. between A_1 in factors_1 and
                factors_2, between A_2 in factors_1 and factors_2, and so on). Then the max score is
                selected (the most conservative approach). In other words, it selects the max score among the
                CorrIndexes computed dimension-wise.
- 'min_score' : Similar to 'max_score', but the min score is selected (the least conservative approach).
- 'avg_score' : Similar to 'max_score', but the avg score is selected.
Returns

score : float CorrIndex metric [0,1]; lower score indicates higher similarity between matrices

Source code in cell2cell/tensor/metrics.py
def correlation_index(factors_1, factors_2, tol=5e-16, method='stacked'):
    """
    CorrIndex implementation to assess tensor decomposition outputs.
    From [1] Sobhani et al 2022 (https://doi.org/10.1016/j.sigpro.2022.108457).
    Metric is scaling and column-permutation invariant, wherein each column is a factor.

    Parameters
    ----------
    factors_1 : dict
        Ordered dictionary containing a dataframe with the factor loadings for each
        dimension/order of the tensor. This is the result from a tensor decomposition,
        it can be found as the attribute `factors` in any tensor class derived from the
        class BaseTensor (e.g. BaseTensor.factors).

    factors_2 : dict
        Similar to factors_1 but coming from another tensor decomposition of a tensor
        with equal shape.

    tol : float, default=5e-16
        Precision threshold below which to call the CorrIndex score 0.

    method : str, default='stacked'
        Method to obtain the CorrIndex by comparing the A matrices from two decompositions.
        Possible options are:

        - 'stacked' : The original method implemented in [1]. Here all A matrices from the same decomposition are
                      vertically concatenated, building a big A matrix for each decomposition.
        - 'max_score' : This computes the CorrIndex for each pair of A matrices (i.e. between A_1 in factors_1 and
                        factors_2, between A_2 in factors_1 and factors_2, and so on). Then the max score is
                        selected (the most conservative approach). In other words, it selects the max score among the
                        CorrIndexes computed dimension-wise.
        - 'min_score' : Similar to 'max_score', but the min score is selected (the least conservative approach).
        - 'avg_score' : Similar to 'max_score', but the avg score is selected.

    Returns
    -------
    score : float
         CorrIndex metric [0,1]; lower score indicates higher similarity between matrices
    """
    factors_1 = list(factors_1.values())
    factors_2 = list(factors_2.values())

    # check input factors shape
    for factors in [factors_1, factors_2]:
        if len({np.shape(A)[1]for A in factors}) != 1:
            raise ValueError('Factors should be a list of loading matrices of the same rank')

    # check method
    options = ['stacked', 'max_score', 'min_score', 'avg_score']
    if method not in options:
        raise ValueError("The `method` must be either option among {}".format(options))

    if method == 'stacked':
        # vertically stack loading matrices -- shape sum(tensor.shape)xR)
        X_1 = [np.concatenate(factors_1, 0)]
        X_2 = [np.concatenate(factors_2, 0)]
    else:
        X_1 = factors_1
        X_2 = factors_2

    for x1, x2 in zip(X_1, X_2):
        if np.shape(x1) != np.shape(x2):
            raise ValueError('Factor matrices should be of the same shapes')

    # normalize columns to L2 norm - even if ran decomposition with normalize_factors=True
    col_norm_1 = [np.linalg.norm(x1, axis=0) for x1 in X_1]
    col_norm_2 = [np.linalg.norm(x2, axis=0) for x2 in X_2]
    for cn1, cn2 in zip(col_norm_1, col_norm_2):
        if np.any(cn1 == 0) or np.any(cn2 == 0):
            raise ValueError('Column norms must be non-zero')
    X_1 = [x1 / cn1 for x1, cn1 in zip(X_1, col_norm_1)]
    X_2 = [x2 / cn2 for x2, cn2 in zip(X_2, col_norm_2)]

    corr_idxs = [_compute_correlation_index(x1, x2, tol=tol) for x1, x2 in zip(X_1, X_2)]

    if method == 'stacked':
        score = corr_idxs[0]
    elif method == 'max_score':
        score = np.max(corr_idxs)
    elif method == 'min_score':
        score = np.min(corr_idxs)
    elif method == 'avg_score':
        score = np.mean(corr_idxs)
    else:
        score = 1.0
    return score

pairwise_correlation_index(factors, tol=5e-16, method='stacked')

Computes the CorrIndex between all pairs of factors

Parameters

factors : list List with multiple Ordered dictionaries, each containing a dataframe with the factor loadings for each dimension/order of the tensor. This is the result from a tensor decomposition, it can be found as the attribute factors in any tensor class derived from the class BaseTensor (e.g. BaseTensor.factors).

tol : float, default=5e-16 Precision threshold below which to call the CorrIndex score 0.

method : str, default='stacked' Method to obtain the CorrIndex by comparing the A matrices from two decompositions. Possible options are:

- 'stacked' : The original method implemented in [1]. Here all A matrices from the same decomposition are
              vertically concatenated, building a big A matrix for each decomposition.
- 'max_score' : This computes the CorrIndex for each pair of A matrices (i.e. between A_1 in factors_1 and
                factors_2, between A_2 in factors_1 and factors_2, and so on). Then the max score is
                selected (the most conservative approach). In other words, it selects the max score among the
                CorrIndexes computed dimension-wise.
- 'min_score' : Similar to 'max_score', but the min score is selected (the least conservative approach).
- 'avg_score' : Similar to 'max_score', but the avg score is selected.
Returns

scores : pd.DataFrame Dataframe with CorrIndex metric for each pair of decompositions. This metric bounds are [0,1]; lower score indicates higher similarity between matrices

Source code in cell2cell/tensor/metrics.py
def pairwise_correlation_index(factors, tol=5e-16, method='stacked'):
    '''
    Computes the CorrIndex between all pairs of factors

    Parameters
    ----------
    factors : list
        List with multiple Ordered dictionaries, each containing a dataframe with
        the factor loadings for each dimension/order of the tensor. This is the
        result from a tensor decomposition, it can be found as the attribute
        `factors` in any tensor class derived from the class BaseTensor
        (e.g. BaseTensor.factors).

    tol : float, default=5e-16
        Precision threshold below which to call the CorrIndex score 0.

    method : str, default='stacked'
        Method to obtain the CorrIndex by comparing the A matrices from two decompositions.
        Possible options are:

        - 'stacked' : The original method implemented in [1]. Here all A matrices from the same decomposition are
                      vertically concatenated, building a big A matrix for each decomposition.
        - 'max_score' : This computes the CorrIndex for each pair of A matrices (i.e. between A_1 in factors_1 and
                        factors_2, between A_2 in factors_1 and factors_2, and so on). Then the max score is
                        selected (the most conservative approach). In other words, it selects the max score among the
                        CorrIndexes computed dimension-wise.
        - 'min_score' : Similar to 'max_score', but the min score is selected (the least conservative approach).
        - 'avg_score' : Similar to 'max_score', but the avg score is selected.

    Returns
    -------
    scores : pd.DataFrame
         Dataframe with CorrIndex metric for each pair of decompositions.
         This metric bounds are [0,1]; lower score indicates higher similarity between matrices
    '''
    N = len(factors)
    idxs = list(range(N))
    pairs = list(combinations(idxs, 2))
    scores = pd.DataFrame(np.zeros((N, N)),index=idxs, columns=idxs)
    for p1, p2 in pairs:
        corrindex = correlation_index(factors_1=factors[p1],
                                      factors_2=factors[p2],
                                      tol=tol,
                                      method=method
                                      )

        scores.at[p1, p2] = corrindex
        scores.at[p2, p1] = corrindex
    return scores

subset

find_element_indexes(interaction_tensor, elements, axis=0, remove_duplicates=True, keep='first', original_order=False)

Finds the location/indexes of a list of elements in one of the axis of an InteractionTensor.

Parameters

interaction_tensor : cell2cell.tensor.BaseTensor A communication tensor generated with any of the tensor class in cell2cell.tensor

elements : list A list of names for the elements to find in one of the axis.

axis : int, default=0 An axis of the interaction_tensor, representing one of its dimensions.

remove_duplicates : boolean, default=True Whether removing duplicated names in elements.

keep : str, default='first' Determines which duplicates (if any) to keep. Options are:

- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.

original_order : boolean, default=False Whether keeping the original order of the elements in interaction_tensor.order_names[axis] or keeping the new order as indicated in elements.

Returns

indexes : list List of indexes for the elements that where found in the axis indicated of the interaction_tensor.

Source code in cell2cell/tensor/subset.py
def find_element_indexes(interaction_tensor, elements, axis=0, remove_duplicates=True, keep='first', original_order=False):
    '''Finds the location/indexes of a list of elements in one of the
    axis of an InteractionTensor.

    Parameters
    ----------
    interaction_tensor : cell2cell.tensor.BaseTensor
        A communication tensor generated with any of the tensor class in
        cell2cell.tensor

    elements : list
        A list of names for the elements to find in one of the axis.

    axis : int, default=0
        An axis of the interaction_tensor, representing one of
        its dimensions.

    remove_duplicates : boolean, default=True
        Whether removing duplicated names in `elements`.

    keep : str, default='first'
        Determines which duplicates (if any) to keep.
        Options are:

        - first : Drop duplicates except for the first occurrence.
        - last : Drop duplicates except for the last occurrence.
        - False : Drop all duplicates.

    original_order : boolean, default=False
        Whether keeping the original order of the elements in
        interaction_tensor.order_names[axis] or keeping the
        new order as indicated in `elements`.

    Returns
    -------
    indexes : list
        List of indexes for the elements that where found in the
        axis indicated of the interaction_tensor.
    '''
    assert axis < len \
        (interaction_tensor.tensor.shape), "List index out of range. 'axis' must be one of the axis in the tensor."
    assert axis < len \
        (interaction_tensor.order_names), "List index out of range. interaction_tensor.order_names must have element names for each axis of the tensor."

    elements = sorted(set(elements), key=list(elements).index)

    if original_order:
        # Avoids error for considering elements not in the tensor
        elements = set(elements).intersection(set(interaction_tensor.order_names[axis]))
        elements = sorted(elements, key=interaction_tensor.order_names[axis].index)


    # Find duplicates if we are removing them
    to_exclude = []
    if remove_duplicates:
        dup_dict = find_duplicates(interaction_tensor.order_names[axis])

        if len(dup_dict) > 0:  # Only if we have duplicate items
            if keep == 'first':
                for k, v in dup_dict.items():
                    to_exclude.extend(v[1:])
            elif keep == 'last':
                for k, v in dup_dict.items():
                    to_exclude.extend(v[:-1])
            elif not keep:
                for k, v in dup_dict.items():
                    to_exclude.extend(v)
            else:
                raise ValueError("Not a valid option was selected for the parameter `keep`")

    # Find indexes in the tensor
    indexes = sum \
        ([np.where(np.asarray(interaction_tensor.order_names[axis]) == element)[0].tolist() for element in elements], [])

    # Exclude duplicates if any to exclude
    indexes = [idx for idx in indexes if idx not in to_exclude]
    return indexes

subset_metadata(tensor_metadata, interaction_tensor, sample_col='Element')

Subsets the metadata of an InteractionTensor to contain only elements in a reference InteractionTensor (interaction_tensor).

Parameters

tensor_metadata : list List of pandas dataframes with metadata information for elements of each dimension in the tensor. A column called as the variable sample_col contains the name of each element in the tensor while another column called as the variable group_col contains the metadata or grouping information of each element.

interaction_tensor : cell2cell.tensor.BaseTensor A communication tensor generated with any of the tensor class in cell2cell.tensor. This tensor is used as reference to subset the metadata. The subset metadata will contain only elements that are present in this tensor, so if metadata was originally built for another tensor, the elements that are exclusive for that original tensor will be excluded.

sample_col : str, default='Element' Name of the column containing the element names in the metadata.

Returns

subset_metadata : list List of pandas dataframes with metadata information for elements contained in interaction_tensor.order_names. It is a subset of tensor_metadata.

Source code in cell2cell/tensor/subset.py
def subset_metadata(tensor_metadata, interaction_tensor, sample_col='Element'):
    '''Subsets the metadata of an InteractionTensor to contain only
    elements in a reference InteractionTensor (interaction_tensor).

    Parameters
    ----------
    tensor_metadata : list
        List of pandas dataframes with metadata information for elements of each
        dimension in the tensor. A column called as the variable `sample_col` contains
        the name of each element in the tensor while another column called as the
        variable `group_col` contains the metadata or grouping information of each
        element.

    interaction_tensor : cell2cell.tensor.BaseTensor
        A communication tensor generated with any of the tensor class in
        cell2cell.tensor. This tensor is used as reference to subset the metadata.
        The subset metadata will contain only elements that are present in this
        tensor, so if metadata was originally built for another tensor, the elements
        that are exclusive for that original tensor will be excluded.

    sample_col : str, default='Element'
        Name of the column containing the element names in the metadata.

    Returns
    -------
    subset_metadata : list
        List of pandas dataframes with metadata information for elements contained
        in `interaction_tensor.order_names`. It is a subset of `tensor_metadata`.
    '''
    subset_metadata = []
    for i, meta in enumerate(tensor_metadata):
        if meta is not None:
            tmp_meta = meta.set_index(sample_col)
            tmp_meta = tmp_meta.loc[interaction_tensor.order_names[i], :]
            tmp_meta = tmp_meta.reset_index()
            subset_metadata.append(tmp_meta)
        else:
            subset_metadata.append(None)
    return subset_metadata

subset_tensor(interaction_tensor, subset_dict, remove_duplicates=True, keep='first', original_order=False)

Subsets an InteractionTensor to contain only specific elements in respective dimensions.

Parameters

interaction_tensor : cell2cell.tensor.BaseTensor A communication tensor generated with any of the tensor class in cell2cell.tensor

subset_dict : dict Dictionary to subset the tensor. It must contain the axes or dimensions that will be subset as the keys of the dictionary and the values corresponds to lists of element names for the respective axes or dimensions. Those axes that are not present in this dictionary will not be subset. E.g. {0 : ['Context 1', 'Context2'], 1: ['LR 10', 'LR 100']}

remove_duplicates : boolean, default=True Whether removing duplicated names in elements.

keep : str, default='first' Determines which duplicates (if any) to keep. Options are:

- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.

original_order : boolean, default=False Whether keeping the original order of the elements in interaction_tensor.order_names or keeping the new order as indicated in the lists in the subset_dict.

Returns

subset_tensor : cell2cell.tensor.BaseTensor A copy of interaction_tensor that was subset to contain only the elements specified for the respective axis in the subset_dict. Corresponds to a communication tensor generated with any of the tensor class in cell2cell.tensor

Source code in cell2cell/tensor/subset.py
def subset_tensor(interaction_tensor, subset_dict, remove_duplicates=True, keep='first', original_order=False):
    '''Subsets an InteractionTensor to contain only specific elements in
    respective dimensions.

    Parameters
    ----------
    interaction_tensor : cell2cell.tensor.BaseTensor
        A communication tensor generated with any of the tensor class in
        cell2cell.tensor

    subset_dict : dict
        Dictionary to subset the tensor. It must contain the axes or
        dimensions that will be subset as the keys of the dictionary
        and the values corresponds to lists of element names for the
        respective axes or dimensions. Those axes that are not present
        in this dictionary will not be subset.
        E.g. {0 : ['Context 1', 'Context2'], 1: ['LR 10', 'LR 100']}

    remove_duplicates : boolean, default=True
        Whether removing duplicated names in `elements`.

    keep : str, default='first'
        Determines which duplicates (if any) to keep.
        Options are:

        - first : Drop duplicates except for the first occurrence.
        - last : Drop duplicates except for the last occurrence.
        - False : Drop all duplicates.

    original_order : boolean, default=False
        Whether keeping the original order of the elements in
        interaction_tensor.order_names or keeping the
        new order as indicated in the lists in the `subset_dict`.

    Returns
    -------
    subset_tensor : cell2cell.tensor.BaseTensor
        A copy of interaction_tensor that was subset to contain
        only the elements specified for the respective axis in the
        `subset_dict`. Corresponds to a communication tensor
        generated with any of the tensor class in cell2cell.tensor
    '''
    # Perform a deep copy of the original tensor and reset previous factorization
    subset_tensor = copy.deepcopy(interaction_tensor)
    subset_tensor.rank = None
    subset_tensor.tl_object = None
    subset_tensor.factors = None

    # Initialize tensor into a numpy object for performing subset
    context = tl.context(subset_tensor.tensor)
    tensor = tl.to_numpy(subset_tensor.tensor)
    mask = None
    if subset_tensor.mask is not None:
        mask = tl.to_numpy(subset_tensor.mask)

    # Search for indexes
    axis_idxs = dict()
    for k, v in subset_dict.items():
        if k < len(tensor.shape):
            if len(v) != 0:
                idx = find_element_indexes(interaction_tensor=subset_tensor,
                                           elements=v,
                                           axis=k,
                                           remove_duplicates=remove_duplicates,
                                           keep=keep,
                                           original_order=original_order
                                           )
                if len(idx) == 0:
                    print("No elements found for axis {}. It will return an empty tensor.".format(k))
                axis_idxs[k] = idx
        else:
            print("Axis {} is out of index, not considering elements in this axis.".format(k))

    # Subset tensor
    for k, v in axis_idxs.items():
        if tensor.shape != (0,):  # Avoids error when returned empty tensor
            tensor = tensor.take(indices=v,
                                 axis=k
                                 )

            subset_tensor.order_names[k] = [subset_tensor.order_names[k][i] for i in v]
            if mask is not None:
                mask = mask.take(indices=v,
                                 axis=k
                                 )

    # Restore tensor and mask properties
    tensor = tl.tensor(tensor, **context)
    if mask is not None:
        mask = tl.tensor(mask, **context)

    subset_tensor.tensor = tensor
    subset_tensor.mask = mask
    return subset_tensor

tensor

BaseTensor

Empty base tensor class that contains the main functions for the Tensor Factorization of a Communication Tensor

Attributes

communication_score : str Type of communication score to infer the potential use of a given ligand- receptor pair by a pair of cells/tissues/samples. Available communication_scores are:

- 'expression_mean' : Computes the average between the expression of a ligand
                      from a sender cell and the expression of a receptor on a
                      receiver cell.
- 'expression_product' : Computes the product between the expression of a
                        ligand from a sender cell and the expression of a
                        receptor on a receiver cell.
- 'expression_gmean' : Computes the geometric mean between the expression
                           of a ligand from a sender cell and the
                           expression of a receptor on a receiver cell.

how : str Approach to consider cell types and genes present across multiple contexts.

- 'inner' : Considers only cell types and genes that are present in all
            contexts (intersection).
- 'outer' : Considers all cell types and genes that are present
            across contexts (union).
- 'outer_genes' : Considers only cell types that are present in all
                  contexts (intersection), while all genes that are
                  present across contexts (union).
- 'outer_cells' : Considers only genes that are present in all
                  contexts (intersection), while all cell types that are
                  present across contexts (union).

outer_fraction : float Threshold to filter the elements when how includes any outer option. Elements with a fraction abundance across samples (in rnaseq_matrices) at least this threshold will be included. When this value is 0, considers all elements across the samples. When this value is 1, it acts as using how='inner'.

tensor : tensorly.tensor Tensor object created with the library tensorly.

genes : list List of strings detailing the genes used through all contexts. Obtained depending on the attribute 'how'.

cells : list List of strings detailing the cells used through all contexts. Obtained depending on the attribute 'how'.

order_names : list List of lists containing the string names of each element in each of the dimensions or orders in the tensor. For a 4D-Communication tensor, the first list should contain the names of the contexts, the second the names of the ligand-receptor interactions, the third the names of the sender cells and the fourth the names of the receiver cells.

order_labels : list List of labels for dimensions or orders in the tensor.

tl_object : ndarray list A tensorly object containing a list of initialized factors of the tensor decomposition where element i is of shape (tensor.shape[i], rank).

norm_tl_object : ndarray list A tensorly object containing a list of initialized factors of the tensor decomposition where element i is of shape (tensor.shape[i], rank). This results from normalizing the factor loadings of the tl_object.

factors : dict Ordered dictionary containing a dataframe with the factor loadings for each dimension/order of the tensor.

rank : int Rank of the Tensor Factorization (number of factors to deconvolve the original tensor).

mask : ndarray list Helps avoiding missing values during a tensor factorization. A mask should be a boolean array of the same shape as the original tensor and should be 0 where the values are missing and 1 everywhere else.

explained_variance : float Explained variance score for a tnesor factorization.

explained_variance_ratio_ : ndarray list Percentage of variance explained by each of the factors. Only present when "normalize_loadings" is True. Otherwise, it is None.

loc_nans : ndarray list An array of shape equal to tensor with ones where NaN values were assigned when building the tensor. Other values are zeros. It stores the location of the NaN values.

loc_zeros : ndarray list An array of shape equal to tensor with ones where zeros that are not in loc_nans are located. Other values are assigned a zero. It tracks the real zero values rather than NaN values that were converted to zero.

elbow_metric : str Stores the metric used to perform the elbow analysis (y-axis).

    - 'error' : Normalized error to compute the elbow.
    - 'similarity' : Similarity based on CorrIndex (1-CorrIndex).

elbow_metric_mean : ndarray list Metric computed from the elbow analysis for each of the different rank evaluated. This list contains (X,Y) pairs where X values are the different ranks and Y values are the mean value of the metric employed. This mean is computed from multiple runs, or the values for just one run. Metric could be the normalized error of the decomposition or the similarity between multiple runs with different initialization, based on the CorrIndex.

elbow_metric_raw : ndarray list Similar to elbow_metric_mean, but instead of containing (X, Y) pairs, it is an array of shape runs by ranks that were used for the analysis. It contains all the metrics for each run in each of the evaluated ranks.

shape : tuple Shape of the tensor.

Source code in cell2cell/tensor/tensor.py
class BaseTensor():
    '''Empty base tensor class that contains the main functions for the Tensor
    Factorization of a Communication Tensor

    Attributes
    ----------
    communication_score : str
        Type of communication score to infer the potential use of a given ligand-
        receptor pair by a pair of cells/tissues/samples.
        Available communication_scores are:

        - 'expression_mean' : Computes the average between the expression of a ligand
                              from a sender cell and the expression of a receptor on a
                              receiver cell.
        - 'expression_product' : Computes the product between the expression of a
                                ligand from a sender cell and the expression of a
                                receptor on a receiver cell.
        - 'expression_gmean' : Computes the geometric mean between the expression
                                   of a ligand from a sender cell and the
                                   expression of a receptor on a receiver cell.

    how : str
        Approach to consider cell types and genes present across multiple contexts.

        - 'inner' : Considers only cell types and genes that are present in all
                    contexts (intersection).
        - 'outer' : Considers all cell types and genes that are present
                    across contexts (union).
        - 'outer_genes' : Considers only cell types that are present in all
                          contexts (intersection), while all genes that are
                          present across contexts (union).
        - 'outer_cells' : Considers only genes that are present in all
                          contexts (intersection), while all cell types that are
                          present across contexts (union).

    outer_fraction : float
        Threshold to filter the elements when `how` includes any outer option.
        Elements with a fraction abundance across samples (in `rnaseq_matrices`)
        at least this threshold will be included. When this value is 0, considers
        all elements across the samples. When this value is 1, it acts as using
        `how='inner'`.

    tensor : tensorly.tensor
        Tensor object created with the library tensorly.

    genes : list
        List of strings detailing the genes used through all contexts. Obtained
        depending on the attribute 'how'.

    cells : list
        List of strings detailing the cells used through all contexts. Obtained
        depending on the attribute 'how'.

    order_names : list
        List of lists containing the string names of each element in each of the
        dimensions or orders in the tensor. For a 4D-Communication tensor, the first
        list should contain the names of the contexts, the second the names of the
        ligand-receptor interactions, the third the names of the sender cells and the
        fourth the names of the receiver cells.

    order_labels : list
        List of labels for dimensions or orders in the tensor.

    tl_object : ndarray list
        A tensorly object containing a list of initialized factors of the tensor
        decomposition where element `i` is of shape (tensor.shape[i], rank).

    norm_tl_object : ndarray list
        A tensorly object containing a list of initialized factors of the tensor
        decomposition where element `i` is of shape (tensor.shape[i], rank). This
        results from normalizing the factor loadings of the tl_object.

    factors : dict
        Ordered dictionary containing a dataframe with the factor loadings for each
        dimension/order of the tensor.

    rank : int
        Rank of the Tensor Factorization (number of factors to deconvolve the original
        tensor).

    mask : ndarray list
        Helps avoiding missing values during a tensor factorization. A mask should be
        a boolean array of the same shape as the original tensor and should be 0
        where the values are missing and 1 everywhere else.

    explained_variance : float
        Explained variance score for a tnesor factorization.

    explained_variance_ratio_ : ndarray list
        Percentage of variance explained by each of the factors. Only present when
        "normalize_loadings" is True. Otherwise, it is None.

    loc_nans : ndarray list
        An array of shape equal to `tensor` with ones where NaN values were assigned
        when building the tensor. Other values are zeros. It stores the
        location of the NaN values.

    loc_zeros : ndarray list
        An array of shape equal to `tensor` with ones where zeros that are not in
        `loc_nans` are located. Other values are assigned a zero. It tracks the
        real zero values rather than NaN values that were converted to zero.

    elbow_metric : str
        Stores the metric used to perform the elbow analysis (y-axis).

            - 'error' : Normalized error to compute the elbow.
            - 'similarity' : Similarity based on CorrIndex (1-CorrIndex).

    elbow_metric_mean : ndarray list
        Metric computed from the elbow analysis for each of the different rank
        evaluated. This list contains (X,Y) pairs where X values are the
        different ranks and Y values are the mean value of the metric employed.
        This mean is computed from multiple runs, or the values for just one run.
        Metric could be the normalized error of the decomposition or the similarity
        between multiple runs with different initialization, based on the
        CorrIndex.

    elbow_metric_raw : ndarray list
        Similar to `elbow_metric_mean`, but instead of containing (X, Y) pairs,
        it is an array of shape runs by ranks that were used for the analysis.
        It contains all the metrics for each run in each of the evaluated ranks.

    shape : tuple
        Shape of the tensor.
    '''
    def __init__(self):
        # Save variables for this class
        self.communication_score = None
        self.how = None
        self.outer_fraction = None
        self.tensor = None
        self.genes = None
        self.cells = None
        self.order_names = [None, None, None, None]
        self.order_labels = None
        self.tl_object = None
        self.norm_tl_object = None
        self.factors = None
        self.rank = None
        self.mask = None
        self.explained_variance_ = None
        self.explained_variance_ratio_ = None
        self.loc_nans = None
        self.loc_zeros = None
        self.elbow_metric = None
        self.elbow_metric_mean = None
        self.elbow_metric_raw = None

    def copy(self):
        '''Performs a deep copy of this object.'''
        import copy
        return copy.deepcopy(self)

    @property
    def shape(self):
        '''Returns the shape of the tensor'''
        if hasattr(self.tensor, 'shape'):
            return self.tensor.shape
        else:
            return ()

    def write_file(self, filename):
        '''Exports this object into a pickle file.

        Parameters
        ----------
        filename : str
            Complete path to the file wherein the variable will be
            stored. For example:
            /home/user/variable.pkl
        '''
        from cell2cell.io.save_data import export_variable_with_pickle
        export_variable_with_pickle(self, filename=filename)

    def to_device(self, device):
        '''Changes the device where the tensor
        is analyzed.

        Parameters
        ----------
        device : str
            Device name to use for the decomposition.
            Options could be 'cpu', 'cuda', 'gpu', depending on
            the backend used with tensorly.
        '''
        try:
            self.tensor = tl.tensor(self.tensor, device=device)
            if self.mask is not None:
                self.mask = tl.tensor(self.mask, device=device)
        except:
            print('Device is either not available or the backend used with tensorly does not support this device.\
                   Try changing it with tensorly.set_backend("<backend_name>") before.')
            self.tensor = tl.tensor(self.tensor)
            if self.mask is not None:
                self.mask = tl.tensor(self.mask)

    def compute_tensor_factorization(self, rank, tf_type='non_negative_cp', init='svd', svd='numpy_svd', random_state=None,
                                     runs=1, normalize_loadings=True, var_ordered_factors=True, n_iter_max=100, tol=10e-7,
                                     verbose=False, **kwargs):
        '''Performs a Tensor Factorization.
        There are no returns, instead the attributes factors and rank
         of the Tensor class are updated.

        Parameters
        ----------
        rank : int
            Rank of the Tensor Factorization (number of factors to deconvolve the original
            tensor).

        tf_type : str, default='non_negative_cp'
            Type of Tensor Factorization.

            - 'non_negative_cp' : Non-negative PARAFAC through the traditional ALS.
            - 'non_negative_cp_hals' : Non-negative PARAFAC through the Hierarchical ALS.
                                       It reaches an optimal solution faster than the
                                       traditional ALS, but it does not allow a mask.
            - 'parafac' : PARAFAC through the traditional ALS. It allows negative loadings.
            - 'constrained_parafac' : PARAFAC through the traditional ALS. It allows
                                      negative loadings. Also, it incorporates L1 and L2
                                      regularization, includes a 'non_negative' option, and
                                      allows constraining the sparsity of the decomposition.
                                      For more information, see
                                      http://tensorly.org/stable/modules/generated/tensorly.decomposition.constrained_parafac.html#tensorly.decomposition.constrained_parafac


        init : str, default='svd'
            Initialization method for computing the Tensor Factorization.
            {‘svd’, ‘random’}

        svd : str, default='numpy_svd'
            Function to use to compute the SVD, acceptable values in tensorly.SVD_FUNS

        random_state : int, default=None
            Seed for randomization.

        runs : int, default=1
            Number of models to choose among and find the lowest error.
            This helps to avoid local minima when using runs > 1.

        normalize_loadings : boolean, default=True
            Whether normalizing the loadings in each factor to unit
            Euclidean length.

        var_ordered_factors : boolean, default=True
            Whether ordering factors by the variance they explain. The order is from
            highest to lowest variance. `normalize_loadings` must be True. Otherwise,
            this parameter is ignored.

        tol : float, default=10e-7
            Tolerance for the decomposition algorithm to stop when the variation in
            the reconstruction error is less than the tolerance. Lower `tol` helps
            to improve the solution obtained from the decomposition, but it takes
            longer to run.

        n_iter_max : int, default=100
            Maximum number of iteration to reach an optimal solution with the
            decomposition algorithm. Higher `n_iter_max`helps to improve the solution
            obtained from the decomposition, but it takes longer to run.

        verbose : boolean, default=False
            Whether printing or not steps of the analysis.

        **kwargs : dict
            Extra arguments for the tensor factorization according to inputs in tensorly.
        '''
        tensor_dim = len(self.tensor.shape)
        best_err = np.inf
        tf = None

        if kwargs is None:
            kwargs = {'return_errors' : True}
        else:
            kwargs['return_errors'] = True

        for run in tqdm(range(runs), disable=(runs==1)):
            if random_state is not None:
                random_state_ = random_state + run
            else:
                random_state_ = None
            local_tf, errors = _compute_tensor_factorization(tensor=self.tensor,
                                                             rank=rank,
                                                             tf_type=tf_type,
                                                             init=init,
                                                             svd=svd,
                                                             random_state=random_state_,
                                                             mask=self.mask,
                                                             n_iter_max=n_iter_max,
                                                             tol=tol,
                                                             verbose=verbose,
                                                             **kwargs)
            # This helps to obtain proper error when the mask is not None.
            if self.mask is None:
                err = tl.to_numpy(errors[-1])
                if best_err > err:
                    best_err = err
                    tf = local_tf
            else:
                err = _compute_norm_error(self.tensor, local_tf, self.mask)
                if best_err > err:
                    best_err = err
                    tf = local_tf

        if runs > 1:
            print('Best model has a normalized error of: {0:.3f}'.format(best_err))

        self.tl_object = tf
        if normalize_loadings:
            self.norm_tl_object = tl.cp_tensor.cp_normalize(self.tl_object)

        factor_names = ['Factor {}'.format(i) for i in range(1, rank+1)]
        if self.order_labels is None:
            if tensor_dim == 4:
                order_labels = ['Contexts', 'Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']
            elif tensor_dim > 4:
                order_labels = ['Contexts-{}'.format(i+1) for i in range(tensor_dim-3)] + ['Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']
            elif tensor_dim == 3:
                order_labels = ['Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']
            else:
                raise ValueError('Too few dimensions in the tensor')
        else:
            assert len(self.order_labels) == tensor_dim, "The length of order_labels must match the number of orders/dimensions in the tensor"
            order_labels = self.order_labels

        if normalize_loadings:
            (weights, factors) = self.norm_tl_object
            weights = tl.to_numpy(weights)
            if var_ordered_factors:
                w_order = weights.argsort()[::-1]
                factors = [tl.to_numpy(f)[:, w_order] for f in factors]
                self.explained_variance_ratio_ = weights[w_order] / sum(weights)
            else:
                factors = [tl.to_numpy(f) for f in factors]
                self.explained_variance_ratio_ = weights / sum(weights)

        else:
            (weights, factors) = self.tl_object
            self.explained_variance_ratio_ = None

        self.explained_variance_ = self.explained_variance()

        self.factors = OrderedDict(zip(order_labels,
                                       [pd.DataFrame(tl.to_numpy(f), index=idx, columns=factor_names) for f, idx in zip(factors, self.order_names)]))
        self.rank = rank

    def elbow_rank_selection(self, upper_rank=50, runs=20, tf_type='non_negative_cp', init='random', svd='numpy_svd',
                             metric='error', random_state=None, n_iter_max=100, tol=10e-7, automatic_elbow=True,
                             manual_elbow=None, smooth=False, mask=None, ci='std', figsize=(4, 2.25), fontsize=14,
                             filename=None, output_fig=True, verbose=False, **kwargs):
        '''Elbow analysis on the error achieved by the Tensor Factorization for
        selecting the number of factors to use. A plot is made with the results.

        Parameters
        ----------
        upper_rank : int, default=50
            Upper bound of ranks to explore with the elbow analysis.

        runs : int, default=20
            Number of tensor factorization performed for a given rank. Each
            factorization varies in the seed of initialization.

        tf_type : str, default='non_negative_cp'
            Type of Tensor Factorization.

            - 'non_negative_cp' : Non-negative PARAFAC through the traditional ALS.
            - 'non_negative_cp_hals' : Non-negative PARAFAC through the Hierarchical ALS.
                                       It reaches an optimal solution faster than the
                                       traditional ALS, but it does not allow a mask.
            - 'parafac' : PARAFAC through the traditional ALS. It allows negative loadings.
            - 'constrained_parafac' : PARAFAC through the traditional ALS. It allows
                                      negative loadings. Also, it incorporates L1 and L2
                                      regularization, includes a 'non_negative' option, and
                                      allows constraining the sparsity of the decomposition.
                                      For more information, see
                                      http://tensorly.org/stable/modules/generated/tensorly.decomposition.constrained_parafac.html#tensorly.decomposition.constrained_parafac

        init : str, default='svd'
            Initialization method for computing the Tensor Factorization.
            {‘svd’, ‘random’}

        svd : str, default='numpy_svd'
            Function to compute the SVD, acceptable values in tensorly.SVD_FUNS

        metric : str, default='error'
            Metric to perform the elbow analysis (y-axis)

            - 'error' : Normalized error to compute the elbow.
            - 'similarity' : Similarity based on CorrIndex (1-CorrIndex).

        random_state : int, default=None
            Seed for randomization.

        tol : float, default=10e-7
            Tolerance for the decomposition algorithm to stop when the variation in
            the reconstruction error is less than the tolerance. Lower `tol` helps
            to improve the solution obtained from the decomposition, but it takes
            longer to run.

        n_iter_max : int, default=100
            Maximum number of iteration to reach an optimal solution with the
            decomposition algorithm. Higher `n_iter_max`helps to improve the solution
            obtained from the decomposition, but it takes longer to run.

        automatic_elbow : boolean, default=True
            Whether using an automatic strategy to find the elbow. If True, the method
            implemented by the package kneed is used.

        manual_elbow : int, default=None
            Rank or number of factors to highlight in the curve of error achieved by
            the Tensor Factorization. This input is considered only when
            `automatic_elbow=True`

        smooth : boolean, default=False
            Whether smoothing the curve with a Savitzky-Golay filter.

        mask : ndarray list, default=None
            Helps avoiding missing values during a tensor factorization. A mask should be
            a boolean array of the same shape as the original tensor and should be 0
            where the values are missing and 1 everywhere else.

        ci : str, default='std'
            Confidence interval for representing the multiple runs in each rank.
            {'std', '95%'}

        figsize : tuple, default=(4, 2.25)
            Figure size, width by height

        fontsize : int, default=14
            Fontsize for axis labels.

        filename : str, default=None
            Path to save the figure of the elbow analysis. If None, the figure is not
            saved.

        output_fig : boolean, default=True
            Whether generating the figure with matplotlib.

        verbose : boolean, default=False
            Whether printing or not steps of the analysis.

        **kwargs : dict
            Extra arguments for the tensor factorization according to inputs in
            tensorly.

        Returns
        -------
        fig : matplotlib.figure.Figure
            Figure object made with matplotlib

        loss : list
            List of normalized errors for each rank. Here the errors are te average
            across distinct runs for each rank.
        '''
        assert metric in ['similarity', 'error'], "`metric` must be either 'similarity' or 'error'"
        ylabel = {'similarity' : 'Similarity\n(1-CorrIndex)', 'error' : 'Normalized Error'}

        # Run analysis
        if verbose:
            print('Running Elbow Analysis')

        if mask is None:
            if self.mask is not None:
                mask = self.mask

        if metric == 'similarity':
            assert runs > 1, "`runs` must be greater than 1 when `metric` = 'similarity'"
        if runs == 1:
            loss = _run_elbow_analysis(tensor=self.tensor,
                                       upper_rank=upper_rank,
                                       tf_type=tf_type,
                                       init=init,
                                       svd=svd,
                                       random_state=random_state,
                                       mask=mask,
                                       n_iter_max=n_iter_max,
                                       tol=tol,
                                       verbose=verbose,
                                       **kwargs
                                       )
            loss = [(l[0], l[1].item()) for l in loss]
            all_loss = np.array([[l[1] for l in loss]])
            if automatic_elbow:
                if smooth:
                    loss_ = [l[1] for l in loss]
                    loss = smooth_curve(loss_)
                    loss = [(i + 1, l) for i, l in enumerate(loss)]
                rank = int(_compute_elbow(loss))
            else:
                rank = manual_elbow
            if output_fig:
                fig = plot_elbow(loss=loss,
                                 elbow=rank,
                                 figsize=figsize,
                                 ylabel=ylabel[metric],
                                 fontsize=fontsize,
                                 filename=filename)
            else:
                fig = None
        elif runs > 1:
            all_loss = _multiple_runs_elbow_analysis(tensor=self.tensor,
                                                     upper_rank=upper_rank,
                                                     runs=runs,
                                                     tf_type=tf_type,
                                                     init=init,
                                                     svd=svd,
                                                     metric=metric,
                                                     random_state=random_state,
                                                     mask=mask,
                                                     n_iter_max=n_iter_max,
                                                     tol=tol,
                                                     verbose=verbose,
                                                     **kwargs
                                                     )

            # Same outputs as runs = 1
            loss = np.nanmean(all_loss, axis=0).tolist()
            if smooth:
                loss = smooth_curve(loss)
            loss = [(i + 1, l) for i, l in enumerate(loss)]

            if automatic_elbow:
                rank = int(_compute_elbow(loss))
            else:
                rank = manual_elbow

            if output_fig:
                fig = plot_multiple_run_elbow(all_loss=all_loss,
                                              ci=ci,
                                              elbow=rank,
                                              figsize=figsize,
                                              ylabel=ylabel[metric],
                                              smooth=smooth,
                                              fontsize=fontsize,
                                              filename=filename)
            else:
                fig = None

        else:
            assert runs > 0, "Input runs must be an integer greater than 0"

        # Store results
        self.rank = rank
        self.elbow_metric = metric
        self.elbow_metric_mean = loss
        self.elbow_metric_raw = all_loss

        if self.rank is not None:
            assert(isinstance(rank, int)), 'rank must be an integer.'
            print('The rank at the elbow is: {}'.format(self.rank))
        return fig, loss

    def get_top_factor_elements(self, order_name, factor_name, top_number=10):
        '''Obtains the top-elements with higher loadings for a given factor

        Parameters
        ----------
        order_name : str
            Name of the dimension/order in the tensor according to the keys of the
            dictionary in BaseTensor.factors. The attribute factors is built once the
            tensor factorization is run.

        factor_name : str
            Name of one of the factors. E.g., 'Factor 1'

        top_number : int, default=10
            Number of top-elements to return

        Returns
        -------
        top_elements : pandas.DataFrame
            A dataframe with the loadings of the top-elements for the given factor.
        '''
        top_elements = self.factors[order_name][factor_name].sort_values(ascending=False).head(top_number)
        return top_elements

    def export_factor_loadings(self, filename):
        '''Exports the factor loadings of the tensor into an Excel file

        Parameters
        ----------
        filename : str
            Full path and filename to store the file. E.g., '/home/user/Loadings.xlsx'
        '''
        writer = pd.ExcelWriter(filename)
        for k, v in self.factors.items():
            v.to_excel(writer, sheet_name=k)
        writer.close()
        print('Loadings of the tensor factorization were successfully saved into {}'.format(filename))

    def excluded_value_fraction(self):
        '''Returns the fraction of excluded values in the tensor,
        given the values that are masked in tensor.mask

        Returns
        -------
        excluded_fraction : float
            Fraction of missing/excluded values in the tensor.
        '''
        if self.mask is None:
            print("The interaction tensor does not have masked values")
            return 0.0
        else:
            fraction = tl.sum(self.mask) / tl.prod(tl.tensor(self.tensor.shape))
            excluded_fraction = 1.0 - fraction.item()
            return excluded_fraction

    def sparsity_fraction(self):
        '''Returns the fraction of values that are zeros in the tensor,
        given the values that are in tensor.loc_zeros

        Returns
        -------
        sparsity_fraction : float
            Fraction of values that are real zeros.
        '''
        if self.loc_zeros is None:
            print("The interaction tensor does not have zeros")
            return 0.0
        else:
            sparsity_fraction = tl.sum(self.loc_zeros) / tl.prod(tl.tensor(self.tensor.shape))
        sparsity_fraction = sparsity_fraction.item()
        return sparsity_fraction

    def missing_fraction(self):
        '''Returns the fraction of values that are missing (NaNs) in the tensor,
        given the values that are in tensor.loc_nans

        Returns
        -------
        missing_fraction : float
            Fraction of values that are real zeros.
        '''
        if self.loc_nans is None:
            print("The interaction tensor does not have missing values")
            return 0.0
        else:
            missing_fraction = tl.sum(self.loc_nans) / tl.prod(tl.tensor(self.tensor.shape))
        missing_fraction = missing_fraction.item()
        return missing_fraction

    def explained_variance(self):
        '''Computes the explained variance score for a tensor decomposition. Inspired on the
        function in sklearn.metrics.explained_variance_score.

        Returns
        -------
        explained_variance : float
            Explained variance score for a tnesor factorization.
        '''
        assert self.tl_object is not None, "Must run compute_tensor_factorization before using this method."
        tensor = self.tensor
        rec_tensor = self.tl_object.to_tensor()
        mask = self.mask

        if mask is not None:
            tensor = tensor * mask
            rec_tensor = tensor * mask

        y_diff_avg = tl.mean(tensor - rec_tensor)
        numerator = tl.norm(tensor - rec_tensor - y_diff_avg)

        tensor_avg = tl.mean(tensor)
        denominator = tl.norm(tensor - tensor_avg)

        if denominator == 0.:
            explained_variance = 0.0
        else:
            explained_variance =  1. - (numerator / denominator)
            explained_variance = explained_variance.item()
        return explained_variance
shape property readonly

Returns the shape of the tensor

compute_tensor_factorization(self, rank, tf_type='non_negative_cp', init='svd', svd='numpy_svd', random_state=None, runs=1, normalize_loadings=True, var_ordered_factors=True, n_iter_max=100, tol=1e-06, verbose=False, **kwargs)

Performs a Tensor Factorization. There are no returns, instead the attributes factors and rank of the Tensor class are updated.

Parameters

rank : int Rank of the Tensor Factorization (number of factors to deconvolve the original tensor).

tf_type : str, default='non_negative_cp' Type of Tensor Factorization.

- 'non_negative_cp' : Non-negative PARAFAC through the traditional ALS.
- 'non_negative_cp_hals' : Non-negative PARAFAC through the Hierarchical ALS.
                           It reaches an optimal solution faster than the
                           traditional ALS, but it does not allow a mask.
- 'parafac' : PARAFAC through the traditional ALS. It allows negative loadings.
- 'constrained_parafac' : PARAFAC through the traditional ALS. It allows
                          negative loadings. Also, it incorporates L1 and L2
                          regularization, includes a 'non_negative' option, and
                          allows constraining the sparsity of the decomposition.
                          For more information, see
                          http://tensorly.org/stable/modules/generated/tensorly.decomposition.constrained_parafac.html#tensorly.decomposition.constrained_parafac

init : str, default='svd' Initialization method for computing the Tensor Factorization.

svd : str, default='numpy_svd' Function to use to compute the SVD, acceptable values in tensorly.SVD_FUNS

random_state : int, default=None Seed for randomization.

runs : int, default=1 Number of models to choose among and find the lowest error. This helps to avoid local minima when using runs > 1.

normalize_loadings : boolean, default=True Whether normalizing the loadings in each factor to unit Euclidean length.

var_ordered_factors : boolean, default=True Whether ordering factors by the variance they explain. The order is from highest to lowest variance. normalize_loadings must be True. Otherwise, this parameter is ignored.

tol : float, default=10e-7 Tolerance for the decomposition algorithm to stop when the variation in the reconstruction error is less than the tolerance. Lower tol helps to improve the solution obtained from the decomposition, but it takes longer to run.

n_iter_max : int, default=100 Maximum number of iteration to reach an optimal solution with the decomposition algorithm. Higher n_iter_maxhelps to improve the solution obtained from the decomposition, but it takes longer to run.

verbose : boolean, default=False Whether printing or not steps of the analysis.

**kwargs : dict Extra arguments for the tensor factorization according to inputs in tensorly.

Source code in cell2cell/tensor/tensor.py
def compute_tensor_factorization(self, rank, tf_type='non_negative_cp', init='svd', svd='numpy_svd', random_state=None,
                                 runs=1, normalize_loadings=True, var_ordered_factors=True, n_iter_max=100, tol=10e-7,
                                 verbose=False, **kwargs):
    '''Performs a Tensor Factorization.
    There are no returns, instead the attributes factors and rank
     of the Tensor class are updated.

    Parameters
    ----------
    rank : int
        Rank of the Tensor Factorization (number of factors to deconvolve the original
        tensor).

    tf_type : str, default='non_negative_cp'
        Type of Tensor Factorization.

        - 'non_negative_cp' : Non-negative PARAFAC through the traditional ALS.
        - 'non_negative_cp_hals' : Non-negative PARAFAC through the Hierarchical ALS.
                                   It reaches an optimal solution faster than the
                                   traditional ALS, but it does not allow a mask.
        - 'parafac' : PARAFAC through the traditional ALS. It allows negative loadings.
        - 'constrained_parafac' : PARAFAC through the traditional ALS. It allows
                                  negative loadings. Also, it incorporates L1 and L2
                                  regularization, includes a 'non_negative' option, and
                                  allows constraining the sparsity of the decomposition.
                                  For more information, see
                                  http://tensorly.org/stable/modules/generated/tensorly.decomposition.constrained_parafac.html#tensorly.decomposition.constrained_parafac


    init : str, default='svd'
        Initialization method for computing the Tensor Factorization.
        {‘svd’, ‘random’}

    svd : str, default='numpy_svd'
        Function to use to compute the SVD, acceptable values in tensorly.SVD_FUNS

    random_state : int, default=None
        Seed for randomization.

    runs : int, default=1
        Number of models to choose among and find the lowest error.
        This helps to avoid local minima when using runs > 1.

    normalize_loadings : boolean, default=True
        Whether normalizing the loadings in each factor to unit
        Euclidean length.

    var_ordered_factors : boolean, default=True
        Whether ordering factors by the variance they explain. The order is from
        highest to lowest variance. `normalize_loadings` must be True. Otherwise,
        this parameter is ignored.

    tol : float, default=10e-7
        Tolerance for the decomposition algorithm to stop when the variation in
        the reconstruction error is less than the tolerance. Lower `tol` helps
        to improve the solution obtained from the decomposition, but it takes
        longer to run.

    n_iter_max : int, default=100
        Maximum number of iteration to reach an optimal solution with the
        decomposition algorithm. Higher `n_iter_max`helps to improve the solution
        obtained from the decomposition, but it takes longer to run.

    verbose : boolean, default=False
        Whether printing or not steps of the analysis.

    **kwargs : dict
        Extra arguments for the tensor factorization according to inputs in tensorly.
    '''
    tensor_dim = len(self.tensor.shape)
    best_err = np.inf
    tf = None

    if kwargs is None:
        kwargs = {'return_errors' : True}
    else:
        kwargs['return_errors'] = True

    for run in tqdm(range(runs), disable=(runs==1)):
        if random_state is not None:
            random_state_ = random_state + run
        else:
            random_state_ = None
        local_tf, errors = _compute_tensor_factorization(tensor=self.tensor,
                                                         rank=rank,
                                                         tf_type=tf_type,
                                                         init=init,
                                                         svd=svd,
                                                         random_state=random_state_,
                                                         mask=self.mask,
                                                         n_iter_max=n_iter_max,
                                                         tol=tol,
                                                         verbose=verbose,
                                                         **kwargs)
        # This helps to obtain proper error when the mask is not None.
        if self.mask is None:
            err = tl.to_numpy(errors[-1])
            if best_err > err:
                best_err = err
                tf = local_tf
        else:
            err = _compute_norm_error(self.tensor, local_tf, self.mask)
            if best_err > err:
                best_err = err
                tf = local_tf

    if runs > 1:
        print('Best model has a normalized error of: {0:.3f}'.format(best_err))

    self.tl_object = tf
    if normalize_loadings:
        self.norm_tl_object = tl.cp_tensor.cp_normalize(self.tl_object)

    factor_names = ['Factor {}'.format(i) for i in range(1, rank+1)]
    if self.order_labels is None:
        if tensor_dim == 4:
            order_labels = ['Contexts', 'Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']
        elif tensor_dim > 4:
            order_labels = ['Contexts-{}'.format(i+1) for i in range(tensor_dim-3)] + ['Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']
        elif tensor_dim == 3:
            order_labels = ['Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']
        else:
            raise ValueError('Too few dimensions in the tensor')
    else:
        assert len(self.order_labels) == tensor_dim, "The length of order_labels must match the number of orders/dimensions in the tensor"
        order_labels = self.order_labels

    if normalize_loadings:
        (weights, factors) = self.norm_tl_object
        weights = tl.to_numpy(weights)
        if var_ordered_factors:
            w_order = weights.argsort()[::-1]
            factors = [tl.to_numpy(f)[:, w_order] for f in factors]
            self.explained_variance_ratio_ = weights[w_order] / sum(weights)
        else:
            factors = [tl.to_numpy(f) for f in factors]
            self.explained_variance_ratio_ = weights / sum(weights)

    else:
        (weights, factors) = self.tl_object
        self.explained_variance_ratio_ = None

    self.explained_variance_ = self.explained_variance()

    self.factors = OrderedDict(zip(order_labels,
                                   [pd.DataFrame(tl.to_numpy(f), index=idx, columns=factor_names) for f, idx in zip(factors, self.order_names)]))
    self.rank = rank
copy(self)

Performs a deep copy of this object.

Source code in cell2cell/tensor/tensor.py
def copy(self):
    '''Performs a deep copy of this object.'''
    import copy
    return copy.deepcopy(self)
elbow_rank_selection(self, upper_rank=50, runs=20, tf_type='non_negative_cp', init='random', svd='numpy_svd', metric='error', random_state=None, n_iter_max=100, tol=1e-06, automatic_elbow=True, manual_elbow=None, smooth=False, mask=None, ci='std', figsize=(4, 2.25), fontsize=14, filename=None, output_fig=True, verbose=False, **kwargs)

Elbow analysis on the error achieved by the Tensor Factorization for selecting the number of factors to use. A plot is made with the results.

Parameters

upper_rank : int, default=50 Upper bound of ranks to explore with the elbow analysis.

runs : int, default=20 Number of tensor factorization performed for a given rank. Each factorization varies in the seed of initialization.

tf_type : str, default='non_negative_cp' Type of Tensor Factorization.

- 'non_negative_cp' : Non-negative PARAFAC through the traditional ALS.
- 'non_negative_cp_hals' : Non-negative PARAFAC through the Hierarchical ALS.
                           It reaches an optimal solution faster than the
                           traditional ALS, but it does not allow a mask.
- 'parafac' : PARAFAC through the traditional ALS. It allows negative loadings.
- 'constrained_parafac' : PARAFAC through the traditional ALS. It allows
                          negative loadings. Also, it incorporates L1 and L2
                          regularization, includes a 'non_negative' option, and
                          allows constraining the sparsity of the decomposition.
                          For more information, see
                          http://tensorly.org/stable/modules/generated/tensorly.decomposition.constrained_parafac.html#tensorly.decomposition.constrained_parafac

init : str, default='svd' Initialization method for computing the Tensor Factorization.

svd : str, default='numpy_svd' Function to compute the SVD, acceptable values in tensorly.SVD_FUNS

metric : str, default='error' Metric to perform the elbow analysis (y-axis)

- 'error' : Normalized error to compute the elbow.
- 'similarity' : Similarity based on CorrIndex (1-CorrIndex).

random_state : int, default=None Seed for randomization.

tol : float, default=10e-7 Tolerance for the decomposition algorithm to stop when the variation in the reconstruction error is less than the tolerance. Lower tol helps to improve the solution obtained from the decomposition, but it takes longer to run.

n_iter_max : int, default=100 Maximum number of iteration to reach an optimal solution with the decomposition algorithm. Higher n_iter_maxhelps to improve the solution obtained from the decomposition, but it takes longer to run.

automatic_elbow : boolean, default=True Whether using an automatic strategy to find the elbow. If True, the method implemented by the package kneed is used.

manual_elbow : int, default=None Rank or number of factors to highlight in the curve of error achieved by the Tensor Factorization. This input is considered only when automatic_elbow=True

smooth : boolean, default=False Whether smoothing the curve with a Savitzky-Golay filter.

mask : ndarray list, default=None Helps avoiding missing values during a tensor factorization. A mask should be a boolean array of the same shape as the original tensor and should be 0 where the values are missing and 1 everywhere else.

ci : str, default='std' Confidence interval for representing the multiple runs in each rank.

figsize : tuple, default=(4, 2.25) Figure size, width by height

fontsize : int, default=14 Fontsize for axis labels.

filename : str, default=None Path to save the figure of the elbow analysis. If None, the figure is not saved.

output_fig : boolean, default=True Whether generating the figure with matplotlib.

verbose : boolean, default=False Whether printing or not steps of the analysis.

**kwargs : dict Extra arguments for the tensor factorization according to inputs in tensorly.

Returns

fig : matplotlib.figure.Figure Figure object made with matplotlib

loss : list List of normalized errors for each rank. Here the errors are te average across distinct runs for each rank.

Source code in cell2cell/tensor/tensor.py
def elbow_rank_selection(self, upper_rank=50, runs=20, tf_type='non_negative_cp', init='random', svd='numpy_svd',
                         metric='error', random_state=None, n_iter_max=100, tol=10e-7, automatic_elbow=True,
                         manual_elbow=None, smooth=False, mask=None, ci='std', figsize=(4, 2.25), fontsize=14,
                         filename=None, output_fig=True, verbose=False, **kwargs):
    '''Elbow analysis on the error achieved by the Tensor Factorization for
    selecting the number of factors to use. A plot is made with the results.

    Parameters
    ----------
    upper_rank : int, default=50
        Upper bound of ranks to explore with the elbow analysis.

    runs : int, default=20
        Number of tensor factorization performed for a given rank. Each
        factorization varies in the seed of initialization.

    tf_type : str, default='non_negative_cp'
        Type of Tensor Factorization.

        - 'non_negative_cp' : Non-negative PARAFAC through the traditional ALS.
        - 'non_negative_cp_hals' : Non-negative PARAFAC through the Hierarchical ALS.
                                   It reaches an optimal solution faster than the
                                   traditional ALS, but it does not allow a mask.
        - 'parafac' : PARAFAC through the traditional ALS. It allows negative loadings.
        - 'constrained_parafac' : PARAFAC through the traditional ALS. It allows
                                  negative loadings. Also, it incorporates L1 and L2
                                  regularization, includes a 'non_negative' option, and
                                  allows constraining the sparsity of the decomposition.
                                  For more information, see
                                  http://tensorly.org/stable/modules/generated/tensorly.decomposition.constrained_parafac.html#tensorly.decomposition.constrained_parafac

    init : str, default='svd'
        Initialization method for computing the Tensor Factorization.
        {‘svd’, ‘random’}

    svd : str, default='numpy_svd'
        Function to compute the SVD, acceptable values in tensorly.SVD_FUNS

    metric : str, default='error'
        Metric to perform the elbow analysis (y-axis)

        - 'error' : Normalized error to compute the elbow.
        - 'similarity' : Similarity based on CorrIndex (1-CorrIndex).

    random_state : int, default=None
        Seed for randomization.

    tol : float, default=10e-7
        Tolerance for the decomposition algorithm to stop when the variation in
        the reconstruction error is less than the tolerance. Lower `tol` helps
        to improve the solution obtained from the decomposition, but it takes
        longer to run.

    n_iter_max : int, default=100
        Maximum number of iteration to reach an optimal solution with the
        decomposition algorithm. Higher `n_iter_max`helps to improve the solution
        obtained from the decomposition, but it takes longer to run.

    automatic_elbow : boolean, default=True
        Whether using an automatic strategy to find the elbow. If True, the method
        implemented by the package kneed is used.

    manual_elbow : int, default=None
        Rank or number of factors to highlight in the curve of error achieved by
        the Tensor Factorization. This input is considered only when
        `automatic_elbow=True`

    smooth : boolean, default=False
        Whether smoothing the curve with a Savitzky-Golay filter.

    mask : ndarray list, default=None
        Helps avoiding missing values during a tensor factorization. A mask should be
        a boolean array of the same shape as the original tensor and should be 0
        where the values are missing and 1 everywhere else.

    ci : str, default='std'
        Confidence interval for representing the multiple runs in each rank.
        {'std', '95%'}

    figsize : tuple, default=(4, 2.25)
        Figure size, width by height

    fontsize : int, default=14
        Fontsize for axis labels.

    filename : str, default=None
        Path to save the figure of the elbow analysis. If None, the figure is not
        saved.

    output_fig : boolean, default=True
        Whether generating the figure with matplotlib.

    verbose : boolean, default=False
        Whether printing or not steps of the analysis.

    **kwargs : dict
        Extra arguments for the tensor factorization according to inputs in
        tensorly.

    Returns
    -------
    fig : matplotlib.figure.Figure
        Figure object made with matplotlib

    loss : list
        List of normalized errors for each rank. Here the errors are te average
        across distinct runs for each rank.
    '''
    assert metric in ['similarity', 'error'], "`metric` must be either 'similarity' or 'error'"
    ylabel = {'similarity' : 'Similarity\n(1-CorrIndex)', 'error' : 'Normalized Error'}

    # Run analysis
    if verbose:
        print('Running Elbow Analysis')

    if mask is None:
        if self.mask is not None:
            mask = self.mask

    if metric == 'similarity':
        assert runs > 1, "`runs` must be greater than 1 when `metric` = 'similarity'"
    if runs == 1:
        loss = _run_elbow_analysis(tensor=self.tensor,
                                   upper_rank=upper_rank,
                                   tf_type=tf_type,
                                   init=init,
                                   svd=svd,
                                   random_state=random_state,
                                   mask=mask,
                                   n_iter_max=n_iter_max,
                                   tol=tol,
                                   verbose=verbose,
                                   **kwargs
                                   )
        loss = [(l[0], l[1].item()) for l in loss]
        all_loss = np.array([[l[1] for l in loss]])
        if automatic_elbow:
            if smooth:
                loss_ = [l[1] for l in loss]
                loss = smooth_curve(loss_)
                loss = [(i + 1, l) for i, l in enumerate(loss)]
            rank = int(_compute_elbow(loss))
        else:
            rank = manual_elbow
        if output_fig:
            fig = plot_elbow(loss=loss,
                             elbow=rank,
                             figsize=figsize,
                             ylabel=ylabel[metric],
                             fontsize=fontsize,
                             filename=filename)
        else:
            fig = None
    elif runs > 1:
        all_loss = _multiple_runs_elbow_analysis(tensor=self.tensor,
                                                 upper_rank=upper_rank,
                                                 runs=runs,
                                                 tf_type=tf_type,
                                                 init=init,
                                                 svd=svd,
                                                 metric=metric,
                                                 random_state=random_state,
                                                 mask=mask,
                                                 n_iter_max=n_iter_max,
                                                 tol=tol,
                                                 verbose=verbose,
                                                 **kwargs
                                                 )

        # Same outputs as runs = 1
        loss = np.nanmean(all_loss, axis=0).tolist()
        if smooth:
            loss = smooth_curve(loss)
        loss = [(i + 1, l) for i, l in enumerate(loss)]

        if automatic_elbow:
            rank = int(_compute_elbow(loss))
        else:
            rank = manual_elbow

        if output_fig:
            fig = plot_multiple_run_elbow(all_loss=all_loss,
                                          ci=ci,
                                          elbow=rank,
                                          figsize=figsize,
                                          ylabel=ylabel[metric],
                                          smooth=smooth,
                                          fontsize=fontsize,
                                          filename=filename)
        else:
            fig = None

    else:
        assert runs > 0, "Input runs must be an integer greater than 0"

    # Store results
    self.rank = rank
    self.elbow_metric = metric
    self.elbow_metric_mean = loss
    self.elbow_metric_raw = all_loss

    if self.rank is not None:
        assert(isinstance(rank, int)), 'rank must be an integer.'
        print('The rank at the elbow is: {}'.format(self.rank))
    return fig, loss
excluded_value_fraction(self)

Returns the fraction of excluded values in the tensor, given the values that are masked in tensor.mask

Returns

excluded_fraction : float Fraction of missing/excluded values in the tensor.

Source code in cell2cell/tensor/tensor.py
def excluded_value_fraction(self):
    '''Returns the fraction of excluded values in the tensor,
    given the values that are masked in tensor.mask

    Returns
    -------
    excluded_fraction : float
        Fraction of missing/excluded values in the tensor.
    '''
    if self.mask is None:
        print("The interaction tensor does not have masked values")
        return 0.0
    else:
        fraction = tl.sum(self.mask) / tl.prod(tl.tensor(self.tensor.shape))
        excluded_fraction = 1.0 - fraction.item()
        return excluded_fraction
explained_variance(self)

Computes the explained variance score for a tensor decomposition. Inspired on the function in sklearn.metrics.explained_variance_score.

Returns

explained_variance : float Explained variance score for a tnesor factorization.

Source code in cell2cell/tensor/tensor.py
def explained_variance(self):
    '''Computes the explained variance score for a tensor decomposition. Inspired on the
    function in sklearn.metrics.explained_variance_score.

    Returns
    -------
    explained_variance : float
        Explained variance score for a tnesor factorization.
    '''
    assert self.tl_object is not None, "Must run compute_tensor_factorization before using this method."
    tensor = self.tensor
    rec_tensor = self.tl_object.to_tensor()
    mask = self.mask

    if mask is not None:
        tensor = tensor * mask
        rec_tensor = tensor * mask

    y_diff_avg = tl.mean(tensor - rec_tensor)
    numerator = tl.norm(tensor - rec_tensor - y_diff_avg)

    tensor_avg = tl.mean(tensor)
    denominator = tl.norm(tensor - tensor_avg)

    if denominator == 0.:
        explained_variance = 0.0
    else:
        explained_variance =  1. - (numerator / denominator)
        explained_variance = explained_variance.item()
    return explained_variance
export_factor_loadings(self, filename)

Exports the factor loadings of the tensor into an Excel file

Parameters

filename : str Full path and filename to store the file. E.g., '/home/user/Loadings.xlsx'

Source code in cell2cell/tensor/tensor.py
def export_factor_loadings(self, filename):
    '''Exports the factor loadings of the tensor into an Excel file

    Parameters
    ----------
    filename : str
        Full path and filename to store the file. E.g., '/home/user/Loadings.xlsx'
    '''
    writer = pd.ExcelWriter(filename)
    for k, v in self.factors.items():
        v.to_excel(writer, sheet_name=k)
    writer.close()
    print('Loadings of the tensor factorization were successfully saved into {}'.format(filename))
get_top_factor_elements(self, order_name, factor_name, top_number=10)

Obtains the top-elements with higher loadings for a given factor

Parameters

order_name : str Name of the dimension/order in the tensor according to the keys of the dictionary in BaseTensor.factors. The attribute factors is built once the tensor factorization is run.

factor_name : str Name of one of the factors. E.g., 'Factor 1'

top_number : int, default=10 Number of top-elements to return

Returns

top_elements : pandas.DataFrame A dataframe with the loadings of the top-elements for the given factor.

Source code in cell2cell/tensor/tensor.py
def get_top_factor_elements(self, order_name, factor_name, top_number=10):
    '''Obtains the top-elements with higher loadings for a given factor

    Parameters
    ----------
    order_name : str
        Name of the dimension/order in the tensor according to the keys of the
        dictionary in BaseTensor.factors. The attribute factors is built once the
        tensor factorization is run.

    factor_name : str
        Name of one of the factors. E.g., 'Factor 1'

    top_number : int, default=10
        Number of top-elements to return

    Returns
    -------
    top_elements : pandas.DataFrame
        A dataframe with the loadings of the top-elements for the given factor.
    '''
    top_elements = self.factors[order_name][factor_name].sort_values(ascending=False).head(top_number)
    return top_elements
missing_fraction(self)

Returns the fraction of values that are missing (NaNs) in the tensor, given the values that are in tensor.loc_nans

Returns

missing_fraction : float Fraction of values that are real zeros.

Source code in cell2cell/tensor/tensor.py
def missing_fraction(self):
    '''Returns the fraction of values that are missing (NaNs) in the tensor,
    given the values that are in tensor.loc_nans

    Returns
    -------
    missing_fraction : float
        Fraction of values that are real zeros.
    '''
    if self.loc_nans is None:
        print("The interaction tensor does not have missing values")
        return 0.0
    else:
        missing_fraction = tl.sum(self.loc_nans) / tl.prod(tl.tensor(self.tensor.shape))
    missing_fraction = missing_fraction.item()
    return missing_fraction
sparsity_fraction(self)

Returns the fraction of values that are zeros in the tensor, given the values that are in tensor.loc_zeros

Returns

sparsity_fraction : float Fraction of values that are real zeros.

Source code in cell2cell/tensor/tensor.py
def sparsity_fraction(self):
    '''Returns the fraction of values that are zeros in the tensor,
    given the values that are in tensor.loc_zeros

    Returns
    -------
    sparsity_fraction : float
        Fraction of values that are real zeros.
    '''
    if self.loc_zeros is None:
        print("The interaction tensor does not have zeros")
        return 0.0
    else:
        sparsity_fraction = tl.sum(self.loc_zeros) / tl.prod(tl.tensor(self.tensor.shape))
    sparsity_fraction = sparsity_fraction.item()
    return sparsity_fraction
to_device(self, device)

Changes the device where the tensor is analyzed.

Parameters

device : str Device name to use for the decomposition. Options could be 'cpu', 'cuda', 'gpu', depending on the backend used with tensorly.

Source code in cell2cell/tensor/tensor.py
def to_device(self, device):
    '''Changes the device where the tensor
    is analyzed.

    Parameters
    ----------
    device : str
        Device name to use for the decomposition.
        Options could be 'cpu', 'cuda', 'gpu', depending on
        the backend used with tensorly.
    '''
    try:
        self.tensor = tl.tensor(self.tensor, device=device)
        if self.mask is not None:
            self.mask = tl.tensor(self.mask, device=device)
    except:
        print('Device is either not available or the backend used with tensorly does not support this device.\
               Try changing it with tensorly.set_backend("<backend_name>") before.')
        self.tensor = tl.tensor(self.tensor)
        if self.mask is not None:
            self.mask = tl.tensor(self.mask)
write_file(self, filename)

Exports this object into a pickle file.

Parameters

filename : str Complete path to the file wherein the variable will be stored. For example: /home/user/variable.pkl

Source code in cell2cell/tensor/tensor.py
def write_file(self, filename):
    '''Exports this object into a pickle file.

    Parameters
    ----------
    filename : str
        Complete path to the file wherein the variable will be
        stored. For example:
        /home/user/variable.pkl
    '''
    from cell2cell.io.save_data import export_variable_with_pickle
    export_variable_with_pickle(self, filename=filename)

InteractionTensor (BaseTensor)

4D-Communication Tensor built from gene expression matrices for different contexts and a list of ligand-receptor pairs

Parameters

rnaseq_matrices : list A list with dataframes of gene expression wherein the rows are the genes and columns the cell types, tissues or samples.

ppi_data : pandas.DataFrame A dataframe containing protein-protein interactions (rows). It has to contain at least two columns, one for the first protein partner in the interaction as well as the second protein partner.

order_labels : list, default=None List containing the labels for each order or dimension of the tensor. For example: ['Contexts', 'Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']

context_names : list, default=None A list of strings containing the names of the corresponding contexts to each rnaseq_matrix. The length of this list must match the length of the list rnaseq_matrices.

how : str, default='inner' Approach to consider cell types and genes present across multiple contexts.

- 'inner' : Considers only cell types and genes that are present in all
            contexts (intersection).
- 'outer' : Considers all cell types and genes that are present
            across contexts (union).
- 'outer_genes' : Considers only cell types that are present in all
                  contexts (intersection), while all genes that are
                  present across contexts (union).
- 'outer_cells' : Considers only genes that are present in all
                  contexts (intersection), while all cell types that are
                  present across contexts (union).

outer_fraction : float, default=0.0 Threshold to filter the elements when how includes any outer option. Elements with a fraction abundance across samples (in rnaseq_matrices) at least this threshold will be included. When this value is 0, considers all elements across the samples. When this value is 1, it acts as using how='inner'.

communication_score : str, default='expression_mean' Type of communication score to infer the potential use of a given ligand- receptor pair by a pair of cells/tissues/samples. Available communication_scores are:

- 'expression_mean' : Computes the average between the expression of a ligand
                      from a sender cell and the expression of a receptor on a
                      receiver cell.
- 'expression_product' : Computes the product between the expression of a
                        ligand from a sender cell and the expression of a
                        receptor on a receiver cell.
- 'expression_gmean' : Computes the geometric mean between the expression
                       of a ligand from a sender cell and the
                       expression of a receptor on a receiver cell.

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

complex_agg_method : str, default='min' Method to aggregate the expression value of multiple genes in a complex.

- 'min' : Minimum expression value among all genes.
- 'mean' : Average expression value among all genes.
- 'gmean' : Geometric mean expression value among all genes.

upper_letter_comparison : boolean, default=True Whether making uppercase the gene names in the expression matrices and the protein names in the ppi_data to match their names and integrate their respective expression level. Useful when there are inconsistencies in the names between the expression matrix and the ligand-receptor annotations.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

group_ppi_by : str, default=None Column name in the list of PPIs used for grouping individual PPIs into major groups such as signaling pathways.

group_ppi_method : str, default='gmean' Method for aggregating multiple PPIs into major groups.

- 'mean' : Computes the average communication score among all PPIs of the
           group for a given pair of cells/tissues/samples
- 'gmean' : Computes the geometric mean of the communication scores among all
            PPIs of the group for a given pair of cells/tissues/samples
- 'sum' : Computes the sum of the communication scores among all PPIs of the
          group for a given pair of cells/tissues/samples

device : str, default=None Device to use when backend allows using multiple devices. Options are:

verbose : boolean, default=False Whether printing or not steps of the analysis.

Source code in cell2cell/tensor/tensor.py
class InteractionTensor(BaseTensor):
    '''4D-Communication Tensor built from gene expression matrices for different contexts
     and a list of ligand-receptor pairs

    Parameters
    ----------
    rnaseq_matrices : list
        A list with dataframes of gene expression wherein the rows are the genes and
        columns the cell types, tissues or samples.

    ppi_data : pandas.DataFrame
        A dataframe containing protein-protein interactions (rows). It has to
        contain at least two columns, one for the first protein partner in the
        interaction as well as the second protein partner.

    order_labels : list, default=None
        List containing the labels for each order or dimension of the tensor. For
        example: ['Contexts', 'Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']

    context_names : list, default=None
        A list of strings containing the names of the corresponding contexts to each
        rnaseq_matrix. The length of this list must match the length of the list
        rnaseq_matrices.

    how : str, default='inner'
        Approach to consider cell types and genes present across multiple contexts.

        - 'inner' : Considers only cell types and genes that are present in all
                    contexts (intersection).
        - 'outer' : Considers all cell types and genes that are present
                    across contexts (union).
        - 'outer_genes' : Considers only cell types that are present in all
                          contexts (intersection), while all genes that are
                          present across contexts (union).
        - 'outer_cells' : Considers only genes that are present in all
                          contexts (intersection), while all cell types that are
                          present across contexts (union).

    outer_fraction : float, default=0.0
        Threshold to filter the elements when `how` includes any outer option.
        Elements with a fraction abundance across samples (in `rnaseq_matrices`)
        at least this threshold will be included. When this value is 0, considers
        all elements across the samples. When this value is 1, it acts as using
        `how='inner'`.

    communication_score : str, default='expression_mean'
        Type of communication score to infer the potential use of a given ligand-
        receptor pair by a pair of cells/tissues/samples.
        Available communication_scores are:

        - 'expression_mean' : Computes the average between the expression of a ligand
                              from a sender cell and the expression of a receptor on a
                              receiver cell.
        - 'expression_product' : Computes the product between the expression of a
                                ligand from a sender cell and the expression of a
                                receptor on a receiver cell.
        - 'expression_gmean' : Computes the geometric mean between the expression
                               of a ligand from a sender cell and the
                               expression of a receptor on a receiver cell.

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    complex_agg_method : str, default='min'
        Method to aggregate the expression value of multiple genes in a
        complex.

        - 'min' : Minimum expression value among all genes.
        - 'mean' : Average expression value among all genes.
        - 'gmean' : Geometric mean expression value among all genes.

    upper_letter_comparison : boolean, default=True
        Whether making uppercase the gene names in the expression matrices and the
        protein names in the ppi_data to match their names and integrate their
        respective expression level. Useful when there are inconsistencies in the
        names between the expression matrix and the ligand-receptor annotations.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a dataframe of
        protein-protein interactions. If the list is for ligand-receptor pairs, the
        first column is for the ligands and the second for the receptors.

    group_ppi_by : str, default=None
        Column name in the list of PPIs used for grouping individual PPIs into major
        groups such as signaling pathways.

    group_ppi_method : str, default='gmean'
        Method for aggregating multiple PPIs into major groups.

        - 'mean' : Computes the average communication score among all PPIs of the
                   group for a given pair of cells/tissues/samples
        - 'gmean' : Computes the geometric mean of the communication scores among all
                    PPIs of the group for a given pair of cells/tissues/samples
        - 'sum' : Computes the sum of the communication scores among all PPIs of the
                  group for a given pair of cells/tissues/samples

    device : str, default=None
        Device to use when backend allows using multiple devices. Options are:
         {'cpu', 'cuda:0', None}

    verbose : boolean, default=False
            Whether printing or not steps of the analysis.
    '''
    def __init__(self, rnaseq_matrices, ppi_data, order_labels=None, context_names=None, how='inner', outer_fraction=0.0,
                 communication_score='expression_mean', complex_sep=None, complex_agg_method='min',
                 upper_letter_comparison=True, interaction_columns=('A', 'B'), group_ppi_by=None,
                 group_ppi_method='gmean', device=None, verbose=True):
        # Asserts
        if group_ppi_by is not None:
            assert group_ppi_by in ppi_data.columns, "Using {} for grouping PPIs is not possible. Not present among columns in ppi_data".format(group_ppi_by)


        # Init BaseTensor
        BaseTensor.__init__(self)

        # Generate expression values for protein complexes in PPI data
        if complex_sep is not None:
            if verbose:
                print('Getting expression values for protein complexes')
            col_a_genes, complex_a, col_b_genes, complex_b, complexes = get_genes_from_complexes(ppi_data=ppi_data,
                                                                                                 complex_sep=complex_sep,
                                                                                                 interaction_columns=interaction_columns
                                                                                                 )
            mod_rnaseq_matrices = [add_complexes_to_expression(rnaseq, complexes, agg_method=complex_agg_method) for rnaseq in rnaseq_matrices]
        else:
            mod_rnaseq_matrices = [df.copy() for df in rnaseq_matrices]

        # Uppercase for Gene names
        if upper_letter_comparison:
            for df in mod_rnaseq_matrices:
                df.index = [idx.upper() for idx in df.index]

        # Deduplicate gene names
        mod_rnaseq_matrices = [df[~df.index.duplicated(keep='first')] for df in mod_rnaseq_matrices]

        # Get context CCC tensor
        tensor, genes, cells, ppi_names, mask = build_context_ccc_tensor(rnaseq_matrices=mod_rnaseq_matrices,
                                                                         ppi_data=ppi_data,
                                                                         how=how,
                                                                         outer_fraction=outer_fraction,
                                                                         communication_score=communication_score,
                                                                         complex_sep=complex_sep,
                                                                         upper_letter_comparison=upper_letter_comparison,
                                                                         interaction_columns=interaction_columns,
                                                                         group_ppi_by=group_ppi_by,
                                                                         group_ppi_method=group_ppi_method,
                                                                         verbose=verbose)

        # Distinguish NaNs from real zeros
        if mask is None:
            self.loc_nans = np.zeros(tensor.shape, dtype=int)
        else:
            self.loc_nans = np.ones(tensor.shape, dtype=int) - np.array(mask)
        self.loc_zeros = (tensor == 0).astype(int) - self.loc_nans
        self.loc_zeros = (self.loc_zeros > 0).astype(int)

        # Generate names for the elements in each dimension (order) in the tensor
        if context_names is None:
            context_names = ['C-' + str(i) for i in range(1, len(mod_rnaseq_matrices)+1)]
            # for PPIS use ppis, and sender & receiver cells, use cells

        # Save variables for this class
        self.communication_score = communication_score
        self.how = how
        self.outer_fraction = outer_fraction
        if device is None:
            self.tensor = tl.tensor(tensor)
            self.loc_nans = tl.tensor(self.loc_nans)
            self.loc_zeros = tl.tensor(self.loc_zeros)
            self.mask = mask
        else:
            if tl.get_backend() in ['pytorch', 'tensorflow']: # Potential TODO: Include other backends that support different devices
                self.tensor = tl.tensor(tensor, device=device)
                self.loc_nans = tl.tensor(self.loc_nans, device=device)
                self.loc_zeros = tl.tensor(self.loc_zeros, device=device)
                if mask is not None:
                    self.mask = tl.tensor(mask, device=device)
                else:
                    self.mask = mask
            else:
                self.tensor = tl.tensor(tensor)
                self.loc_nans = tl.tensor(self.loc_nans)
                self.loc_zeros = tl.tensor(self.loc_zeros)
                if mask is not None:
                    self.mask = tl.tensor(mask)
                else:
                    self.mask = mask
        self.genes = genes
        self.cells = cells
        self.order_labels = order_labels
        self.order_names = [context_names, ppi_names, self.cells, self.cells]

PreBuiltTensor (BaseTensor)

Initializes a cell2cell.tensor.BaseTensor with a prebuilt communication tensor

Parameters

tensor : ndarray list Prebuilt tensor. Could be a list of lists, a numpy array or a tensorly.tensor.

order_names : list List of lists containing the string names of each element in each of the dimensions or orders in the tensor. For a 4D-Communication tensor, the first list should contain the names of the contexts, the second the names of the ligand-receptor interactions, the third the names of the sender cells and the fourth the names of the receiver cells.

order_labels : list, default=None List containing the labels for each order or dimension of the tensor. For example: ['Contexts', 'Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']

mask : ndarray list, default=None Helps avoiding missing values during a tensor factorization. A mask should be a boolean array of the same shape as the original tensor and should be 0 where the values are missing and 1 everywhere else.

loc_nans : ndarray list, default=None An array of shape equal to tensor with ones where NaN values were assigned when building the tensor. Other values are zeros. It stores the location of the NaN values.

device : str, default=None Device to use when backend allows using multiple devices. Options are:

Source code in cell2cell/tensor/tensor.py
class PreBuiltTensor(BaseTensor):
    '''Initializes a cell2cell.tensor.BaseTensor with a prebuilt communication tensor

    Parameters
    ----------
    tensor : ndarray list
        Prebuilt tensor. Could be a list of lists, a numpy array or a tensorly.tensor.

    order_names : list
        List of lists containing the string names of each element in each of the
        dimensions or orders in the tensor. For a 4D-Communication tensor, the first
        list should contain the names of the contexts, the second the names of the
        ligand-receptor interactions, the third the names of the sender cells and the
        fourth the names of the receiver cells.

    order_labels : list, default=None
        List containing the labels for each order or dimension of the tensor. For
        example: ['Contexts', 'Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']

    mask : ndarray list, default=None
        Helps avoiding missing values during a tensor factorization. A mask should be
        a boolean array of the same shape as the original tensor and should be 0
        where the values are missing and 1 everywhere else.

    loc_nans : ndarray list, default=None
        An array of shape equal to `tensor` with ones where NaN values were assigned
        when building the tensor. Other values are zeros. It stores the
        location of the NaN values.

    device : str, default=None
        Device to use when backend allows using multiple devices. Options are:
         {'cpu', 'cuda:0', None}
    '''
    def __init__(self, tensor, order_names, order_labels=None, mask=None, loc_nans=None, device=None):
        # Init BaseTensor
        BaseTensor.__init__(self)

        # Initialize tensor
        try:
            context = tl.context(tensor)
        except:
            context = {'dtype': tensor.dtype, 'device' : None}
        tensor = tl.to_numpy(tensor)
        if mask is not None:
            mask = tl.to_numpy(mask)

        # Location of NaNs and zeros
        tmp_nans = (np.isnan(tensor)).astype(int) # Find extra NaNs that were not considered
        if loc_nans is None:
            self.loc_nans = np.zeros(tuple(tensor.shape), dtype=int)
        else:
            assert loc_nans.shape == tensor.shape, "`loc_nans` and `tensor` must be of the same shape"
            self.loc_nans = np.array(loc_nans.copy())
        self.loc_nans = self.loc_nans + tmp_nans
        self.loc_nans = (self.loc_nans > 0).astype(int)

        self.loc_zeros = (np.array(tensor) == 0.).astype(int) - self.loc_nans
        self.loc_zeros = (self.loc_zeros > 0).astype(int)

        # Store tensor
        tensor_ = np.nan_to_num(tensor)
        if device is not None:
            context['device'] = device
        if 'device' not in context.keys():
            self.tensor = tl.tensor(tensor_)
            self.loc_nans = tl.tensor(self.loc_nans)
            self.loc_zeros = tl.tensor(self.loc_zeros)
            if mask is None:
                self.mask = mask
            else:
                self.mask = tl.tensor(mask)
        else:
            self.tensor = tl.tensor(tensor_, device=context['device'])
            self.loc_nans = tl.tensor(self.loc_nans, device=context['device'])
            self.loc_zeros = tl.tensor(self.loc_zeros, device=context['device'])
            if mask is None:
                self.mask = mask
            else:
                self.mask = tl.tensor(mask, device=context['device'])
        self.order_names = order_names
        if order_labels is None:
            self.order_labels = ['Dimension-{}'.format(i + 1) for i in range(len(self.tensor.shape))]
        else:
            self.order_labels = order_labels
        assert len(self.tensor.shape) == len(self.order_labels), "The length of order_labels must match the number of orders/dimensions in the tensor"

aggregate_ccc_tensor(ccc_tensor, ppi_data, group_ppi_by=None, group_ppi_method='gmean')

Aggregates communication scores of multiple PPIs into major groups (e.g., pathways) in a communication tensor

Parameters

ccc_tensor : ndarray list List of directed cell-cell communication matrices, one for each ligand- receptor pair in ppi_data. These matrices contain the communication score for pairs of cells for the corresponding PPI. This tensor represent a 3D-communication tensor for the context.

ppi_data : pandas.DataFrame A dataframe containing protein-protein interactions (rows). It has to contain at least two columns, one for the first protein partner in the interaction as well as the second protein partner.

group_ppi_by : str, default=None Column name in the list of PPIs used for grouping individual PPIs into major groups such as signaling pathways.

group_ppi_method : str, default='gmean' Method for aggregating multiple PPIs into major groups.

- 'mean' : Computes the average communication score among all PPIs of the
           group for a given pair of cells/tissues/samples
- 'gmean' : Computes the geometric mean of the communication scores among all
            PPIs of the group for a given pair of cells/tissues/samples
- 'sum' : Computes the sum of the communication scores among all PPIs of the
          group for a given pair of cells/tissues/samples
Returns

aggregated_tensor : ndarray list List of directed cell-cell communication matrices, one for each major group of ligand-receptor pair in ppi_data. These matrices contain the communication score for pairs of cells for the corresponding PPI group. This tensor represent a 3D-communication tensor for the context, but for major groups instead of individual PPIs.

Source code in cell2cell/tensor/tensor.py
def aggregate_ccc_tensor(ccc_tensor, ppi_data, group_ppi_by=None, group_ppi_method='gmean'):
    '''Aggregates communication scores of multiple PPIs into major groups
    (e.g., pathways) in a communication tensor

    Parameters
    ----------
    ccc_tensor : ndarray list
        List of directed cell-cell communication matrices, one for each ligand-
        receptor pair in ppi_data. These matrices contain the communication score for
        pairs of cells for the corresponding PPI. This tensor represent a
        3D-communication tensor for the context.

    ppi_data : pandas.DataFrame
        A dataframe containing protein-protein interactions (rows). It has to
        contain at least two columns, one for the first protein partner in the
        interaction as well as the second protein partner.

    group_ppi_by : str, default=None
        Column name in the list of PPIs used for grouping individual PPIs into major
        groups such as signaling pathways.

    group_ppi_method : str, default='gmean'
        Method for aggregating multiple PPIs into major groups.

        - 'mean' : Computes the average communication score among all PPIs of the
                   group for a given pair of cells/tissues/samples
        - 'gmean' : Computes the geometric mean of the communication scores among all
                    PPIs of the group for a given pair of cells/tissues/samples
        - 'sum' : Computes the sum of the communication scores among all PPIs of the
                  group for a given pair of cells/tissues/samples

    Returns
    -------
    aggregated_tensor : ndarray list
        List of directed cell-cell communication matrices, one for each major group of
        ligand-receptor pair in ppi_data. These matrices contain the communication
        score for pairs of cells for the corresponding PPI group. This tensor
        represent a 3D-communication tensor for the context, but for major groups
        instead of individual PPIs.
    '''
    tensor_ = np.array(ccc_tensor)
    aggregated_tensor = []
    for group, df in ppi_data.groupby(group_ppi_by):
        lr_idx = list(df.index)
        ccc_matrices = tensor_[lr_idx]
        aggregated_tensor.append(aggregate_ccc_matrices(ccc_matrices=ccc_matrices,
                                                        method=group_ppi_method).tolist())
    return aggregated_tensor

build_context_ccc_tensor(rnaseq_matrices, ppi_data, how='inner', outer_fraction=0.0, communication_score='expression_product', complex_sep=None, upper_letter_comparison=True, interaction_columns=('A', 'B'), group_ppi_by=None, group_ppi_method='gmean', verbose=True)

Builds a 4D-Communication tensor. Takes the gene expression matrices and the list of PPIs to compute the communication scores between the interacting cells for each PPI. This is done for each context.

Parameters

rnaseq_matrices : list A list with dataframes of gene expression wherein the rows are the genes and columns the cell types, tissues or samples.

ppi_data : pandas.DataFrame A dataframe containing protein-protein interactions (rows). It has to contain at least two columns, one for the first protein partner in the interaction as well as the second protein partner.

how : str, default='inner' Approach to consider cell types and genes present across multiple contexts.

- 'inner' : Considers only cell types and genes that are present in all
            contexts (intersection).
- 'outer' : Considers all cell types and genes that are present
            across contexts (union).
- 'outer_genes' : Considers only cell types that are present in all
                  contexts (intersection), while all genes that are
                  present across contexts (union).
- 'outer_cells' : Considers only genes that are present in all
                  contexts (intersection), while all cell types that are
                  present across contexts (union).

outer_fraction : float, default=0.0 Threshold to filter the elements when how includes any outer option. Elements with a fraction abundance across samples (in rnaseq_matrices) at least this threshold will be included. When this value is 0, considers all elements across the samples. When this value is 1, it acts as using how='inner'.

communication_score : str, default='expression_mean' Type of communication score to infer the potential use of a given ligand- receptor pair by a pair of cells/tissues/samples. Available communication_scores are:

- 'expression_mean' : Computes the average between the expression of a ligand
                      from a sender cell and the expression of a receptor on a
                      receiver cell.
- 'expression_product' : Computes the product between the expression of a
                        ligand from a sender cell and the expression of a
                        receptor on a receiver cell.
- 'expression_gmean' : Computes the geometric mean between the expression
                       of a ligand from a sender cell and the
                       expression of a receptor on a receiver cell.

complex_sep : str, default=None Symbol that separates the protein subunits in a multimeric complex. For example, '&' is the complex_sep for a list of ligand-receptor pairs where a protein partner could be "CD74&CD44".

upper_letter_comparison : boolean, default=True Whether making uppercase the gene names in the expression matrices and the protein names in the ppi_data to match their names and integrate their respective expression level. Useful when there are inconsistencies in the names between the expression matrix and the ligand-receptor annotations.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

group_ppi_by : str, default=None Column name in the list of PPIs used for grouping individual PPIs into major groups such as signaling pathways.

group_ppi_method : str, default='gmean' Method for aggregating multiple PPIs into major groups.

- 'mean' : Computes the average communication score among all PPIs of the
           group for a given pair of cells/tissues/samples
- 'gmean' : Computes the geometric mean of the communication scores among all
            PPIs of the group for a given pair of cells/tissues/samples
- 'sum' : Computes the sum of the communication scores among all PPIs of the
          group for a given pair of cells/tissues/samples

verbose : boolean, default=False Whether printing or not steps of the analysis.

Returns

tensors : list List of 3D-Communication tensors for each context. This list corresponds to the 4D-Communication tensor.

genes : list List of genes included in the tensor.

cells : list List of cells included in the tensor.

list

List of names for each of the PPIs included in the tensor. Used as labels for the elements in the cognate tensor dimension (in the attribute order_names of the InteractionTensor)

numpy.array

Mask used to exclude values in the tensor. When using how='outer' it masks missing values (e.g., cell types that are not present in a given context), while using how='inner' makes the mask_tensor to be None.

Source code in cell2cell/tensor/tensor.py
def build_context_ccc_tensor(rnaseq_matrices, ppi_data, how='inner', outer_fraction=0.0,
                             communication_score='expression_product', complex_sep=None, upper_letter_comparison=True,
                             interaction_columns=('A', 'B'), group_ppi_by=None, group_ppi_method='gmean', verbose=True):
    '''Builds a 4D-Communication tensor.
    Takes the gene expression matrices and the list of PPIs to compute
    the communication scores between the interacting cells for each PPI.
    This is done for each context.

    Parameters
    ----------
    rnaseq_matrices : list
        A list with dataframes of gene expression wherein the rows are the genes and
        columns the cell types, tissues or samples.

    ppi_data : pandas.DataFrame
        A dataframe containing protein-protein interactions (rows). It has to
        contain at least two columns, one for the first protein partner in the
        interaction as well as the second protein partner.

    how : str, default='inner'
        Approach to consider cell types and genes present across multiple contexts.

        - 'inner' : Considers only cell types and genes that are present in all
                    contexts (intersection).
        - 'outer' : Considers all cell types and genes that are present
                    across contexts (union).
        - 'outer_genes' : Considers only cell types that are present in all
                          contexts (intersection), while all genes that are
                          present across contexts (union).
        - 'outer_cells' : Considers only genes that are present in all
                          contexts (intersection), while all cell types that are
                          present across contexts (union).

    outer_fraction : float, default=0.0
        Threshold to filter the elements when `how` includes any outer option.
        Elements with a fraction abundance across samples (in `rnaseq_matrices`)
        at least this threshold will be included. When this value is 0, considers
        all elements across the samples. When this value is 1, it acts as using
        `how='inner'`.

    communication_score : str, default='expression_mean'
        Type of communication score to infer the potential use of a given ligand-
        receptor pair by a pair of cells/tissues/samples.
        Available communication_scores are:

        - 'expression_mean' : Computes the average between the expression of a ligand
                              from a sender cell and the expression of a receptor on a
                              receiver cell.
        - 'expression_product' : Computes the product between the expression of a
                                ligand from a sender cell and the expression of a
                                receptor on a receiver cell.
        - 'expression_gmean' : Computes the geometric mean between the expression
                               of a ligand from a sender cell and the
                               expression of a receptor on a receiver cell.

    complex_sep : str, default=None
        Symbol that separates the protein subunits in a multimeric complex.
        For example, '&' is the complex_sep for a list of ligand-receptor pairs
        where a protein partner could be "CD74&CD44".

    upper_letter_comparison : boolean, default=True
        Whether making uppercase the gene names in the expression matrices and the
        protein names in the ppi_data to match their names and integrate their
        respective expression level. Useful when there are inconsistencies in the
        names between the expression matrix and the ligand-receptor annotations.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a dataframe of
        protein-protein interactions. If the list is for ligand-receptor pairs, the
        first column is for the ligands and the second for the receptors.

    group_ppi_by : str, default=None
        Column name in the list of PPIs used for grouping individual PPIs into major
        groups such as signaling pathways.

    group_ppi_method : str, default='gmean'
        Method for aggregating multiple PPIs into major groups.

        - 'mean' : Computes the average communication score among all PPIs of the
                   group for a given pair of cells/tissues/samples
        - 'gmean' : Computes the geometric mean of the communication scores among all
                    PPIs of the group for a given pair of cells/tissues/samples
        - 'sum' : Computes the sum of the communication scores among all PPIs of the
                  group for a given pair of cells/tissues/samples

    verbose : boolean, default=False
            Whether printing or not steps of the analysis.

    Returns
    -------
    tensors : list
        List of 3D-Communication tensors for each context. This list corresponds to
        the 4D-Communication tensor.

    genes : list
        List of genes included in the tensor.

    cells : list
        List of cells included in the tensor.

    ppi_names: list
        List of names for each of the PPIs included in the tensor. Used as labels for the
        elements in the cognate tensor dimension (in the attribute order_names of the
        InteractionTensor)

    mask_tensor: numpy.array
        Mask used to exclude values in the tensor. When using how='outer' it masks
        missing values (e.g., cell types that are not present in a given context),
        while using how='inner' makes the mask_tensor to be None.
    '''
    df_idxs = [list(rnaseq.index) for rnaseq in rnaseq_matrices]
    df_cols = [list(rnaseq.columns) for rnaseq in rnaseq_matrices]

    if how == 'inner':
        genes = set.intersection(*map(set, df_idxs))
        cells = set.intersection(*map(set, df_cols))
    elif how == 'outer':
        genes = set(get_elements_over_fraction(abundance_dict=get_element_abundances(element_lists=df_idxs),
                                               fraction=outer_fraction))
        cells = set(get_elements_over_fraction(abundance_dict=get_element_abundances(element_lists=df_cols),
                                               fraction=outer_fraction))
    elif how == 'outer_genes':
        genes = set(get_elements_over_fraction(abundance_dict=get_element_abundances(element_lists=df_idxs),
                                               fraction=outer_fraction))
        cells = set.intersection(*map(set, df_cols))
    elif how == 'outer_cells':
        genes = set.intersection(*map(set, df_idxs))
        cells = set(get_elements_over_fraction(abundance_dict=get_element_abundances(element_lists=df_cols),
                                               fraction=outer_fraction))
    else:
        raise ValueError('Provide a valid way to build the tensor; "how" must be "inner", "outer", "outer_genes" or "outer_cells"')

    # Preserve order or sort new set (either inner or outer)
    if set(df_idxs[0]) == genes:
        genes = df_idxs[0]
    else:
        genes = sorted(list(genes))

    if set(df_cols[0]) == cells:
        cells = df_cols[0]
    else:
        cells = sorted(list(cells))

    # Filter PPI data for
    ppi_data_ = filter_ppi_by_proteins(ppi_data=ppi_data,
                                       proteins=genes,
                                       complex_sep=complex_sep,
                                       upper_letter_comparison=upper_letter_comparison,
                                       interaction_columns=interaction_columns)

    if verbose:
        print('Building tensor for the provided context')

    tensors = [generate_ccc_tensor(rnaseq_data=rnaseq.reindex(genes).reindex(cells, axis='columns'),
                                   ppi_data=ppi_data_,
                                   communication_score=communication_score,
                                   interaction_columns=interaction_columns) for rnaseq in rnaseq_matrices]

    if group_ppi_by is not None:
        ppi_names = [group for group, _ in ppi_data_.groupby(group_ppi_by)]
        tensors = [aggregate_ccc_tensor(ccc_tensor=t,
                                        ppi_data=ppi_data_,
                                        group_ppi_by=group_ppi_by,
                                        group_ppi_method=group_ppi_method) for t in tensors]
    else:
        ppi_names = [row[interaction_columns[0]] + '^' + row[interaction_columns[1]] for idx, row in ppi_data_.iterrows()]

    # Generate mask:
    if how != 'inner':
        mask_tensor = (~np.isnan(np.asarray(tensors))).astype(int)
    else:
        mask_tensor = None
    tensors = np.nan_to_num(tensors)
    return tensors, genes, cells, ppi_names, mask_tensor

generate_ccc_tensor(rnaseq_data, ppi_data, communication_score='expression_product', interaction_columns=('A', 'B'))

Computes a 3D-Communication tensor for a given context based on the gene expression matrix and the list of PPIS

Parameters

rnaseq_data : pandas.DataFrame Gene expression matrix for a given context, sample or condition. Rows are genes and columns are cell types/tissues/samples.

ppi_data : pandas.DataFrame A dataframe containing protein-protein interactions (rows). It has to contain at least two columns, one for the first protein partner in the interaction as well as the second protein partner.

communication_score : str, default='expression_mean' Type of communication score to infer the potential use of a given ligand- receptor pair by a pair of cells/tissues/samples. Available communication_scores are:

- 'expression_mean' : Computes the average between the expression of a ligand
                      from a sender cell and the expression of a receptor on a
                      receiver cell.
- 'expression_product' : Computes the product between the expression of a
                        ligand from a sender cell and the expression of a
                        receptor on a receiver cell.
- 'expression_gmean' : Computes the geometric mean between the expression
                       of a ligand from a sender cell and the
                       expression of a receptor on a receiver cell.

interaction_columns : tuple, default=('A', 'B') Contains the names of the columns where to find the partners in a dataframe of protein-protein interactions. If the list is for ligand-receptor pairs, the first column is for the ligands and the second for the receptors.

Returns

ccc_tensor : ndarray list List of directed cell-cell communication matrices, one for each ligand- receptor pair in ppi_data. These matrices contain the communication score for pairs of cells for the corresponding PPI. This tensor represent a 3D-communication tensor for the context.

Source code in cell2cell/tensor/tensor.py
def generate_ccc_tensor(rnaseq_data, ppi_data, communication_score='expression_product', interaction_columns=('A', 'B')):
    '''Computes a 3D-Communication tensor for a given context based on the gene
    expression matrix and the list of PPIS

    Parameters
    ----------
    rnaseq_data : pandas.DataFrame
        Gene expression matrix for a given context, sample or condition. Rows are
        genes and columns are cell types/tissues/samples.

    ppi_data : pandas.DataFrame
        A dataframe containing protein-protein interactions (rows). It has to
        contain at least two columns, one for the first protein partner in the
        interaction as well as the second protein partner.

    communication_score : str, default='expression_mean'
        Type of communication score to infer the potential use of a given ligand-
        receptor pair by a pair of cells/tissues/samples.
        Available communication_scores are:

        - 'expression_mean' : Computes the average between the expression of a ligand
                              from a sender cell and the expression of a receptor on a
                              receiver cell.
        - 'expression_product' : Computes the product between the expression of a
                                ligand from a sender cell and the expression of a
                                receptor on a receiver cell.
        - 'expression_gmean' : Computes the geometric mean between the expression
                               of a ligand from a sender cell and the
                               expression of a receptor on a receiver cell.

    interaction_columns : tuple, default=('A', 'B')
        Contains the names of the columns where to find the partners in a dataframe of
        protein-protein interactions. If the list is for ligand-receptor pairs, the
        first column is for the ligands and the second for the receptors.

    Returns
    -------
    ccc_tensor : ndarray list
        List of directed cell-cell communication matrices, one for each ligand-
        receptor pair in ppi_data. These matrices contain the communication score for
        pairs of cells for the corresponding PPI. This tensor represent a
        3D-communication tensor for the context.
    '''
    ppi_a = interaction_columns[0]
    ppi_b = interaction_columns[1]

    ccc_tensor = []
    for idx, ppi in ppi_data.iterrows():
        v = rnaseq_data.loc[ppi[ppi_a], :].values
        w = rnaseq_data.loc[ppi[ppi_b], :].values
        ccc_tensor.append(compute_ccc_matrix(prot_a_exp=v,
                                             prot_b_exp=w,
                                             communication_score=communication_score).tolist())
    return ccc_tensor

generate_tensor_metadata(interaction_tensor, metadata_dicts, fill_with_order_elements=True)

Uses a list of of dicts (or None when a dict is missing) to generate a list of metadata for each order in the tensor.

Parameters

interaction_tensor : cell2cell.tensor.BaseTensor A communication tensor.

metadata_dicts : list A list of dictionaries. Each dictionary represents an order of the tensor. In an interaction tensor these orders should be contexts, LR pairs, sender cells and receiver cells. The keys are the elements in each order (they are contained in interaction_tensor.order_names) and the values are the categories that each elements will be assigned as metadata.

fill_with_order_elements : boolean, default=True Whether using each element of a dimension as its own metadata when a None is passed instead of a dictionary for the respective order/dimension. If True, each element in that order will be use itself, that dimension will not contain metadata.

Returns

metadata : list A list of pandas.DataFrames that will be used as an input of the cell2cell.plot.tensor_factors_plot.

Source code in cell2cell/tensor/tensor.py
def generate_tensor_metadata(interaction_tensor, metadata_dicts, fill_with_order_elements=True):
    '''Uses a list of of dicts (or None when a dict is missing) to generate a list of
    metadata for each order in the tensor.

    Parameters
    ----------
    interaction_tensor : cell2cell.tensor.BaseTensor
        A communication tensor.

    metadata_dicts : list
        A list of dictionaries. Each dictionary represents an order of the tensor. In
        an interaction tensor these orders should be contexts, LR pairs, sender cells
        and receiver cells. The keys are the elements in each order (they are
        contained in interaction_tensor.order_names) and the values are the categories
        that each elements will be assigned as metadata.

    fill_with_order_elements : boolean, default=True
        Whether using each element of a dimension as its own metadata when a None is
        passed instead of a dictionary for the respective order/dimension. If True,
        each element in that order will be use itself, that dimension will not contain
        metadata.

    Returns
    -------
    metadata : list
        A list of pandas.DataFrames that will be used as an input of the
        cell2cell.plot.tensor_factors_plot.
    '''
    tensor_dim = len(interaction_tensor.tensor.shape)
    assert (tensor_dim == len(metadata_dicts)), "metadata_dicts should be of the same size as the number of orders/dimensions in the tensor"

    if interaction_tensor.order_labels is None:
        if tensor_dim == 4:
            default_cats = ['Contexts', 'Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']
        elif tensor_dim > 4:
            default_cats = ['Contexts-{}'.format(i + 1) for i in range(tensor_dim - 3)] + ['Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']
        elif tensor_dim == 3:
            default_cats = ['Ligand-Receptor Pairs', 'Sender Cells', 'Receiver Cells']
        else:
            raise ValueError('Too few dimensions in the tensor')
    else:
        assert len(interaction_tensor.order_labels) == tensor_dim, "The length of order_labels must match the number of orders/dimensions in the tensor"
        default_cats = interaction_tensor.order_labels

    if fill_with_order_elements:
        metadata = [pd.DataFrame(index=names) for names in interaction_tensor.order_names]
    else:
        metadata = [pd.DataFrame(index=names) if (meta is not None) else None for names, meta in zip(interaction_tensor.order_names, metadata_dicts)]

    for i, meta in enumerate(metadata):
        if meta is not None:
            if metadata_dicts[i] is not None:
                meta['Category'] = [metadata_dicts[i][idx] for idx in meta.index]
            else: # dict is None and fill with order elements TRUE
                if len(meta.index) < 75:
                    meta['Category'] = meta.index
                else:
                    meta['Category'] = len(meta.index) * [default_cats[i]]
            meta.index.name = 'Element'
            meta.reset_index(inplace=True)
    return metadata

interactions_to_tensor(interactions, experiment='single_cell', context_names=None, how='inner', outer_fraction=0.0, communication_score='expression_product', upper_letter_comparison=True, verbose=True)

Takes a list of Interaction pipelines (see classes in cell2cell.analysis.pipelines) and generates a communication tensor.

Parameters

interactions : list List of Interaction pipelines. The Interactions has to be all either BulkInteractions or SingleCellInteractions.

experiment : str, default='single_cell' Type of Interaction pipelines in the list. Either 'single_cell' or 'bulk'.

context_names : list List of context names or labels for each of the Interaction pipelines. This list matches the length of interactions and the labels have to follows the same order.

how : str, default='inner' Approach to consider cell types and genes present across multiple contexts.

- 'inner' : Considers only cell types and genes that are present in all
            contexts (intersection).
- 'outer' : Considers all cell types and genes that are present
            across contexts (union).
- 'outer_genes' : Considers only cell types that are present in all
                  contexts (intersection), while all genes that are
                  present across contexts (union).
- 'outer_cells' : Considers only genes that are present in all
                  contexts (intersection), while all cell types that are
                  present across contexts (union).

outer_fraction : float, default=0.0 Threshold to filter the elements when how includes any outer option. Elements with a fraction abundance across samples at least this threshold will be included. When this value is 0, considers all elements across the samples. When this value is 1, it acts as using how='inner'.

communication_score : str, default='expression_mean' Type of communication score to infer the potential use of a given ligand- receptor pair by a pair of cells/tissues/samples. Available communication_scores are:

- 'expression_mean' : Computes the average between the expression of a ligand
                      from a sender cell and the expression of a receptor on a
                      receiver cell.
- 'expression_product' : Computes the product between the expression of a
                        ligand from a sender cell and the expression of a
                        receptor on a receiver cell.
- 'expression_gmean' : Computes the geometric mean between the expression
                       of a ligand from a sender cell and the
                       expression of a receptor on a receiver cell.

upper_letter_comparison : boolean, default=True Whether making uppercase the gene names in the expression matrices and the protein names in the ppi_data to match their names and integrate their respective expression level. Useful when there are inconsistencies in the names between the expression matrix and the ligand-receptor annotations.

Returns

tensor : cell2cell.tensor.InteractionTensor A 4D-communication tensor.

Source code in cell2cell/tensor/tensor.py
def interactions_to_tensor(interactions, experiment='single_cell', context_names=None, how='inner', outer_fraction=0.0,
                           communication_score='expression_product', upper_letter_comparison=True, verbose=True):
    '''Takes a list of Interaction pipelines (see classes in
    cell2cell.analysis.pipelines) and generates a communication
    tensor.

    Parameters
    ----------
    interactions : list
        List of Interaction pipelines. The Interactions has to be all either
        BulkInteractions or SingleCellInteractions.

    experiment : str, default='single_cell'
        Type of Interaction pipelines in the list. Either 'single_cell' or 'bulk'.

    context_names : list
        List of context names or labels for each of the Interaction pipelines. This
        list matches the length of interactions and the labels have to follows the
        same order.

    how : str, default='inner'
        Approach to consider cell types and genes present across multiple contexts.

        - 'inner' : Considers only cell types and genes that are present in all
                    contexts (intersection).
        - 'outer' : Considers all cell types and genes that are present
                    across contexts (union).
        - 'outer_genes' : Considers only cell types that are present in all
                          contexts (intersection), while all genes that are
                          present across contexts (union).
        - 'outer_cells' : Considers only genes that are present in all
                          contexts (intersection), while all cell types that are
                          present across contexts (union).

    outer_fraction : float, default=0.0
        Threshold to filter the elements when `how` includes any outer option.
        Elements with a fraction abundance across samples at least this
        threshold will be included. When this value is 0, considers
        all elements across the samples. When this value is 1, it acts as using
        `how='inner'`.

    communication_score : str, default='expression_mean'
        Type of communication score to infer the potential use of a given ligand-
        receptor pair by a pair of cells/tissues/samples.
        Available communication_scores are:

        - 'expression_mean' : Computes the average between the expression of a ligand
                              from a sender cell and the expression of a receptor on a
                              receiver cell.
        - 'expression_product' : Computes the product between the expression of a
                                ligand from a sender cell and the expression of a
                                receptor on a receiver cell.
        - 'expression_gmean' : Computes the geometric mean between the expression
                               of a ligand from a sender cell and the
                               expression of a receptor on a receiver cell.

    upper_letter_comparison : boolean, default=True
        Whether making uppercase the gene names in the expression matrices and the
        protein names in the ppi_data to match their names and integrate their
        respective expression level. Useful when there are inconsistencies in the
        names between the expression matrix and the ligand-receptor annotations.

    Returns
    -------
    tensor : cell2cell.tensor.InteractionTensor
        A 4D-communication tensor.

    '''
    ppis = []
    rnaseq_matrices = []
    complex_sep = interactions[0].complex_sep
    complex_agg_method = interactions[0].complex_agg_method
    interaction_columns = interactions[0].interaction_columns
    for Int_ in interactions:
        if Int_.analysis_setup['cci_type'] == 'undirected':
            ppis.append(Int_.ref_ppi)
        else:
            ppis.append(Int_.ppi_data)

        if experiment == 'single_cell':
            rnaseq_matrices.append(Int_.aggregated_expression)
        elif experiment == 'bulk':
            rnaseq_matrices.append(Int_.rnaseq_data)
        else:
            raise ValueError("experiment must be 'single_cell' or 'bulk'")

    ppi_data = pd.concat(ppis)
    ppi_data = ppi_data.drop_duplicates().reset_index(drop=True)

    tensor = InteractionTensor(rnaseq_matrices=rnaseq_matrices,
                               ppi_data=ppi_data,
                               context_names=context_names,
                               how=how,
                               outer_fraction=outer_fraction,
                               complex_sep=complex_sep,
                               complex_agg_method=complex_agg_method,
                               interaction_columns=interaction_columns,
                               communication_score=communication_score,
                               upper_letter_comparison=upper_letter_comparison,
                               verbose=verbose
                               )
    return tensor

tensor_manipulation

concatenate_interaction_tensors(interaction_tensors, axis, order_labels, remove_duplicates=False, keep='first', mask=None, device=None)

Concatenates interaction tensors in a given tensor dimension or axis.

Parameters

interaction_tensors : list List of any tensor class in cell2cell.tensor.

axis : int The axis along which the arrays will be joined. If axis is None, arrays are flattened before use.

order_labels : list List of labels for dimensions or orders in the tensor.

remove_duplicates : boolean, default=False Whether removing duplicated names in the concatenated axis.

keep : str, default='first' Determines which duplicates (if any) to keep. Options are:

- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.

mask : ndarray list Helps avoiding missing values during a tensor factorization. A mask should be a boolean array of the same shape as the original tensor and should be 0 where the values are missing and 1 everywhere else. This must be of equal shape as the concatenated tensor.

device : str, default=None Device to use when backend is pytorch. Options are:

Returns

concatenated_tensor : cell2cell.tensor.PreBuiltTensor Final tensor after concatenation. It is a PreBuiltTensor that works any interaction tensor based on the class BaseTensor.

Source code in cell2cell/tensor/tensor_manipulation.py
def concatenate_interaction_tensors(interaction_tensors, axis, order_labels, remove_duplicates=False, keep='first',
                                    mask=None, device=None):
    '''Concatenates interaction tensors in a given tensor dimension or axis.

    Parameters
    ----------
    interaction_tensors : list
        List of any tensor class in cell2cell.tensor.

    axis : int
        The axis along which the arrays will be joined. If axis is None, arrays are flattened before use.

    order_labels : list
        List of labels for dimensions or orders in the tensor.

    remove_duplicates : boolean, default=False
        Whether removing duplicated names in the concatenated axis.

    keep : str, default='first'
        Determines which duplicates (if any) to keep.
        Options are:

        - first : Drop duplicates except for the first occurrence.
        - last : Drop duplicates except for the last occurrence.
        - False : Drop all duplicates.

    mask : ndarray list
        Helps avoiding missing values during a tensor factorization. A mask should be
        a boolean array of the same shape as the original tensor and should be 0
        where the values are missing and 1 everywhere else. This must be of equal shape
        as the concatenated tensor.

    device : str, default=None
        Device to use when backend is pytorch. Options are:
         {'cpu', 'cuda', None}

    Returns
    -------
    concatenated_tensor : cell2cell.tensor.PreBuiltTensor
        Final tensor after concatenation. It is a PreBuiltTensor that works
        any interaction tensor based on the class BaseTensor.
    '''
    # Assert if all other dimensions contains the same elements:
    shape = len(interaction_tensors[0].tensor.shape)
    assert all(shape == len(tensor.tensor.shape) for tensor in interaction_tensors[1:]), "Tensors must have same number of dimensions"

    for i in range(shape):
        if i != axis:
            elements = interaction_tensors[0].order_names[i]
            for tensor in interaction_tensors[1:]:
                assert elements == tensor.order_names[i], "Tensors must have the same elements in the other axes."

    # Initialize tensors into a numpy object for performing subset
    # Use the same context as first tensor for everything
    try:
        context = tl.context(interaction_tensors[0].tensor)
    except:
        context = {'dtype': interaction_tensors[0].tensor.dtype, 'device' : None}

    # Concatenate tensors
    concat_tensor = tl.concatenate([tensor.tensor.to('cpu') for tensor in interaction_tensors], axis=axis)
    if mask is not None:
        assert mask.shape == concat_tensor.shape, "Mask must have the same shape of the concatenated tensor. Here: {}".format(concat_tensor.shape)
    else: # Generate a new mask from all previous masks if all are not None
        if all([tensor.mask is not None for tensor in interaction_tensors]):
            mask = tl.concatenate([tensor.mask.to('cpu') for tensor in interaction_tensors], axis=axis)
        else:
            mask = None

    concat_tensor = tl.tensor(concat_tensor, device=context['device'])
    if mask is not None:
        mask = tl.tensor(mask, device=context['device'])

    # Concatenate names of elements for the given axis but keep the others as in one tensor
    order_names = []
    for i in range(shape):
        tmp_names = []
        if i == axis:
            for tensor in interaction_tensors:
                tmp_names += tensor.order_names[i]
        else:
            tmp_names = interaction_tensors[0].order_names[i]
        order_names.append(tmp_names)

    # Generate final object
    concatenated_tensor = PreBuiltTensor(tensor=concat_tensor,
                                         order_names=order_names,
                                         order_labels=order_labels,
                                         mask=mask,  # Change if you want to omit values in the decomposition
                                         device=device
                                        )

    # Remove duplicates
    if remove_duplicates:
        concatenated_tensor = subset_tensor(interaction_tensor=concatenated_tensor,
                                            subset_dict={axis: order_names[axis]},
                                            remove_duplicates=remove_duplicates,
                                            keep=keep,
                                            original_order=False)
    return concatenated_tensor

utils special

networks

export_network_to_cytoscape(network, filename)

Exports a network into a spreadsheet that is readable by the software Gephi.

Parameters

network : networkx.Graph, networkx.DiGraph or a pandas.DataFrame A networkx Graph or Directed Graph, or an adjacency matrix, where in rows and columns are nodes and values represents a weight for the respective edge.

filename : str, default=None Path to save the network into a Cytoscape-readable format (JSON file in this case). E.g. '/home/user/network.json'

Source code in cell2cell/utils/networks.py
def export_network_to_cytoscape(network, filename):
    '''
    Exports a network into a spreadsheet that is readable
    by the software Gephi.

    Parameters
    ----------
    network : networkx.Graph, networkx.DiGraph or a pandas.DataFrame
        A networkx Graph or Directed Graph, or an adjacency matrix,
        where in rows and columns are nodes and values represents a
        weight for the respective edge.

    filename : str, default=None
        Path to save the network into a Cytoscape-readable format
        (JSON file in this case). E.g. '/home/user/network.json'
    '''
    # This allows to pass a network directly or an adjacency matrix
    if type(network) != nx.classes.graph.Graph:
        network = generate_network_from_adjacency(network,
                                                  package='networkx')

    data = nx.readwrite.json_graph.cytoscape.cytoscape_data(network)

    # Export
    import json
    json_str = json.dumps(data)
    with open(filename, 'w') as outfile:
        outfile.write(json_str)

export_network_to_gephi(network, filename, format='excel', network_type='Undirected')

Exports a network into a spreadsheet that is readable by the software Gephi.

Parameters

network : networkx.Graph, networkx.DiGraph or a pandas.DataFrame A networkx Graph or Directed Graph, or an adjacency matrix, where in rows and columns are nodes and values represents a weight for the respective edge.

filename : str, default=None Path to save the network into a Gephi-readable format.

format : str, default='excel' Format to export the spreadsheet. Options are:

- 'excel' : An excel file, either .xls or .xlsx
- 'csv' : Comma separated value format
- 'tsv' : Tab separated value format

network_type : str, default='Undirected' Type of edges in the network. They could be either 'Undirected' or 'Directed'.

Source code in cell2cell/utils/networks.py
def export_network_to_gephi(network, filename, format='excel', network_type='Undirected'):
    '''
    Exports a network into a spreadsheet that is readable
    by the software Gephi.

    Parameters
    ----------
    network : networkx.Graph, networkx.DiGraph or a pandas.DataFrame
        A networkx Graph or Directed Graph, or an adjacency matrix,
        where in rows and columns are nodes and values represents a
        weight for the respective edge.

    filename : str, default=None
        Path to save the network into a Gephi-readable format.

    format : str, default='excel'
        Format to export the spreadsheet. Options are:

        - 'excel' : An excel file, either .xls or .xlsx
        - 'csv' : Comma separated value format
        - 'tsv' : Tab separated value format

    network_type : str, default='Undirected'
        Type of edges in the network. They could be either
        'Undirected' or 'Directed'.
    '''
    # This allows to pass a network directly or an adjacency matrix
    if type(network) != nx.classes.graph.Graph:
        network = generate_network_from_adjacency(network,
                                                  package='networkx')

    gephi_df = nx.to_pandas_edgelist(network)
    gephi_df = gephi_df.assign(Type=network_type)
    # When weight is not in the network
    if ('weight' not in gephi_df.columns):
        gephi_df = gephi_df.assign(weight=1)

    # Transform column names
    gephi_df = gephi_df[['source', 'target', 'Type', 'weight']]
    gephi_df.columns = [c.capitalize() for c in gephi_df.columns]

    # Save with different formats
    if format == 'excel':
        gephi_df.to_excel(filename, sheet_name='Edges', index=False)
    elif format == 'csv':
        gephi_df.to_csv(filename, sep=',', index=False)
    elif format == 'tsv':
        gephi_df.to_csv(filename, sep='\t', index=False)
    else:
        raise ValueError("Format not supported.")

generate_network_from_adjacency(adjacency_matrix, package='networkx')

Generates a network or graph object from an adjacency matrix.

Parameters

adjacency_matrix : pandas.DataFrame An adjacency matrix, where in rows and columns are nodes and values represents a weight for the respective edge.

package : str, default='networkx' Package or python library to built the network. Implemented optios are {'networkx'}. Soon will be available for 'igraph'.

Returns

network : graph-like A graph object built with a python-library for networks.

Source code in cell2cell/utils/networks.py
def generate_network_from_adjacency(adjacency_matrix, package='networkx'):
    '''
    Generates a network or graph object from an adjacency matrix.

    Parameters
    ----------
    adjacency_matrix : pandas.DataFrame
        An adjacency matrix, where in rows and columns are nodes
        and values represents a weight for the respective edge.

    package : str, default='networkx'
        Package or python library to built the network.
        Implemented optios are {'networkx'}. Soon will be
        available for 'igraph'.

    Returns
    -------
    network : graph-like
        A graph object built with a python-library for networks.
    '''
    if package == 'networkx':
        network = nx.from_pandas_adjacency(adjacency_matrix)
    elif package == 'igraph':
        # A = adjacency_matrix.values
        # network = igraph.Graph.Weighted_Adjacency((A > 0).tolist(), mode=igraph.ADJ_UNDIRECTED)
        #
        # # Add edge weights and node labels.
        # network.es['weight'] = A[A.nonzero()]
        # network.vs['label'] = list(adjacency_matrix.columns)
        #
        # Warning("iGraph functionalities are not completely implemented yet.")
        raise NotImplementedError("Network using package {} not implemented".format(package))
    else:
        raise NotImplementedError("Network using package {} not implemented".format(package))
    return network

parallel_computing

agents_number(n_jobs)

Computes the number of agents/cores/threads that the computer can really provide given a number of jobs/threads requested.

Parameters

n_jobs : int Number of threads for parallelization.

Returns

agents : int Number of threads that the computer can really provide.

Source code in cell2cell/utils/parallel_computing.py
def agents_number(n_jobs):
    '''
    Computes the number of agents/cores/threads that the
    computer can really provide given a number of
    jobs/threads requested.

    Parameters
    ----------
    n_jobs : int
        Number of threads for parallelization.

    Returns
    -------
    agents : int
        Number of threads that the computer can really provide.
    '''
    if n_jobs < 0:
        agents = cpu_count() + 1 + n_jobs
        if agents < 0:
            agents = 1
    elif n_jobs > cpu_count():
        agents = cpu_count()

    elif n_jobs == 0:
        agents = 1
    else:
        agents = n_jobs
    return agents

parallel_spatial_ccis(inputs)

Parallel computing in cell2cell2.analysis.pipelines.SpatialSingleCellInteractions

Source code in cell2cell/utils/parallel_computing.py
def parallel_spatial_ccis(inputs):
    '''
    Parallel computing in cell2cell2.analysis.pipelines.SpatialSingleCellInteractions
    '''
    # TODO: Implement this for enabling spatial analysis and compute interactions in parallel

    # from cell2cell.core import spatial_operation
    #results = spatial_operation()

    # return results
    pass