web123456

Single-cell analysis (18): Cell communication analysis and visualization based on CellPhoneDB (Previous)

Cell communication analysis can give us some information about the mutual regulation/communication between cellular taxa, and this regulation between cells is mainly achieved through ligand binding and signaling. There may be specific cellular communication relationships for different differentiation and disease processes, so it is crucial to elucidate these communication relationships.

CellPhoneDB comes with a detailed ligand database that integrates previous public databases and will be manually corrected for more accurate ligand annotations. In addition, ligands with multiple subunits are also annotated. The graph below shows how many secreted and membrane proteins, protein complexes, and ligand relationships are included in the CellPhoneDB database, and what databases they are derived from.

1. CellPhoneDB infers the principles of cellular communication

Given the expression matrix and cell annotations, for the gene1-gene2 interactions, calculate the mean value of gene1 expression in one clusterA and the mean value of gene2 expression in another clusterB, and the mean value of the two is the MEAN; after randomly replacing the cell label, calculate the MEAN value of gene1 expression in "clusterA" and the mean value of gene2 expression in "clusterB" based on the new labels, and then find a mean value of MEAN, such a process. After randomly changing the label of the cell, based on the new label, calculate the expression mean of gene1 in "clusterA" and the expression mean of gene2 in "clusterB", and then seek a mean value of MEAN, and this process is repeated many times, then we can get a distribution of MEAN, i.e.null distributionWhere MEAN is located in this distribution, and at more extreme locations, constitutes a percentage of the p-value (definition of p-value). So CellPhoneDB's presumption of a significantly enriched receptor-ligand relationship between two cell types is still essentially based on the amount of receptor expression inside one cell type, and the amount of ligand expression inside the other cell type. Furthermore, if a relationship is ubiquitous (evident between all cell types), it is not found.

There are also a couple of other things to keep in mind:

  • Downsampling when large samples are taken and only 1/3 of the cells are analyzed
  • When multiple subunits are present, the one with low expression is considered.
  • Genes with a certain threshold of expression share are analyzed, the default is 10 percent

2. How to present the results

This is an example of the visualization dedicated to the original article, and there are two things to note here:

  1. The heatmap on the right represents the number of cell type two-by-two interactions, and we can see that along the diagonal, left and right are symmetrical, i.e., the number of interactions is the same for A-B vs. B-A. Why is this?
  2. On the left is a bubble diagram of the interactions of specific receptor ligand pairs, cell pairs, with the size of the dots indicating the level of significance and the color of theThe means of the average expression level of interacting molecule 1 in cluster 1 and interacting molecule 2 in cluster 2 Notice how it says.interacting molecule 1/2, without saying which is the receptor and which is the ligand.

The reason has to do with the built-in list of gene-gene interactions in CellPhoneDB, which can't distinguish between receptor and ligand, and for gene1-gene2, it can be either a gene1 ligand-gene2 receptor or a gene1 receptor-gene2 ligand (as shown in the figure below). I personally think it is also for this reason, that the heat map on the right in order to say it is convenient, so that no matter whether to do the receptor or ligand relationship are counted as two cell interactions, so A-B and B-A in the heat map is the same value (otherwise the horizontal and vertical coordinates to write an interacting molecule, to see the people will naturally ask, this molecule is the receptor or the ligand it, plus) together would save time - both are included).

This is mentioned on github:

Also for this reason, I see articles that keep an eye on CellPhoneDB's graphs if it's used, and if it's useddirected graphIndicating the number of cell groups in relation to each other two by two, I would wonder if this is appropriate (of course it isn't)

3. Practical analysis

Public backstage reply20210723Get the test data for this demo, as well as the main visualization code.

3.1 Format of the input file

annotated document Two columns in total, Cell column cell_type column, with column names; .csv, .txt suffixes are available

Expression paper Normalize the matrix after the general simple division of normalize on the line; .csv, .txt suffix can be

3.2 Operation

hardwareThe installation is not covered here, create acondaEnvironment.pip installJust download and install

The main code to run CellPhoneDB is simple:

source /home/huangsiyuan/miniconda3/bin/activate cpdb

file_count=/home/huangsiyuan/cpdb/test_normat.txt
file_anno=/home/huangsiyuan/cpdb/test_anno.txt
outdir=/home/huangsiyuan/cpdb/test

if [ ! -d ${outdir} ]; then
  mkdir ${outdir}
fi

cellphonedb method statistical_analysis \
            --counts-data hgnc_symbol \
            --output-path ${outdir} \
            --threshold 0.01 \ #Percentage of cells expressing the specific ligand or receptor
            --threads 10 \
            ${file_anno} ${file_count}

source /home/huangsiyuan/miniconda3/bin/deactivate cpdb

#If the cell count is too high,Sampling parameters can be added,Only analyze by default1/3human cell
#--subsampling
#--subsampling-log true #For those who do notlogTransformed data,And add this parameter.
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

After this step in thetestInside the folder, 4 files are generated




significant_means.txt
  • 1
  • 2
  • 3

Among them.

  • The rows are receptor-ligand pairs, the columns are cell pairs, and the values are the average of the receptor and ligand expression means in the corresponding clusters;
  • Comparison of format andSimilarly, the value is the p-value;
  • significant_means.txtThe format and content are similar to that of theSimilar, although only means with p-values less than 0.05 were retained.

4. Visualization of results

In this step, I generally only use the above mentionedcap (a poem)Documentation Let's start by drawing those two diagrams, modeled after the original literature

library(tidyverse)
library(RColorBrewer)
library(scales)

pvalues=(". /test/",header = T,sep = "\t",stringsAsFactors = F)
pvalues=pvalues[,12:dim(pvalues)[2]] # Do not focus on the first 11 columns at this time
statdf=(colSums(pvalues < 0.05)) #count the number of significant ligand pairs in the case of a particular cell pair; threshold can be chosen by yourself
colnames(statdf)=c("number")

# Molecules at the front of the list are defined as indexa; molecules at the back of the list are defined as indexb
statdf$indexb=str_replace(rownames(statdf),"^. *\\\." ,"")
statdf$indexa=str_replace(rownames(statdf),"\\... *$","")
# Set the order of the appropriate cell types
rankname=sort(unique(statdf$indexa))
# Convert to factor type, when drawing the graph, the graph will be arranged in the pre-set order
statdf$indexa=factor(statdf$indexa,levels = rankname)
statdf$indexb=factor(statdf$indexb,levels = rankname)

statdf%>%ggplot(aes(x=indexa,y=indexb,fill=number))+geom_tile(color="white")+
  scale_fill_gradientn(colours = c("#4393C3", "#ffdbba", "#B2182B"),limits=c(0,20))+
  scale_x_discrete("cluster 1 produces molecule 1")+
  scale_x_discrete("cluster 1 produces molecule 1")+ scale_y_discrete("cluster 2 produces molecule 2")+
  theme_minimal()+
  theme(
     = element_text(hjust = 1, vjust = NULL, angle = 45),
     = element_blank()
  )
ggsave(filename = "." ,device = "pdf",width = 12,height = 10,units = c("cm"))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27

The inconsistency here with the diagrams in the literature is that my diagram is not symmetric about the diagonal because I did not sum the interactions A-B, B-A

For example, in the output of CellPhoneDB, there are 10 significant interactions between A-B and 20 significant interactions between B-A [1]. However, the interactions of A-B actually include 8 times of A as ligand and 2 times of A as receptor, and the interactions of B-A actually include 19 times of B as ligand and 1 time of B as receptor, so technically speaking, the two types of A and B cells interact with each other, and A acts as ligand for 9 times, and B acts as ligand for 21 times [②], which is the information that CellPhoneDB can not give. Of course, the total number of interactions is still 30 [③].

In other words, the symmetrical figure in the literature gives information ③, the one above me gives information ①, and information ② is not known (if one goes one by one with the naked eye to see which is receptor and which is ligand in the CellPhoneDB database for gene1-gene2, one can still count).

Due to the length of this article, the rest of the visualization will be shown in the next article, so stay tuned~!


bibliography

[1] Efremova M, Vento-Tormo M, Teichmann S A, et al. CellPhoneDB: inferring cell–cell communication from combined expression of multi-subunit ligand–receptor complexes[J]. Nature protocols, 2020, 15(4): 1484-1506.

Due to the limited level, there are mistakes, welcome to criticize and correct!