Single-cell sequencing data integration

One,Single-cell sequencingPrinciples of Data Integration

Integration of single-cell sequencing data is somewhat similar to the process of splicing and comparing genomes. The integration process requires finding twodatasetThe similarity between the parts in the single-cell sequencing data then implies that there is a portion of cells in the dataset of the two two different experiments that have a similar biological state (although the absolute values of gene expression in this cluster of cells may not be the same, the cluster of cells as a whole has consistency or similarity).

A Two single-cell data from different experiments have cell populations with similar biological states, but the query dataset has cell populations that are unique to it

B. Conducting routinecorrelation analysis, and Log2 normalization was performed.

C Identify the mutual nearest neighbors (MNN) between two datasets under the same shared space, and these cells can then act as anchors cells between the two datasets, thus helping in dataset integration.

D For each anchor pair, a score is given for consistency.

E Based on these ANCHOR cells as well as the scoring values, corrective vectors are computed to performdata setintegration of

Source:Single Cell Sequencing Data Integration - Knowledge ()/p/158974557?ivk_sa=1024320u

II. General steps

Source:Single Cell Data Integration Comprehensive Integration - know ()/p/465227964

（1）Normalization

In all analyses, we apply standard preprocessing to single-cell RNA-seq datasets. Unless otherwise stated, we first log-normalize all datasets. The purpose of lognormalization is to prevent data variance from being completely controlled by highly expressed genes, which after log-normalization become independent of gene expression levels, making genes with lower expression levels less likely to be overlooked.

After that, we do the z-score transformation, which performs a(math.) lower dimensionalityThe operation is a standard step before PCA. z-scroe transformation purpose is shown in the figure below, due to technical reasons, the transcript levels detected in different cells are not the same, i.e., the data is noise, after transfrom, the data is comparable between different cells.

（2）Feature selection for individual datasets

In each dataset, we next perform feature selection, selecting a set of features (. genes) , such that they exhibit a high degree of variation across cells and can be used first for downstream analysis.

For each gene, we calculate the variance of its standardized value across all cells, and we directly use this variance to rank the features, selecting the 2000 genes with the largest standardized variance as "highly variable genes". This process can be done using the FindVariableFeatures function in Seurat v3.

(3) Feature selection for integrated analysis of multiple datasets

When integrating all datasets, we want to give priority to highly variable genes. Therefore, we first select features for each dataset individually, using the process described above. Then we look at the number of genes that are individually recognized as highly variable in all datasets, and use this metric to rank and select the top 2000 genes for downstream analysis. These steps can be accomplished using the function SelectIntegrationFeatures in Seurat v3.

(4)Identification of anchor correspondences between two datasets

The key step in integrative analysis is to identify anchors between the two datasets. Anchors represent two cells (one from each dataset) that are the most correlated pair of cells from different datasets, which we predict to come from a common biological state. Thus finding the anchors allows us to determine the cellular relationships between different datasets, which facilitates subsequent integration operations. Anchors for reference assembly or transfer learning can be computed using the functions FindIntegrationAnchors and FindTransferAnchors in the Seurat v3 package, respectively.

III. Code Implementation

Approach 1: Standard Integration Process for Seurat v3

(1) Data acquisition




# Clear the environment and develop good habits



rm(list = ls())




setwd("")




library(Seurat)




library(multtest)




library(dplyr)




library(ggplot2)




library(patchwork)




library(SeuratData)# Covers the functions InstallData and LoadData




library(tidyverse)

Test data selection ifnb provided by Seuratdata set, which contained two PBMC samples, one for interferon stimulation treatment and the other for control.




#('devtools')



 



library(devtools)




devtools::install_github('satijalab/seurat-data',force = TRUE)




InstallData("ifnb")




LoadData("ifnb")

The ifnb was split into two lists (stim and CTRL) according to the stim status, which contained two PBMC samples, one for interferon-stimulated treatment and the other for control.




 <- SplitObject(ifnb,  = "stim")




C57 <- $CTRL



AS1 <- $STIM

(2) Data processing

Data normalization + selection of highly variable genes, respectively




myfunction1 <- function(){




   <- NormalizeData(,  = "LogNormalize",  = 10000)




   <- FindVariableFeatures(,  = "vst", nfeatures = 2000)




  return()




}



C57 <- myfunction1(C57)




AS1 <- myfunction1(AS1)

Integration ----FindIntegrationAnchors is used to find data points (anchors) for sample integration




 <- FindIntegrationAnchors( = list(C57,AS1), dims = 1:20)




 <- IntegrateData(anchorset = , dims = 1:20)

It should be noted that: the above integration steps relative to the harmony integration method, for larger datasets (tens of thousands of cells) is very consuming of memory and time, about 9G of data 32G of memory has not been able to run; when the existence of a Seurat object with a small number of cells (the impression of this kind of 200 or less), it will be reported as an error, which suggests the use of the harmony integration method

（3）data analysisand visualization

1. Switch to the integrated assay, set to integrated can be understood as: this is the integrated data. You can think of this as batch-adjusted data. The next analysis will be based on this integrated data




DfaultAssay() <- "integrated"




 



# # Run the standard workflow for visualization and clustering



 <- ScaleData(, features = rownames())




 <- RunPCA(, npcs = 50, verbose = FALSE)




 <- FindNeighbors(, dims = 1:30)




 <- FindClusters(, resolution = 0.5)




 <- RunUMAP(, dims = 1:30)




 <- RunTSNE(, dims = 1:30)




p1<- DimPlot(,label = T, = 'stim')#integrated




# Visualization



p2 <- DimPlot(, reduction = "umap",  = "stim")




p3 <- DimPlot(, reduction = "umap", label = TRUE, repel = TRUE)

2. Setting DefaultAssay to "RNA" means that the next analysis will be based on the original values, and the identification of conserved cell type markers can be performed using the integrated data.clustering. But using INTEGRATED for differential genes is inappropriate, and most tools only accept raw DE counts.




DefaultAssay() <- "RNA"




 <- ScaleData(, features = rownames())




 <- RunPCA(, npcs = 50, verbose = FALSE)




 <- FindNeighbors(, dims = 1:30)




 <- FindClusters(, resolution = 0.5)




 <- RunUMAP(, dims = 1:30)




 <- RunTSNE(, dims = 1:30)




p4 <- DimPlot(,label = T, = 'stim')




p1|p4

It can be seen that the batch effect occurs if the raw values are analyzed, and there are almost no overlapping cell populations between the two samples. Integration methods remove the batch effect by first identifying pairs of cells across the dataset that are in matching biological states (anchors), and then correcting for the batch effect between the datasets based on these anchors. After the test samples are batch-corrected, cells of the same type are clustered together, which is more conducive to the identification of cell types.

Approach II. HARMONY'S APPROACH

The main advantages are: speed and small memory

rationale: We use different colors to represent different datasets and shapes to represent different cell types. First, Harmony appliesprincipal component analysis (PCA)The transcriptome expression profiles are embedded in a low-dimensional space, and then an iterative process is applied to remove dataset-specific effects.

(A) Harmony probabilistically assigns cells to clusters, thereby maximizing the diversity of datasets within each cluster. (B) Harmony computes the global centers of all datasets for each cluster, as well as the centers of specific datasets.

(C) In each cluster, Harmony computes correction factors for each dataset based on the center.

(D) Finally, Harmony corrects each cell using cell-specific factors based on C. Since Harmony uses soft clustering, each single cell can be corrected for the soft clustering assignments made in its A by linear combinations of multiple factors that are linearly corrected for each cell. Repeat steps A through D until convergence. The dependence between the clustering assignments and the dataset decreases with each round.

Source:A Journey of Single-Cell Data with "harmony" Integration of Different Platforms - Cloud+ Community - Tencent Cloud ()/developer/article/1650648

1. Data processing

Before running Harmony, create a Seurat object and analyze it according to standard PCA, as in the first method.





if(!require(harmony))devtools::install_github("immunogenomics/harmony")




 <- pbmc



 <-  %>%




  Seurat::NormalizeData() %>%




  FindVariableFeatures( = "vst", nfeatures = 2000) %>% 



  ScaleData()




 <- RunPCA(, npcs = 50, verbose = FALSE)

2.Run Harmony

The easiest way to run Harmony is to pass in the Seurat object and specify the variables to integrate.RunHarmonyReturns the Seurat object with the correctedHarmony coordinates. Let's putplot_convergenceset toTRUEThis way we can make sure that the Harmony objective function gets better in each round.




##### The Seurat standard process is performed to the PCA step, followed by the Harmony integration, which can be simply interpreted as a new kind of dimensionality reduction. ########



= %>% RunHarmony("stim", plot_convergence = TRUE)




# Set reduction = 'harmony' and subsequent analysis will call the PCA downscaled data after Harmony correction.



 <-  %>% 



  RunUMAP(reduction = "harmony", dims = 1:30) %>% 



  FindNeighbors(reduction = "harmony", dims = 1:30) %>% 



  FindClusters(resolution = 0.5) %>% 



  identity()




 



 <-  %>% 



  RunTSNE(reduction = "harmony", dims = 1:30)

3. Visualization




#Tsne mapping by merged grouping, = "stim"



p5 <- DimPlot(, reduction = "tsne",  = "stim", =0.5)+theme(




   = element_blank(),




   = element_blank(), = element_blank()




)



#tsne mapping based on cell type = "ident"



p6 <- DimPlot(, reduction = "tsne",  = "ident",   =0.5, label = TRUE,repel = TRUE)+theme(




   = element_blank(),




   = element_blank(), = element_blank()




)