The ConsensusClusterPlus packages areR languageA method for implementing consensus clustering in the
There are three main steps: 1, preparing the input data; 2, running the process; 3, generating consensus
- 1-Input data
Input data requirements are unspecialized and are listed as sample behavioral genes, normalized expression matrices.
It's worth noting that this package by default chooses to start with themedian absolute deviation (MAD) measure of top5000 highly variable genes was used for analysis to betterclusteringClustering (which is very similar to single cell). How many genes are selected and the selection method is optional, as this step uses CLASSICAL R statistics rather than the integrated commands in the package.
-
library(ALL)
-
data(ALL)
-
d=exprs(ALL)
-
d[1:5,1:5]
-
-
mads=apply(d,1,mad)
-
d=d[rev(order(mads))[1:5000],]
-
d = sweep(d,1, apply(d,1,median,=T))
- 2-Clustering
Several important parameters:
pItem: percent of items (column) resampling
pFeature: percent of features (rows) resampling
maxK: maxium cluster counts
reps: resampling times
clusterAlg: agglomerative hierarchical clustering algorithm
distance: 1- Pearson correlation distances
Note: In practice, K and reps can be set higher, e.g. 20, 1000.
-
library(ConsensusClusterPlus)
-
title=tempdir()
-
results = ConsensusClusterPlus(d,maxK=6,reps=50,pItem=0.8,pFeature=1,
-
+ title=title,clusterAlg="hc",distance="pearson",seed=1262118388.71279,plot="png")
The result is a list whose elements correspond to the results for different values of k
-
### View important results
-
-
#consensusMatrix - the consensus matrix.
-
#For .example, the top five rows and columns of results for k=2:
-
results[[2]][["consensusMatrix"]][1:5,1:5]
-
-
#consensusTree - hclust object
-
results[[2]][["consensusTree"]]
-
-
#consensusClass - the sample classifications
-
results[[2]][["consensusClass"]][1:5]
-
-
#ml - consensus matrix result
-
#clrs - colors for cluster
- 3-Computing cluster consensus vs. item consensus
These two concepts are analogous to intracluster heterogeneity and the concept of MEMBERSHIP in WGCNA.
-
icl = calcICL(results,title=title,plot="png")
-
-
icl[["clusterConsensus"]]
-
#k cluster clusterConsensus
-
#[1,] 2 1 0.7681668
-
#[2,] 2 2 0.9788274
-
#[3,] 3 1 0.6176820
-
#[4,] 3 2 0.9190744
-
#[5,] 3 3 1.0000000
-
#[6,] 4 1 0.8446083
-
-
icl[["itemConsensus"]][1:5,]
-
#k cluster item itemConsensus
-
#1 2 1 28031 0.6173782
-
#2 2 1 28023 0.5797202
-
#3 2 1 43012 0.5961974
-
#4 2 1 28042 0.5644619
-
#5 2 1 28047 0.6259350
-
- 4-Graphical presentation
for further details, refer toBioconductor - ConsensusClusterPlus
R Language|ConsensusClusterPlus Package for Consensus Clustering
In practice, CC clustering is often associated with some specific biological processes. For example, angiogenesis, hypoxia. First find the related gene set, take a subset of the expression matrix, and then cluster into subtypes. With different subtypes, the later analysis is very diverse, and can dig deeper into the molecular mechanisms of different, co-expression networks, and can also explore his diagnostic or prognostic value.
Or, first use cibersort, etc.arithmeticCalculate the immune infiltration MATRIX and use the immune infiltration results to do infiltration clustering.
This idea is not really much different from ssGSEA or GSVA using median gene set scoring to divide high and low groups, except that with GSVA it is more intuitive and the prognostic or diagnostic distinctions are more pronounced, but distance-based clustering may be more advantageous in terms of co-expression networks.