Single Cell Analysis (2): Expression Matrix using Cell Ranger

Cell Ranger is an idiot.hardwareYou just need to provide the original fastq file and it will return the feature-barcodeExpression matrix. Why not say gene-cell, for example, cell hashing data to get the matrix and tag rows, and columns can not be sure is a cell, may take into account this is not called gene-cell matrix it ~ it is 10xgenomics to provide the official comparison of quantitative software, there are four sub-commands, I've only ever used the cellranger count, the other three cellranger mkfastq, cellranger aggr, cellranger reanalyze did not use, there is no effect.
downloading：https://support./single-cell-gene-expression/software/downloads/latest
mounting：https://support./single-cell-gene-expression/software/pipelines/latest/installation

Before we get into the use of Cell Ranger, let's take a look at what 10X single-cell data looks like

This is the sequencing data for a sample of 5 Lines, and may be only one Line if there is enough data. as you can see, their naming format is relatively standardized, and you should try not to change the naming yourself after receiving the data from the company. One more detail to note is that the directory where these fastq files are stored should be underlined with the first underscore_Otherwise, the subsequent cell ranger will not recognize the files in the directory and will report an error.

[error] Unable to detect the chemistry for the following dataset. 
Please validate it and/or specify the chemistry 
via the --chemistry argument.

It's not really the -chemistry parameter.
To understand the contents of the document more clearly, let's look at the sequencing schematic for a 10X single cell

The Read1 sequence was originally attached to the magnetic beads, there are cellular barcode (same on all beads), there are UMI (different), and there is poly-T. Read2 is the RNA from the cell, and after the two of them are attached to the complementary pairing, there will be a sample index sequence attached to the other end of Read2. What is the function of this sample index sequence? You can refer to the role of index primers in illumina sequencing:

Simply put, in order to sequence multiple samples in a single sequencing, the sequences originating from a specific sample are followed by a specific index, and then split according to the correspondence after the sequencing. One sample corresponds to 4 indexes:

It's easier to understand what's inside each file when you look at it again, so let's take a Line as an example:

less -S S20191015T1_S6_L001_I1_001. | head -n 8
less -S S20191015T1_S6_L001_R1_001. | head -n 8
less -S S20191015T1_S6_L001_R2_001. | head -n 8

Actually, this index sequence is contained in lines 1, 5, 9... of the file, which is a bit redundant and generally not too concerned about it. This file has up to four sequences, interested partners can take a look.

Inside the R1 file is the cellular barcode information, and the extra sequences have been removed. v2 reagent base length for 10X is 26 and v3 reagent base length is 28.

The last file is the cDNA sequence corresponding to the real transcripts
The previous article talked about cell hashing sequencing has transcript information, the file obtained is the same as above; there is also a cell surface protein information, according to this protein information to distinguish the cell source, as follows:

As you can see from the figure, it is almost the same as the normal transcript building library, that is, the part of R2 is replaced with HTO sequence, and the whole fragment length is also changed.

The two graphs above are two cell hashing sequencing that I have seen in actual processing, the first is TotalSeqA and the second is TotalSeqB. In TotalSeqA, the first base of R2 starts with the HTO sequence (followed by the polyA sequence), whereas in TotalSeqB, the first 10 bases of R2 are any bases of N, and the 11th base is the start position of the HTO sequence, and the length of the HTO sequence is 16.

To summarize, there are two sets of sequencing data for cell hashing, one for regular transcript fastq, and one for protein information (also can be said to be sample information.) So to deal with this kind of data, you have to confirm with the sequencing company to make sure whether TotalSeqA or B is used and the correspondence between the sample and the HTO sequence.

Next, let's talk about how to use Cell Ranger to process normal 10X single-cell sequencing data, as well as cell hashing single-cell sequencing data.
Regular 10X

indir=/project_2019_11/data/S20191015T1
outdir=/project_2019_11/cellranger/
sample=S20191015T1
ncells=5000 #expected number of cells, this parameter doesn't have a big impact on the final number of cells you can get, so don't bother.
threads=20

refpath=/ref/10x/human/refdata-cellranger-GRCh38-3.0.0
cellranger=/softwore/bin/cellranger
cd ${outdir}
${cellranger} count --id=${sample} \
                 ---transcriptome=${refpath} \
                 --fastqs=${indir} \\
                 ---sample=${sample} \\
                 --expect-cells=${ncells} \
                 --localcores=${threads}

cell hashing

total_seq_A
Two folders need to be prepared in advance, for example, I use total_seq_A or total_seq_B to store the correspondence between HTO sequences and sample sources:

$ ls
feature.
$ cat feature.
id,name,read,pattern,sequence,feature_type
tag1,tag1,R2,^(BC),GTCAACTCTTTAGCG,Antibody Capture
tag2,tag2,R2,^(BC),TGATGGCCTATTGGG,Antibody Capture

tag1, tag2 corresponds to which sample is known in advance; ^(BC) can be seen as a regular expression indicating that the R2 sequence starts with a barcode (aka HTO sequence)
total_seq_B

$ ls

$ cat 
id,name,read,pattern,sequence,feature_type
tag6,tag6,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,GGTTGCCAGATGTCA,Antibody Capture
tag7,tag7,R2,5PNNNNNNNNNN(BC)NNNNNNNNN,TGTCTTTCCTGCCAG,Antibody Capture

5PNNNNNNNNNN(BC)NNNNNNNNNdenote5beginning，10One base later it'sHTOsequences，后面的sequences随意
lib_csv
The second folder, lib_csv, is used to store the paths of the two sets of cell hashing data in csv format, with the column sample as the folder name.

$ cat 
fastqs,sample,library_type
/project_2019_11/data/fastq/,S20200612P1320200702N,Gene Expression
/project_2019_11/data/antibody_barcode/,S20200612P13F20200702N,Antibody Capture

The final script is as follows

lib_dir=/script/cellranger/1/lib_csv/
#need to be changed based on your seq-tech: total_seq_A or total_seq_B
feature_ref_dir=/script/cellranger/1/total_seq_A/
outdir=/project_2019_11/cellranger/
sample=S20191017P11
ncells=5000
threads=20
refpath=/ref/10x/human/refdata-cellranger-GRCh38-3.0.0
cellranger=/softwore/bin/cellranger

cd ${outdir}
${cellranger} count --libraries=${lib_dir}${sample}. \
        --r1-length=28 \
        --feature-ref=${feature_ref_dir}feature. \
        --transcriptome=${refpath} \
        --localcores=${threads} \
        --expect-cells=${ncells} \
        --id=${sample}

The final expression matrix is output to the

${outdir}${sample_id}/outs/filtered_feature_bc_matrix

$ cd S20200619P11120200716NC/outs/filtered_feature_bc_matrix/
$ ls
    

$ less -S 
ENSG00000243485	MIR1302-2HG	Gene Expression
ENSG00000237613	FAM138A	Gene Expression
......
ENSG00000277475	AC213203.1	Gene Expression
ENSG00000268674	FAM231C	Gene Expression
tag7	tag7	Antibody Capture
tag8	tag8	Antibody Capture

Stored is the gene information, because it is cell hashing data, the matrix at the end of a few more rows of tag information, a total of 33540 rows

$ less -S  | head -n 4
AAACCCAAGACTTAAG-1
AAACCCAAGCTACTGT-1
AAACCCAAGGACTGGT-1
AAACCCAAGGCCTGCT-1

Stored is the final cellular barcode, which is 10139 lines long.

$ less -S  | head -n 8
%%MatrixMarket matrix coordinate integer general
%metadata_json: {"format_version": 2, "software_version": "3.1.0"}
33540 10139 15746600
65 1 1
103 1 1
155 1 2
179 1 2
191 1 1

is the matrix information, except for the first three rows, the remaining rows are equal to the FEATURES multiplied by the number of CBs, and the second column indicates the CB number from 1 to 10139, 1 repeated 33540 times, corresponding to the 33540 FEATURES in the first column. the third column indicates the UMI
The following script converts these three files into the common matrix form

path1=/softwore/biosoft/cellranger-3.1.0/cellranger
path2=/project_2019_11/cellranger/

i=S20191211P71
${path1} mat2csv ${path2}${i}/outs/filtered_feature_bc_matrix ${path2}Feature_Barcode_Matrices/${i}.
sed 's/,/\t/g' ${path2}Feature_Barcode_Matrices/${i}.  > ${path2}Feature_Barcode_Matrices/${i}.
sed -i 's/^\t//g' ${path2}Feature_Barcode_Matrices/${i}.
rm -f ${path2}Feature_Barcode_Matrices/${i}.