Seurat 包图文详解 | 单细胞转录组(scRNA-seq)分析02

一、创建 Seurat 对象

使用的示例数据集来自10X Genome 测序的 Peripheral Blood Mononuclear Cells (PBMC)。

下载链接：https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

library(dplyr)
library(Seurat)

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200library(dplyr)
library(Seurat)

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc

二、标准预处理流程

流程包括：

基于质控指标（QC metric）来筛选细胞
数据归一化和缩放
高异质性基因检测

1.基因质控指标来筛选细胞

质控指标：

每个细胞中检测到的基因数
- 低质量的细胞和空油滴（droplet）只有少量基因
- 两个及以上的细胞会有异常的高基因数
每个细胞中的UMI总数（与上类似）
线粒体基因组的reads比例
- 低质量或死细胞会有大百分比的线粒体基因组
- 使用PercentageFeatureSet函数来计数线粒体质控指标
- MT-是线粒体基因

# 计算线粒体read的百分比
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# 显示前5个细胞的质控指标
head(pbmc@meta.data, 5# 计算线粒体read的百分比
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# 显示前5个细胞的质控指标
head(pbmc@meta.data, 5)

通过上图，过滤标准设定为：

过滤UMI数大于2500，小于200的细胞
过滤线粒体百分比大于5%的细胞

查看特征与特征间的相关性

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt"plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")

plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA"plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

过滤

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

看看相关性

p1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
p2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
CombinePlots(plots = list(p1, p2)p1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
p2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
CombinePlots(plots = list(p1, p2))

2.归一化数据

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

LogNormalize that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in pbmc[["RNA"]]@data.

上述代码可以替换为：pbmc <- NormalizeData(pbmc)

3.识别高异质性特征

高异质性：这些特征在有的细胞中高表达，有的细胞中低表达。在下游分析中关注这些基因有助于找到单细胞数据集中的生物信号[https://www.nature.com/articles/nmeth.2645 ]

# 识别前2000个特征
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# 识别前10的高异质性基因
top10 <- head(VariableFeatures(pbmc), 10)

# 绘图看看
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2)# 识别前2000个特征
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# 识别前10的高异质性基因
top10 <- head(VariableFeatures(pbmc), 10)

# 绘图看看
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2))

4.缩放数据

这是在PCA等降维操作前的一个步骤，ScaleData函数：

转换每个基因的表达值，使每个细胞的平均表达值为0
转换每个基因的表达值，使细胞间方差为1
- 此步骤在下游分析中具有相同的权重，因此高表达的基因不会占主导地位

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
head(pbmc[["RNA"]]@scale.data,5all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
head(pbmc[["RNA"]]@scale.data,5)

5.线性维度约化 PCA

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

可视化细胞与特征间的PCA有三种方式：

VizDimLoadings

print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
# 绘图
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca"print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
# 绘图
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

DimPlot

DimPlot(pbmc, reduction = "pca"DimPlot(pbmc, reduction = "pca")

DimHeatmap

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUEDimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

主要用来查看数据集中的异质性的主要来源，并且可以确定哪些PC维度可以用于下一步的下游分析。

细胞和特征根据PCA分数来排序

DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUEDimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)

5.确定数据集的维度

为了克服在单细胞数据中在单个特征中的技术噪音，Seurat 聚类细胞是基于PCA分数的。每个PC代表着一个‘元特征’（带有跨相关特征集的信息）。因此，最主要的主成分代表了压缩的数据集。问题是要选多少PC呢？

方法一：JackStrawPlot

作者受JackStraw procedure 启发。随机置换数据的一部分子集（默认1%）再运行PCA，构建了一个’null distribution’的特征分数，重复这一步。最终会识别出低P-value特征的显著PCs

pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
# 绘图看看
JackStrawPlot(pbmc, dims = 1:15pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
# 绘图看看
JackStrawPlot(pbmc, dims = 1:15)

In this case it appears that there is a sharp drop-off in significance after the first 10-12 PCs

在上图中展示出在前10到12台PC之后，重要性显著下降

方法二：ElbowPlot

“ElbowPlot”：基于每个分量所解释的方差百分比对主要成分进行排名。在此示例中，我们可以在PC9-10周围观察到“elbow ”，这表明大多数真实信号是在前10台PC中捕获的。

ElbowPlot(pbmcElbowPlot(pbmc)

为了识别出数据的真实维度，有三种方法：

用更加受监督的方法来确定PCs的异质性，比如可以结合GSEA来分析（ The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example ）
The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff.
The third is a heuristic that is commonly used, and can be calculated instantly.

在这个例子中三种方法均产生了相似的结果，以PC 7-12作为阈值。

这个例子中，作者选择10，但是实际过程中还要考虑：

树突状细胞和NK细胞可能在PCs12和13中识别，这可能定义了罕见的免疫亚群（比如，MZB1是浆细胞样的er）。但是除非有一定的知识量，否则很难从背景噪音中发现。
用户可以选择不同的PCs再进行下游分析，比如选10，15，50等。结果常常有很多的不同。
建议在选择该参数时候，尽量偏高一点。如果仅仅使用5PCs会对下游分析产生不利影响

6.聚类细胞

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
# 查看前5聚类
head(Idents(pbmc), 5pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
# 查看前5聚类
head(Idents(pbmc), 5)

7.非线性维度约化（UMAP/TSNE）

# 使用UMAP聚类
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")
# 显示在聚类标签
DimPlot(pbmc, reduction = "umap", label = TRUE# 使用UMAP聚类
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")
# 显示在聚类标签
DimPlot(pbmc, reduction = "umap", label = TRUE)

# 使用TSNE聚类
pbmc <- RunTSNE(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "tsne")
# 显示在聚类标签
DimPlot(pbmc, reduction = "tsne", label = TRUE# 使用TSNE聚类
pbmc <- RunTSNE(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "tsne")
# 显示在聚类标签
DimPlot(pbmc, reduction = "tsne", label = TRUE)

8.发现差异表达特征（cluster bioers）

# 发现聚类一的所有biomarkers
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)

# 查找将聚类5与聚类0和3区分的所有标记
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)

# 与所有其他细胞相比，找到每个簇的标记，仅报告阳性细胞
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE# 发现聚类一的所有biomarkers
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)

# 查找将聚类5与聚类0和3区分的所有标记
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)

# 与所有其他细胞相比，找到每个簇的标记，仅报告阳性细胞
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)

可视化

# 绘图看看
VlnPlot(pbmc, features = c("MS4A1", "CD79A")# 绘图看看
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# 使用原始count绘制
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE# 使用原始count绘制
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))

mark

RidgePlot(pbmc, features = c("MS4A1", "CD79A")RidgePlot(pbmc, features = c("MS4A1", "CD79A"))

DotPlot(pbmc, features = c("MS4A1", "CD79A")DotPlot(pbmc, features = c("MS4A1", "CD79A"))

top10 <- pbmc.ers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend(top10 <- pbmc.ers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

9.识别细胞类型

在这个数据集的情况下，我们可以使用 canonical markers 轻松地将无偏聚类与已知的细胞类型相匹配。

Cluster ID	Markers	Cell Type
0	IL7R, CCR7	Naive CD4+ T
1	IL7R, S100A4	Memory CD4+
2	CD14, LYZ	CD14+ Mono
3	MS4A1	B
4	CD8A	CD8+ T
5	FCGR3A, MS4A7	FCGR3A+ Mono
6	GNLY, NKG7	NK
7	FCER1A, CST3	DC
8	PPBP	Platelet

new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend(new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()