一、创建 Seurat 对象 使用的示例数据集来自10X Genome 测序的 Peripheral Blood Mononuclear Cells (PBMC)。
下载链接:https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
library( dplyr) library( Seurat) pbmc.data <- Read10X( data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/" ) pbmc <- CreateSeuratObject( counts = pbmc.data, project = "pbmc3k" , min.cells = 3 , min.features = 200 library( dplyr) library( Seurat) pbmc.data <- Read10X( data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/" ) pbmc <- CreateSeuratObject( counts = pbmc.data, project = "pbmc3k" , min.cells = 3 , min.features = 200 ) pbmc
二、标准预处理流程 流程包括:
基于质控指标(QC metric)来筛选细胞
数据归一化和缩放
高异质性基因检测
1.基因质控指标来筛选细胞 质控指标:
每个细胞中检测到的基因数
低质量的细胞和空油滴(droplet)只有少量基因
两个及以上的细胞会有异常的高基因数
每个细胞中的UMI总数(与上类似)
线粒体基因组的reads比例
pbmc[[ "percent.mt" ] ] <- PercentageFeatureSet( pbmc, pattern = "^MT-" ) VlnPlot( pbmc, features = c ( "nFeature_RNA" , "nCount_RNA" , "percent.mt" ) , ncol = 3 ) head( pbmc@ meta.data, 5 pbmc[[ "percent.mt" ] ] <- PercentageFeatureSet( pbmc, pattern = "^MT-" ) VlnPlot( pbmc, features = c ( "nFeature_RNA" , "nCount_RNA" , "percent.mt" ) , ncol = 3 ) head( pbmc@ meta.data, 5 )
通过上图,过滤标准设定为:
过滤UMI数大于2500,小于200的细胞
过滤线粒体百分比大于5%的细胞
查看特征与特征间的相关性
plot1 <- FeatureScatter( pbmc, feature1 = "nCount_RNA" , feature2 = "percent.mt" plot1 <- FeatureScatter( pbmc, feature1 = "nCount_RNA" , feature2 = "percent.mt" )
plot2 <- FeatureScatter( pbmc, feature1 = "nCount_RNA" , feature2 = "nFeature_RNA" plot2 <- FeatureScatter( pbmc, feature1 = "nCount_RNA" , feature2 = "nFeature_RNA" )
过滤
pbmc <- subset( pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5 pbmc <- subset( pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5 )
看看相关性
p1 <- FeatureScatter( pbmc, feature1 = "nCount_RNA" , feature2 = "percent.mt" ) p2 <- FeatureScatter( pbmc, feature1 = "nCount_RNA" , feature2 = "nFeature_RNA" ) CombinePlots( plots = list ( p1, p2) p1 <- FeatureScatter( pbmc, feature1 = "nCount_RNA" , feature2 = "percent.mt" ) p2 <- FeatureScatter( pbmc, feature1 = "nCount_RNA" , feature2 = "nFeature_RNA" ) CombinePlots( plots = list ( p1, p2) )
2.归一化数据 pbmc <- NormalizeData( pbmc, normalization.method = "LogNormalize" , scale.factor = 10000 pbmc <- NormalizeData( pbmc, normalization.method = "LogNormalize" , scale.factor = 10000 )
LogNormalize that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in pbmc[["RNA"]]@data
.
上述代码可以替换为:pbmc <- NormalizeData(pbmc)
3.识别高异质性特征 高异质性:这些特征在有的细胞中高表达,有的细胞中低表达。在下游分析中关注这些基因有助于找到单细胞数据集中的生物信号[https://www.nature.com/articles/nmeth.2645 ]
pbmc <- FindVariableFeatures( pbmc, selection.method = "vst" , nfeatures = 2000 ) top10 <- head( VariableFeatures( pbmc) , 10 ) plot1 <- VariableFeaturePlot( pbmc) plot2 <- LabelPoints( plot = plot1, points = top10, repel = TRUE ) CombinePlots( plots = list ( plot1, plot2) pbmc <- FindVariableFeatures( pbmc, selection.method = "vst" , nfeatures = 2000 ) top10 <- head( VariableFeatures( pbmc) , 10 ) plot1 <- VariableFeaturePlot( pbmc) plot2 <- LabelPoints( plot = plot1, points = top10, repel = TRUE ) CombinePlots( plots = list ( plot1, plot2) )
4.缩放数据 这是在PCA等降维操作前的一个步骤,ScaleData
函数:
转换每个基因的表达值,使每个细胞的平均表达值为0
转换每个基因的表达值,使细胞间方差为1
此步骤在下游分析中具有相同的权重,因此高表达的基因不会占主导地位
all.genes <- rownames( pbmc) pbmc <- ScaleData( pbmc, features = all.genes) head( pbmc[[ "RNA" ] ] @ scale.data, 5 all.genes <- rownames( pbmc) pbmc <- ScaleData( pbmc, features = all.genes) head( pbmc[[ "RNA" ] ] @ scale.data, 5 )
5.线性维度约化 PCA pbmc <- RunPCA( pbmc, features = VariableFeatures( object = pbmc) pbmc <- RunPCA( pbmc, features = VariableFeatures( object = pbmc) )
可视化细胞与特征间的PCA有三种方式:
VizDimLoadings print( pbmc[[ "pca" ] ] , dims = 1 : 5 , nfeatures = 5 ) VizDimLoadings( pbmc, dims = 1 : 2 , reduction = "pca" print( pbmc[[ "pca" ] ] , dims = 1 : 5 , nfeatures = 5 ) VizDimLoadings( pbmc, dims = 1 : 2 , reduction = "pca" )
DimPlot DimPlot( pbmc, reduction = "pca" DimPlot( pbmc, reduction = "pca" )
DimHeatmap DimHeatmap( pbmc, dims = 1 , cells = 500 , balanced = TRUE DimHeatmap( pbmc, dims = 1 , cells = 500 , balanced = TRUE )
主要用来查看数据集中的异质性的主要来源,并且可以确定哪些PC维度可以用于下一步的下游分析。
细胞和特征根据PCA分数来排序
DimHeatmap( pbmc, dims = 1 : 15 , cells = 500 , balanced = TRUE DimHeatmap( pbmc, dims = 1 : 15 , cells = 500 , balanced = TRUE )
5.确定数据集的维度 为了克服在单细胞数据中在单个特征中的技术噪音,Seurat 聚类细胞是基于PCA分数的。每个PC代表着一个‘元特征’(带有跨相关特征集的信息)。因此,最主要的主成分代表了压缩的数据集。问题是要选多少PC呢?
方法一:JackStrawPlot 作者受JackStraw procedure 启发。随机置换数据的一部分子集(默认1%)再运行PCA,构建了一个’null distribution’的特征分数,重复这一步。最终会识别出低P-value特征的显著PCs
pbmc <- JackStraw( pbmc, num.replicate = 100 ) pbmc <- ScoreJackStraw( pbmc, dims = 1 : 20 ) JackStrawPlot( pbmc, dims = 1 : 15 pbmc <- JackStraw( pbmc, num.replicate = 100 ) pbmc <- ScoreJackStraw( pbmc, dims = 1 : 20 ) JackStrawPlot( pbmc, dims = 1 : 15 )
In this case it appears that there is a sharp drop-off in significance after the first 10-12 PCs
在上图中展示出在前10到12台PC之后,重要性显著下降
方法二:ElbowPlot “ElbowPlot”:基于每个分量所解释的方差百分比对主要成分进行排名。 在此示例中,我们可以在PC9-10周围观察到“elbow ”,这表明大多数真实信号是在前10台PC中捕获的。
ElbowPlot( pbmcElbowPlot( pbmc)
为了识别出数据的真实维度,有三种方法:
用更加受监督的方法来确定PCs的异质性,比如可以结合GSEA来分析( The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example )
The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff.
The third is a heuristic that is commonly used, and can be calculated instantly.
在这个例子中三种方法均产生了相似的结果,以PC 7-12作为阈值。
这个例子中,作者选择10,但是实际过程中还要考虑:
树突状细胞和NK细胞可能在PCs12和13中识别,这可能定义了罕见的免疫亚群(比如,MZB1是浆细胞样的er)。但是除非有一定的知识量,否则很难从背景噪音中发现。
用户可以选择不同的PCs再进行下游分析,比如选10,15,50等。结果常常有很多的不同。
建议在选择该参数时候,尽量偏高一点。如果仅仅使用5PCs会对下游分析产生不利影响
6.聚类细胞 pbmc <- FindNeighbors( pbmc, dims = 1 : 10 ) pbmc <- FindClusters( pbmc, resolution = 0.5 ) head( Idents( pbmc) , 5 pbmc <- FindNeighbors( pbmc, dims = 1 : 10 ) pbmc <- FindClusters( pbmc, resolution = 0.5 ) head( Idents( pbmc) , 5 )
7.非线性维度约化(UMAP/TSNE) pbmc <- RunUMAP( pbmc, dims = 1 : 10 ) DimPlot( pbmc, reduction = "umap" ) DimPlot( pbmc, reduction = "umap" , label = TRUE pbmc <- RunUMAP( pbmc, dims = 1 : 10 ) DimPlot( pbmc, reduction = "umap" ) DimPlot( pbmc, reduction = "umap" , label = TRUE )
pbmc <- RunTSNE( pbmc, dims = 1 : 10 ) DimPlot( pbmc, reduction = "tsne" ) DimPlot( pbmc, reduction = "tsne" , label = TRUE pbmc <- RunTSNE( pbmc, dims = 1 : 10 ) DimPlot( pbmc, reduction = "tsne" ) DimPlot( pbmc, reduction = "tsne" , label = TRUE )
8.发现差异表达特征(cluster bioers) cluster1.markers <- FindMarkers( pbmc, ident.1 = 1 , min.pct = 0.25 ) head( cluster1.markers, n = 5 ) cluster5.markers <- FindMarkers( pbmc, ident.1 = 5 , ident.2 = c ( 0 , 3 ) , min.pct = 0.25 ) head( cluster5.markers, n = 5 ) pbmc.markers <- FindAllMarkers( pbmc, only.pos = TRUE , min.pct = 0.25 , logfc.threshold = 0.25 ) pbmc.markers %>% group_by( cluster) %>% top_n( n = 2 , wt = avg_logFC) cluster1.markers <- FindMarkers( pbmc, ident.1 = 0 , logfc.threshold = 0.25 , test.use = "roc" , only.pos = TRUE cluster1.markers <- FindMarkers( pbmc, ident.1 = 1 , min.pct = 0.25 ) head( cluster1.markers, n = 5 ) cluster5.markers <- FindMarkers( pbmc, ident.1 = 5 , ident.2 = c ( 0 , 3 ) , min.pct = 0.25 ) head( cluster5.markers, n = 5 ) pbmc.markers <- FindAllMarkers( pbmc, only.pos = TRUE , min.pct = 0.25 , logfc.threshold = 0.25 ) pbmc.markers %>% group_by( cluster) %>% top_n( n = 2 , wt = avg_logFC) cluster1.markers <- FindMarkers( pbmc, ident.1 = 0 , logfc.threshold = 0.25 , test.use = "roc" , only.pos = TRUE )
可视化
VlnPlot( pbmc, features = c ( "MS4A1" , "CD79A" ) VlnPlot( pbmc, features = c ( "MS4A1" , "CD79A" ) )
VlnPlot( pbmc, features = c ( "NKG7" , "PF4" ) , slot = "counts" , log = TRUE VlnPlot( pbmc, features = c ( "NKG7" , "PF4" ) , slot = "counts" , log = TRUE )
FeaturePlot( pbmc, features = c ( "MS4A1" , "GNLY" , "CD3E" , "CD14" , "FCER1A" , "FCGR3A" , "LYZ" , "PPBP" , "CD8A" ) FeaturePlot( pbmc, features = c ( "MS4A1" , "GNLY" , "CD3E" , "CD14" , "FCER1A" , "FCGR3A" , "LYZ" , "PPBP" , "CD8A" ) )
RidgePlot( pbmc, features = c ( "MS4A1" , "CD79A" ) RidgePlot( pbmc, features = c ( "MS4A1" , "CD79A" ) )
DotPlot( pbmc, features = c ( "MS4A1" , "CD79A" ) DotPlot( pbmc, features = c ( "MS4A1" , "CD79A" ) )
top10 <- pbmc.ers %>% group_by( cluster) %>% top_n( n = 10 , wt = avg_logFC) DoHeatmap( pbmc, features = top10$ gene) + NoLegend( top10 <- pbmc.ers %>% group_by( cluster) %>% top_n( n = 10 , wt = avg_logFC) DoHeatmap( pbmc, features = top10$ gene) + NoLegend( )
9.识别细胞类型 在这个数据集的情况下,我们可以使用 canonical markers 轻松地将无偏聚类与已知的细胞类型相匹配。
Cluster ID
Markers
Cell Type
0
IL7R, CCR7
Naive CD4+ T
1
IL7R, S100A4
Memory CD4+
2
CD14, LYZ
CD14+ Mono
3
MS4A1
B
4
CD8A
CD8+ T
5
FCGR3A, MS4A7
FCGR3A+ Mono
6
GNLY, NKG7
NK
7
FCER1A, CST3
DC
8
PPBP
Platelet
new.cluster.ids <- c ( "Naive CD4 T" , "Memory CD4 T" , "CD14+ Mono" , "B" , "CD8 T" , "FCGR3A+ Mono" , "NK" , "DC" , "Platelet" ) names ( new.cluster.ids) <- levels( pbmc) pbmc <- RenameIdents( pbmc, new.cluster.ids) DimPlot( pbmc, reduction = "umap" , label = TRUE , pt.size = 0.5 ) + NoLegend( new.cluster.ids <- c ( "Naive CD4 T" , "Memory CD4 T" , "CD14+ Mono" , "B" , "CD8 T" , "FCGR3A+ Mono" , "NK" , "DC" , "Platelet" ) names ( new.cluster.ids) <- levels( pbmc) pbmc <- RenameIdents( pbmc, new.cluster.ids) DimPlot( pbmc, reduction = "umap" , label = TRUE , pt.size = 0.5 ) + NoLegend( )