2024-06-16

小果带你玩转单细胞注释工具-ScType

师妹生信果 2024-02-22 19:00:08

转自公众号：R语言学徒
http://mp.weixin.qq.com/s?__biz=Mzg5MDk3Mzg4OA==&mid=2247490154&idx=1&sn=78fd173d88b6f8dfaab8f8a409948dff

前言

小伙伴们，是不是还在为这自己的单细胞数据注释找不到合适的工具而愁眉苦脸呢？别担心，小果带你轻轻松松搞定你的单细胞亚群注释。下面，跟着小果学起来吧！

ScType是一款自动化的细胞类型识别工具，无需手动干预，只需提供你的scRNA-seq数据，它将使用全面的ScType标记数据库，以非常高的准确性快速注释单细胞类型。这是一个全新的维度，它不仅加速了分析过程，还提供了更精确的结果。与其他方法不同，ScType采用无监督学习，依赖细胞簇和细胞类型的标记基因特异性，识别出高度特异性阳性和阴性标记基因，从而提供了对特定细胞类型注释的证据。

不再受制于繁琐的手动注释，让ScType为你的单细胞数据分析提供新的可能性！立即尝试ScType，加速你的研究，探索细胞类型的奥秘。废话不多说，跟着小果来实现自己的单细胞数据注释吧！

ScType交互式 Web 平台

我们可以通过http://sctype.app网址去访问ScType交互式 Web 平台。ScType平台为单细胞 RNA-seq 数据分析提供了完整的管道，包括数据处理、归一化、聚类和细胞类型注释。具体步骤如下

（1）单击上传选项卡来上传数据集。有关详细信息，请参阅技术文档（http://sctype.app）。

（2）选择您的scRNA-seq数据文件或将其拖到蓝色区域。ScType为scRNA-seq数据提供了多种输入数据格式。

（3）指定分析选项。

（4）上传数据和指定的分析选项后，单击开始分析按钮。

ScType代码实现流程

#设置工作环境，这里要改成自己的工作环境哦！！！！setwd("D:\work")# 加载R包lapply(c("dplyr","Seurat","HGNChelper"), library, character.only = T)#数据下载：https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz# 加载PBMC数据集pbmc.data <- Read10X(data.dir = "./filtered_gene_bc_matrices/hg19/")# Initialize the Seurat object with the raw (non-normalized data).pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)# 数据标准化pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")# pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) # make some filtering based on QC metrics visualizations, see Seurat tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial.htmlpbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)    # 进行PCA处理pbmc <- ScaleData(pbmc, features = rownames(pbmc))pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))# 检查 PC 组件的数量（根据 Elbow 图，我们选择了 10 个 PC 进行下游分析）ElbowPlot(pbmc)# 聚类和可视化pbmc <- FindNeighbors(pbmc, dims = 1:10)pbmc <- FindClusters(pbmc, resolution = 0.8)pbmc <- RunUMAP(pbmc, dims = 1:10)DimPlot(pbmc, reduction = "umap")

##细胞类型注释# 使用 ScType 自动分配单元格类型。首先加载两个包#载入基因组构建函数source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")# 加载细胞类型注释source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")

接下来，让我们从输入细胞标记文件中准备基因集。

默认情况下，我们使用内置的单元格标记数据库，但是，请随意使用您自己的数据。

只需准备一个与我们的数据库文件相同格式的输入 XLSX 文件

数据库文件应包含四列（组织类型 – 组织类型，单元格名称 – 细胞类型，geneSymbolmore1 – 阳性标记基因，geneSymbolmore2 – 预计不会由细胞类型表达的标记基因）

最后，让我们为每个聚类分配细胞类型

#DB filedb_ = "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";tissue = "Immune system" # e.g. Immune system,Pancreas,Liver,Eye,Kidney,Brain,Lung,Adrenal,Heart,Intestine,Muscle,Placenta,Spleen,Stomach,Thymus # prepare gene setsgs_list = gene_sets_prepare(db_, tissue)# prepare gene setsgs_list = gene_sets_prepare(db_, tissue)           # get cell-type by cell matrixes.max = sctype_score(scRNAseqData = pbmc[["RNA"]]@scale.data, scaled = TRUE,                       gs = gs_list$gs_positive, gs2 = gs_list$gs_negative)

注意：scRNAseqData 参数应与输入的 scRNA-seq 矩阵相对应。

如果使用 Seurat，则为 pbmc[[“RNA”]]@scale.data（默认值）；如果使用 sctransform 进行归一化，则为 pbmc[[“SCT”]]@scale.data、

或 pbmc[[“integrated”]]@scale.data（对多个单细胞数据集进行联合分析时）

# 按分群合并cL_resutls = do.call("rbind", lapply(unique(pbmc@meta.data$seurat_clusters), function(cl){  es.max.cl = sort(rowSums(es.max[ ,rownames(pbmc@meta.data[pbmc@meta.data$seurat_clusters==cl, ])]), decreasing = !0)  head(data.frame(cluster = cl, type = names(es.max.cl), scores = es.max.cl, ncells = sum(pbmc@meta.data$seurat_clusters==cl)), 10)}))sctype_scores = cL_resutls %>% group_by(cluster) %>% top_n(n = 1, wt = scores)             # 将低置信度（ScType 分数低）群设为 "unknown"sctype_scores$type[as.numeric(as.character(sctype_scores$scores)) < sctype_scores$ncells/4] = "Unknown"print(sctype_scores[,1:3])

结果

sctype_score 函数通过 gs 和 gs2参数接受正标记和负标记。如果没有负标记(即标记提供证据证明细胞具有特定的细胞类型) ，只需将 gs2参数设置为 NULL (即gs2 = NULL)。

##可视化#将识别的细胞类型叠加在 UMAP 图上pbmc@meta.data$customclassif = ""for(j in unique(sctype_scores$cluster)){  cl_type = sctype_scores[sctype_scores$cluster==j,];   pbmc@meta.data$customclassif[pbmc@meta.data$seurat_clusters == j] = as.character(cl_type$type[1])}           DimPlot(pbmc, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'customclassif')

此外，可以可视化一个气泡图，显示 ScType 考虑用于聚类注释的所有细胞类型。

# 加载R包lapply(c("ggraph","igraph","tidyverse", "data.tree"), library, character.only = T)# 绘图cL_resutls=cL_resutls[order(cL_resutls$cluster),]; edges = cL_resutls; edges$type = paste0(edges$type,"_",edges$cluster); edges$cluster = paste0("cluster ", edges$cluster); edges = edges[,c("cluster", "type")]; colnames(edges) = c("from", "to"); rownames(edges) <- NULL               nodes_lvl1 = sctype_scores[,c("cluster", "ncells")]; nodes_lvl1$cluster = paste0("cluster ", nodes_lvl1$cluster); nodes_lvl1$Colour = "#f1f1ef"; nodes_lvl1$ord = 1; nodes_lvl1$realname = nodes_lvl1$cluster; nodes_lvl1 = as.data.frame(nodes_lvl1); nodes_lvl2 = c(); ccolss= c("#5f75ae","#92bbb8","#64a841","#e5486e","#de8e06","#eccf5a","#b5aa0f","#e4b680","#7ba39d","#b15928","#ffff99", "#6a3d9a","#cab2d6","#ff7f00","#fdbf6f","#e31a1c","#fb9a99","#33a02c","#b2df8a","#1f78b4","#a6cee3")for (i in 1:length(unique(cL_resutls$cluster))){  dt_tmp = cL_resutls[cL_resutls$cluster == unique(cL_resutls$cluster)[i], ]; nodes_lvl2 = rbind(nodes_lvl2, data.frame(cluster = paste0(dt_tmp$type,"_",dt_tmp$cluster), ncells = dt_tmp$scores, Colour = ccolss[i], ord = 2, realname = dt_tmp$type))}nodes = rbind(nodes_lvl1, nodes_lvl2); nodes$ncells[nodes$ncells<1] = 1;files_db = openxlsx::read.xlsx(db_)[,c("cellName","shortName")]; files_db = unique(files_db); nodes = merge(nodes, files_db, all.x = T, all.y = F, by.x = "realname", by.y = "cellName", sort = F)nodes$shortName[is.na(nodes$shortName)] = nodes$realname[is.na(nodes$shortName)]; nodes = nodes[,c("cluster", "ncells", "Colour", "ord", "shortName", "realname")]               mygraph <- graph_from_data_frame(edges, vertices=nodes)           gggr<- ggraph(mygraph, layout = 'circlepack', weight=I(ncells)) +   geom_node_circle(aes(filter=ord==1,fill=I("#F5F5F5"), colour=I("#D3D3D3")), alpha=0.9) + geom_node_circle(aes(filter=ord==2,fill=I(Colour), colour=I("#D3D3D3")), alpha=0.9) +  theme_void() + geom_node_text(aes(filter=ord==2, label=shortName, colour=I("#ffffff"), fill="white", repel = !1, parse = T, size = I(log(ncells,25)*1.5)))+ geom_node_label(aes(filter=ord==1,  label=shortName, colour=I("#000000"), size = I(3), fill="white", parse = T), repel = !0, segment.linetype="dotted")           scater::multiplot(DimPlot(pbmc, reduction = "umap", label = TRUE, repel = TRUE, cols = ccolss), gggr, cols = 2)

小结

细胞类型注释对于单细胞数据分析至关重要，但以前的方法往往繁琐而耗时。ScType的出现改变了这一现状，提供了一种高效且精确的替代方案。现在只需提供数据，ScType将为您提供可靠的细胞类型注释。这将极大地加速研究进程，同时提供更深入的洞察力。快跟随小果的脚步放手一试吧，ScType将帮助您深入探索单细胞数据中的细胞多样性。

用ScType，释放你的科研潜力，深入了解细胞类型的精彩世界！代码基础薄弱的同学也别担心不会对单细胞数据进行注释哦！不妨来体验一下我们的0代码生信分析平台吧！小果亲自为你呈上网址：

http://www.biocloudservice.com/home.html。

你想要的分析小果这里都能给你提供哦！！

参考文献

Ianevski A, Giri AK, Aittokallio T. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat Commun. 2022;13(1):1246.

小果还提供思路设计、定制生信分析、文献思路复现；有需要的小伙伴欢迎直接扫码咨询小果，竭诚为您的科研助力！