2024-06-15

GEO数据中探针ID的转换

原创小果生信果 2023-02-24 19:00:36

在使用GEO数据的过程中，经常会遇到表达矩阵是探针ID不是我们分析所需要的gene symbol，这时候就需要去对表达数据的探针ID进行转换。接下来我们我们以GSE63067为数据为例进行数据探针ID转换。

#01

下载GEO数据，提取表达矩阵，观察表达矩阵中行名

是否为探针ID。

> eSet <- getGEO("GSE63067",                 destdir = '.',                 getGPL = F)> exp <- exprs(eSet[[1]])> exp[1:4,1:4]          GSM1539877 GSM1539878 GSM1539879 GSM15398801007_s_at   2.871410   2.852016   2.765939   2.8723121053_at     2.680588   2.688279   2.527078   2.625285117_at      2.835742   3.029779   2.774594   2.981834121_at      2.916225   2.871338   2.852980   2.804634

通过查看我们发现该表达矩阵的行名为1007sat，1053at，117at，它们是探针ID，不是gene symbol。

#02

下载GPL数据，用AnnoProbe提取探针信息

> library(AnnoProbe)> idprob = AnnoProbe::idmap("GPL570",type = 'soft')trying URL 'http://49.235.27.111/GEOmirror/GPL/GPL570_soft.rda'Content type 'application/octet-stream' length 295573 bytes (288 KB)downloaded 288 KB> head(idprob)         ID           symbol1 1007_s_at DDR1 /// MIR46402   1053_at             RFC23    117_at            HSPA64    121_at             PAX85 1255_g_at           GUCA1A6   1294_at MIR5193 /// UBA7

通过命令我们可以查看提取的idprob文件中每个探针ID都有对应的gene symbol，idprob文件的ID列正好与第一步中表达矩阵的行名一致接下来就可以进行ID转换了。

#03

使用探针ID信息进行基因注释，即探针ID转换

> expr_1 <- filterEM(exp1,id_prob )input expression matrix is 54675 rows(genes or probes) and 18 columns(samples).input probe2gene is 54675 rows(genes or probes)after remove NA or useless probes for probe2gene, 54675 rows(genes or probes) leftThere are 54675 of 54675 probes can be annotated.output expression matrix is 23521 rows(genes or probes) and 18 columns(samples).> expr_1[1:4,1:4]       GSM1539877 GSM1539878 GSM1539879 GSM1539880ZZZ3     2.767478   2.673070   2.872924   2.851977ZZEF1    2.921419   2.880373   2.803486   2.900011ZYX      3.212532   3.162545   2.950402   3.020136ZYG11B   3.160104   3.161922   3.325905   3.241667

通过查看生成的expr_1文件，现在的表达矩阵的行名已经变成了gene symbol，至此探针ID转换就完成了。

推荐阅读

关注小果，小果将会持续为你带来更多生信干货哦。

GEO数据中探针ID的转换

GEO数据中探针ID的转换

多种花样都能懂，富集图的看法

相关性图谱之相关性热图学习

小果教你快速优雅使用Genecards数据库

从UCSC下载TCGA数据

简单高效利用Batch Entrez批量获取基因别名