只看本文就够了–使用bedtools进行GWAS基因注释

bedtools是一款非常强大的用于处理bed,vcf,gff等格式数据的工具,该软件由犹他大学的Quinlan实验室开发。但是目前bedtools主要提供的是在linux,unxi等操作系统环境下的“命令行”运行方式,今天,小果就给大家介绍这款由犹他大学昆兰实验室开发的基因组算法工具集bedtools。

bedtools的主要功能

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.

usage: bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]

intersect Find overlapping intervals in various ways.

求区域之间的交集,可以用来注释peak,计算reads比对到的基因组区域

不同样品的peak之间的peak重叠情况。

window Find overlapping intervals within a window around an interval.

closest Find the closest, potentially non-overlapping interval.

寻找最近但可能不重叠的区域

coverage Compute the coverage over defined intervals.

计算区域覆盖度

map Apply a function to a column for each overlapping interval.

genomecov Compute the coverage over an entire genome.

merge Combine overlapping/nearby intervals into a single interval.

合并重叠或相接的区域

cluster Cluster (but don’t merge) overlapping/nearby intervals.

complement Extract intervals _not_ represented by an interval file.

获得互补区域

subtract Remove intervals based on overlaps b/w two files.

计算区域差集

slop Adjust the size of intervals.

调整区域大小,如获得转录起始位点上下游3 K的区域

flank Create new intervals from the flanks of existing intervals.

sort Order the intervals in a file.

排序,部分命令需要排序过的bed文件

random Generate random intervals in a genome.

获得随机区域,作为背景集

shuffle Randomly redistrubute intervals in a genome.

根据给定的bed文件获得随机区域,作为背景集

sample Sample random records from file using reservoir sampling.

spacing Report the gap lengths between intervals in a file.

annotate Annotate coverage of features from multiple files.

[ Multi-way file comparisons ]

multiinter Identifies common intervals among multiple interval files.

unionbedg Combines coverage intervals from multiple BEDGRAPH files.

[ Paired-end manipulation ]

pairtobed Find pairs that overlap intervals in various ways.

pairtopair Find pairs that overlap other pairs in various ways.

[ Format conversion ]

bamtobed Convert BAM alignments to BED (& other) formats.

bedtobam Convert intervals to BAM records.

bamtofastq Convert BAM records to FASTQ records.

bedpetobam Convert BEDPE intervals to BAM records.

bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.

[ Fasta manipulation ]

getfasta Use intervals to extract sequences from a FASTA file.

提取给定位置的FASTA序列

maskfasta Use intervals to mask sequences from a FASTA file.

nuc Profile the nucleotide content of intervals in a FASTA file.

[ BAM focused tools ]

multicov Counts coverage from multiple BAMs at specific intervals.

tag Tag BAM alignments based on overlaps with interval files.

[ Statistical relationships ]

jaccard Calculate the Jaccard statistic b/w two sets of intervals.

计算数据集相似性

reldist Calculate the distribution of relative distances b/w two files.

fisher Calculate Fisher statistic b/w two feature files.

[ Miscellaneous tools ]

overlap Computes the amount of overlap from two intervals.

igv Create an IGV snapshot batch script.

用于生成一个脚本,批量捕获IGV截图

links Create a HTML page of links to UCSC locations.

makewindows Make interval “windows” across a genome.

把给定区域划分成指定大小和间隔的小区间 (bin)

groupby Group by common cols. & summarize oth. cols. (~ SQL “groupBy”)

分组结算,不只可以用于bed文件。

expand Replicate lines based on lists of values in columns.

split Split a file into multiple files with equal records or base pairs.

今天介绍一下GWAS分析中注释的方法,我们知道,GWAS分析找到显著性SNP后,需要注释,才能找到候选的基因。GWAS的依据是SNP与控制性状的基因处于LD状态,所以,我们才能推断显著性的SNP附近的基因是影响性状的候选基因。

话不多说,直接上干货。

软件的安装两种方式

第一种下载软件安装,进入到自己存放软件的文件夹

wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz

下载完后解压

tar -zxvf bedtools-2.30.0.tar.gz

cd bdtools2

第二种方法,直接用conda 安装

conda activate xxx 激活一个环境conda环境

conda install bedtools

bedtools intersect -a xx.bed -b xx.bed -wa -wb | bedtools groupby -i – -g 1-4 -c 8 -o collapse >xx.txt

##参考:https://www.cnblogs.com/xudongliang/p/5051503.html

##-a bed文件

chr start end gene

##-b bed文件

chr start end SNP_ID

## bedtools intersect -a xx.bed -b xx.bed

默认情况下只输出A中overlap的区域

## bedtools intersect -a cpg.bed -b exon.bed -wa

-wa 参数后,只要A中的这段区域与B中区域有交集,就输出,而且overlap几次,就输出几次

##bedtools intersect -a cpg.bed -b exon.bed -wb

加上-wb参数后,除了输出A中的overlap区域外,还会输出B中的整个区间

##bedtools intersect -a cpg.bed -b exon.bed -wa -wb

添加-wa和-wb参数会将overlap 区域成对输出

##-c参数,统计A中每个区域与Boverlap的次数

##-v参数:只输出A中没有与Boverlap的区域

##参考:https://zhuanlan.zhihu.com/p/93660864

##bedtools groupby -i out -g 1-4 -c 8 -o collapse

##-i,输入文件,由于使用|,没有显示输入文件

##-g:选择哪几列的值进行合并。“-g 1-4”表示合并前四列相同的行。

##-c:选择第几列的值汇总结果。“-c 8”表示选择第八列的值进行汇总。

/home/bedtools/bedtools2-2.25.0/bin/bedtools intersect -a gene.bed -b SNP.bed -wa -wb| /home/bedtools/bedtools2-2.25.0/bin/bedtools groupby -i – -g 1-4 -c 8 -o collapse >xx.txt

##基于bedtools软件提取fa序列。

bedtools getfasta -fi /home/Reference/Sus_101/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa -bed lncRNA.bed > lncRNABSSeq.fa

bedtools getfasta -fi Gallus_gallus.GRCg6a.dna.toplevel.fa -bed /root/bedtools2/4T-5T-BU -name -fo /root/bedtools2/4T-5T-BU.fa

#bed文件内容:染色体 起始位置 终止位置 基因ID

总的来说,bedtools实用工具用于广泛的基因组学分析任务。虽然每个单独的工具都设计用于执行相对简单的任务,但通过在UNIX命令行上组合多个bedtools操作,可以进行相当复杂的分析。具体可参考文档:https://www.jieandze1314.com/post/c

想要更好的学习和交流,快来加入小果的微信公众号(生信果)和云生信生物信息学平台(  http://www.biocloudservice.com/home.html),在这里你可以向小果提问、帮你制定相应分析操作。点击这里加入吧!