How to extract regions with a minimum value of coverage

One of the most important parameters in NGS analysis is coverage. Coverage (or depth) is defined as the number of times one base has been sequenced. It’s a very important parameter in variant and small indels detection and, generally, the sequencing processes are designed depending on this parameter. Last week I found with the problem that I had to retrieve the intervals with coverage greater than 30 from a alignment file in bam format. How could we do that?

You need to install samtools and bedtools in your machine. And with only one command line you can get these intervals:

samtools mpileup -B -Q 0 -d 8000 -f reference_genome.fasta alingment.bam | awk -F'\t' '{if($4>=30) print $1"\t"$2-1"\t"$2}' | mergeBed -i stdin

Firstly, with samtools program we convert the alignment bam file in pileup format. Secondly, awk helps us to extract only the positions with coverage greater than 30 and print them in bed format. As bed format is 0-based, we must print the position less 1 as start of the interval in each position. Finally, with bedtools, function mergeBed, we merge overlapping repetitive elements into a single entry.

That’s it.


Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos necesarios están marcados *