Skip to content

How to Preprocess ATAC seq Data for JAMM

Mahmoud Ibrahim edited this page Oct 21, 2019 · 6 revisions

Update (Oct. 2019): The information in this tutorial may be a bit outdated, please check out the methods section in this manuscript for a more recent pipeline. You can also contact me (see below) if you have questions, or need help.

On this page we will give some notes on how to process ATAC-seq data so that it's ready for peak calling with JAMM.

ATAC-seq is Paired-end (almost always!)

When an ATAC-seq experiment is performed, the fragments produced will not all have the same length. Some will be small (~200 bp or so on bioanalyzer plots) and some will be bigger. The small ones correspond to fragments that do NOT span nucleosomes. The longer fragments correspond to cases where the transposase cuts in two open regions that are separated by one or more nucleosomes. Therefore, those fragments do NOT correspond fully to open regions only because they span one or more nucleosomes but their starts and ends DO mark open regions.

In some cases, ATAC-seq will be sequenced in single-end mode. This means that there will be no way to tell the size of each fragment from the resulting data. In this case, analysis of ATAC-seq is very straightforward: Take your reads and count the 5' ends (with JAMM, that means using the parameters -f1 -d y).

In most cases, ATAC-seq is sequenced in paired-end mode. This means that you do know the size of each fragment and that you can take advantage of this information: it will allow you to generate a nucleosome occupany track from ATAC-seq data in addition to the open region track. To learn how to get the nucleosome track, check Buenrostro et al.. In the rest of this page, we will tell you how to get open regions.

Use the 5' ends or ...

To get open regions from paired-end ATAC-seq data using JAMM, you should first generate a paired-end BED file (bedpe) from your paired-end BAM file using bedtools bamtobed -bedpe. Before you use bedtools, make sure the bamfile is name-sorted using samtools sort -n.

Then you can proceed in two ways:

  1. Convert your paired-end BED file into a regular single-end BED file by concatenating the second mate information (typically columns 4,5,6,7,8 and 10) after the first mate information (typically columns 1,2,3,7,8 and 9). Then use this file to call peaks with JAMM with parameter -f 1 -d y. This way, JAMM will use the 5' ends of the reads to call peaks (just like what you would do with DNase-seq). Summary: You are now calling peaks using the 5' ends of all fragments..just like you would with DNase-seq and single-end ATAC-seq.

  2. Use the fragment length information to separate the long fragments (nucleosomal fragments) from the short fragments (open region fragments, how to do this in Buenrostro et al.). Then, use only the open region fragments to call peaks using JAMM in paired-end mode with the parameters -t paired. JAMM will then use the entire fragment length to construct the read coverage profile and call peaks. Summary: You are now calling peaks using the entire fragment length from short open region fragments..you ignored all the information from the nucleosomal fragments.

In very informal tests, we found that the first way produces better results, although the output is really very similar. Theoretically, you would use the first way when your data is clean with high signal-to-noise ratio and you want to make sure you capture all peaks present in your data. The second way might be beneficial when your data is bit a noisy with low signal-to-noise ratio and you are hoping to clean it up a bit before calling peaks.

Very soon, example code with example data will be added here to turn this into a proper tutorial. In the meantime if you have any more questions, please don'T hesitate to email us.

Questions or Corrections or Comments?