How to identify motifs using ChIP-Seq data

Here I explained how to analyze GABP ChIP-Seq data using QuEST. We will use QuEST output to detect motifs in GABP data and hence detec GABP DNA binding specificity.

The idea behind is very simple, I am going to take QuEST peaks, output 200 bp of sequence around each significant peak, and then use a small subset to analyze it with de novo motif finder MEME (thanks to the MEME team for bringing such a wonderful tool to the community!)

1. First we need to obtain a reference genome which we will use to infer sequence around each binding site.

- go to the UCSC Genome Browser by typing "http://genome.ucsc.edu/" into your web browser

- click on the "Genomes" tab in the Genome Browser home page

- select hg18 assembly of the human genome by choosing "Mammal" in the clade, "Human" in the genome and "Mar. 2006" in the assembly drop-down menus. You should see that "hg18" appears on the "About the Human Mar. 2006 (hg18) assembly" (see the screenshot below)

- scroll down this page and under the "Assembly Details" section you will see the links to the assembly download page (see the screenshot below). Click on http link.

- scroll down the Downloads page untill you see the "Human Genome" section. Click on on "Full data set" under Mar. 2006 (hg18) (see the screenshot below).

- scroll down this page and download the "chromFa.zip" file. Save it in the convenient location.

- unpack the reference genome. You can do this by copying the chromFa.zip into the convenient location and then typing "unzip chromFa.zip" in the unix shell after you cd in the same directory

- move the files that we don't need into a sepate directory. Type the following commands:

> mkdir not_needed

> mv *random* *hap* ./not_needed

- concatenate the individual chromosomes to produce a single fa file containing the entire genome:

type "cat *.fa > all_chr.fasta". You will now have file all_chr.fasta containing the entire reference genome

2. Download and unpack this scipt. Type "gzip -d output_genomic_regions_from_peak_calls.pl.gz". Type "perl output_genomic_regions_from_peak_calls.pl" to see the command-line options:

3. Extract 200 bp around each peak call

type the following command (make sure to replace paths to all_chr.fasta and peak call file if they are different on your computer):

"/output_genomic_regions_from_calls.pl -i ./QuEST_analysis/calls/peak_caller.ChIP.out.accepted -o ./GABP -r ./../hg18_distribution/all_chr.fasta -w 200"

you will now have files GABP.peaks.200bp.fa and GABP.regions.200bp.fa. Because these files contain a lot of sequence, it is usually impossible to run a motif search on the entire dataset, unless you have access to the big cluster and have MEME configured to run on it. Because of this, we will do a motif search on the top 150 sequences.

4. Extract a subset of 150 sequences by typing the following command:

"awk 'BEGIN{counter=0}{if(counter < 300){print $0; counter++; }}' GABP.peaks.200bp.fa > GABP.top150.peaks.200bp.fa"

The file GABP.top150.peaks.200bp.fa will contain what you need to submit to the motif finder.

5. Upload GABP.top150.peaks.200bp.fa file to the MEME jobs page.

- type in "http://meme.sdsc.edu/meme4_1/intro.html" in the web browser address field. Click on the MEME logo (see the screenshot below).

- Type in your email address, select the file "GABP.top150bp.peaks.200bp.fa", hit "Start search" (see the screenshot below)

you will see the following page and will have for some minutes before the job finishes on the meme cluster.

-

after some time, you can refresh the page to see that the job finished. Click on the "MEME output as HTML" link:

- experience the awesome power of MEME:

You can see that MEME found a canonical GABP motif in every out of 150 bp sequences. The sequence logos are good for showing at the talks.

Good luck with analyzing your ChIP-Seq data.

Notes:

1. MEME web-based submission only allows small data sets to be analyzed (something like 150 sequences of 200 bps). If you need to process larger sets, it is best to install meme locally and run it in the command line (this is not covered here).

References

MEME:

1. Timothy L. Bailey, Nadya Williams, Chris Misleh, and Wilfred W. Li, "MEME: discovering and analyzing DNA and protein sequence motifs", Nucleic Acids Research, Vol. 34, pp. W369-W373, 2006.

2. Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994.

Last modified: Apr. 16 2009