bioinformatics and randomness


Hello all!

I just completed my undergrad senior thesis recently and I wanted to share a little bit with you. All in all, it was a amazing experience! I had a lot of help with my thesis from my advisors and other students but it was a very independent project for me also, and I feel very privileged to have gotten involved in such a cool field—bioinformatics!

The thesis is titled “Comparison of transcription factor binding sites from ChIP-seq experiments” and I can make it available on request or once it’s published on my university’s website. It contains a lot of stuff that I have already talked about on this blog already, but a lot of new stuff too!

For my final project, one of the main things that I did was to use the limma package from Bioconductor as a way to compare ChIP-seq data. My original goal was to use some standard t-tests to do comparisons, but I found digging into the literature that there are improvements including the moderated t-test from the limma package.

As a starting point, I use the binned read counts that are outputted as wiggle files from MACS, find the bins that are differentially bound by comparing the entire datasets using limma (Figure 1). This reveals intuitively that the read counts that are high in one experiment but low in the other are identified as differential.


Figure 1. Differential ChIP-seq read counts from two ChIP-seq experiments that are identified using limma

The next thing that I did was to merge all the nearby binding sites together. I used a very simple merge approach using BEDTools by merging nearby positions only <20bp apart. It turns out that the merging performs very well and many nearby binding sites are merged together from the high-confidence sites that limma identifies (Figure 2)


Figure 2. The ratio of the merged peaks (output) to the inputted bins (input) The decreasing ratio is a sign of many nearby bins that are identified by limma being merged together at different p-value thresholds.

The next thing I did was inspect the ‘signals’ to show that these peaks resemble intuitively differential peaks. Indeed, many of these these peaks are highly significant and highly differentially bound (Figure 3). I attribute the success of this approach to the limma package. In my paper, I also looked at how my method compares with other software such as MACS and DIME for finding differential binding sites.


Figure 3 An example of a differential binding site, showing a peak that is reproduced in the replicates from one strain but that is not present in the  replicates of the other strain.

These differential binding sites appear in many different forms, and one question is what effect do these binding sites have? These binding sites essentially regulate the gene expression of nearby genes, so I looked at the gene expression data that was also made available and assessed the differential gene expression vs the differential binding (Figure 4). This method is inspired by plotMM from the Rcade package from Bioconductor. The gene expression data was obtained with help from the GEO2R interface that is fairly new interface for GEOquery on the NCBI website!


Figure 4. The differences in binding sites are associated with changes to gene expression also. Comparison of the log ratio of differential binding sites vs the log ratio of the gene expression data. The lightly colored points are the gene expression from a random gene, and the dark colored points are gene expression for the nearest gene to the binding site

Overall, I was able to really deeply explore one approach for comparing ChIP-seq data. I think the availability of high-throughput sequencing data is very cool on the GEO database, and many new types of bioinformatics tools are needed to help analyze it! I don’t think my work will stop just because I finished my project, and I will hope to find new and interesting things for the future!

PS: there are some interesting comparisons of limma and edgeR that I did not consider in my project but which would be interesting to analyze.