Wednesday, November 26, 2008

Binning Algorithms for Metagenomic Sequencing

One of my assistants, SM, who is at least as smart as me and twice as hard working, wrote to ask my advice.

SM: Do you know of any good binning algorithms for metagenomic sequencing?
DL: Huh? What does, "binning algorithms for metagenomic sequencing," mean?

SM has not given me an answer. Either she assumes I am joking, and actually do know (which I don't) or she assumes it would take far too long to explain it to me (which I will pretend to resent.) So now I shall try to reckon out what "binning algorithms for metagenomic sequencing" means on my own.

Metagenomics, according to my sources (Wikipedia) "is the study of genetic material recovered directly from environmental samples." So, you take a pinch of garden dirt, extract all the DNA in it and then set out to study it in some way. You are metagenomisizing.

Sequencing, in the context of genetics, means figuring out the sequence of DNA bases (A's, T's, G's and C's) that make up part of the genome of an organism. So metagenomic sequencing presumably is taking the DNA from your pinch of dirt, then trying to figure out the sequence of DNA bases that made up all the genomes of all the organisms whose DNA are jumbled together in that dirt. A pinch of dirt, I am guessing, has DNA from hundreds of types of bacteria, a huge number of types of fungi, various protozoans and whatever else has dropped seeds, pollen, poo, tissue or hair in that vicinity in the recent past. And much of that DNA isn't going to be whole chromosomes, but whatever bits and pieces are still mostly intact after all that pooing and shedding and biodegrading. You'll have a real mishmash.

This, I suspect, is where the "binning algorithm" comes in. Binning is any process where you have a large number of elements and you want to separate them into a smaller number of categories. A binning algorithm would be a set of rules one uses to make those decisions on categorization. In the context of metagenomics, I'm guessing that each bin represents a species. You have a snippet of DNA and you need to assign it to an organism, so you don't just think that every bit of DNA is another organism, and you want to get a sense of how much representation you have of each species. So the set of rules you use to assign snippets of DNA extracted from your pinch of dirt to different species is your Binning Algorithms for Metagenomic Sequencing. I think.
My friend DS works on this kind of stuff. I'll write to him and ask.

UPDATE:
I wrote to SM and DS and asked:
Will one of you tell me what "binning algorithms for metagenomic sequencing" means?
I know what each word means, but I could come up with three or four very different guesses as to what the whole phrase means. What does each bin represent?

DS writes: [Bins represent] Taxa. In metagenomic sequencing, you get a soup of reads from all the strains of microbes present in your sample. "Binning" is the process of trying to guess which species each read comes from (or genus, or kingdom for that matter).

All methods in the literature so far are "supervised", meaning that you can only assign a read to a taxon bin if you know something about that taxon in advance (e.g., you have an isolate genome). However, environmental samples may contain previously unknown taxa: new bacterial divisions are still being discovered fairly rapidly, and at the strain level of course nearly everything is novel. A supervised binning process ought to throw up its hands at sequences from novel taxa, since they don't match any known bins. An "unsupervised" process would create new bins on the fly, in order to lump together reads that seem to be related to each other, independent of reference sequences. No published methods do that yet, though.

The accuracy of binning varies dramatically depending on the complexity of the community, the read length, the phylogenetic resolution you're asking for, and many other parameters.

Hope this helps,

-ds

1 comment:

Unknown said...

For accurate and specific taxonomic binning of metagenomic sequences use an algorithm named SOrt-ITEMS.
The paper is available in Bioinformatics journal at this url
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/14/1722

Software is available for download at
http://metagenomics.atc.tcs.com/binning/SOrt-ITEMS