Sunday, March 25, 2007

Update on the unfinished count of the human genome

I'm a big fan of Bill Clinton, but he did have a talent for useful hokum. It often served a greater cause, but it did have the disadvantage of being a bit ummm untrue. The Y2K "human genome sequenced" story was a bit like that. The real timeline of the project seems suspiciously close to how long the grumpy old skeptics thought it would take. We're still slogging away. Carl Zimmer brings us up to date. The original article has a fascinating link to "PANTHER", an academic project for assigning genes to functional categories. Don't miss Zimmer's ending sentence ...
The Loom : You Don't Miss Those 8,000 Genes, Do You?

... When Craig Venter and his colleagues published their rough draft of the human genome in 2001 they identified 26,588 human genes. They then broke those genes down by their functions. Some were involved in building DNA, some in relaying signals, and so on. Remarkably, though, they classified 12809 genes--almost half--as "molecular function unknown."

... There are web sites where you can observe works in progress, such as the human genome. One of those sites is called PANTHER. I contacted the top scientist behind it, Paul D. Thomas, with my question, and he sent me a link. When I clicked on the link, I got the pie chart I've posted here (click on the image to go to the original page if it's hard to read).

The pie shows that we're now down to just 18,308 genes. That's over 8,000 genes fewer than six years ago. Many sequences that once looked like full-fledged genes, capable of generating a protein, now don't make the grade. Some genes turned out to be pseudogenes--vestiges of genes that once worked but have been since wrecked by mutations. In other cases, DNA segments that appeared to be parts of separate genes have turned out to be part of the same gene.

Today scientists still don't know the function of 5898 genes in the human genome...For all the work that has poured into the genome, for all the grand announcements, we still don't know have the faintest idea of what about a third of our genes are for.

... few human genes have experimental evidence for their function in humans. In one study of 35329 proteins, scientists estimated that only 2784 met this gold standard.

... And then there's the whole matter of all the other DNA that doesn't encode proteins (98.5% of the genome all told). A lot of it is most likely a mishmash of broken genes and viral DNA. It's possible to cut huge swaths of it out of a mouse's genome with no apparent ill effect. But there are also a lot of important players hiding in that wilderness--switches that proteins can use to turn genes on and off, sequences that do not give rise to proteins but rather RNA molecules that create their own control system for a cell. In all of these complications, scientists will probably find the answer to the question, "How do roughly the same number of genes encode such different kinds of animals?" Complexity isn't purely a matter of the number of genes you have. It's also how you use them.

...few human genes have experimental evidence for their function in humans. In one study of 35329 proteins, scientists estimated that only 2784 met this gold standard...

... I would not have been able to have created this pie chart without Thomas's help. Perhaps some science writers will become more like investigative political reporters who know how to sift through Federal election databases for the real news...
I recall from Dickson's 1970s "Dorsai" cycle that much research in that "space opera" consisted of mining "the encyclopedia" (re: the web) for knowledge. Zimmer is quietly predicting "knowledge mining" will become a bit part of science description -- not just writing, but also doing science. In fact, I'm not sure there's a clear difference between knowledge mining and classic science, though I confess knowledge mining seems to have some resemblance to the medieval scholasticism.

We're now in the story of the 'incredibly shrinking genome'. Meanwhile we learn elsewhere in Zimmer (I think it was there) that humans and chimpanzees are much less alike than we'd thought. There are many ways to encode complexity, and evolved organisms have a rather baroque approach to solving such problems.

Update: If we extrapolate a bit, it would not be surprising to discover that we have about 16,000 genes that code for 35,000 proteins. So there's a 2:1 compression ratio from gene to expression, which is comparable to best lossless compression algorithms run against highly complex data. The 2:1 ratio presumably must have some implications for how natural selection can proceed. It's easy to imagine that a mutation in a gene that "improves" the function of one protein product might disable another protein coded by the same gene. So evolution would typically proceed "two steps forward and one step backwards", or at best with peculiar side-effects on a secondary protein arising from changes to the primary protein.

No comments: