More Recent Comments

Thursday, May 20, 2010

Junk RNA or Imaginary RNA?

RNA is very popular these days. It seems as though new varieties of RNA are being discovered just about every month. There have been breathless reports claiming that almost all of our genome is transcribed and most of the this RNA has to be functional even though we don't yet know what the function is. The fervor with which some people advocate a paradigm shift in thinking about RNA approaches that of a cult follower [see Greg Laden Gets Suckered by John Mattick].

We've known for decades that there are many types of RNA besides messenger RNA (mRNA encodes proteins). Besides the standard ribosomal RNAs and transfer RNAs (tRNAs), there are a variety of small RNAs required for splicing and many other functions. There's no doubt that some of the new discoveries are important as well. This is especially true of small regulatory RNAs.

However, the idea that a huge proportion of our genome could be devoted to synthesizing functional RNAs does not fit with the data showing that most of our genome is junk [see Shoddy But Not "Junk"?]. That hasn't stopped RNA cultists from promoting experiments leading to the conclusion that almost all of our genome is transcribed.

Late to the Party

Several people have already written about this paper including Carl Zimmer and PZ Myers. There are also summaries in Nature News and PLoS Biology.
That may change. A paper just published in PLoS Biology shows that the earlier work was prone to artifacts. Some of those RNAs may not even be there and others are present in tiny amounts.

The work was done by Harm van Bakel in Tim Hughes' lab, right here in Toronto. It's only a few floors, and a bridge, from where I'm sitting right now. The title of their paper tries to put a positive spin on the results: "Most 'Dark Matter' Transcripts Are Associated With Known Genes" [van Bakel et. al. (2010)]. Nobody's buying that spin. They all recognize that the important result is not that non-coding RNAs are mostly associated with genes but the fact that they are not found in the rest of the genome. In other words, most of our genome is not transcribed in spite of what was said in earlier papers.

Van Bekal compared two different types of analysis. The first, called "tiling arrays," is a technique where bulk RNA (cDNA, actually) is hybridized to a series of probes on a microchip. The probes are short pieces of DNA corresponding to genomic sequences spaced every few thousand base pairs along each chromosome. When some RNA fragment hybridizes to one of these probes you score that as a "hit." The earlier experiments used this technique and the results indicated that almost every probe could hybridize an RNA fragment. Thus, as you scanned the chip you saw that almost every spot recorded a "hit." The conclusion is that almost all of the genome is transcribed even though only 2% corresponds to known genes.

The second type of analysis is called RNA-Seq and it relies on direct sequencing of RNA fragments. Basically, you copy the RNA into DNA, selecting for small 200 bp fragments. Using new sequencing technology, you then determine the sequence of one (single end) or both ends (paired end) of this cDNA. You may only get 30 bp of good sequence information but that's sufficient to place the transcript on the known genome sequence. By collecting millions of sequence reads, you can determine what parts of the genome are transcribed and you can also determine the frequency of transcription. The technique is much more quantitative than tiling experiments.

Van Bekel et al. show that using RNA-Seq they detect very little transcription from the regions between genes. On the other hand, using tiling arrays they detect much more transcription from these regions. They conclude that the tiling arrays are producing spurious results—possibly due to cross-hybridization or possibly due to detection of very low abundance transcripts. In other words, the conclusion that most of our genome is transcribed may be an artifact of the method.

The parts of the genome that are presumed to be transcribed but for which there is no function is called "dark matter." Here's the important finding in the author's own words.
To investigate the extent and nature of transcriptional dark matter, we have analyzed a diverse set of human and mouse tissues and cell lines using tiling microarrays and RNA-Seq. A meta-analysis of single- and paired-end read RNA-Seq data reveals that the proportion of transcripts originating from intergenic and intronic regions is much lower than identified by whole-genome tiling arrays, which appear to suffer from high false-positive rates for transcripts expressed at low levels.
Many of us dismissed the earlier results as transcriptional noise or "junk RNA." We thought that much of the genome could be transcribed at a very low level but this was mostly due to accidental transcription from spurious promoters. This low level of "accidental" transcription is perfectly consistent with what we know about RNA polymerase and DNA binding proteins [What is a gene, post-ENCODE?, How RNA Polymerase Binds to DNA]. Although we might have suspected that some of the "transcription" was a true artifact, it was difficult to see how the papers could have failed to consider such a possibility. They had been through peer review and the reviewers seemed to be satisfied with the data and the interpretation.

That's gonna change. I suspect that from now on everybody is going to ignore the tiling array experiments and pretend they don't exist. Not only that, but in light of recent results, I suspect more and more scientists will announce that they never believed the earlier results in the first place. Too bad they never said that in print.


van Bakel, H., Nislow, C., Blencowe, B. and Hughes, T. (2010) Most "Dark Matter" Transcripts Are Associated With Known Genes. PLoS Biology 8: e1000371 [doi:10.1371/journal.pbio.1000371]

17 comments :

Georgi Marinov said...

Just to correct one thing: 30bp of sequence is what people were getting in 2008, now it's 75-100bp and you can sequence both ends of the fragment. The paper in question uses 2x50bp reads, however they only sequenced 23M reads per sample on average. Which is significant (and outdated too, the new HiSeq/SOLID4 instruments get 100M reads in a single lane so massive improvements in read numbers are are coming soon), as I will explain in a second.

One of the fundamental differences between tiling arrays and RNA-Seq is that RNA-Seq is a digital measurement, while arrays are an analog one. So with arrays there is the possibility of compressing the dynamic range of the assay and seeing a lot more of the truly rare stuff that you would have to sequence billions of RNA-Seq reads to get to. Which will happen in the not so distant future, but we don't have now, and the paper in question certainly hasn't done either.

While I am no fan of the "The whole genome is transcribed, let's celebrate" spin of the data, the paper in question by no means puts and end to the discussion. The genome may very well be transcribed at relatively low levels, with those transcripts being degraded very quickly so that they become very hard to detect. Which does not mean that those transcripts have any function, or that even the process of transcription itself is important (we know it is for some types of heterochromatin assembly processes, for example), which is a more likely possibility, but still not supported by sufficient evidence.

Sean Eddy said...

Georgi, Figure 1A and 1B in the paper already address your point. The authors show strong evidence that RNAseq is far more sensitive than tiled arrays.

Georgi Marinov said...

I don't see how it does that.

Larry Moran said...

Georgi Marinov says,

Just to correct one thing: 30bp of sequence is what people were getting in 2008, now it's 75-100bp and you can sequence both ends of the fragment.

I'm aware of the optimistic claims in the latest papers. However, in this paper the authors were concerned about the stringency of their data so they restricted their hits to the first 25-28 bases allowing for one mismatch.

I agree with the rest of your comment. What this paper claims is that any remaining intergenic RNA must be confined to the occasional transcript every few cell generations. Such rare transcripts are much more compatible with accident than design, don't you think?

Larry Moran said...

to Sean Eddy,

You were the "academic" editor for this paper. I know you have an interest in the topic so what's your take on the earlier literature?

Papers were being published without any attempt to account for possible artifacts and without any attempt to mention that accidental transcription was a serious possibility. How did those papers get by reviewers and editors?

Why did real scientific papers become indistinguishable from press releases?

Georgi Marinov said...

Such rare transcripts are much more compatible with accident than design, don't you think?

Where have I mentioned anything about design? And I didn't say that they are extremely rarely transcribed, there is a difference between RNA levels and transcriptional activity. Things may be getting transcribed, because for some reason the process of transcription itself is important or for no reason at all, and then degraded very quickly.

Georgi Marinov said...

Also, remember that ENCODE is being done genome-wide right now so there will be more on the subject in the near future. Here is some of the data that has been publicly released and you may want to take a look at:

http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=160136038&c=chrX&g=wgEncodeCshlShortRnaSeq

http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=160136038&c=chrX&g=wgEncodeRikenCage

Alex said...

What amuses me is that Dr. Blencowe is an author on the paper you're praising right now, whereas not two weeks ago you expressed scepticism about his paper on the "splicing code".

Speaking of big biology papers, there's this one, doi:10.1126/science.1176495, which I haven't read yet but looks really neat

Anonymous said...

It's good to finally see results that make sense in the context of what is already known. "The crazier it sounds the better" attitude of Nature's editors really is a powerful force.

Larry Moran said...

Dunbar says,

What amuses me is that Dr. Blencowe is an author on the paper you're praising right now, whereas not two weeks ago you expressed scepticism about his paper on the "splicing code".

It amuses me too. Blencowe seems to have never met a splicing alternative that he doesn't believe and yet here he is on a paper that questions the significance of most low level transcripts.

Go figure.

I note that his contribution on the van Bakel paper is minor and he is not listed as one of the people who actually wrote the paper.

jbw said...

Reading the article raised the following question in my mind:

Are the parts of the DNA that are not transcribed capable of being transcribed? Is there some defect in the sequence or are they not transcribed because of the machinery of transcription?

If the sequence is OK, do they code for proteins?

DK said...

jbw:
Are the parts of the DNA that are not transcribed capable of being transcribed? Is there some defect in the sequence or are they not transcribed because of the machinery of transcription?

Any part of DNA has *some* potential to be transcribed. With some low probability, various transcription factors can bind to any piece of DNA and lead to the formation of what is called "transcription initiation complex". Once this happens, there will be some RNA made. Most will be very short but some will look "normal". This is what Larry refers to as "transcription noise".

If the sequence is OK, do they code for proteins?

Some low percentage may even have sequence that's enough to encode a polypeptide (i.e., ATG followed by >50 in-frame codons before hitting TAA/TAG). Majority of those will code for "garbage" proteins that won't fold into anything functional.

jbw said...

Thanks for the answer. Another question. Which evolved first, DNA or proteins?

Georgi Marinov said...

Most likely proteins, if RNA was first

jbw said...

So RNA, then proteins, then DNA. Does this mean the first RNA was junk RNA and that protein coding RNA evolved from junk RNA?

Georgi Marinov said...

Junk RNA probably arose very early

Anonymous said...

I'll sign.

Stephen Anstey
Student, Memorial University
St. John's, NL