Wednesday, August 30, 2017

Experts meet to discuss non-coding RNAs - fail to answer the important question

The human genome is pervasively transcribed. More than 80% of the genome is complementary to transcripts that have been detected in some tissue or cell type. The important question is whether most of these transcripts have a biological function. How many genes are there that produce functional non-coding RNA?

There's a reason why this question is important. It's because we have every reason to believe that spurious transcription is common in large genomes like ours. Spurious, or accidental, transcription occurs when the transcription initiation complex binds nonspecifically to sites in the genome that are not real promoters. Spurious transcription also occurs when the initiation complex (RNA plymerase plus factors) fires in the wrong direction from real promoters. Binding and inappropriate transcription are aided by the binding of transcription factors to nonpromoter regions of the genome—a well-known feature of all DNA binding proteins [see Are most transcription factor binding sites functional?].

The controversy over the role of these transcripts has been around for many decades but it has become more important in recent years as many labs have focused on identifying transcripts. After devoting much time and effort to the task, these groups are not inclined to admit they have been looking at junk RNA. Instead, they tend to focus on trying to prove that most of the transcripts are functional.

Keep in mind that the correct default explanation is that a transcript is just spurious junk unless someone has demonstrated that it has a function. This is especially true of transcripts present at less than one copy per cell; are not conserved in other species; and have only been detected in a few types of cells. That's the majority of transcripts.

Nobody knows how many different transcripts have been detected since there's no comprehensive database that combines all of the data. I suspect there are several hundred thousand different transcripts. Human genome annotators have struggled to represent this data accurately. They have rejected or ignored most of the transcripts and focused on those that are most likely to have a biological function. Unfortunately, their criteria for functionality are weak and this leads them to include a great many putative genes in their annotated genome. For example, the latest annotation by Ensembl lists 22,521 genes for noncoding RNAs. This is slightly more than the total number of protein-coding genes (20,338) [Human assembly and gene annotation].

It's important to note two things about the work of these annotators. First, they have correctly rejected most of the transcripts. Second, they cannot provide solid evidence that most of those 22,521 transcripts are actually functional. What they really should be saying is that these are the best candidates for real genes.

The experts held a meeting recently in Heraklion, Greece (June 9-14, 2017). You would think that a major emphasis in that meeting would have been on identifying how many of these transcripts are biologically functional but that doesn't seem to have been a major theme according to the brief report published in Genome Biology [Canonical mRNA is the exception, rather than the rule].

Let's look at what the authors have to say about the important question.
Investigations into gene regulation and disease pathogenesis have been protein-centric for decades. However, in recent years there has been a profound expansion in our knowledge of the variety and complexity of eukaryotic RNA species, particularly the non-coding RNA families. Vast amounts of RNA sequencing data generated from various library preparation methods have revealed these non-coding RNA species to be unequivocally more abundant than canonical mRNA species.
This is very misleading. It's certainly true that there are far more than 20,000 transcripts but that's not controversial. What's controversial is how many of those transcripts are functional and how many genes are devoted to producing those functional transcripts.

The report on the meeting doesn't offer an opinion on that matter unless the authors are referring only to functional RNA species. I get the impression that most of the people who attend these meeting are reluctant to state unequivocally whether there's convincing evidence of function for more than 5,000 RNAs. I don't think that evidence exists. Until it does, the default scientific position is that there are far fewer genes for functional noncoding RNAs than for proteins.


  1. Actually the evidence is for no more than 19000 protein coding genes (Ezkurdia I, Juan D, Rodriguez JM, Frankish A, Diekhans M, Harrow J, Vazquez J, Valencia A, Tress ML (November 2014). "Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes". Human Molecular Genetics. 23 (22): 5866–78. PMC 4204768 Freely accessible. PMID 24939910. doi:10.1093/hmg/ddu309.)

    1. How many genes do we have and what happened to the orphans?

      There are many estimates of the total number of protein-coding genes. The best data indicate a number closer to 19,000 that 20,000 but it's impossible to come up with a precise number at the present time.

      The fact that Ensembl annotators still list more than 20,000 protein-coding genes shows us that their criteria are not up to date. In fairness, it's a lot of work to confirm each and every putative gene.

      How many proteins in the human proteome?

      How many proteins do humans make?

  2. "Keep in mind that the correct default explanation is that a transcript is just spurious junk unless someone has demonstrated that it has a function. This is especially true of transcripts present at less than one copy per cell; are not conserved in other species; and have only been detected in a few types of cells."

    The point about absolute numbers seems to be incontestable. And I think it is equally true that the same (or modified) mRNAs should be found in related species. But it's not clear that being found in only certain types of cell are a priori a sign they have no function. Shouldn't the possibility they play a role in cellular differentiation be ruled out?

    1. Larry's point is that the 'null hypothesis' that a transcript is spurious would be weakened by showing that a transcript is found in many cell types.