![]() |
|
|---|---|
| Content | |
| Description | whole-genome data |
| Contact | |
| Research center | University of California Santa Cruz |
| Laboratory | Center for Biomolecular Science and Engineering |
| Authors | Brian J Raney[1] |
| Primary citation | PMID 21037257 |
| Release date | 2010 |
| Access | |
| Website | encodeproject.org |
| Tools | |
| Miscellaneous | |
The Encyclopedia of DNA Elements (ENCODE) is a public research consortium[2] launched by the US National Human Genome Research Institute (NHGRI) in September 2003.[1][3][4][5] The goal is to find all functional elements in the human genome, one of the most critical projects by NHGRI after it completed the successful Human Genome Project. All data generated in the course of the project will be released rapidly into public databases.
On 5 September 2012, initial results of the project were released in a coordinated set of 30 papers published in the journals Nature (6 publications), Genome Biology (18 papers), and Genome Research (6 papers).[6][7] These publications combine to show that approximately 20% of noncoding DNA in the human genome is functional while an additional 60% is transcribed with no known function.[8] Much of this functional non-coding DNA is involved in the regulation of the expression of coding genes.[9] Furthermore the expression of each coding gene is controlled by multiple regulatory sites located both near and distant from the gene. These results demonstrate that gene regulation is far more complex than was previously believed.[10]
Genome-wide association studies have determined that approximately 90% of single-letter differences in sequences that are associated with various diseases fall outside of protein coding regions. Previously it was not clear how these sequence differences could influence disease however new gene regulatory sites discovered by the ENCODE project in many cases provide an explanation.[2]
Contents |
The human genome consists of just over 3 billion DNA base pairs. The Human Genome Project, completed in 2003, sequenced the entire genome for one specific person. In the years since then, the genomes of many other individual people have been sequenced, partly under the auspices of the 1000 Genomes Project. Sequencing a genome, however, produces several gigabytes of raw data but does not directly say anything about how it works. The aim of the ENCODE project is to determine which parts of the DNA are biologically active, and make an initial assessment of their functions.
The part of the DNA that has long been best understood is the exome, consisting of around 20,000 protein-coding genes. These genes, however, make up in total only around 1.5% of the DNA, and are separated from each other by long stretches of DNA that does not code for proteins. This remaining DNA includes the so-called regulome, which comprises a variety of DNA elements that in one way or another modulate the expression of protein-coding genes. It has not been clear, though, how much of the total DNA is comprised within the regulome. Until recently, the majority view has been that much of the DNA is "junk"—DNA that is never transcribed and has no biological function. The central goal of the ENCODE project is to map out the regulome, by determining which parts of the DNA belong to it and the mechanisms by which those parts influence gene transcription.[11]
The project was initiated with a $12 million pilot phase to evaluate a variety of different methods for use in later stages. A number of then-existing techniques were used to analyse a portion of the genome equal to about 1% (30 million base-pairs). The results of these analyses were evaluated based on their ability to identify regions of DNA which were known or suspected to contain functional elements. 50% of the sample area selected for study under this phase was manually selected whilst the other 50% was selected at random.[12] The manually selected regions were selected based on the presence of well studied genes and the availability of comparative data. Methods evaluated included chromatin immunoprecipitation (ChIP) and quantitative PCR.
The ENCODE pilot project rapidly released all of its data into public databases.[13] The pilot phase was successfully finished and the results were published in June 2007 in Nature[4] and in a special issue of Genome Research.[14]
In September 2007, NHGRI began funding the production phase of the ENCODE project. In this phase, the goal was to analyze the entire genome and to conduct "additional pilot-scale studies".[15]
As in the pilot project, the production effort is organized as an open consortium. In October 2007, NHGRI awarded grants totaling more than $80 million over four years.[16] The production phase also includes a Data Coordination Center, a Data Analysis Center, and a Technology Development Effort.[17]
By 2010, over 1,000 genome-wide data sets had been produced by the ENCODE project. Taken together, these data sets show which regions are transcribed into RNA, which regions are likely to control the genes that are used in a particular type of cell, and which regions are associated with a wide variety of proteins. The primary assays used in ENCODE are ChIP-seq, DNase I Hypersensitivity, RNA-seq, and assays of DNA methylation.
In September 2012, the project released a much more extensive set of results, in 30 papers published simultaneously in several journals, including six in Nature, six in Genome Biology and a special issue with 18 publications of Genome Research.[18] The most striking finding was that the fraction of human DNA that is biologically active is considerably higher than even the most optimistic previous estimates. In an overview paper, the ENCODE Consortium reported that its members were able to assign biochemical functions to over 80% of the genome.[9] Much of this was found to be involved in controlling the expression levels of coding DNA, which makes up less than 1% of the genome.
The most important new elements of the "encyclopedia" include:
The Model Organism ENCyclopedia Of DNA Elements (modENCODE) project is a continuation of the original ENCODE project targeting the identification of functional elements in selected model organism genomes, specifically, Drosophila melanogaster and Caenorhabditis elegans.[23] The extension to model organisms permits biological validation of the computational and experimental findings of the ENCODE project, something that is difficult or impossible to do in humans.[23]
Funding for the modENCODE project was announced by the National Institutes of Health (NIH) in 2007 and included several different research institutions in the US.[24][25]
In late 2010, the modENCODE consortium unveiled its first set of results with publications on annotation and integrative analysis of the worm and fly genomes in Science.[26][27] Data from these publications is available from the modENCODE web site.[28]
An analysis of transcription factor binding data generated by the ENCODE project is available in the web-accessible repository FactorBook.[29]
Here you can share your comments or contribute with more information, content, resources or links about this topic.