On 5 September 2012, initial results of the project were released in a coordinated set of 30 papers published in the journals Nature (6 publications), Genome Biology (18 papers), and Genome Research (6 papers). Their summary publication claimed to have assigned biochemical function for 80% of the human genome. Much of this functional non-coding DNA is involved in the regulation of the expression of coding genes. Furthermore the expression of each coding gene is controlled by multiple regulatory sites located both near and distant from the gene. These results demonstrate that gene regulation is far more complex than was previously believed.
Genome-wide association studies have determined that approximately 90% of single-letter differences in sequences that are associated with various diseases fall outside of protein coding regions. Previously it was not clear how these sequence differences could influence disease; however, new gene regulatory sites discovered by the ENCODE project in many cases provide an explanation.
The human genome consists of just over 3 billion DNA base pairs. The Human Genome Project, completed in 2003, sequenced the entire genome for one specific person. In the years since then, the genomes of many other individual people have been sequenced, partly under the auspices of the 1000 Genomes Project. Sequencing a genome, however, produces several gigabytes of raw data but does not directly say anything about how it works. The aim of the ENCODE project is to determine which parts of the DNA are biologically active, and make an initial assessment of their functions.
The part of the DNA that has long been best understood is the exome, consisting of around 20,000 protein-coding genes. These genes, however, make up in total only around 1.5% of the DNA, and are separated from each other by long stretches of DNA that does not code for proteins. This remaining DNA includes the so-called regulome, which comprises a variety of DNA elements that in one way or another modulate the expression of protein-coding genes. It has not been clear, though, how much of the total DNA is comprised within the regulome. Until recently, the majority view has been that much of the DNA is "junk"—DNA that is never transcribed and has no biological function. The central goal of the ENCODE project is to map out the regulome, by determining which parts of the DNA belong to it and the mechanisms by which those parts influence gene transcription.
The project was initiated with a $12 million pilot phase to evaluate a variety of different methods for use in later stages. A number of then-existing techniques were used to analyse a portion of the genome equal to about 1% (30 million base-pairs). The results of these analyses were evaluated based on their ability to identify regions of DNA which were known or suspected to contain functional elements. 50% of the sample area selected for study under this phase was manually selected whilst the other 50% was selected at random. The manually selected regions were selected based on the presence of well studied genes and the availability of comparative data. Methods evaluated included chromatin immunoprecipitation (ChIP) and quantitative PCR.
The ENCODE pilot project rapidly[clarification needed] released all of its data into public databases. The pilot phase was successfully finished and the results were published in June 2007 in Nature and in a special issue of Genome Research.
Image of ENCODE data in the UCSC Genome Browser. This shows several tracks containing information on gene regulation. The gene on the left (ATP2B4) is transcribed in a wide variety of cells. The gene on the right is only transcribed in a few types of cells, including embryonic stem cells.
In September 2007, NHGRI began funding the production phase of the ENCODE project. In this phase, the goal was to analyze the entire genome and to conduct "additional pilot-scale studies".
As in the pilot project, the production effort is organized as an open consortium. In October 2007, NHGRI awarded grants totaling more than $80 million over four years. The production phase also includes a Data Coordination Center, a Data Analysis Center, and a Technology Development Effort.
By 2010, over 1,000 genome-wide data sets had been produced by the ENCODE project. Taken together, these data sets show which regions are transcribed into RNA, which regions are likely to control the genes that are used in a particular type of cell, and which regions are associated with a wide variety of proteins. The primary assays used in ENCODE are ChIP-seq, DNase I Hypersensitivity, RNA-seq, and assays of DNA methylation.
In September 2012, the project released a much more extensive set of results, in 30 papers published simultaneously in several journals, including six in Nature, six in Genome Biology and a special issue with 18 publications of Genome Research. The most striking finding was that the fraction of human DNA that is biologically active is considerably higher than even the most optimistic previous estimates. In an overview paper, the ENCODE Consortium reported that its members were able to assign biochemical functions to over 80% of the genome. Much of this was found to be involved in controlling the expression levels of coding DNA, which makes up less than 1% of the genome.
The most important new elements of the "encyclopedia" include:
A comprehensive map of DNase 1 hypersensitive sites, which are markers for regulatory DNA that is typically located adjacent to genes and allows chemical factors to influence their expression. The map identified nearly 3 million sites of this type, including nearly all that were previously known and many that are novel.
A lexicon of short DNA sequences that form recognition motifs for DNA-binding proteins. Approximately 8.4 million such sequences were found, comprising a fraction of the total DNA roughly twice the size of the exome. Thousands of transcription promoters were found to make use of a single stereotyped 50-base-pair footprint.
A preliminary sketch of the architecture of the network of human transcription factors, that is, factors that bind to DNA in order to promote or inhibit the expression of genes. The network was found to be quite complex, with factors that operate at different levels as well as numerous feedback loops of various types.
A measurement of the fraction of the human genome that is capable of being transcribed into RNA. This fraction was estimated to add up to more than 75% of the total DNA, a much higher value than previous estimates. The project also began to characterize the types of RNA transcripts that are generated at various locations.
The Nature editors and ENCODE authors "... collaborated over many months to make the biggest splash possible and capture the attention of not only the research community but also of the public at large." The ENCODE project's claim that 80% of the human genome has biochemical function was rapidly picked up by the popular press who described the results of the project as leading to the death of junk DNA.
However the conclusion that most of the genome is functional was severely criticized on the grounds that ENCODE project used a far too liberal definition of functional, namely anything that is transcribed must be functional. This conclusions was arrived at despite the widely accepted view that many DNA elements such as pseudogenes that are transcribed are nevertheless non-functional. Furthermore the ENCODE project has emphasized sensitivity over specificity leading to the detection of many false positives. Lack of appropriate control experiment was another major criticism of ENCODE. Random DNA mimics ENCODE-like 'functional' behavior.
The project also has been criticized for it high cost ($400 million) and favoring "Soviet-style" big science which takes money away from highly productive investigator-initiated research.
The Model Organism ENCyclopedia Of DNA Elements (modENCODE) project is a continuation of the original ENCODE project targeting the identification of functional elements in selected model organism genomes, specifically, Drosophila melanogaster and Caenorhabditis elegans. The extension to model organisms permits biological validation of the computational and experimental findings of the ENCODE project, something that is difficult or impossible to do in humans.
In late 2010, the modENCODE consortium unveiled its first set of results with publications on annotation and integrative analysis of the worm and fly genomes in Science. Data from these publications is available from the modENCODE web site.
^Graur D, Zheng Y, Price N, Azevedo RB, Zufall RA, Elhaik E (2013). "On the immortality of television sets: "function" in the human genome according to the evolution-free gospel of ENCODE". Genome Biol Evol5 (3): 578–90. doi:10.1093/gbe/evt028. PMC3622293. PMID23431001.
^White MA, Myers CA, Corbo JC, Cohen BA (July 2013). "Massively parallel in vivo enhancer assay reveals that highly local features determine the cis-regulatory function of ChIP-seq peaks". Proc. Natl. Acad. Sci. U.S.A.110 (29): 11952–7. doi:10.1073/pnas.1307449110. PMID23818646. Lay summary – thefinchandpea.com.