1. Isolation of Full Length cDNAs
a. Rationale. The goal of the NSF supported 2010 project is to identify the functions of all of the genes in the reference plant Arabidopsis. It is estimated that there are 26,000 in the genome. Identification of all the genes will lead to the identification of the proteins encoded by the genes, and this is of paramount importance for the elucidation of the various signaling pathways as well as for all the other disciplines of Plant Biology. The establishment of a protein database encoded by the Arabidopsis genomic sequence will revolutionize Plant Biology. Two major approaches are currently used for mapping the transcriptional units in the Arabidopsis genomic sequence. (1) ESTs, which are small fragments of mRNAs transcribed from a specific gene and represent only a small portion of the encoded protein. There are approximately 8,000 unique EST sequences from Arabidopsis representing ~40% of the potential transcription units. Annotation of sequences utilizing various programs that predict the presence and the structure of genes across the Arabidopsis genome. Great progress has been made in computational biology during the past 10 years for developing gene prediction tools. The question immediately arises of whether the annotation of the Arabidopsis genome sequence is accurate. The answer is simply `No' because the various programs used by the various sequencing groups cannot accurately predict all the genes and their corresponding structures. The inaccuracy of the annotated sequence is due to the inaccuracy of the software used. For annotation to be precise, it needs to be experimentally verified by the isolation of full-length cDNAs. The task of annotating the entire genome is similar to the historical construction of the map of the North American continent. The first maps were inaccurate because the tools used did not have the precision for fine mapping. As the mapping tools improved, the maps became more accurate. A similar situation exists regarding the annotation of various genomes.
We propose here to construct 15,000 full-length cDNAs identified by global expression studies using chip gene technology. This is the most technically difficult part of our proposal, but its successful completion will allow the isolation of full-length cDNAs for the entire Arabidopsis genome. The full-length cDNAs will be deposited in the Arabidopsis Biological Resource Center to be used by the entire Plant Biology community for protein biochemical studies. We define a full-length cDNA as one that contains part of the UTR (+1) and the polyadenylation site.
b. Isolation of PolyA
d. The UPS cloning system. The availability of the full-length cDNA cloned into these novel vectors will greatly facilitate their use by the Plant Biology community. Recently, a new cloning method was described by Steve Elledge's laboratory (33) that facilitates the rapid and systematic construction of recombinant DNA molecules. The central cloning method is named the univector plasmid-fusion system (UPS). The UPS uses Cre-lox site-specific recombination to catalyze plasmid fusion between the univector - a plasmid containing the gene of interest- and host vectors containing regulatory information. Fusion events are genetically selected and place the gene under the control of new regulatory elements. A second UPS-related method allows for the precise transfer of coding sequences only from the univector into a host vector. The UPS eliminates the need for restriction enzymes, DNA ligases and many in vitro manipulations required for cloning, and allows for the rapid construction of multiple constructs for expression in multiple organisms.
Unlike the conventional 'cut-and-paste' strategy of restriction-enzyme-based methods, recombinant DNA assembled by UPS is achieved by plasmid fusion through site-specific recombination. UPS can be used to fuse a coding region of interest either with a specific promoter to gain novel transcriptional regulation, or with another coding sequence to produce a fusion protein with new properties. UPS eliminates the use of restriction enzymes and DNA ligase: instead, these functions are both carried out simultaneously by a single enzyme, Cre. This relieves the constraints on cloning vectors with respect to DNA sequence and size because the UPS reaction is independent of vector size or sequence. Furthermore, the time-consuming processes inherent in conventional cloning, such as the identification of a suitable vector, designing a cloning strategy, restriction endonuclease digestion, agarose gel electrophoresis, isolation of DNA fragments, and the ligation reaction, is shortened to a 20 minute UPS reaction. Due to the uniformity and simplicity of UPS, dozens of constructs can be made simultaneously by simply using different recipient vectors. In addition, unlike restriction enzymes and DNA ligases, GST-Cre can be made inexpensively in large quantities. These features will save investigators significant amounts of time and expense. Furthermore, these recombination-based technologies open new avenues for the systematic manipulation and processing of large gene sets in parallel, a feature that is essential for future functional genomic research (33).
2. Sequencing the Full Length cDNAs
We propose to isolate and sequence 15,000 full-length cDNAs. The average length of each cDNA is estimated to be 2 kb. Thus, 30 Mb of sequence will be produced during the duration of this grant. Currently, we (SPP consortium) are sequencing the 30 Mb of chromosome 1 of Arabidopsis estimated to be completed sometime between the end of 2000 and the beginning of 2001. The sequencing of full-length cDNAs will be initiated some time at the end of 2000 (a year after the start date of this proposal). Thus, as the SPPC is coming out from the Arabidopsis Genome Sequencing Project, it will get involved with sequencing 30 Mb of full-length cDNAs. This pattern is highly efficient because the human and instrumentation resources that were established for the Arabidopsis genomic sequence will now be utilized for the sequencing of full-length cDNAs.
a. Sequencing Strategy Accurate sequence of the 15,000 cDNAs will provide experimental verification of the annotation as well as will allow the precise determination of the molecular mass of each protein encoded by the Arabidopsis genes. To facilitate the rapid, accurate and efficient sequencing of cDNAs we will follow the concatenation cDNA sequencing (CCS) procedure where multiple short DNA molecules 1-2 kb are first ligated to form long DNA fragments or concatemers, that are then randomly sheared and sequenced (56) exactly as currently done for sequencing genomic BAC clones 100 kb in size (PGEC: http://pgec-genome.pw.usda.gov; PENN: http://genome.bio.upenn.edu; Stanford: http://sequence-www.stanford.edu/ara/ArabidopsisSeqStanford.html ). The pUN120 clones will be digested with NotI/SfiI separated by agarose gel electrophoresis and the cDNA inserts will be purified by agarase treatment according to the manufacture conditions. The cDNAs will be ligated to form concatemers. The boundaries between individuals cDNAs are recognized at the stage of computer editing by virtue of the restriction sites that the program used when the clones were first isolated. In our case, they will be NotI/SfiI. The sites are electronically "cut" prior to computer assembly and as a result each cDNA sequence is ultimately identified as an individually assembled contiguous sequence (contig). CCS is an alternative to oligonucleotide walking and deletion library construction (56).
A summary of the sequencing approach is shown below in Fig. 1. This strategy has been very successful for sequencing chromosomes of Arabidopsis.
b. Operation of the SPP Consortium. Each member group of the SPPC performs a unique role to serve the Consortium. Joe Ecker's and Ron Davis's laboratory will send the cDNA clones to A. Theologis' lab, where the shotgun libraries will be constructed in M13. The libraries will be sent to Ron Davis' Genome Center where the libraries are electroporated into E. coli, plaques are picked and grown, and templates are prepared using the high-throughput template preparation robot. Subsequently, the templates are equally distributed among the three laboratories for sequencing. Each site is responsible for a given concatemer, so all the templates prepared from a given clone are sent to the appropriate site, where the sequence is generated, assembled, and submitted to GenBank.
The Consortium will provide the following sequencing capacity towards proposed sequencing: Stanford, U. Penn.
A MegaBACE capillary sequencer has been requested by PENN and PGEC to replace the 377s after the
Arabidopsis genome-sequencing project is completed.
c. Construction of Sequencing Libraries.
M13 libraries are constructed with concatemer DNA
isolated with a protocol from the MIT Genome Center, which was optimized at the PGEC site. The DNA
is randomly fragmented using the hydrodynamic shearing process developed at the Stanford Center by
Dr. Peter Oefner (38). This method utilizes an HPLC pump with a manifold valve and a stainless steel tee
with recirculation of the DNA sample. The resulting size of the sheared DNA can be changed by varying
the flow rate as the DNA is recirculated through the tee. This shearing procedure has been utilized for the
construction of all 95 M13 libraries used in the Arabidopsis sequencing efforts by the SPPC. The clones
which have been sequenced from these libraries show random distribution of ends and are distributed
relatively evenly across the entire clone/genome. The ends of the sheared DNA are repaired with T4
DNA polymerase and adaptors are ligated containing non-complementary cohesive ends, as diagrammed
in below figure.
The excess adaptors are removed using agarose gel electrophoresis and the adapted DNA is ligated into a M13 cloning vector. The M13 vector is cleaved with HindIII and a single stranded non-complementary 9-mer (linker) is ligated at the HindIII site generating a complementary end to the cohesive end of the adapted DNA (see Figure 2). This cloning procedure is highly efficient and does not require phosphatasing of the vector. More importantly, it reduces chimera formation that can cause considerable difficulty during sequence assembly.
d. Genome Sequencing Technologies and Strategies. The new instrumentation developed at Stanford for the "front-end" of the sequencing process was designed and constructed with funding from the NCHGR. These instruments have been used at Stanford throughout the current Arabidopsis sequencing project to pick and grow M13 clones and to prepare template DNA for all three sites of the SPP Consortium.
The next step in the sequencing process after DNA template production is preparation of the sequencing reactions. Instrumentation currently in use for this process includes the ABI TurboCatalyst (Stanford & PGEC), which produces one 96-well plate of dye-primer reactions as an ethanol precipitate in ~8 hour, including all pipetting and thermal cycling steps. The alternative method involves setting up the reactions with a multi-channel pipettor manually or with a robot such as the Hamilton, Robbins, or Biomek 2000, transfer to a PE9700 or MJ thermal cycler, followed by purification of the products by ethanol precipitation or column chromatography (all three sites of the SPPC). The latter method is faster than the Catalyst but is more labor-intensive. The sequencing reactions are then run on ABI377 Sequencers.
In the first year of our current sequencing project, most of the shotgun phase of sequencing was carried out using dye primer chemistry with ThermoSequenase (Amersham) and Operon dye-labeled primers. ABI terminator chemistry was used for finishing reactions, since these were 4-5X more expensive than the dye primer reactions. However, recently ABI has released new reagents, using TaqFS and BigDye labeling for both dye primer and dye terminator chemistry. The BigDye reagents are energy-transfer dyes with increased sensitivity, resulting in more reliable detection even with low amounts of template DNA. Amersham has a comparable reagent with their ET-primers, but no comparable terminator chemistry. Also, we have been able to use dilutions of the BigDye chemistry without affecting sequence quality, thereby lowering the cost. Because the cost of the BigDye terminator reactions is now ~25% of the previous cost, we expect to do a higher proportion of the shotgun coverage with dye terminator chemistry. This will also help with throughput requirements since dye terminator reactions are done in a single reaction plate compared to four plates for dye primer chemistry.
Combinations of approaches are being used for finishing the sequences to high accuracy. After assembly of the shotgun data with phredPhrap, if gaps exist, primers are synthesized at the ends of the contigs to walk further on existing templates if possible, or to PCR across the gap, or walk on the plasmid clone. Regions of low quality or low coverage are identified and designated for resequencing with an alternative chemistry. The sequence will be compared to the genomic sequence and if discrepancies exist, the cDNA sequence will be compared with the genomic DNA sequence using denatured HPLC (20,34,39,51,52).
e. Sequence Quantity and Quality. In order to complete 30 Mb of cDNA sequence in 4 years (cDNA sequencing will be initiated during the second year of the project), 7.5 MB of sequence will be produced at the three SPP sites per year or 2.5 Mb/year/site. This corresponds to 75 shotgun libraries/year to be constructed by the PGEC site.
30 Mb sequence produced/4 years = 7.5 Mb/year
5x shotgun coverage x 7.5 Mb = 37.5x10 6 raw bases/year
with an 80% overall success rate, 45x10 6 raw bases/year will be required.
[45x10 6 raw bases/year]/[400 bases/lane] = 112,500 lanes/year
[112,500 lanes/yr]/[48 weeks/yr] = 2344 lanes/week
[2344 lanes/week]/[5 days/wk] = 468 lanes/day
[468 lanes/day]/[96 lanes/gel] = 5 gels/day
[5 gels/day]/3 sites = 2 gels/day/site
The current sequencing capacity of the SPP is 10 gels/day/site.
f. Costs of Production Sequencing. We are requesting a total of $7.5 million (50% of the proposed budget) to complete the sequence of 30 Mb of cDNAs, resulting in a cost of $0.25/base. This cost is based on the actual data obtained during the Arabidopsis sequencing project by the SPPC, which had a cost of $0.50/base, but with 10x coverage. It is lower than that of the genomic sequence because the coverage for the cDNA sequence will be 5x.
1. Abel, S. and Theologis, A. 1996. Early genes and
auxin action. Plant Physiol., 111:9-17.
|© SIGnAL 2001-2017||