MBGP: Introduction: Why choose SAGE?

Introduction	Data	Tools	Publications	Further resources	Contact us
			Melbourne Brain Genome Project

Why choose SAGE?

As SAGE relies on sequencing to identify genes it may be considered as a variant of expressed sequence tag (EST) analysis. While large scale EST sequencing is "somewhat" quantitative and also an effective approach to gene discovery, it is laborious due to the length of the clones and the high level of redundancy (more than 2 million human ESTs have been found to collapse by UniGene clustering to only approximately 86,000 unique genes; dbEST release, October, 2000). In comparison, by reducing the sequencing effort to the minimum sequence length required for unambiguous transcript identification, SAGE results in an approximately 40 fold increase in efficiency over EST sequencing. Techniques such as cDNA Schena 1995 or oligonucleotide arrays Lockhart 1996, have been used to compare the expression of thousands of genes in a variety of tissues but are limited to analysing only previously identified transcripts. Aside from its lack of dependence on prior knowledge of a transcript, other important advantages of SAGE include:
i) expression data is in a standard format allowing ready comparison of data sets from different experiments and labs
ii) the ability to cumulatively acquire, store and reanalyse data as genome projects are significantly advanced or completed
iii) the generation of absolute rather than relative quantitative expression data
iv) the existence of large public SAGE data sets for numerous human tissues, both normal and diseased (NCBI_SAGE) and initiation of similar inventories for the mouse
v) the ability to annotate genomic sequences with quantitative expression information derived from SAGE and uncover higher order organizational patterns of chromosomal arrangement (e.g. RIDGEs: Regions of IncreaseD Gene Expression in the Human Transcriptome Map Caron 2000).

A recent SAGE study of the human transcriptome (3.5 million transcripts from 19 tissues Velculescu 1999) revealed expression of 43,000 different genes in a single cell type with expression levels ranging from 0.3 to 9,417 transcript copies per cell. 83% of transcripts were present at levels as low as one copy per cell. The 633 most highly expressed genes accounted for 45% of the cellular mRNA mass fraction. Most unique transcripts were expressed at low levels (<5 copies per cell) with just under 25% of the mRNA mass of the cell comprising 94% of the unique transcripts expressed, as previously shown by reassociation kinetic studies. Only 52% of these low copy transcripts matched expressed sequences (mRNAs and ESTs) in GenBank/EMBL. Similar data for transcript abundance in an analysis of the mouse brain transcriptome (150,000 tags representing 42,738 unique transcripts) has recently been shown Chrast 2000. However, the 42,738 unique tags matched only approximately 4,000 known genes (there are ~6,000 mouse mRNA sequences in GenBank) and approximately 10,000 EST clusters of unknown function (from a total of ~70,000 mouse UniGene clusters), whereas the remaining transcript tags (76%), mainly for genes with low expression levels, had no match in the public databases Chrast 2000. The SAGE tag sequence for an unknown gene is sufficient information to generate longer cDNA fragments for gene identification Polyak 1997 and this process will be further facilitated as genome sequencing nears completion. Many genes not currently present in public databases will be predicted from accumulating genomic sequence and corresponding cDNAs will eventually be arrayed. However it is questionable whether available, or even developing microarray technologies will have the sensitivity to detect and quantitate the preponderance of low abundance transcripts given hybridization kinetic limitations. The power of SAGE for gene discovery, combined with the sensitivity of an in-depth SAGE analysis, as exemplified by the human and mouse studies described above and enabled by facilities such as those of the AGRF, will be exploited by the MBGP.