A SAGE library yields results in this format:
ACGTGTGACT   12
GTACGTACGT   2
GTAACCAGTA   35
where the first column contains the tag sequence (of either 10 or 17 base pairs depending on whether you are working with a short or long SAGE library), and the second column contains the number of times that tag has been seen in the library (the count of that tag). It is necessary to relate the tag sequence to another piece of sequence that is annotated in some format.
The MBGP first approached this problem by taking sequences from the Reference sequence (RefSeq) collection at NCBI and extracting the 10/17 base pairs immediately following the 3' CATG. The experimental SAGE tags were then matched to these "virtual" SAGE tags with their annotation (see https://mbgproject.org/MBGP_Tools_assign.html for an online tool to use this resource).
Advantages:
  • Non-redundancy
  • Updates to reflect current knowledge of sequence data and biology
  • data validation and format consistency
  • ongoing curation by NCBI staff and collaborators, with review status indicated on each record
Disadvantages:
  • A large number of experimental tags were failing to be identified.
This disadvantage meant that the power of the data within the SAGE libraries was not being used. Hence the MBGP designed a new approach, SAGEome. SAGEome focuses on
  • the 3' end of the gene (with regard to the SAGE technique)
  • making use of all sources of transcribed (i.e. experimentally verified) sequence data
  • not limiting our use of sequence to just consensus sequences.
  • making use of genomic sequence
  • beginning with the most complete starting information, i.e. the DNA sequences submitted to public databases.
SAGEome works by taking all sources of transcribed sequence data that have been aligned to the genome using the alignment program blat (see http://genome.ucsc.edu/cgi-bin/hgBlat) with the Oct. 2003 assembly of the mouse genome. Details of sequences currently used:
  • 2,998,011 all_est (*)
  • 151,784 all_mrna
  • 14,875 mgcGenes (*)
  • 17,451 refGene
  • 61,603 rikenClones
* Despite their redundancy and over-representation of highly expressed genes, ESTs are cheap, easy and quick to obtain relative to full genomic sequencing and currently sample more mouse genes than any other data source.
It is important not to rely soley on clone data, as an cDNA sequence with common start and stop codon is considered as a full length clone even if it does not have the poly A signal or tail.
These sequences are then assigned to clusters. Assignment is determined by whether the sequences overlap, or in the case of ESTs, if they are known to originate from the same clone (this information is derived from UniGene). A number of parameters are applied to the clusters:
  • poly-A tails are extracted directly from the genome (i.e. not from the EST sequences)
  • For those clones which are annotated as "5'" and aligned in "+" orientation or annotated as "3'" and aligned in "-" orientation or the splice sites indicate "+" orientation, poly-A tails are searched for within 35 bases from the start and 30 bases before the start.
  • For clones which are annotated as "3'" and aligned in "+" orientation or annotated as "5'" and aligned in "-" orientation or the the splice sites indicate "-" orientation, poly-A tags are searched within 35 bases from the end and 30 bases after the end.
  • Orientation for genes and ESTs is predicted from the splice sites, only splice sites where the intron is larger than 12 bases are taken into account (others are frequently just gaps in the alignment).
  • Searching for SAGE tags starts at all poly-A tags, extended poly-A tags, refSeq genes, mgcGenes, rikenClones and all_mRnas.
  • Genes or ESTs have to have at least one exon with 30b length
  • Exons have to overlap by at least 10b to be clustered
  • ESTs of the same clone are clustered if their distance is less than 10000kb.
  • Exons which are more than 1000000 bases appart are not used for clustering.
A diagram indicating how SAGEome works is shown below:-
In this diagram six ESTs (in red) are aligned against the genome (in blue, at the bottom), representing two clusters. The 'T''s indicate where SAGE tags have been found (i.e 10/17 bases after the 3' most or the second 3' most CATG). The green lines indicate splicing. The three ESTs in group one clearly belong clustered together as their sequences overlap. Equally the two ESTs in group two belong together. Groups one and two belong in the same cluster as the pink dotted line indicates that these two ESTs are from the same clone. The sixth EST, although nearby, is not from the same clone as any of the other ESTs and has no overlapping sequence, hence it belongs in another cluster. ESTGraph begins at the 3' end of the cluster, and works 5'-most until it reaches a polyadenylation signal (shown in blue). It then "walks" along the cluster, looking for the 3' CATG and then the second-most 3' CATG. In the example above, there are three "walks". The first runs from the polyadenylation signal at A, to B. The second "walk" has two routes, from the polyadenylation signal at C to B, and again from C to D. The third runs from the polyadenylation signal at E to B.
Interpreting your results.
Question My tag matches to hundreds of things, help!
Answer Okay, let us take an example: catgtggttgctgggaattga. Looking this tag up with the SAGEome program (see https://mbgproject.org/cgi-bin/tools/programs.pl, "GET GENE and GENOME POSITION for sage libraries from database"), gives a total of 107 matches! Looking closer at these results, it is clear that all hits are on different clusters (all have different cluster numbers). As each of our clusters represents a transcriptional unit, each cluster can be considered to be a gene. The positions of these genes are shown in the column 'position'.
Question Why does one tag have several refGene types associated with it?
Answer Here it is usually the case that the tag maps to multiple places on the genome. For example, consider the tag catgaaagaattcatactgga. Using the alignment program blat (see http://genome.ucsc.edu/cgi-bin/hgBlat) with the Oct. 2003 assembly of the mouse genome, this 17 base pair tag and 4 base pair enzyme site (hence a total of 21 bases), maps completely (i.e. all 21 bases match) to 71 places in the genome. Mapping can occur on either strand.
Question Why does my tag match to *random*?
Answer Please see http://genome.ucsc.edu/FAQ/FAQdownloads#download10
Question Why does my tag match to chrUn?
Answer Please see http://genome.ucsc.edu/FAQ/FAQdownloads#download11
Question Why does my column 'search_method" state either 'graph_last' or 'graph_first'
Answer This is because our SAGEome method takes both the ten/seventeen (short or long SAGE tags) base pairs after the 3' CATG, and the ten/seventeen base pairs after the second 3' CATG. This is particularly useful in cases where there is a 3' CATG with less than ten/seventeen bases following it.
Question I have one tag which matches to one gene id, but this one gene id appears several times?!
Answer Okay, let us take an example: the tag catgaaaaagtaccagagctg. This tag has 61 matches with SAGEome. This includes eight matches to AK122514. Looking this accession number up at NCBI (see http://www.ncbi.nlm.nih.gov), we discover that this accession number refers to an mRNA. If we take this mRNA and look it up at the Draft Genome Browser at UCSC (see http://genome.ucsc.edu), we can see that it has eleven matches to the genome (chromosome 18, chromosome 4, chromosome 11 twice, chromosome 5 twice, chromosome 2, chromosome X twice, chromosome 13 and chromosome 1). Looking again at our tag list we can see that AK122514 represents a cluster on chr 11 (twice), chr 13, chr 18, chr5 and chrX (twice).
Question What can be considered to be a good match, are RefSeq annotations better than mgcGenes?
Answer As a rough guide, the most "trustworthy" sequences are reviewed RefSeq and SWISS-PROT. Both of them are high quality data carefully reviewed by human experts. The next level of confidence applies to:
  • other RefSeq
  • TrEMBL
  • UCSC Known Genes
Use the mRNA track and other comparative genomic data tracks as supporting evidence. Gene prediction tracks may help, but should be considered in the main as "hypothetical".

Key points
All of the original alignments (except the Riken data) came from the UCSC site. If you would like to find out further information on this site but do not know where to begin, you can search the entire site by using Google, using this format (replace random with the keyword that you are looking for):-