A SAGE library yields results in this format:
The MBGP first approached this problem by taking sequences from the Reference sequence (RefSeq) collection at NCBI and extracting the 10/17 base pairs immediately following the 3' CATG. The experimental SAGE tags were then matched to these "virtual" SAGE tags with their annotation (see https://mbgproject.org/MBGP_Tools_assign.html for an online tool to use this resource). Advantages:
It is important not to rely soley on clone data, as an cDNA sequence with common start and stop codon is considered as a full length clone even if it does not have the poly A signal or tail. These sequences are then assigned to clusters. Assignment is determined by whether the sequences overlap, or in the case of ESTs, if they are known to originate from the same clone (this information is derived from UniGene). A number of parameters are applied to the clusters:
In this diagram six ESTs (in red) are aligned against the genome (in blue, at the bottom), representing two clusters. The 'T''s indicate where SAGE tags have been found (i.e 10/17 bases after the 3' most or the second 3' most CATG). The green lines indicate splicing. The three ESTs in group one clearly belong clustered together as their sequences overlap. Equally the two ESTs in group two belong together. Groups one and two belong in the same cluster as the pink dotted line indicates that these two ESTs are from the same clone. The sixth EST, although nearby, is not from the same clone as any of the other ESTs and has no overlapping sequence, hence it belongs in another cluster. ESTGraph begins at the 3' end of the cluster, and works 5'-most until it reaches a polyadenylation signal (shown in blue). It then "walks" along the cluster, looking for the 3' CATG and then the second-most 3' CATG. In the example above, there are three "walks". The first runs from the polyadenylation signal at A, to B. The second "walk" has two routes, from the polyadenylation signal at C to B, and again from C to D. The third runs from the polyadenylation signal at E to B. |
||||||||||
Interpreting your results. | ||||||||||
Question | My tag matches to hundreds of things, help! | |||||||||
Answer | Okay, let us take an example: catgtggttgctgggaattga. Looking this tag up with the SAGEome program (see https://mbgproject.org/cgi-bin/tools/programs.pl, "GET GENE and GENOME POSITION for sage libraries from database"), gives a total of 107 matches! Looking closer at these results, it is clear that all hits are on different clusters (all have different cluster numbers). As each of our clusters represents a transcriptional unit, each cluster can be considered to be a gene. The positions of these genes are shown in the column 'position'. | |||||||||
Question | Why does one tag have several refGene types associated with it? | |||||||||
Answer | Here it is usually the case that the tag maps to multiple places on the genome. For example, consider the tag catgaaagaattcatactgga. Using the alignment program blat (see http://genome.ucsc.edu/cgi-bin/hgBlat) with the Oct. 2003 assembly of the mouse genome, this 17 base pair tag and 4 base pair enzyme site (hence a total of 21 bases), maps completely (i.e. all 21 bases match) to 71 places in the genome. Mapping can occur on either strand. | |||||||||
Question | Why does my tag match to *random*? | |||||||||
Answer | Please see http://genome.ucsc.edu/FAQ/FAQdownloads#download10 | |||||||||
Question | Why does my tag match to chrUn? | |||||||||
Answer | Please see http://genome.ucsc.edu/FAQ/FAQdownloads#download11 | |||||||||
Question | Why does my column 'search_method" state either 'graph_last' or 'graph_first' | |||||||||
Answer | This is because our SAGEome method takes both the ten/seventeen (short or long SAGE tags) base pairs after the 3' CATG, and the ten/seventeen base pairs after the second 3' CATG. This is particularly useful in cases where there is a 3' CATG with less than ten/seventeen bases following it. | |||||||||
Question | I have one tag which matches to one gene id, but this one gene id appears several times?! | |||||||||
Answer | Okay, let us take an example: the tag catgaaaaagtaccagagctg. This tag has 61 matches with SAGEome. This includes eight matches to AK122514. Looking this accession number up at NCBI (see http://www.ncbi.nlm.nih.gov), we discover that this accession number refers to an mRNA. If we take this mRNA and look it up at the Draft Genome Browser at UCSC (see http://genome.ucsc.edu), we can see that it has eleven matches to the genome (chromosome 18, chromosome 4, chromosome 11 twice, chromosome 5 twice, chromosome 2, chromosome X twice, chromosome 13 and chromosome 1). Looking again at our tag list we can see that AK122514 represents a cluster on chr 11 (twice), chr 13, chr 18, chr5 and chrX (twice). | |||||||||
Question | What can be considered to be a good match, are RefSeq annotations better than mgcGenes? | |||||||||
Answer |
As a rough guide, the most "trustworthy" sequences are reviewed RefSeq and SWISS-PROT. Both of them are high quality data carefully reviewed by human experts. The next level of confidence applies to:
|
|||||||||
Key points |