Melbourne Brain Genome Project


This webpage aims to provide all tools (or links to appropriate tools) to analyse SAGE and LongSAGE data beginning from the raw chromatograms. Read through the steps and begin at the one that is most appropriate to your stage of analysis. If you need additional explanations, leave a request via contact form at

•Step 1 : Process the sequence
If the chromatograms for the sequences are available, then ideally you should use the basecalling program phred to read the trace data and call the bases. There is an estimated sequencing error rate per base which can affect the ditags. This can be improved by use of the quality scores obtained with phred, and discarding ambiguous sequences. Phred scores report the confidence in base calling by indicating the expected error frequency, a function of sequencing technology and chemistry. Phred is free for academic users and can be obtained from
•Step 2 : Extract SAGE tags from sequence
The MBGP recommends the use of SAGE2000 to extracting tags from SAGE sequences. This tool results in a list of tags and their respective abundances, extracted from raw sequence. The raw sequence is a file (.seq) containing sequences from clones of concatenated SGE ditags. The SAGE2000 analysis software tabulates the occurrence of each tag, and creates a report of each tag and its abundance level. There is an online tool available here that will extract SAGE tags, however it does not currently account for linker sequences. Sequences that have a low number of tags tend to fall into 3 different categories:
1. Sequence-verified empty vector
2. Sequence composed partially or exclusively of 'long dimers'.
3. Sequences with dimers of very variable lengths

For internal users only, a new tool has been added to the MBGP site, to extract tags and account for sequencing errors simultaneously. This tool can be found here, and requires a zip archive of files in either phred phd format or ABI phd format (i.e. quality files).

•Step 3 : Determine which genes are significantly differentially expressed

Files of tags and their respective abundances (the tag count) can be analysed to determine which tags (and therefore which genes), are significantly expressed, by clicking here. The submitted file must be in the following format:

Gene Library_A Library_B
A 22 4
B 6 8
C 0 3
To convert your SAGE library files of tags and counts into this format, please click here. Note - it is advisable to give your genes intuitive names to aid with identification in the resulting statistics. Further information on the statistical test used can be found here.

For internal users only, this tool is also available with the selection of all of our libraries, and can be accessed by clicking here.

A number of new tools for the manipulation of SAGE data have been made available. These can be accessed by internal users only, by clicking here.

•Step 4 : Find information on your unknown SAGE tag

"Unknown" SAGE tags can be assigned to known SAGE tags previously extracted from databases including the Reference sequence collection at NCBI. To map your SAGE tags against selected databases, please click here.

For internal users only, "unknown" SAGE tags can have identities assigned to them using ESTgraph, which can be found as the tool GET GENE and GENOME POSITION for sage libraries from database. The file must be either in this format:

TAG             COUNT
for one library, or this format if you are looking at tags that appear in several libraries:

TAGTGATGAG      12                              36                      0
CGTAGCACGT      3                               17                      88
GGTACCAGTA      34                              2                       23
For further information on ESTGraph - how and why it was constructed, please click here
The poster from ISMB2004 can be viewed here.

•Step 5 : Pursuing your gene(s) of interest

Once a SAGE tag has been assigned to a RefSeq gene, further information can be retrieved on the gene it was assigned to. To do this, simply upload your list of RefSeq accession numbers into this form. Further information on the genes of interest can be obtained from these websites:

Assists in looking for items or concepts that may be present in common between two distinct sets of articles. For example, it can aid in picking up distant relationships that a NCBI PubMed search may miss.
Pathways depicting molecular relationships, either browse by category or search by title.
FatiGO carries out simple datamining using GO. Input is either a list of Gene Symbols, or UniGene cluster ids.
This is a java application that allows users to search for gene prodcuts with particular gene ontology attributes, or combinations of attributes. GoFish ranks gene products by the degree to which they satisfy a Boolean query.
A web-based tool to annotate a large number of genes simultaneously using the vocabulary of gene ontology. Further information on the GO accession numbers (e.g. GO:0007186) can be found by searching the GO Browser at Mouse Genome Informatics.
A network of concurring genes and proteins extends through the scientific literature touching on phenotypes, pathologies and gene function. This information system provides this network as a natural way of accessing the more than ten million abstracts in PubMed. Currently only set up to search one gene at a time.
Note - with the introduction of proprietary software, the free web public service is 12 months out of date
Enables user selection of a gene, with the result that the literature neighbourhood around the gene is then shown. All information is extracted from the published literature.
A web interface to the mouse SAGE data published by Trendelenburg et al. (2002) Serial analysis of Gene Expression Identifies Metallothionein-II as Major Neuroprotective Gene in Mouse Focal Cerebral Ischemia. Journal of Neuroscience 22 (14):5879-5888. The data can be searched in several ways, including by using a tag or gene name.
Signal Transduction Knowledge Environment. Free registration is required to see these signal transduction connection maps. For further information, see the Science article pdf. TNF Pathway
Fas Signaling Pathway
G alpha i Pathway [Gai]
G alpha s Pathway [Gas]
Gaq Pathway [Gaq]
Ga12 Pathway [Ga12]
Ga13 Pathway
T Cell Signal Transduction
B Cell Antigen Receptor
Estrogen Receptor Pathway
Wnt/ß-Catenin Pathway
TGF-ß Pathway
Differentiation Pathway in PC12 Cells
Jasmonate Biochemical Pathway
Cytokinin Signaling Pathway
Integrin Signaling Pathway
JAK-STAT Pathway
Interferon Pathway
STAT3 Pathway
Type 1 Interferon (IFNa/ß) Pathway
Phosphoinositide 3-Kinase Pathway
Tagmapper is a comprehensive tool used to perform tag-to-gene mapping. It is built on the premise that a 10bp SAGE tag when derived from a definied position within the transcript, contains information sufficient enough to identify the corresponding gene. It promises to be an innovative and an user friendly tool that is built using open source software.
A web site and database that can be used to systematically map gene loci (ATLs, modifiers) that modulate variation in the expression levels of transcripts. Entering a key word will give the expression data for that transcript, giving an indication of the expression with regard to genotype.
•Step 6 : Use of the UniGene Tag maps

An alternative method of assigning information to tags is to use the Comparative Count Display software written at the NCBI, which can be used over the web here. This compares two libraries of tags and their abundances by generating a probability that the frequency of any tag in the distribution differs by more than a given fold factor from the other distribution.

There are various other tools available for use on SAGE data. A list of the more commonly available tools is shown below. If you are aware of any software missing from this list, please .

Tool Brief summary
CSAGE An open source UNIX/Linux tool written in C intended for the automated analysis of data generated with SAGE. Software and source are freely available. Please direct inquiries to . pdf
eSAGE A comprehensive software package for managing and analyzing data generated by SAGE, freely available for non-commerical use. pdf
An Open Source gene expression database and integrated tool set. Also see pdf
POWER_SAGE Program that can run 'virtual' SAGE studies with different combinations of sample size and tag frequency and determine the power for each combination - useful for planning SAGE experiments. pdf
SAGE2000 SAGE2000 is freely available to academic investigators for non-commercial use.
SAGE Library Analysis A method for predicting the reliability of low abundant SAGE tags that has been developed at Serono. The tool is in the form of a web page which allows you to upload a SAGE library and have it processed, for references see Colinge & Feger, Detecting the impact of sequencing errors on SAGE data, 2001, Bioinformatics, 17:840-842. The implementation is now available online at the following address:
SAGElyzer SAGElyzer is a R package for integrating gene expression data from various SAGE libraries, clustering transcripts on the basis of SAGE expression pattern, and annotating SAGE tags using pubic databases. For further information, please see the bioconductor listing or the pdfpdf.
SAGEmap NCBI's SAGEmap resource is a public resource for serial analysis of gene expression data. It is possible to retrieve information by use of:
•a UniGene cluster id
•a sequence with a GI number
•a tag.
SAGEscreen SAGEscreen is a multi-step procedure that addresses ditag processing, estimation of empirical error rates from highly abundant tags, grouping of similar-sequence tags and statistical testing of observed counts. SAGEscreen is available for academic users from Reference: Correction of sequence-based artifacts in serial analysis of gene expression Akmaev and Wang, Bioinformatics 2004 20: 1254-1263 pdf.
SAGEstat A Windows executable compatible with all versions of Windows. To request a copy of this program please send an e-mail to [email protected] with the comment SAGEstat in the subject field. A description of the use of the program in planning SAGE experiments can be found in: Ruijter et al. (2002) Statistical evaluation of SAGE libraries: consequences for experimental design Physiological Genomics 11: 37-44. pdf A second reference can be found in: Kal et al. (1999) Dynamics of Gene Expression Revealed by Comparison of Serial Analysis of Gene Expression Transcript Profiles from Yeast Grown on Two Different Carbon Sources. Molecular Biology of the Cell 10, 1859-1872. pdf
A computer program that can search for SAGE tags in genomic sequences.
A web-based application for analysis of SAGE data. [Registration required].pdf
•Step 7 : No SAGE tag, but a gene/genes of interest ...
If you do not have a SAGE tag or tags, but instead have a gene or list of genes, and are interested in knowing which SAGE tags can be extracted from the sequence of the gene(s) and their respective abundances within all the SAGE libraries of the MBGP, please click here. This tool will work equally if you have a list of 10 bp SAGE tags, and are interested in knowing their respective abundances within all the SAGE libraries of the MBGP.
•Step 8 : Using the Draft Genome Browser at UCSC
Our SAGE tags and ESTgraph is available for browsing by internal users only at this webpage.
Finally, all results need to be validated by an alternate method, such as RT-PCR or in situ hybridisation on the same experimental material. Ambion provide a very informative and concise guide to RT-PCR.

Last modified on 25th October 2004
Website comments to
The MBGP is not responsible for the content of any external websites listed on this site.