ASSEDA User Guide

Table of contents

What is the Automated Splice Site Analyses Server?
What is the basis for identification of the binding sites?
Which reference sequence of Human Genome is this based on?
Why is a list of various accession numbers displayed, when I submit a HUGO designated gene name?
What does "Submit own sequence" mean?
What does "Submit mRNA Accession #" mean?
What does "Designated Gene Name" mean?
What does "Mutation / Variant" mean?
What does "Window Range" mean?
What does "Analyze following Sites:" mean?
What does "Mutation Coordinate is specified relative to the beginning of either the:" mean?
What does "Translate all forward frames in the lister generated map" mean?
What does "mRNA Accession No." mean?
Why Gene Name is required in "Submit mRNA Accession #" option?
What does "Direction" mean?
What does "Sequence" mean?
How much time does it take, to analyze one mutation?
Why such a long time to analyze one small mutation?
What do the sub divisions "acceptor, donor, etc.." in the results page mean?
Explain different headings in the results tables
Any dependencies?
Why am I supposed to submit at least 110 bases in the "Submit Own Sequence" option?
How do you compute the fold change in binding affinity?
Architecture and Program Flow Chart of ASSEDA
What does "Lower Threshold of Information Content" mean?
What does "Mutation Coordinate is specified relative to the beginning of either the:" mean?
What does "Translate all forward frames in the lister generated map" mean?
What does "Do not produce Visualization map of binding sites" mean?
What does "Consideration of nearest ESE/ISS" mean?
What does "Treat splicing regulatory elements as:" mean?
What does "Threshold for Molecular Phenotype to declare potential exon skipping" mean?
The Logic and Formulation of Exon Definition for Splice and Splicing Regulatory Sites with Negative Information Content

What is the Automated Splice Site Analyses Server?

A system to evaluate changes in splice site strength based on information theory-based models.

What is the basis for identification of the binding sites?

Shannon Information theory. Read about the application of information theory to molecular biology at Dr. Tom Schneider's page.

Which reference sequence of Human Genome is this based on?

Presently the system is based on April 2003 reference sequence. It will be soon extended to all drafts available in UCSC genome browser, where the user can choose the draft they are interested in.

Why is a list of various accession numbers displayed when I submit a HUGO designated gene name?

Multiple accession numbers may be attached to the same functional gene name. In such cases, a list of the mRNA accession numbers are displayed, allowing the user to choose one of them.

It is recommended for the user to choose the mRNA accession number with largest range of base pairs.

What does "Submit own sequence" mean?

If you find that a particular accession number is missing in the UCSC genome April 2003 assembly, then you can make use of this option "Submit own sequence". This option enables you to submit your own sequence and make desired mutation(s). Hopefully, the server will be updated very soon, so that it is not limited to April 2003 draft alone.

What does "Submit mRNA Accession #" mean?

Some of the accession numbers still do not have a associated gene name. In such cases, the user can use this option where mutational analyses can be done with out requiring associated HUGO designated gene names.

Wondering why "Gene Name" field is still asked? Then you click this link.

What does "Designated Gene Name" mean?

Designated Gene Name is a HUGO designated gene name, which is present in the UCSC genome browser. To know the name of associated genes use the following link UCSC Genome Browser or Genew database search engine. In case you don't find the gene name, you can choose the "Submit mRNA Accession #" option, where the gene name asked is for naming conventions only.

What does "Mutation / Variant" mean?

This is the Mutation / Variant field where the user can submit mutation / variant. The Mutation indicated should be in strict conformation with HUGO Designated Mutation Nomenclature. The user can analyze multiple mutations / variants by submitting multiple mutations / variants separated by a '+'.

What does "Window Range" mean?

The Window Range is the region, in bases before and after the base, where the mutation takes place. It is the region where the information content of sites will be calculated. The sites falling outside the range of the window will be neglected. In case of haplotypes, all the sites falling in-between the bases where the mutations are taking place will be considered. The window range is limited to only 1000 bases to reduce the overhead of scanning all the base pairs. The default value is 54, which is twice of acceptor Ri(b,l) matrix range.

What does "Analyze following Sites:" mean?

There are variety of Information weight matrix (Ri(b,l)) matrices available, which can recognize certain kind of sites. The user is given the option to choose one or multiple Information weight matrices. The acceptor and donor Ri(b,l) matrices are scanned by default. In the near future, more Ri(b,l) matrices will be added to the list.

The binding site selection method has been redesigned, using checkboxes instead of a drop down menu. Over time, the number of models developed for ASSEDA has increased, making the old selection method cumbersome. The new method allows for easier selection of specific models, and can easily be expanded without adding clutter. Simply check the models before submitting your mutation. Additional models of RNA binding proteins involved in splicing are planned to be added in the near future.

List of available binding sites: Donors and acceptors (human and mouse), branch point, SF2/ASF (SRSF1), SC35 (SRSF2), SRp40 (SRSF5), SRp55 (SRSF6), hnRNPA1, hnRNPH1.

What does "Mutation Coordinate is specified relative to the beginning of either the:" mean?

CDS (CoDing Segment) introduces a complex section that describes the gene open reading frame (ORF), the portion of the sequence that codes for a protein product.

It is observed that most of the authors indicating the mutation considered initial start codon as position 1, where as, on contrary in some of the publications the start position of the gene is considered to be position 1. To facilitate the user's preference to set the parameters according to their numbering terminology this option is provided:

What does "Translate all forward frames in the lister generated map" mean?

Every region of DNA has six possible reading frames, three in each direction. The resulting visualization map of binding sites is configured such that only forward frames are shown. When the user selects this option, the resulting visualization map of binding sites will indicate all of the three forward frames with amino acids encoded. This enables the user to analyze whether the mutation made shifts within the reading frame or not.

What does "mRNA Accession No." mean?

mRNA Accession Number is the accession number associated with the gene name. The user has to enter the accession number which is present in the April 2003 draft of the UCSC genome assembly, as this system is based on that draft. The user can find the mRNA Accession Number of the gene from the links Genew database search engine or from UCSC Genome Browser. The accession number should not be the refseq accession number.

If you find that a particular accession number is missing in the April 2003 draft of the UCSC genome assembly, then you can make use of the option "Submit own sequence". This option enables you to submit your own sequence and make desired mutation(s). Hopefully, the server will be updated very soon, so that it is not limited to the April 2003 draft alone.

Why is Gene Name required in "Submit mRNA Accession #" option?

This is simply to provide the gene name in the results pages. The user can enter any gene name, but it will not be tested or verified. It is used to generate comprehensive information results only.

What does "Direction" mean?

Direction is the strand of the sequence pasted in the sequence text box. The user can specify either '+' or '-' strand.

What does "Sequence" mean?

This is the text box where the user can paste in his own sequence. The sequence is expected to contain only characters a, g, c, or t. If any other characters are found, they will be removed from the sequence.

How much time does it take to analyze one mutation?

Depending upon the type of the option chosen (submitting own sequence or submitting designated gene name or mRNA accession number), it will take approximately 30 to 60 seconds to analyze one mutation when the load is optimum. A longer delay may be expected if load is high.

Why does it take such a long time to analyze one small mutation?

The mutation(s) / variant(s) submitted is/are parsed and the base pairs where the changes are taking place are identified. All the base pairs falling in the window range from those base pairs are pulled out from the library file (of that chromosome) which consists of millions of bases. To identify and pull out specific parts of the chromosome will naturally lead to delay.

Not to forget, "It's always Worth Waiting!"

What do the sub divisions "acceptor, donor, etc.." in the results page mean?

The information content obtained at the sites, when scanned with various information weight matrices (Ri(b,l) matrices), are categorized into decreased, increased and no change depending upon the type of the information content change obtained. The total sites sub division contains all the sites recognized which have information content greater than threshold set by the user. The above categories are displayed under the sub heading of their respective information weight matrix name.

Explain different headings in the results tables.

Genomic Coordinate: The genomic coordinate number of the base where the information content is measured.

Position Relative to Natural Site: The relative distance of the base from the closest natural site.

Closest Natural Site: The genomic coordinate number of the closest natural site. This link, when clicked, pops up a window containing information content information of all the natural sites of that particular mRNA accession.

Initial(Ri): Initial information content measured at the base before the mutation is made.

Final(Ri): Final information content measured at the base after the mutation is made.

ΔRi: Final(Ri) - Initial(Ri); change of information content obtained at the site due to a mutation or variant.

Fold change: A single bit difference in Ri value corresponds to at least a two-fold difference in binding site strength. Fold change indicates the change in binding affinity of two sites.

Fold Change = 2ΔRi where ΔRi = difference between their respective individual information contents of two sites (wild type, mutant type) % Binding (Final/Initial): Indicates the change of binding energy calculated as a percentage.

Initial(Z): Z score for this evaluation, assuming that individual information values form a Gaussian distribution.

Final(Z): Z score after mutation.

ΔZ: Change in Z score obtained at the site due to a mutation or variant.

Any dependencies?

This system uses the Delila system tools for the identification of potential sites.

Why am I supposed to submit at least 110 bases in the "Submit Own Sequence" option?

The sequence submitted is scanned by different weight matrices selected by the user. Acceptor and Donor weight matrices are used by default. The number of bases scanned by each matrix is twice the length of the weight matrix on either side of the base(s) where change is made. Since the acceptor weight matrix scans the longest number of bases (27 bases), twice the length of acceptor window on both sides of the base where change is made sums up to about 110 bases.

How do you compute the fold change in binding affinity?

The fold change in binding affinity of two sites ( wild-type, mutant) is 2ΔRi , where ΔRi is the difference between their respective individual information contents.

Architecture and Program Flow Chart of ASSEDA

The architecture diagram can be found here, and the program flow can be found here.

What does "Lower Threshold of Information Content" mean?

By default, the ASSEDA server will only report potential binding sites that have a calculated bit score of 0 bits or more. We allow users to change this minimum in case they want to: 1) Increase the threshold to filter regions with a high number of potential binding sites, or 2) Decrease the threshold below 0 bits to investigate very weak splice binding sites that may be supported by splicing regulatory elements.

What does "Mutation Coordinate is specified relative to the beginning of either the:" mean?

CDS (CoDing Segment) introduces a complex section that describes the gene open reading frame (ORF), the portion of the sequence that codes for a protein product. It is observed that most of the authors indicating the mutation considered initial start codon as position 1, where as, on contrary in some of the publications the start position of the gene is considered to be position 1. To facilitate the user's preference to set the parameters according to their numbering terminology this option is provided: Open Reading Frame ( Initial CDS position in NCBI mRNA Accession): The initial start codon is considered as position 1. First Position of the NCBI mRNA Accession: The first position of the mRNA Accession is considered as position 1.

What does "Translate all forward frames in the lister generated map" mean?

Every region of DNA has six possible reading frames, three in each direction. The resulting visualization map of binding sites is configured such that only forward frames are shown. When the user selects this option, the resulting visualization map of binding sites will indicate all of the three forward frames with amino acids encoded. This enables the user to analyze whether the mutation made shifts within the reading frame or not.

What does "Do not produce Visualization map of binding sites" mean?

The enables the user to generate results with out visualization map ie. sequence walkers.

What does "Consideration of nearest ESE/ISS" mean?

When calculating total exon information content (when Molecular Phenotype Predictionby Exon Definition is selected), splicing regulatory elements are not accounted for by default. If the user suspects that a mutation is altering an ESE/ISS, then it can be included into the calculation (currently, only SF2/ASF and SC35 sites are available). As a single mutation can lead to multiple redundant changes, only one altered site is considered (ie. if two sites are weakened, the one which was initially stronger is considered as it is the one most likely to be used).

What does "Treat splicing regulatory elements as:" mean?

When calculating total exon information content, ESE/ISS consideration can be selected by the user (above). This option allows the user to treat these splicing regulatory elements as exonic splicing enhancers (exonic enhancer strength is added to calculation) or intronic splicing silencer (altered intronic regulatory elements are subtracted to calculation). A second gap surprisal is also factored into the calculation, which is specific for regulatory binding site type (SF2/ASF and SC35) and if ESE or ISS is selected.

What does "Threshold for Molecular Phenotype to declare potential exon skipping" mean?

Within the Phenotype Prediction tab, the second tab (Isoform Structure) will display a diagram for each predicted exon splice form. When a natural site is weakened, exon skipping can occur. ASSEDA will draw the exon skipping splice form if 1) the mutation abolishes the natural site (below 1.6 bit final Ri) or 2) lead to a natural site decrease of at least 7 bits (128 fold decrease in binding affinity). This option allows the user to change the 7 bit value. This exon skipping splice form will also appear in the Custom UCSC Track tab.

The Logic and Formulation of Exon Definition for Splice and Splicing Regulatory Sites with Negative Information Content

There has been recent changes to the Exon Definition formulation in regards to the impact of negative values. This update will not affect individual Ri values, but may affect previous computations of Ri,total involving sites which were abolished. The Logic and Formulation of Exon Definition for Splice and Splicing Regulatory Factors is described in detail here.