LonelyWords

From BioDOM

RNA Structure Analysis

A great part of bioinformatic research deals with rna structure analysis. Because of the relative simplicity of rna strands it is feasible to compute possible rna secondary structure foldings, that are remarkably useful to biologists. Not seldom a scientist is interested in more than one candidate rna structure folding, since in real systems all the different foldings may occur. The relative frequency of occurence in real systems depends on the thermodynamic energy level of a given structure.

In the common biological use case, rna secondary structure analysis is only a step in a iterative process. In most cases biologists are not interested in the secondary structure or secondary structure alignments as a final result, but as a intermediate result, which can be used as a puzzle piece to empower a specific thesis or direct at some other. Thus rna structure formats have to be exspecially well defined.

A prevalent easy, but powerful format for rna structures is the Vienna DotBracket format, which the secondary structure formats use for structure description. While this may require an extra parsing during data extraction, the benefits of simplicity and human readability prevail.

RNAStructML

RNAStructML is a format for rna secondary structure data, e.g. the output data of secondary structure predictors like RNAshapes or Mfold. Tools, that use rna secondary structure date like e.g. RNAforester would use it as input source.

RNAStructML can contain any number of independent sets of rna structure data, which can basicly contain sequence data, based on SequenceML, program data, derived from hobitTypes.xsd, and a positive number of shape and structure data sets each. While a structure defines a specific secondary structure, a shape is more general and defines a specific type of structure, in other words a structure class.

RNAStructML is easy to understand, which is a big advantage compared to existing formats like RNAML. The reason is that secondary structure information is stored using DotBracket strings, which unify a simple syntax with a powerful semantic field. Readabilility is an often underestimated quality factor, that gains additional impact in bioinformatics since it is common that users, sometimes even programmers, of bioinformatic tools lack a computer science education and therefore need easy-to-understand data formats.

RNAStructML can easily be overloaded with additional information, because of several 'anyAttribute'-attribute fields. Thus RNAStructML is a very flexible format, that can be easily adopted to specific needs, which are also not uncommon in bioinformatics.


One widely accepted usage for rnatools, such as RNAshapes or RNAfold, is the proprosal of rna secondary structures, based on thermodynamic principles or, more common, string pairing algorithms. Depending on the research topic, scientists can be interested only in the consensus structure or more general in all possible structure foldings. In a common workflow in rna analysis generation the secondary structure will not be the last step, but only one program call in a longer analytic process.

Furthermore a format should be both easy to understand and human readable to recieve acceptence in a worldwide community.

RNAStructML contains basicly a list of rna secondary structure elements, that are composed of three data elements belonging to the structure information:

  * The first subelement contains the sequence data. Any rna secondary structure
    is based on some rna sequence, but sequence information is optional 
    in RNAStructML. The sequence element is based on the SequenceML structure, but does not 
    allow aminoacid sequences.
  * The second element is a set of structures and shapes. While a structure
    element represents a specific structure, a shape is more generell. Shapes 
    represent structure classes and define the general type of structures.
    The data structure allows and encorages referencing from structures to
    shapes using generated shapeids. The rna secondary structure and the shape 
    information is stored using a Vienna-Dot-Bracket-Like description, making both
    fields very human readable. Basicly a string consisting of brackets and dots
    describes the structure. Opening and closing brackets encode corresponding
    bonds, while dots encode unbounded nucleotides. Additional bracket types can 
    encode pseudoknot structures. Shapes are described by underscores and higher
    order bracket pairs. A bracket pair in shapes stand for any number of pairing
    bases, while a underscore stands for a region of unpaired bases on lenght > 1.
  * The third element contains program information. This field allows to
    describe which tools where used to generate this results. The program call,
    version, name and date of execution can all be stored here.

RNAStructAlignmentML

RNAStructAlignmentML is a format for storing rna secondary structure alignments as computed by e.g. the RNAforester tool. Structure alignments are a 'hot topic' in bioinformatics these days, since the approach is nearer to real biological systems. This is because chemical properties of a given molecule depend to a greater extend on the general structure than on a the sequence.

  * rna structure alignment data for tools like RNAforester.
  * can contain any number of independent sets of rna structure alignment data
  * based heavily on RNAStructML
  * a data set can contain
     * at least two sequence data sets (based on AlignmentML) with a secondary structure element (at least structure must exist!)
     * program data (same as RNAStructML , derived from hobitTypes.xsd)
     * a consensus shape element
  * great humanreadability: structure in alignedDotBracket
  * same flexibility like RNAStructML

RNAStructAlignmentML is a format for storing alignments of rna secondary structures. This format is based on RNAStructML, but extends it with a great... (bla blub bla)