Deeper Analysis in to Genome Database Systems
- Typologies of recoverable data
Genomic databases contain large set of data types. Data that are recoverable from these genome
databases can be distinguished in to six categories. They are Genomic segments
which include all the nucleotide subsequences which are eloquent from a
biological point of view such as genes, clones etc. The second one is the Maps.
Those are the “result from projects that produced the sequencing and mapping of
the DNA of diverse organisms, such as the Human Genome Project”. Third one is Variants and mutations which are
the alternations which happen in a less frequency in DNA and proteins. The next category is the Pathways which
describes the interaction of genes. The fifth one is Expression Data which are
experimental data about different levels of expression of genes. The final data
type category is the Bibliographic references which are the repositories of
relevant biological literature.
- Database
schema types
Most of the genome databases are belongs to relational data
model. Mainly there are five types of database schemas in genome databases.
They are unspecific schema, GUS, Genolist schema, Chado schema and Pathway tool
schema. Under unspecific schema all the relational schemas and schemas which do
not belong to the mentioned schemas are categorized. Genomic Unified Schema or GUS is a relational
schema which is suitable for large set of information. Mainly in Genome
databases systems this is used. Genolist
schema is also a relational schema which is used to manage Bacterial Genome
Database Systems. A schema which has “an extensible module structure” is a
Chado Schema. This is used in Beetle Bees Genome Database. The Pathway Tool
Schema is an object schema which is based on an ontology defining a large set
of classes, attributes and relations to model genome databases.
- Query
types
In genome database systems we can see three main types of
queries. They are simple query, Batch query and Analysis query. Using simple
query we can recover data “satisfying some standard search parameters.” Batch
queries consist of bunch of simple queries which are being processed
simultaneously. The most relevant and complex query type is Analysis query. It
can be further divided in to two categories. They are Pattern search and
sequence similarity. In Pattern query, it inputs a pattern and a DNA sequence
and returns “those subsequences of DNA
sequence, which turn out to be most strongly related to the input pattern”. In
the Sequence similarity it takes a DNA sequence as an input and return those
“sequences found in the database that are the most similar to the input
sequence”.
- Query
methods
In genome database systems there are four main query
methods. They are Text based queries, Graphical interaction based queries,
Sequence based queries and query language based queries. The most common
methods are the text based ones. In this query method user can specify sets of
words and can use logical operators. The
next most common method of query in genome database systems in Graphical interaction
based queries. “A large set of genomic information can be visualized by
physical and genetic maps, whereby queries can be formulated by interacting
with graphical objects representing the annotated genomic data.” In sequence
based queries a nucleotide and a data mining algorithm are given as inputs and
alignments and similar information are retrieved as outputs. Query language
based queries use languages as SQL for level interaction with the database.
- Export
Formats
In genome databases several formats have been used to
represent the query results. HTML, XML, FASTA, Flat File are some of the
examples.
From next post onward I am hoping discuss some major researches done in this field of genome database systems. Till then take care everyone . Good Bye !