Harindu 's Personal Blog : May 2014

Sunday, May 4, 2014

Genome Database Systems - Deeper Analysis

Hey guys ! So as I promised yesterday , today I will be giving you some kind of a deeper analysis in to genome database systems using few criteria.

Deeper Analysis in to Genome Database Systems

Typologies of recoverable data

Genomic databases contain large set of data types. Data that are recoverable from these genome databases can be distinguished in to six categories. They are Genomic segments which include all the nucleotide subsequences which are eloquent from a biological point of view such as genes, clones etc. The second one is the Maps. Those are the “result from projects that produced the sequencing and mapping of the DNA of diverse organisms, such as the Human Genome Project”. Third one is Variants and mutations which are the alternations which happen in a less frequency in DNA and proteins. The next category is the Pathways which describes the interaction of genes. The fifth one is Expression Data which are experimental data about different levels of expression of genes. The final data type category is the Bibliographic references which are the repositories of relevant biological literature.

Database schema types

Most of the genome databases are belongs to relational data model. Mainly there are five types of database schemas in genome databases. They are unspecific schema, GUS, Genolist schema, Chado schema and Pathway tool schema. Under unspecific schema all the relational schemas and schemas which do not belong to the mentioned schemas are categorized. Genomic Unified Schema or GUS is a relational schema which is suitable for large set of information. Mainly in Genome databases systems this is used. Genolist schema is also a relational schema which is used to manage Bacterial Genome Database Systems. A schema which has “an extensible module structure” is a Chado Schema. This is used in Beetle Bees Genome Database. The Pathway Tool Schema is an object schema which is based on an ontology defining a large set of classes, attributes and relations to model genome databases.

Query types

In genome database systems we can see three main types of queries. They are simple query, Batch query and Analysis query. Using simple query we can recover data “satisfying some standard search parameters.” Batch queries consist of bunch of simple queries which are being processed simultaneously. The most relevant and complex query type is Analysis query. It can be further divided in to two categories. They are Pattern search and sequence similarity. In Pattern query, it inputs a pattern and a DNA sequence and returns “those subsequences of DNA sequence, which turn out to be most strongly related to the input pattern”. In the Sequence similarity it takes a DNA sequence as an input and return those “sequences found in the database that are the most similar to the input sequence”.

Query methods

In genome database systems there are four main query methods. They are Text based queries, Graphical interaction based queries, Sequence based queries and query language based queries. The most common methods are the text based ones. In this query method user can specify sets of words and can use logical operators. The next most common method of query in genome database systems in Graphical interaction based queries. “A large set of genomic information can be visualized by physical and genetic maps, whereby queries can be formulated by interacting with graphical objects representing the annotated genomic data.” In sequence based queries a nucleotide and a data mining algorithm are given as inputs and alignments and similar information are retrieved as outputs. Query language based queries use languages as SQL for level interaction with the database.

Export Formats

In genome databases several formats have been used to represent the query results. HTML, XML, FASTA, Flat File are some of the examples.

From next post onward I am hoping discuss some major researches done in this field of genome database systems. Till then take care everyone . Good Bye !

Saturday, May 3, 2014

Genome Database Systems- Important areas in Data Management

Hey guys, today I will be discussing a very important aspect of Genome Database systems. It is Data Management in genome database systems.

Important areas in Data Management in Genome Database Systems

Nonstandard and unstructured data.

Most of the genome data are nonstandard and unstructured. It is not clear that every position in a DNA sequence should be treated as a data object. There are structural data like proteins, but they need a 3D representation. Techniques from GIS (geographic information systems) and CAD (computer aided design) as well as from geometric modeling need to be applied for efficient indexing and querying. Scientists are currently using tools like BLAST or PSBLAST to do pattern searches and this capability needs to be integrated into the DBMS.

Complex query processing.

As discussed in the characteristics of genome databases, since the similarity of sequences, graphs and 3-D shapes it has become really hard to implement queries. Relational DBMS and Object DBMS are not capable of processing these type of queries. Therefore DBMS developers have implemented path oriented queries and specialized libraries to cater this requirement.

Data interpretation and Meta data management.

Decent mechanisms should be implemented in the database system manage the meta data, because in a database system like this is enough meta data should be provided to the scientists for the interpretation purposes and they need to be maintained in a virtuous manner. In order to do that several techniques have been implemented such as use of Annotations and Ontology.

Data integration across related databases.

Various genome databases interrelated with each other and there should be a proper management tool to handle these kinds of cross related database links. Currently no uniform interfaces or consolidation of data has been done so that information can be accessed in an integrated fashion in any given context or by any particular classification.

Need for a set of uniform data management solutions.

There is a tremendous need to have this kind of uniform data management solution because of “typical problems in databases of heterogeneous data integration - multiple models, multiple formats, different underlying files and database systems, and a large amount of context-sensitive semantic content.”

This is a brief explanation on the key areas of data management in genome database systems. In the next post I hope to give your guys a mush more deeper analysis in to genome database systems. Till then Good bye folks. :D

Friday, May 2, 2014

Genome database systems - Overview

Hi guys after some time back to blogging. So this time I though of writing about a very demanding and growing area of bioinformatics, genome databases. For the last six month I have been working on genome database systems to write my independent study. In this post I would like to give a small overview to genome database systems and some of the characteristics of genome database systems.

Overview of Genome Database Systems

The study of genes and proteins has become an extremely important area in the modern day biology and they are better known as genomics and proteomics. In these areas larger number of biological data is being used frequently. Therefore the databases which contain these data play a vital role in fields of biology and medicine. The term genomes refer to the total amount of genetic code present in the cells of an organism. Genomics consists of two component areas. Namely structural genomics and functional genomics. Genome databases store this information and differently from gene databases the genome databases contain both coding and non-coding intergenic sequences. Following are some examples for genome databases.

Saccharomyces genome database
Mouse genome database
Human genome database
European mutant mouse pathology database
Mito Map
Kyoto Encyclopedia of Genes and Genomes

Characteristics of Genome Database Systems

Data are highly complex when compared with most other domains and applications.

Compared to other domains’ data types genome data have the highest possible complexity. This can be explained using the following example. In MITO Map database it stores the human mitochondrial genome. “This single genome is a small, circular piece of DNA encompassing information about 16,569 nucleotide bases; 52 gene encoding messenger RNA, ribosomal RNA, and transfer RNA; 1000 known population variants; over 60 known disease associations.” These types of data should be stored in way that can be processed by the computers and also should have the ability to be handled by biologists as well. At the first relational DBMS and Object Oriented DBMS approaches were taken to model this data but then scientists moved on to their own ways of representing these data. But currently relational DBMS is used for the sake of long time maintenance and ease of curation.

Schemas change at a rapid pace.

Therefore in order to have improved information flow management in released databases, there should be features to support data object migration and schema evolution. Most relational and object databases have a fixed schema. Therefore in order to maintain this characteristic, some databases, release new schema releases in every two or three years’ time. e.g.: GenBank

Representations of the same data by different biologists will likely be different.

Therefore there should be mechanisms to maintain the uniformity of the database. In order to implement this queries which can be interrelate and link different schemas have been used.

The amount and range of variability in data is high. Hence, biological systems must be flexible in handling data types and values.

Defining and executing complex queries.

Mainly these databases are used by the biologists who do not possess a great technological knowledge in how the queries are structured and how these data are being stored. Therefore simple interfaces should be implemented with integrating query templates.

So guys I hope this information will be helpful for you if you are a bioinformatics enthusiast like me and craving for more insight in this field. In the next post I would like to discuss some insight details on genome database systems. Till then Good bye. Learn and Empower yourselves.