Subscribe to Newsletter

Google

Career Resources

Bag that Promotion!
5 Skills You MUST Convey During The Interview
Resigning from Your Job the Right Way
Making a Resolution and Making it Work
More Employers Are Using Personality Tests As Hiring Tools
India emerging as a global hub
Salaries to perk up this year
The advantages of online job hunting
Embedded IT segment to create more jobs
Betting high on embedded software
A-Z Listing of SIP Technology Companies
Five Commandments for Employee Survival
Rising star of pre-sales in a C-economy
Tech support is not a low-end job!
Nice guys don't finish last, they rule!
Business Development scores in a slow economy
Downturn prompts techies to do their homework
Work in an insecure economy
Career in Bioinformatics
Hot on a job trail
Hot Jobs in a freezing economy
Fiscal Fitness
Jazz up an ho-hum resume
Outmaneuver the office cads
Shrug away pink slip blues
Debugging communication for techies
Baring the body code
Dispatch your skills with a cover letter
Getting past the recruiter’s inbox
Business of hard netWORK
Are you wielding the right fork Mr Executive?
Make Your Resume
Five Rogue Resume Tribes
Five Rogue Interview Tribes
Bowl your recruiters with a High-Powered Resume
Surehire Ways To Call The Shots At the Interview
The ring of a successful telephonic interview
Money is not a five letter dirty word

Home

Career in Bioinformatics

Part A: Mapping the genome
Part B: Computing the genome

Part B: Computing the genome


he deluge of data generated by the Human Genome Project (HGP) and other genomic research presents a broad array of commercial opportunities One such opportunity is Bioinformatics - information control for the Biotechnology industry. The successful mapping of the human genome by scientists has released detailed, complex and voluminous data that needs to be read and analyzed correctly for the initial research to make headway. For example, the National Center for Biotechnology Information processes nearly 3 million requests a day from biologists and other researchers. Compounding this is the fact that there are as many kinds of biological data as they are experiments. Currently gene and genome sequences are the most abundantly collected data types, followed by protein atomic coordinates, DNA sequences, etc. Industries involved in Genomes, Pharmaceutics, Proteomics Gene Expression, Genotyping etc are heavily dependant on timely development of computational analysis, effective data management and analysis of data.

It is in these areas that the industries spawned by the Human Genome Project are looking for the services of IT professionals or Bioinformaticians, professional data analysts who can work with the avalanche of data generated by the experimental biological community and by a growing number of data factory projects eg genome sequencing projects. Their work involves: designing sophisticated databases that can accurately represent map information (linkage, STSs, physical location, disease loci) and sequences (genomic, DNA’s, proteins) and linking them to each other and to bibliographic text databases of the scientific and medical literature; improving database design, software for database access and manipulation and data-entry procedures and mining the data base to develop new hypotheses, new models of how biological systems function and even rules and patterns which can be used to analyze data sets Companies like IBM, Satyam and Wipro in India, seeking a role in the information revolution with DNA at its core, are already extending their IT services to companies interested in the potential for targeting and applying genome data. Apart from the bellwethers dozens of small companies have sprung up to sell information, technologies, and services to facilitate basic research into genes and their functions. These new entrepreneurs offer an abundance of genomic services and applications, including additional databases with DNA sequences from humans, animals, plants, and microbes.

Key Bioinformatic Areas
They are a variety of areas in the yet emerging territory of Bioinformatics:

Seamless High Performance Computing: Megabases of DNA sequence being analyzed each day will strain the capacity of existing supercomputing centers. Interoperability between high-performance computing centers will be needed to provide the aggregate computing power, managed through the use of sophisticated resource management tools. The system must be fault-tolerant to machine and network failures so that no data or results are lost.

Sequence Annotation: Computers can be used very effectively to indicate the location of genes and of regions that control the expression of genes and to discover relationships between each new sequence and other known sequences from many different organisms. The process is referred to as sequence annotation. Annotation (the elucidation and description of biologically relevant features in the sequence) is the essential prerequisite before the genome sequence data can be useful and the quality with which annotation is done will directly affect the value of the sequence.

Simulation: The process involves using known information about a system along with a mathematical or physiochemical model to simulate properties of the system. The category is incredibly diverse from simulating the motion of interacting protein molecules to modelling the flow of chemicals through biochemical pathways.

Data Mining and Information Retrieval: Methods are needed to locate and retrieve information relevant to newly discovered genes. If similar genes or proteins are discovered through sequence comparison, often experiments have been performed on one or more homologues that can provide insight into the newly discovered gene or protein. Relevant information is contained in more than 100 databases scattered throughout the world, including DNA and protein sequence databases, genome mapping databases, metabolic pathway databases, gene expression databases, gene function and phenotype databases, and protein structure data-bases. This data can provide insight into a gene’s biochemical or whole organism function, pattern of expression in tissues, protein structure type or class, functional family, metabolic role, and potential relationship to disease phenotypes.

The target data resources are very heterogeneous (i.e., structured in a variety of ways), and some are merely text-based and poorly formatted, making the identification of relevant information and its retrieval difficult. Intelligent information retrieval technology is being applied to this domain to improve the reliability of such systems. One challenge here is that information relevant to an important gene or protein may appear in any database at any time. As a result, systems now being developed dynamically update the descriptions of genes and proteins in our data warehouse and continually poll remote data resources for new information.

Data Warehousing: The information retrieved by intelligent agents or calculated by the analysis system must be collected and stored in a local repository from which it can be retrieved and used in further analysis processes, seen by researchers, or downloaded into community databases. Numerous data of many types need to be stored and managed in such a way that descriptions of genomic regions and links to external data can be maintained and updated continually. In addition, large volumes of data in the warehouse must be accessible to the analysis systems running at multiple sites at a moment’s notice.

Visualization for Data and Collaboration: The sheer volume and complexity of the analyzed information and links to data in many remote databases require advanced data visualization methods to allow user access to the data. Users need to interface with the raw sequence data; the analysis process; and the resulting synthesis of gene models, features, patterns, genome map data, anatomical or disease phenotypes; and other relevant data. In addition, collaborations among multiple sites are required for most large genome analysis problems, Even more complex and hierarchical displays are needed that that will be able to zoom in from each chromosome to see the chromosome fragments (or clones) that have been sequenced and then display the genes and other functional features at the sequence level. Linked (or hyperlinked) to each feature will be detailed information about its properties, the computational or experimental methods used for its characterization, and further links to many remote databases that contain additional information. Analysis processes and intelligent retrieval agents will provide the feature details available in the interface and dynamically construct links to remote data.

Parallel Algorithms for Sequence Analysis: The recognition of important features in a sequence, such as genes, must be highly automated to eliminate the need for time-consuming manual gene model building. Five distinct types of algorithms (pattern recognition, statistical measurement, sequence comparison, gene modeling, and data mining) must be combined into a coordinated toolkit to synthesize the complete analysis. One of the key types of algorithms needed is pattern recognition. Methods need to be designed to detect the subtle statistical patterns characteristic of biologically important sequence features, such as genes or gene regulatory regions. DNA sequences are remarkably difficult to interpret through visual examination.

In genomics and computational biology, pattern recognition systems often employ artificial neural networks or other similar classifiers to distinguish sequence regions containing a particular feature from those regions that do not. Machine-learning methods allow computer-based systems to learn about patterns from examples in DNA sequence. They have proven to be valuable because our biological understanding of the properties of sequence patterns is very limited. Also, the underlying patterns in the sequence corresponding to genes or other features are often very weak, so several measures must be combined to improve the reliability of the prediction.

High-speed sequence comparison represents another important class of algorithms used to compare one DNA or protein sequence with another in a way that extracts how and where the two sequences are similar. Many organisms share many of the same basic genes and proteins, and information about a gene or protein in one organism provides insight into the function of its “relatives” or “homologues” in other organisms.

Experiments in simpler organisms often provide insight into the importance of a gene in humans, so sequence comparison is a very important tool. Often the most accurate and sensitive methods for making this comparison are carried out using massively parallel computational platforms.

Key skills: Knowledge of relational databases like Oracle and Sybase, ability to work comfortably in a command line scripting environment and knowledge of programming languages such and C, C++ and a scripting language such as Perl are fundamental key skills. A detailed list of skills needed for various posts are listed below

Software Engineer Informatics: If you want to be a software engineer (informatics) you must possess knowledge of relational databases like Oracle Sybase or SQL Strong object related design and development skills in Java or C++ would be of great help.

Software Engineer Bioinformatics: Strong object oriented design and skills in Java C, C++ along with knowledge of Oracle PL/SQL. XML middleware or application servers are a plus.

Support Engineers: Here again strong object oriented design and skills in Java C, C++ along with knowledge of Oracle PL/SQL are needed. XML middle ware or application servers are a plus.

Quality Engineers: Familiarity with sequence analysis tools such as BLAST, FAST A is desired. Other desired skills are Perl and Shell programming. Oracle SQL and Unix computing.

Programmer Analyst: Knowledge of Unix operating environment and database management system like SQL, Sybase and Oracle is a plus. Knowledge of user application software such as PC database packages spreadsheets and word processing programs would also be helpful.

With the crash in the US markets, IT companies are desperately looking for a new emerging area that can help them tide over the rough times. With the biotechnology industry being flooded with funds, Bioinformatics may just be it.

(Assure Consulting acknowledges help of existing sources, especially Biospace’s Genomics Primer in compiling this article.)

Back to Part A


 

Email this article | Respond to this article

---------------------------------------------------------------------------------------------------------