Career in Bioinformatics
Part A: Mapping the genome
Part B: Computing the genome
Part A: Mapping the genome
|Business Models for Leading Genomics Companies
Sequencing Genomic Map
Human Genome Sciences
Millennium Pharmaceuticals Maxygen
Target Drug Discovery Companies
Human Genome Sciences
Lead Generation/Lead Optimization
Cambridge Antibody Tech.
Human Genome Sciences
Protein Design Labs, Inc.
Forward Integrated Drug Discovery Companies
Human Genome Sciences
High Throughput Screening
n June 2000, the working draft of the human genome was announced. The Human Genome Project’s success in sequencing the chemical bases of DNA opened a new frontier for the IT industry – Bioinformatics. A discipline representing the combined power of biology, mathematics, and computers which involves the study of data relating to three billion units of the human genome – the entire sequence of DNA made up of the four nucleotide based A C T and G spread across 23 chromosomes. DNA. Tipped to be a $25 billion market in the next five years in India, IT industries are gearing up to meet the challenges of data management and control for the bio-tech industry. Although the new buzz word for the IT industry, apart from a functional definition, of the field most techies are unaware of high throughput genomic computational technologies which are transforming biological sciences and the IT industry alike.
Mapping the Genome
Genomics is the study of chromosomes inhabiting the nuclei of human cells. All cells as we know have 46 chromosomes, one pair of sex chromosomes and 22 pairs of autosomes, which have no role in sex determination.
Each chromosome is made up of two strands of DNA which wrap around each other in the form of a double helix. Each chromosomal is made up of four DNA bases, or nucleotides: Adenine (A), thymine (T), guanine (G), and cytosine (C). When the individual chromosomal strands bond, adenine binds to thymine and guanine binds to cytosine. Each set of two bound nucleotides is a base pair. Ninety-seven per cent of these base pairs have no known function, and have been labeled “junk DNA” by scientists. Interspersed along the strands, however, are distinct nucleotide sequences called genes, each of which contains instructions that govern the body’s development and functions. The human genome — which refers to all the genes stretched along the chromosomes in cells — contains anywhere between 30,000 and 100,000 genes.
Genes, and the instructions they contain, express themselves by forming proteins, which are large molecules that give the body structure and carry out its functions. Examples of proteins which we are familiar with include enzymes, which facilitate, or catalyse, the reactions in cells; antibodies, produced by the immune system to fight infection when antigens invade the body; and hormones, which regulate functions such as growth and reproduction.
But for all the hype generated just as to understand meaning of a book it is not enough to know the number of pages, mere knowledge of the kinds of sequence and number of genes is by itself useless. To successfully apply the knowledge, various genes will have to be pieced together together with their corresponding proteins like a giant jigsaw puzzle identifying hidden biochemical pathways at work in health and disease. Normal genes, and variations known as polymorphisms, will be pieced together with their corresponding proteins like a giant jigsaw puzzle, identifying hidden biochemical pathways at work in health and disease. Some of these proteins will serve as drugs directly. Others will serve as new targets for drug intervention. Protein, antibody, and small-molecule (chemical) drugs will be developed to act on these targets with much more selectivity and potency than seen today.
Hence there is a tremendous opportunity for companies to intercede at various levels in this process with technologies and information that will revolutionise human health and fulfil the phpiration unleashed by the unraveling of the genomic sequence.
Areas in Genomics
The efforts in the near term post Genomic Sequence era will focus on the following areas:
Genetic variability or SNPs: The complete genetic makeup of an individual is known as a genotype. The way in which the genotype expresses itself, i.e., a person’s physical traits, is called a phenotype. The genotype of all humans is 99.9 per cent the same; only the .1 per cent difference in the genetic makeup of individuals accounts for their uniqueness, or their differing phenotypes. These genetic differences often come in the form of Single Nucleotide Polymorphisms (SNPs), so named because a single nucleotide in a base pair is different from what it should be.
This subsector is hence concerned with how the single nucleotide polymorphisms (SNPs) that constitute most of the genetic differences between individuals influence susceptibility to disease. Genetic variations might also explain why a given drug works well with some patients, but not with others
Functional genomics: With a catalog of the entire genome, gene sequences can be more rapidly isolated, but functional genomics – the process of determining the function of genes to find those that will make good targets for drug discovery – is going to be a growing field of research.
Lead generation and optimisation: Identifying targets for drug discovery from the total pool of genes will allow companies to produce and optimise drug leads using drug discovery methods such as antibody generation (the most direct route to finding an inhibitor of a given a target), combinatorial chemistry (for generating small-molecule compounds) and high-throughput screening (to find active small molecule compounds among the millions available in libraries).
Gene expression: All cells in the body, with the exception of red blood cells, contain all 23 chromosomes, which contain all genes. But not all genes are "turned on" to make their respective proteins. If a gene is turned on, it is said to be expressed. This is what makes a kidney cell a kidney cell and a nerve cell a nerve cell. For example, the insulin gene is turned on in specialized cells in the pancreas called islet cells. The insulin gene also resides in kidney cells, but it is turned off. In specialized kidney cells, the erythropoietin gene is turned on. Much is being learned about gene function by comparing levels of gene expression in body tissues and organs in states of health and disease. Scientists are also studying gene expression of tissues in response to drug therapy. This approach, known as pharmacogenomics, can provide important clues on toxicity and efficacy even before a drug enters human clinical testing.
Genes, and the instructions they contain, express themselves by forming proteins. Genes produce these proteins via a process called transcription. During transcription, the bonds that bind the DNA molecules break apart. Once separated, a molecule of messenger RNA (mRNA) moves along the individual DNA strand and copies its coding portions, the nucleotides that make up its genes.
When this transcription process is complete, the mRNA molecule will have formed a string of coding nucleotide bases. The mRNA then carries this string outside the nucleus and into the cytoplasm of the cell. Here, the string of bases is exposed to the ribosome, an organelle that carries out the work of translation—the process of carrying out the nucleotides’ instructions to form a protein.
The bases along the nucleotide string work three at a time. That is, each string of three bases, known as a codon, codes for a specific amino acid—the molecules that form proteins. As the bases perform this coding function, a molecule of transfer RNA (tRNA) carries the appropriate amino acid bases to the ribosome. The various amino acids—each of which was coded for by a different set of codons—are then strung together by the ribosome to form a protein.
By analysing genes and their expression, scientists can gain insight into the causes of illness. By looking at genetic variabilities, they can ascertain why a given population is more susceptible to certain diseases than others. Such analysis can also give them clues as to how to customize treatments to fit individual genotypes.
Commercialising the genome
The genomics industry is trying to leverage the enormous and limitless potential of these dicoveries. Apart from Bioinformatics, the general categories of future genomic product revenues are described below.
Genome sequence: Genome sequence is an ordered map of the DNA molecules in a given organism. It is the ultimate description of the genome, as the periodic table of the elements is for chemistry or as an alphabet is for a language. Now complete, the sequence of the human genome will be made available free of charge to the public, both by Celera and by the DOE/NIH Human Genome Project. The commercial value of the sequence will be realised as additional layers of genomic information are added to it. As information accumulates from other downstream genomics activities -- such as polymorphism discovery, disease association, genotyping, pharmacogenomics, and phenotyping -- the information will be added to the databases and annotated. These databases will become important repositories of information for biotechnology and pharmaceutical companies.
Gene discovery and function: Newly discovered genes may be patented if their function is known, as may the proteins for which they code. These proteins may be useful as protein drugs in and of themselves, an approach that served as the basis for the founding of the biotechnology industry. For example, the erythropoietin gene codes for the protein hormone erythropoietin, which Amgen turned into a blockbuster drug (Epogen) used to stimulate the bone marrow to produce red blood cells for the treatment of anemia.
Human Genome Sciences is on the forefront of this strategy, with three novel protein drugs currently in Phase II clinical testing. Lower organisms may be subject to genetic manipulations that are highly revealing of the roles of a gene, but that are not feasible in humans for obvious ethical reasons. An example is the use of genetically engineered mouse models. It is possible to engineer a mouse so that it is incapable of expressing a gene, potentially revealing how other proteins in the organism are affected by such a loss. Alternatively, expression of the gene can be manipulated up or down, or regulated in a time or location-dependent manner. Additional constraints on the utility of such models, beyond the appropriateness of the model itself, derive from the size of the organism, the reproductive turnover time, and similar factors.
Other proteins serve as cellular receptors that mediate specific cellular processes. Once associated with disease, these receptors can serve as important targets for development of new protein, antibody, and small-molecule drugs. Abnormal genes can also serve as diagnostic markers for associated diseases.
Expression Profiling: No gene functions in an isolated fashion. Genes, and the proteins they encode, are parts of larger systems, interacting with each other in complicated ways. These systems are known as pathways, and as the expression of a particular gene is regulated by the body to occur at specific times and in specific locations, so the expression of entire pathways of genes is likewise regulated. The aim of expression profiling is to use the patterns of gene expression as a clue for understanding the underlying pathways.
Genotyping: Progress in the hunt for disease genes depends on the ability to access individuals with disease and analyze their genetic makeup. Several methods based on target signal amplification technology have now emerged that can rapidly test for specific SNPs. The process, broadly known as genotyping, will soon reach industrial scale within several companies. High-throughput factories will comb through hundreds of tissue samples per day looking for genetic needles in disease haystacks
Pharmacogenomics: Pharmacogenomics is the study of genotype and its relationship to drug action. The goal is to use the right drug for the right person at the right time. Surprisingly, fewer than half of patients experience the intended benefit of most drugs, despite taking them as directed. Not surprisingly, most drugs have side effects, ranging from the merely bothersome (like rash and fatigue) to life-threatening reactions. Why do some patients benefit and some not? Why do some patients have violent reactions while others are unfazed? The answers are increasingly believed to be genetic.
Variations in drug-metabolizing enzymes, transporters, receptors, and other drug targets explain these individual responses. Pharmacogenomic approaches apply this information throughout the drug development process to define subsets of patient populations who are both likely to benefit from a drug and not experience adverse events. Already the concept is in place with Herceptin — only patients who have the HER2 gene and receptor are treated.
Soon, diseases that have been viewed as one, like hypertension and depression, will be seen as many. The day will come when doctors will diagnose several kinds of hypertension, each with a specific therapy known through pharmacogenomic information to be effective. The upside of this for the pharmaceutical industry is that many drugs that failed to show benefit in broad clinical trials may be resurrected with new trials tailored to treat individuals who are genetically likely to benefit.
Proteomics: An offshoot of the genomics revolution has been the advent of proteomics. As genomics is the study of the genome, the complete collection of genes in an organism, proteomics is defined as the study of the proteome, the complete complement of proteins. Information about a protein includes its amino acid sequence, its mass, and other physical properties that might be thought of as protein chemistry. Other investigations deal more with the proteome as a whole, such as large-scale pathway interaction assays, which look at large numbers of physical protein-protein interactions in parallel. Proteomics is highly complementary to genomics. For example, a proteomic lead is typically strong in potential, but difficult to follow up on. The ability to work backward from proteomics to genomics and discover the gene responsible for that protein opens many avenues of further investigation.
Bioinformatics: High-throughput approaches to data gathering necessitate substantial data-sorting abilities. The expanding field of bioinformatics involves the application of computing techniques toward processing all of this data and making it accessible and useful. Computers are used to reassemble the pieces of DNA sequence that come out of sequencing efforts, to process the enormous data sets arising from expression profiling, and at all stages of the increasingly automated discovery process to manage the robotic systems that perform the various procedures. And, of course, bioinformatics offers a means of interpreting the results of all this manipulated data.
A particularly large potential application uses computers to predict the significance of sequence data. The idea of using sequence to predict function starts with the observation that many genes fall into what are known as families, similar both in sequence and in function. Usually the genes, and the proteins that they encode, have retained certain critical, defining features over the course of evolution. These similarities go all the way down to the DNA sequence level, meaning that the family as a whole can be defined by a sort of DNA fingerprint. An analysis of new genomic sequence data that looks for similarities to known gene families can shed light on the function of a gene absent any other knowledge. Given the fact that the DNA sequence of a gene alone is sufficient to determine the properties of a protein, it makes sense that researchers would attempt to understand this translation to the level necessary to predict the characteristics of a gene based solely on that sequence. As simple as this might sound, the difficulty of such an endeavor is fearsome, and such attempts remain for the most part rudimentary. In truth, sequence analysis never relies solely on the sequence, since it is necessarily a comparative enterprise reliant upon the "wet biology" that has preceded it. The problem of sequence-function predictive analysis is multifarious, reflecting both a lack of computational power and a lack of accumulated knowledge on the basis of which to model the systems involved. One example is the relatively small amount that we know about the final three-dimensional structures of proteins, due to the difficulty involved in such structure determination. Nevertheless, there is considerable optimism that such attempts, known collectively as structural genomics, will eventually become an essential tool in the biologist’s toolbox.
Genomics Business Models: Four primary business models have emerged in the genomics industry. Structural Genomics Information companies, Functional Genomics Information companies, target drug dicovery companies and enabling genomic technology companies.
Structural Genomics Information Companies: These are the genomics information companies, defining the structure of the human genome and its related proteins. They seek to become the "Bloombergs" of the pharmaceutical industry, providing must-have genomic sequence, variation, and function information in gigantic databases. Initially, companies like Celera and Incyte have pursued a database subscription model for deep-pocketed large pharmaceutical customers. As with the genomic technology companies, the key challenge is to stay ahead of the information obsolescence curve.
Functional Genomics: Information Companies These companies start life as genomics services providers to pharmaceutical companies. The pharmaceutical deals validate the technology and pay the bills as they build their own pipelines of proprietary drugs in an effort to skip into the genomic drugs category. Like other genomics participants, functional genomics companies are meaningful competitors in the intellectual property race. Because of customer demand, they are at the forefront of pharmacogenomics, applying genomic technology to the evaluation of big pharma pipeline drugs even before they enter human clinical testing. The key challenges for these companies are to preserve enough of their discoveries and intellectual property to remain competitive as standalone entities. Examples include CuraGen, Tularik, Myraid, Genset, Lexicon Genetics, and Gene Logic.
Target Drug Discovery Companies: These are the companies at the forefront of developing tomorrow’s genomic drugs. For example, Human Genome Sciences has identified hundreds of proteins that have potential for use directly as injectable drugs. Three of these are currently undergoing Phase II clinical testing. The other major player, Millennium Pharmaceuticals, is target-based, with antibody drugs in human clinical testing today and several small-molecule compounds directed to new genomic targets in preclinical development. The central challenge for this model is success in clinical trials.
Enabling Genomic Technology Companies: These are the tool companies providing the picks and shovels of the genomics industry. New research tools, gene sequencers, chips, and hardware have enabled the entire industry in a mere short decade. The business models are similar to the hardware and processor models in the technology industry, with the addition of diagnostic and reagent sales. The key challenge for these companies will be to remain on the cusp of the innovation curve as yesterday’s technologies become commodified.
Part B: Computing the genome
Email this article
to this article