Version July 2007 (pdf) In the year 2000, the Swiss Institute of Bioinformatics (SIB) and the European Bioinformatics Institute (EBI) announced their intention to set up a major effort to annotate, describe and distribute to the life science community a large amount of extensive curation regarding human protein sequences. This initiative - coined the Human Proteome Initiative (HPI) - is combined with an appeal to the user community to participate actively in such an effort and at various levels. Once upon a time...
In 2004, approximately 99% of the human euchromatic genome was accurately sequenced and the current challenge for the scientific community has become the human genome (re)annotation. Four institutes - the European Bioinformatics Institute (EBI), the National Center for Biotechnology Information (NCBI), the University of California at Santa Cruz (UCSC) and the Wellcome Trust Sanger Institute (WTSI) - joined their efforts in order to create a standard set of gene annotations. Toward this end, they launched the Consensus CDS(CCDS) project. Such a collaborative approach - a consequence of which involves sharing results obtained by different automated and manual methods - will undoubtedly be extremely fruitful. After several years of wild guesses, a consensus has been reached and currently it is estimated that the number of human genes ranges from 20,000 to 25,000. One of the challenges in human biology is to understand how such a relatively limited number of genes can give rise to an organism as complex as Mozart, Matisse or Marie Curie. Complexity is generated at several levels, those being mainly alternative splicing and post-translational modifications (PTMs). Largely underestimated in the past, alternative splicing appears today to be one of the most important biological events in generating complexity; indeed it is believed that at least 40 - 60% of the total of human genes have alternative splicing isoforms. Large-scale studies on chromosomes 21 and 22 indicate that over 80% of the genes could undergo alternative splicing. Genomic information does not suffice to predict all the PTMs of which the majority of proteins are the target. Once synthesized on the ribosomes, proteins are subject to a multitude of PTMs. They are cleaved (thus eliminating signal sequences, transit or pro-peptides and initiator methionines); many simple chemical groups can be attached to them (acetyl, methyl, phosphoryl, etc.), as well as a number of more complex molecules, such as sugars and lipids; and finally, proteins can be internally or externally cross-linked (e.g. disulfide bonds). More than two hundred different types of PTM are currently known and many more are yet to be discovered.
When combining the complexity generated by alternative splicing with that produced by PTMs, it appears that the number of different protein molecules expressed by the 20,000 to 25,000 protein-encoding genes is probably more than one million (Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154). While the considerations above concerned protein complexity at the level of an individual, additional diversity factors - at the genomic level this time - have to be taken into account when dealing with the entire human population: these are polymorphisms, commonly termed "c-SNPs" (coding single nucleotide polymorphisms) which, after translation, give rise to "SAPs" (single amino-acid polymorphisms). While some of these polymorphisms are linked to disease states, the majority is not, though in many cases they can have a direct or indirect effect on the activities of the proteins. HPI goals and meansIn this context, the HPI's aim is to annotate all known human protein sequences according to the quality standards of UniProtKB/Swiss-Prot. Most UniProtKB/Swiss-Prot sequences are derived from the translation of EMBL/GenBank/DDBJ database nucleotide sequences. Sequences derived from the same gene are manually merged into a single UniProtKB/Swiss-Prot record. During this process, sequence comparison allows us to find and show the most reliable sequence. All discrepancies are carefully analyzed and stored. These can be due to alternative splicing, polymorphism, or unknown reasons such as sequencing errors or as yet uncharacterized polymorphisms. Currently, an average of about 6 nucleotide entries are used to create one human UniProtKB/Swiss-Prot entry and this number is growing continuously. These sequences can be further - fully or partially - confirmed by direct protein sequencing either by the classical Edman sequencing technique or by mass spectrometry methods. Currently, above 15% of the human entries contain such data.
In addition to accurate sequences, UniProtKB/Swiss-Prot manual annotation strives to provide, for each known protein, a wealth of information that includes the description of its function, domain structure, subcellular location, post-translational modifications, variants, similarities to other proteins, etc. This involves not only a critical examination of computer predictions obtained with constantly improving bioinformatics tools but also the careful review of the scientific literature. The HPI project contains a number of sub-components, which are briefly described below:
We need youFor all aspects of the HPI project, we would appreciate the help and collaboration of the scientific community. Information regarding the human proteome is highly critical to a large section of the life science community. We therefore greatly encourage the user community to fully participate in this initiative by providing information not only to help the comprehensive annotation of the human proteome but also to speed it up. The HPI project is a long-term challenge. It will take years to annotate and periodically re-annotate all human proteins so as to obtain a full and useful compendium which will describe the function and, more specifically, the role of these crucial actors involved in most, if not all, biological processes.
"May you live in interesting times!"is supposedly a proverb used by the Chinese in Antiquity, which was less a blessing, however, than a curse... There is no doubt that the life science community is living in interesting times; it would be agreeable to believe that this is not a curse, but clearly a blessing. |