![]() ![]() The sequences of detected multidomain proteins are split into single-domain segments and steps 1–4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance with their distinct evolutionary affinities.Ħ. This analysis serves to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation of the BLAST search outputs. Merge triangles with a common side to form COGs.ĥ. Detect triangles of mutually consistent, genome-specific best hits (BeTs), taking into account the paralogous groups detected at step 2.Ĥ. Detect and collapse obvious paralogs, that is, proteins from the same genome that are more similar to each other than to any proteins from other species.ģ. Perform the all-against-all protein sequence comparison.Ģ. Briefly, COG construction includes the following steps.ġ. This prediction holds even if the absolute level of sequence similarity between the proteins in question is relatively low and thus the COG approach accommodates both slow-evolving and fast-evolving genes. The COG construction procedure is based on the simple notion that any group of at least three proteins from distant genomes that are more similar to each other than they are to any other proteins from the same genomes are most likely to belong to an orthologous family. ![]() Here we report the current status of the COG database which now consists of 2091 COGs and includes proteins from 21 complete genomes.ĬOGs have been identified on the basis of an all-against-all sequence comparison of the proteins encoded in complete genomes using the gapped BLAST program ( 9) after masking low-complexity and predicted coiled-coil regions ( 7). The original set included the proteins from five bacterial, one archaeal and one eukaryotic genomes and consisted of 720 COGs subsequently, a sixth bacterial genome was added, with the number of COGs increasing to 860 ( 8). The COGs reflect one-to-many and many-to-many orthologous relationships as well as simple one-to-one relationships (hence Orthologous Groups of proteins). The Clusters of Orthologous Groups of proteins (COGs) database has been designed as an attempt to classify proteins from completely sequenced genomes on the basis of the orthology concept ( 7). ![]() Typically, orthologous proteins have the same domain architecture and the same function although there are significant exceptions and complications to this generalization, particularly among multicellular eukaryotes ( 6). Orthologs are direct evolutionary counterparts related by vertical descent as opposed to paralogs which are genes within the same genome related by duplication ( 4, 5). For such functional predictions to be reliable, it is critical to infer orthologous relationships between genes from different species. This allows one to transfer functional information from experimentally characterized proteins to their homologs from poorly studied organisms. On the other hand, computer analysis of complete microbial genomes has shown that prokaryotic proteins are in general highly conserved, with ~70% of them containing ancient conserved regions (ACRs) ( 3). This challenge is daunting, given that even in Escherichia coli, arguably the best-studied organism ( 1), only ~40% of the gene products have been characterized experimentally ( 2). Computational biology strives to extract the maximal possible information from these sequences by classifying them according to their homologous relationships, predicting their likely biochemical activities and/or cellular functions, three-dimensional structures and evolutionary origin. The recent progress in genome sequencing has led to a rapid enrichment of protein databases with an unprecedented variety of deduced protein sequences, most of them without a documented functional role.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |