服务承诺





51Due提供Essay,Paper,Report,Assignment等学科作业的代写与辅导,同时涵盖Personal Statement,转学申请等留学文书代写。




私人订制你的未来职场 世界名企,高端行业岗位等 在新的起点上实现更高水平的发展




143 complete bacterial genomic sequences--论文代写范文精选
2016-01-13 来源: 51due教员组 类别: Essay范文
细菌基因组紧凑基因组,大多数序列包含编码信息。因此任何统计研究的细菌基因组序列将检测编码信息的主要来源。没有任何生物信息的使用,所有这些方法可以被看作是特定集群的相对较短的基因片段。下面的essay代写范文进行详述。
Abstract
Coding information is the main source of heterogeneity (non-randomness) in the sequences of bacterial genomes. This information can be naturally modeled by analysing cluster structures in the “in-phase” triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in all 143 completely sequenced bacterial genomes available in Genbank in August 2004, and explained its properties. The 7-cluster structure is responsible for the main part of sequence heterogeneity in bacterial genomes. In this sense, our 7 clusters is the basic model of bacterial genome sequence. We demonstrated that there are four basic “pure” types of this model, observed in nature: “parallel triangles”, “perpendicular triangles”, degenerated case and the flower-like type. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy (more precisely, by two similar functions, one for eubacterial genomes and the other one for archaea).
Introduction
The bacterial genomes are compact genomes: most of the sequence contains coding information. Hence any statistical study of bacterial genomic sequence will detect coding information as the main source of heterogeneity (non-randomness). This is confirmed by mining sequences “from scratch”, without use of any biological information, using entropic or Hidden Markov Modeling (HMM) statistical approaches (for examples, see [1], [2], [3], [18]). All these methods can be seen as specific clustering of relatively short genomic fragments of length in the range 200-400bp comparable to the average length of a coding information piece. Surprisingly, not much is known about the properties of the cluster structure itself, independently on the gene recognition problems, where it is implicitly used since long time ago (see, for example, early paper [5] about famous GENMARK gene-predictor, or [20] about GLIMMER approach).
Only recently the structure was described explicitly. In [12], [13], [24] and [25] the structure was visualized in the 64-dimensional space of non-overlapping triplet distributions for several genomes. Also the same dataset was visualized in [14] and [11] using non-linear principal manifolds. In [19] several particular cases of this structure were observed in the context of the Z-curve methodology in the 9-dimensional space of Z-coordinates: it was claimed that the structure has interesting flower-like pattern but can be observed only for GC-rich genomes. This is somehow in contradiction with the results of [24], published before, where the same flower-like picture was demonstrated for AT-rich genome of Helicobacter pylori. This fact shows that this simple and basic structure is far from being completely understood and described.
The problem can be stated in the following way: there is a set of genomic fragments of length 100-1000 bp representing a genome almost uniformly. There are various ways to produce this set, for example, by sliding window with a given step of sliding (in this case sequence assembly is not generally needed), or it might be a full set of ORFs (in this case one needs to know the assembled sequence). We construct a distribution of points in a multidimensional space of statistics calculated on the fragments and study the cluster structure of this distribution. The following questions arise: What is the number of clusters? What is the character of their mutual locations? Is there a “thin structure” in the clusters?
How the structure depends on the properties of genomic sequence, can we predict it? Every fragment can be characterized by a “frequency dictionary” of short words (see examples in [8], [9], [10], [15]). For our purposes we use frequencies of non-overlapping triplets, counted from the first basepair of a fragment. Thus every fragment is a point in 64-dimensional space of triplet frequencies. This choice is not unique, moreover, we use dimension reduction techniques to simplify this description and take the essential features. The cluster structure we are going to describe is universal in the sense that it is observed in any bacterial genome and with any type of statistics which takes into account coding phaseshifts. The structure is basic in the sense that it is revealed in the analysis in the first place, serving as the principal source of sequence non-randomness. In [12], [13], [19], [24] it was shown that even simple clustering methods like K-Means or Fuzzy K-Means give good results in application of the structure to gene-finding.
Discussion
In this paper we prove the universal 7-cluster structure in triplet distributions of bacterial genomes. Some hints at this structure appeared long time ago, but only recently it was explicitly demonstrated and studied. The 7-cluster structure is the main source of sequence heterogeneity (non-randomness) in the bacterial genomes. In this sense, our 7 clusters is the basic model of bacterial genome sequence. We demonstrated that there are four basic “pure” types of this model, observed in nature: “parallel triangles”, “perpendicular triangles”, degenerated case and the flower-like type (see Fig.4). To explain the properties and types of the structure, which occur in natural bacterial genomic sequences, we studied 143 bacterial genomes available in Genbank in August, 2004. We showed that, surprisingly, the codon usage of the genomes can be very closely approximated by a multi-linear function of their genomic G+C-content (more precisely, by two similar functions, one for eubacterial genomes and the other one for archaea).
In the 64-dimensional space of all possible triplet distributions the bacterial codon distributions are close to two curves, coinciding at their AT-rich ends and diverging at their GC-rich ends. When moving along these curves we meet all naturally occurred 7-cluster structures in the following order: “parallel triangles” for the AT-rich genomes (G+C-content is around 25%), then “perpendicular triangles” for G+C-content is around 35%, switching gradually to the degenerated case in the regions of GC=50% and, finally, the degeneracy is resolved in one plane leading to the flower-like symmetric pattern (starting from GC=60%). All these events can be illustrated using the material from the web-site we established [7]. The properties of the 7-cluster structure have natural interpretations in the language of Hidden Markov Models. Locations of clusters in multdimensional space correspond to in-state transition probabilities, the way how clusters touch each other reflects inter-state transition probabilities. Our clustering approach is independent on the Hidden Markov Modeling, though can serve as a source of information to adjust the learning parameters. In our paper we analyzed only triplet distributions.
It is easy to generalize our approach for longer (or shorter) words. In-phase hexamers, for example, are characterized by the same 7-cluster structure. However, our experience shows that the most of information is contained in triplets: the correlations in the order of codons are small and the formulas (1) work reasonably well. Other papers confirm this observation (see, for example, [1], [12]). The subject of the paper has a lot of possible continuations. There are several basic questions: how one can explain the one-dimensional model of codon usage or why the signatures in the middle of G+C-content scale have palindromic structures? There are questions about how our model is connected 12 with codon bias in translationally biased genomes: the corresponding cluster structure is the second hierarchical level or the “thin structure” in every cluster of the 7-cluster structure (see, for example, [6]). Also the following question is important: is it possible to detect and use the universal 7-cluster structure for higher eukaryotic genomes, where this structure also exists (see [24]), but is hidden by the huge non-coding cluster? The information about the 7-cluster structure can be readily introduced into existing or new software for gene-prediction, sequence alignment and genome classification.
51Due网站原创范文除特殊说明外一切图文著作权归51Due所有;未经51Due官方授权谢绝任何用途转载或刊发于媒体。如发生侵犯著作权现象,51Due保留一切法律追诉权。(essay代写)
更多essay代写范文欢迎访问我们主页 www.51due.com 当然有essay代写需求可以和我们24小时在线客服 QQ:800020041 联系交流。-X(essay代写)
