Failed to load latest commit information. View code. Requirements Download and Install Anaconda or Miniconda from here. The interface does not work for whole-genome prediction. This command is used together with the -s, --predict-sequences or -PG, --parse-genome arguments. This command limits the number of sequences the sliding window cuts from the genome.
It is used only with the -pg, --parse-genome argument. The mandatory argument used with this command is -f, --fasta. Note that the fasta file should contain a single sequence. The promoter region is a key element required for the production of RNA in bacteria.
While new high-throughput technology allows massively parallel mapping of promoter elements, we still mainly rely on bioinformatics tools to predict such elements in bacterial genomes. Additionally, despite many different prediction tools having become popular to identify bacterial promoters, no systematic comparison of such tools has been performed.
For this, we used data sets of experimentally validated promoters from Escherichia coli and a control data set composed of randomly generated sequences with similar nucleotide distributions. We compared the performance of the tools using metrics such as specificity, sensitivity, accuracy, and Matthews correlation coefficient MCC.
Of these tools, iProFMWin exhibited the best results for most of the metrics used. We present here some potentials and limitations of available tools, and we hope that future work can build upon our effort to systematically characterize this useful class of bioinformatics tools. Also, when combining new DNA elements into synthetic sequences, predicting the potential generation of new promoter sequences is critical. Over the last years, many bioinformatics tools have been created to allow users to predict promoter elements in a sequence or genome of interest.
Here, we assess the predictive power of some of the main prediction tools available using well-defined promoter data sets. Using Escherichia coli as a model organism, we demonstrated that while some tools are biased toward AT-rich sequences, others are very efficient in identifying real promoters with low false-negative rates.
We hope the potentials and limitations presented here will help the microbiology community to choose promoter prediction tools among many available alternatives. Thus, the correct mapping of promoters is a critical step when studying gene expression dynamics in bacteria. While the definition of promoters could vary widely, here we will consider promoters as the core elements recognized by the sigma subunit of the RNAP. In Escherichia coli , seven alternative sigma factors are responsible for gene expression, while sigma 70 is the most important one as it is required for the expression of housekeeping genes 2 , 3.
In addition to the core promoter region, other cis -regulatory elements can play relevant roles in the regulation of gene expression 4. In this sense, the production of RNA at the transcription start site TSS is the result of the interplay between the core promoter region and the cis -regulatory elements 5.
Mapping of functional promoter elements have been performed mostly using low-throughput techniques such as promoter probing, primer extension, DNA footprinting, etc. However, the rapidly growing number of fully sequenced bacterial genomes greatly exceeds our ability to map promoter elements experimentally.
Yet, over the past years, a growing number of computational strategies have evolved in complexity. Notable novel approaches raised, such as sequence alignment-base kernel for support vector machine 14 , 15 , profiles of hidden Markov models combined with artificial neural networks 16 , or weighted rules extracted from neural network models Also, new ways to extract information from DNA sequences to perform predictions have appeared. Thus, there are now several numerical representations of DNA sequences in which each one carries its properties 18 , — 20 , such as methods that use k-mer frequencies or variations 21 , 22 and other methods that include physicochemical properties of DNA Recently, machine learning ML techniques have been used to obtain insight from different sources from diverse biology fields an extensive survey can be seen in Libbrecht and Noble [ 24 ], Camacho et al.
Among most of the ML algorithms used for this purpose, we can mention support vector machine 27 , neural networks 28 , logistic regression 29 , decision trees 30 , and hidden Markov models 31 , Despite the existence of all these modern techniques, promoters cannot always be inferred based on their sequence only, and currently, we have no clue on how efficient these tools are.
This occurs since each new tool is validated without the use of standardized data sets or methods, making it difficult to compare novel emerging alternatives with the current state of the art. In this work, we summarize general aspects of the available promoter prediction tools, exposing comparatively their main strong and weak features. For this, we compared the performance of these tools using experimentally validated promoters from E.
Unexpectedly, we show that some very popular tools such as BPROM performed very poorly compared to tools created over the last 2 years. We hope our results can help both community users to choose a suitable tool for their specific applications, as well as developers to construct novel tools overcoming key limitations reported here. In this section, we present a succinct explanation of each methodology see Table 1 as well as the usability information about their use requirements, acceptable file types, etc.
Below, we describe briefly for each tool how they have been built and some of the main features. BPROM 33 was developed as a module of an annotation pipeline for microbial sequences to find promoters in upstream regions of predicted open reading frames ORFs. To train the model, the authors used a data set of experimentally validated promoters from elsewhere They applied linear discriminant analysis to discriminate between those promoters and inner regions of protein-coding sequences.
For attributes, they used five position weight matrices of promoter conserved motifs and they also consider the distance between the —10 and —35 boxes and the ratio of densities of octanucleotides overrepresented in known bacterial transcription factor binding site TFBS relative to their occurrence in coding regions.
This tool is available as a web application, and users can submit a local file or paste the sequence in the web form. It quickly returns the results in the screen with the possible —10 and —35 boxes of predicted promoters and their positions in the submitted sequence. Its positive data set consists of experimentally validated E.
Its negative data set consists of genomic regions where there is no experimental evidence for the presence of TSSs. They started with 30 features distributed between these types: promoter element motifs PWMs , the distance between the elements, oligomer scores, TFBS density, and physicochemical properties.
The final set of features was selected by evaluating the predictive power of these features by calculating Mahalanobis distance and used to train a neural network. This tool is available as a web application or as a stand-alone tool for Linux. On the website, an e-mail is needed to login and the results are saved for a week. BacPP 17 is a prediction tool to find E.
For a positive data set, the authors used promoter sequences from Regulon DB for six different sigma factors in E. Each nucleotide of these sequences was transformed into binary digits and used to train neural networks. To use this tool, the user must create a login in the website, then paste the sequences or fasta file according to their model, and select the sigma factors of interest.
CNNProm 34 is a web tool that can predict prokaryotic and eukaryotic promoters from big genomic sequences or multifasta files. In the case of E. Each of these sequences was transformed into a binary four-dimensional vector and used directly as features to train a convolutional neural network. To use this predictor, users must enter the sequences or the file on the website and choose the organism model. The image generation and selection are conducted by applying an evolutionary approach and calculating the similarity of these images in a set of E.
The authors measured the accuracy of the tool by analyzing the set of promoters and protein-coding sequences. To use this software, it is necessary to download the executable files, execute the evolutionary algorithm with the promoters of interest, and then implement the classifier software, which uses the resulting model generated in the previous step.
Virtual Footprint 36 is a web framework for prokaryotic regulon prediction. To make the prediction, it is necessary to upload a DNA sequence or a fasta file, select different PWMs for core promoter elements or other transcription factor binding sites, and set some parameters. Its training data set consists of sigma 70 promoter sequences from data set Regulon DB 9. These features include, for example, different kinds of k-mer and g-gapped k-mer compositions and statistical and nucleotide frequency measures.
Among the machine learning methods tested by the authors, logistic regression achieved better results. They also applied the AdaBoost technique for feature selection to improve prediction. The attributes generated from the sequences were position-specific trinucleotide propensity and electron-ion interaction pseudopotentials of nucleotides, considering single- or double-stranded DNA, to reveal trinucleotide distribution differences between the samples and represent the interaction of trinucleotides, respectively.
For model training, the authors used experimentally confirmed promoter sequences from Regulon DB 9. It is important to emphasize that sequences with more than 0. Their feature extraction was based on multiwindow-based pseudo K-tuple nucleotide composition, which consists of a sliding window, extracting and encoding physicochemical attributes of different regions of a given sequence.
To train their model, the authors used experimentally validated promoter sequences from Regulon DB for all type of sigma factors in E. Their feature extraction was divided into two types; the first one was used to represent global features, applying biprofile Bayes and KNN k-nearest neighbor features, and the second one was used to represent local features, applying k-tuple nucleotide composition sequence-based feature and dinucleotide-based auto-covariance which considers physicochemical properties.
This method also performs two steps of classification: first, it resolves whether a given sequence is a promoter or not, and then it decides to which class of sigma promoter it belongs. The authors used the SVM method for classification and the F-score method for feature selection. In order to compare the performance of the promoter prediction tools presented above, we analyzed the positive and negative data sets as described in Materials and Methods.
From the 10 algorithms selected, BacPP could not be tested with our entire data set, because multifasta files were not supported, and Virtual Footprint produces a large number of predicted —10 boxes for sigma 70 in both positive and negative data sets, a number that greatly exceeds the number of sequences analyzed.
Thus, these two tools were not considered in further analyses. The best performance was observed for CNNProm From the analysis presented in Fig. Analysis of the performance of promoter prediction tools. A Percentage of sequences predicted as sigma dependent promoters in both data sets.
The percentage of correct classifications of experimental promoters blue and the percentage of misclassified random sequences gray are presented. The vertical dashed line separates the five best tools from the three worse tools analyzed. B Metrics used to evaluate the performance of the tools. It is important to emphasize that two tools presented the highest sensitivity associated with low specificity, i.
The vertical dashed line divides the four best tools from the four worse tools. Next, we performed a hierarchical clustering analysis using the results from the five tools that presented the best results.
As can be seen in Fig. Sharma CM, Vogel J. Differential RNA-seq: the approach behind and the biological insight gained. Curr Opin Microbiol. A novel enrichment strategy reveals unprecedented number of novel transcription start sites at single base resolution in a model prokaryote and the gut microbiome. Ho TK. Random decision forests. IEEE: Breiman L. Random forests.
Mach Learn. Learning representations by back-propagating errors. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. Learning phrase representations using RNN encoder—decoder for statistical machine translation. Doha: Association for Computational Linguistics: A neural probabilistic language model.
J Mach Learn Res. Permutation importance: a corrected feature importance measure. Pribnow D. Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proc Natl Acad Sci. Regulondb v Accessed 29 Oct BEDTools: a flexible suite of utilities for comparing genomic features. Nextflow enables reproducible computational workflows. Nat Biotechnol. Zhang C, Ma Y. Ensemble machine learning: methods and applications.
Redmond: Springer; Scikit-learn: machine learning in python. Chollet F. Baldwin S. Compute Canada: advancing computational research. In: Journal of Physics: Conference Series, vol. IOP Publishing: The primary transcriptome of the Escherichia coli O H4 pAA plasmid and novel insights into its virulence gene expression and regulation.
J Bacteriol. PubMed Google Scholar. The primary transcriptome of the major human pathogen Helicobacter pylori. PLoS Genet. Conserved and specific features of Streptococcus pyogenes and Streptococcus agalactiae transcriptional landscapes. The transcriptional landscape of Chlamydia pneumoniae. Genome Biol. Conservation of transcription start sites within genes across a bacterial genus.
Genome-wide transcriptional start site mapping and sRNA identification in the pathogen Leptospira interrogans. Front Cell Infect Microbiol. The dynamic transcriptional and translational landscape of the model antibiotic producer Streptomyces coelicolor A3 2. Nat Commun.
Defining the transcriptional and post-transcriptional landscapes of Mycobacterium smegmatis in aerobic growth and hypoxia. Front Microbiol. Global repositioning of transcription start sites in a plant-fermenting bacterium. Transcriptomic studies of the bacterium rhodobacter capsulatus.
PhD thesis: Memorial University of Newfoundland; The global transcriptional landscape of Bacillus amyloliquefaciens XH7 and high-throughput screening of strong promoters based on RNA-seq data.
Download references. In: Li RW ed , Metagenomics and its applications in agriculture, biomedicine and environmental studies. Song K. Song W. FEMS Microbiol. Stormo G. Bioinformatics , 16 , 16 — Studholme D. Vijayan V. Genome Biol. Wosten M. Oxford University Press is a department of the University of Oxford.
It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide. Sign In or Create an Account. Sign In. Advanced Search. Search Menu. Article Navigation. Close mobile search navigation Article Navigation. Volume Article Contents Abstract. To whom correspondence should be addressed. Email: ilham. Oxford Academic. Rozaimi Mohamad Razali. Salim Bougouffa. Aleksandar Radovanovic.
Vladimir B Bajic. Revision received:. Select Format Select format. Permissions Icon Permissions. Abstract Motivation. Open in new tab Download slide. Open in new tab. Promoter prediction tool. Total number of TSSpr. F 1 -score. Test sets. Sensitivity 1. Google Scholar Crossref. Search ADS. Dual RpoH sigma factors and transcriptional plasticity in a symbiotic bacterium.
Improving promoter prediction for the NNPP2. Structural basis for promoter element melting by environmentally induced sigma factors. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. BacPP: bacterial promoter prediction — a tool for accurate sigma-factor specific assignment in enterobacteria. Redefining Escherichia coli sigma 70 promoter elements: motif as a complement of the motif.
Identification of an UP element consensus sequence for bacterial promoters. Multiple sigma subunits and the partitioning of bacterial transcription space.
Sigma70 promoters in Escherichia coli : specific transcription in dense regions of overlapping promoter-like signals. CollecTF: a database of experimentally validated transcription factor-binding sites in Bacteria. The recognition and prediction of sigma70 promoters in Escherichia coli K DNA sequence classification via an expectation maximization algorithm and neural networks: a case study.
An experimentally anchored map of transcriptional start sites in the model cyanobacterium Synechocystis sp. Dynamics of transcriptional start site selection during nitrogen stress-induced cell differentiation in Anabaena sp. Promoters of Escherichia coli versus promoter islands: function and structure comparison. Relative stability of DNA as a generic criterion for promoter prediction: whole genome annotation of microbial genomes with varying nucleotide base composition.
Analysis of n-gram based promoter recognition methods and application to whole genome promoter prediction. Google Scholar PubMed. Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome. RegulonDB v8. RNA polymerase subunit homology among cyanobacteria, other eubacteria and archaebacteria. Metagenomics and its applications in agriculture, biomedicine and environmental studies. Recognition of prokaryotic promoters based on a novel variable-window Z-curve method.
0コメント