Annotate with description an article

4/10/2024

Over the past 5 years, 12 global phylogenetic and expression analysis of CIII Prxs from different plant species have been published, including four in 2020 ( Ren et al., 2014 Wang et al., 2015 Cao et al., 2016 Moural et al., 2017 Duan et al., 2019 Wu et al., 2019 Yan et al., 2019 Zhu et al., 2019 Li et al., 2020 Xiao et al., 2020 Yang et al., 2020 Cai et al., 2021). This PFAM domain can extract most of the non-animal encoded sequences from any annotated genome, but unfortunately, it does not discriminate between the three classes ( Figure 2B) and may produce erroneous annotations that require correction by experts. This family is grouped under a unique PFAM entry (PF00141) ( Table 1), which describes the conserved peroxidase domain (mainly the heam binding sites). It includes three classes of peroxidases: Class I (CI Prxs), Class II (CII Prxs), and Class III (CIII Prxs). This family was first described thanks to structural homology ( Welinder et al., 1992). They may or may not contain a prosthetic group also called haem, justifying further subdivision into two major protein families, namely “haem peroxidases” and “non-haem peroxidases.” The haem peroxidases, such as the non-animal peroxidase family, are found in all kingdoms ( Passardi et al., 2007). Peroxidases, which belong to this network, participate in oxidation-reduction reactions using hydrogen peroxide (H 2O 2) as an electron acceptor and various substrates as electron donors. Indeed, one PFAM entry may encompass several classes or subclasses of proteins ( Table 1, gray cells) and lead to mis-annotations. Most of the proteins of this network are members of large superfamilies characterized by PFAM domains that are more or less specific. In all cases, ROS homeostasis can be controlled by a large set of proteins described as ROS gene network ( Inupakutika et al., 2016). Reactive Oxygen Species (ROS) are constitutively produced in plants during photosynthesis, respiration, and photorespiration but also produced in a control manner as signal or active molecules.

Therefore, using this PFAM entry to extract MYB proteins returns many false positives (total of 326 sequences from A. In addition, the PF00249 entry also contains the SANT domain, which has a strong structural similarity to the Myb domain but is functionally divergent. As an example, 198 genes of the MYB superfamily have been detected in Arabidopsis thaliana ( Yanhui et al., 2006), but the PFAM domain PF00249 (Myb_DNA-binding) does not discriminate between the R2R3-MYB, the R1R2R3-MYB, the MYB-related, and the atypical MYB families. Mis-annotations are also observed for proteins belonging to superfamilies with conserved domain and large number of protein families and classes. For example, the PFAM domain PF07992 (Pyridine nucleotide-disulphide oxidoreductase) is detected in MonoDehydroAscorbate Reductases (MDARs), Glutathione Reductases (GRs), and in the Thioredoxin family (Trx) but does not discriminate between these three different families ( Table 1). The risk of mis-annotations is high for proteins containing multiple domains or small domain(s) common to several classes of proteins. In addition, protein annotation or function assignment, based on the presence of a hypothetical protein domain or on homology with known proteins, can also lead to an inappropriate annotation. Due to high sequence identity and duplication rate, the risks of mis-prediction are exacerbated in the case of multigenic families ( Figure 1, Fawal et al., 2014).

For example, unusually long introns, short exons or long genes can generate incomplete or partially predicted gene structure short intergenic regions can lead to gene fusion DNA sequencing errors (nucleotide deletions or insertions) introducing frameshifts can affect predictions non-canonical splice sites, overlapping genes and genes located within introns are also a source of erroneous predictions. Despite the number of performant gene prediction programs combining ab initio and homology-based approaches ( Mathe et al., 2002 Hoff and Stanke, 2015), the rate of mis-predicted genes is not negligible and can be due to several factors ( Scalzitti et al., 2020). The prediction of a gene structure can be defined as the capacity to determine the start and the stop of the gene as well as the positions of introns, if present. In the last few decades, the explosion of genomic projects has produced huge sets of predicted genes and annotated sequences.

0 Comments

Annotate with description an article

Leave a Reply.

Author

Archives

Categories