GenBank.
Sayers Eric W,Cavanaugh Mark,Clark Karen,Pruitt Kim D,Schoch Conrad L,Sherry Stephen T,Karsch-Mizrachi Ilene
Nucleic acids research
GenBank® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public database that contains 15.3 trillion base pairs from over 2.5 billion nucleotide sequences for 504 000 formally described species. Recent updates include resources for data from the SARS-CoV-2 virus, including a SARS-CoV-2 landing page, NCBI Datasets, NCBI Virus and the Submission Portal. We also discuss upcoming changes to GI identifiers, a new data management interface for BioProject, and advice for providing contextual metadata in submissions.
10.1093/nar/gkab1135
Gene: a gene-centered information resource at NCBI.
Brown Garth R,Hem Vichet,Katz Kenneth S,Ovetsky Michael,Wallin Craig,Ermolaeva Olga,Tolstoy Igor,Tatusova Tatiana,Pruitt Kim D,Maglott Donna R,Murphy Terence D
Nucleic acids research
The National Center for Biotechnology Information's (NCBI) Gene database (www.ncbi.nlm.nih.gov/gene) integrates gene-specific information from multiple data sources. NCBI Reference Sequence (RefSeq) genomes for viruses, prokaryotes and eukaryotes are the primary foundation for Gene records in that they form the critical association between sequence and a tracked gene upon which additional functional and descriptive content is anchored. Additional content is integrated based on the genomic location and RefSeq transcript and protein sequence data. The content of a Gene record represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI. Records in Gene are assigned unique, tracked integers as identifiers. The content (citations, nomenclature, genomic location, gene products and their attributes, phenotypes, sequences, interactions, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI's Entrez system, via NCBI's Entrez programming utilities (E-Utilities and Entrez Direct) and for bulk transfer by FTP.
10.1093/nar/gku1055
A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins.
Scientific reports
Because of the limited effectiveness of prevailing phylogenetic methods when applied to highly divergent protein sequences, the phylogenetic analysis problem remains challenging. Here, we propose a sequence-based evolutionary distance algorithm termed sequence distance (SD), which innovatively incorporates site-to-site correlation within protein sequences into the distance estimation. In protein superfamilies, SD can effectively distinguish evolutionary relationships both within and between protein families, producing phylogenetic trees that closely align with those based on structural information, even with sequence identity less than 20%. SD is highly correlated with the similarity of the protein structure, and can calculate evolutionary distances for thousands of protein pairs within seconds using a single CPU, which is significantly faster than most protein structure prediction methods that demand high computational resources and long run times. The development of SD will significantly advance phylogenetics, providing researchers with a more accurate and reliable tool for exploring evolutionary relationships.
10.1038/s41598-023-47496-9
p53: 800 million years of evolution and 40 years of discovery.
Levine Arnold J
Nature reviews. Cancer
The evolutionarily conserved p53 protein and its cellular pathways mediate tumour suppression through an informed, regulated and integrated set of responses to environmental perturbations resulting in either cellular death or the maintenance of cellular homeostasis. The p53 and MDM2 proteins form a central hub in this pathway that receives stressful inputs via MDM2 and respond via p53 by informing and altering a great many other pathways and functions in the cell. The MDM2-p53 hub is one of the hubs most highly connected to other signalling pathways in the cell, and this may be why TP53 is the most commonly mutated gene in human cancers. Initial or truncal TP53 gene mutations (the first mutations in a stem cell) are selected for early in cancer development inectodermal and mesodermal-derived tissue-specific stem and progenitor cells and then, following additional mutations, produce tumours from those tissue types. In endodermal-derived tissue-specific stem or progenitor cells, TP53 mutations are functionally selected as late mutations transitioning the mutated cell into a malignant tumour. The order in which oncogenes or tumour suppressor genes are functionally selected for in a stem cell impacts the timing and development of a tumour.
10.1038/s41568-020-0262-1
Analyzing Phylogenetic Trees with a Tree Lattice Coordinate System and a Graph Polynomial.
Systematic biology
Phylogenetic trees are a central tool in many areas of life science and medicine. They demonstrate evolutionary patterns among species, genes, and patterns of ancestry among sets of individuals. The tree shapes and branch lengths of phylogenetic trees encode evolutionary and epidemiological information. To extract information from tree shapes and branch lengths, representation and comparison methods for phylogenetic trees are needed. Representing and comparing tree shapes and branch lengths of phylogenetic trees are challenging, for a tree shape is unlabeled and can be displayed in numerous different forms, and branch lengths of a tree shape are specific to edges whose positions vary with respect to the displayed forms of the tree shape. In this article, we introduce representation and comparison methods for rooted unlabeled phylogenetic trees based on a tree lattice that serves as a coordinate system for rooted binary trees with branch lengths and a graph polynomial that fully characterizes tree shapes. We show that the introduced tree representations and metrics provide distance-based likelihood-free methods for tree clustering, parameter estimation, and model selection and apply the methods to analyze phylogenies reconstructed from virus sequences. [Graph polynomial; likelihood-free inference; phylogenetics; tree lattice; tree metrics.].
10.1093/sysbio/syac008
MEGA11: Molecular Evolutionary Genetics Analysis Version 11.
Tamura Koichiro,Stecher Glen,Kumar Sudhir
Molecular biology and evolution
The Molecular Evolutionary Genetics Analysis (MEGA) software has matured to contain a large collection of methods and tools of computational molecular evolution. Here, we describe new additions that make MEGA a more comprehensive tool for building timetrees of species, pathogens, and gene families using rapid relaxed-clock methods. Methods for estimating divergence times and confidence intervals are implemented to use probability densities for calibration constraints for node-dating and sequence sampling dates for tip-dating analyses. They are supported by new options for tagging sequences with spatiotemporal sampling information, an expanded interactive Node Calibrations Editor, and an extended Tree Explorer to display timetrees. Also added is a Bayesian method for estimating neutral evolutionary probabilities of alleles in a species using multispecies sequence alignments and a machine learning method to test for the autocorrelation of evolutionary rates in phylogenies. The computer memory requirements for the maximum likelihood analysis are reduced significantly through reprogramming, and the graphical user interface has been made more responsive and interactive for very big data sets. These enhancements will improve the user experience, quality of results, and the pace of biological discovery. Natively compiled graphical user interface and command-line versions of MEGA11 are available for Microsoft Windows, Linux, and macOS from www.megasoftware.net.
10.1093/molbev/msab120
MEGA: Machine Learning-Enhanced Graph Analytics for Infodemic Risk Management.
IEEE journal of biomedical and health informatics
The COVID-19 pandemic brought not only global devastation but also an unprecedented infodemic of false or misleading information that spread rapidly through online social networks. Network analysis plays a crucial role in the science of fact-checking by modeling and learning the risk of infodemics through statistical processes and computation on mega-sized graphs. This article proposes MEGA, Machine Learning-Enhanced Graph Analytics, a framework that combines feature engineering and graph neural networks to enhance the efficiency of learning performance involving massive graphs. Infodemic risk analysis is a unique application of the MEGA framework, which involves detecting spambots by counting triangle motifs and identifying influential spreaders by computing the distance centrality. The MEGA framework is evaluated using the COVID-19 pandemic Twitter dataset, demonstrating superior computational efficiency and classification accuracy.
10.1109/JBHI.2023.3314632
MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms.
Kumar Sudhir,Stecher Glen,Li Michael,Knyaz Christina,Tamura Koichiro
Molecular biology and evolution
The Molecular Evolutionary Genetics Analysis (Mega) software implements many analytical methods and tools for phylogenomics and phylomedicine. Here, we report a transformation of Mega to enable cross-platform use on Microsoft Windows and Linux operating systems. Mega X does not require virtualization or emulation software and provides a uniform user experience across platforms. Mega X has additionally been upgraded to use multiple computing cores for many molecular evolutionary analyses. Mega X is available in two interfaces (graphical and command line) and can be downloaded from www.megasoftware.net free of charge.
10.1093/molbev/msy096