SciELO - Scientific Electronic Library Online

vol.117 número1-2François Levaillant: Explorer and biologistDung beetles do the dirty work for the planet with style and charisma índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados



Links relacionados

  • En proceso de indezaciónCitado por Google
  • En proceso de indezaciónSimilares en Google


South African Journal of Science

versión On-line ISSN 1996-7489
versión impresa ISSN 0038-2353

S. Afr. j. sci. vol.117 no.1-2 Pretoria ene./feb. 2021 



Artificial intelligence enhanced molecular databases can enable improved user-friendly bioinformatics and pave the way for novel applications



Chenjerayi Kashangura

Biological Sciences Department, University of Zimbabwe, Harare, Zimbabwe




Keywords: deep learning, deep reasoning, algorithm



Molecular databases have enabled scientists across the globe to collaborate and contribute to the growth of the databases. The current form of the databases involves researcher input which is acted upon by algorithms developed by bioinformaticians leading to outputs for researchers. Experimental data analysis using the molecular databases normally results in a reduction in cost and time for in vitro experiments preceded by in silico stages. Molecular biology technologies are applied in multiple disciplines, generating enormous amounts of data every day, which, when deposited, requires professional staff to annotate, verify submissions and generally maintain the database. The rapid rise of artificial intelligence (AI) can be used to enhance molecular databases through incorporation of deep learning and deep reasoning to enable the molecular databases to partially self-maintain, bringing novel applications and the potential for an improved user-friendly interface for researchers who are not trained in bioinformatics to generate data that require bioinformatics-related analysis.

Bioinformatics has been around since the 1960s, whilst online molecular databases that handle data generated by disciplines in the life sciences have existed since the 1990s1, with researchers contributing by submitting data generated through experiments conducted in vitro and in vivo or by utilising the molecular databases for in silico analysis. The backbone of this analysis is bioinformatics presented in various algorithms tailor-made for different molecular data, such as DNA (genomics), RNA (transcriptomics) and protein (proteomics), to produce particular outcomes. This analysis relies on the researcher interrogating the database through the algorithms.

A plethora of databases such as GenBank has been published in the Nucleic Acids Research Database issues for the past 26 years, highlighting a range of different molecular data and a synchronous range of bioinformatics capabilities within the databases.2-4 The tools to use on data contained in the molecular databases is determined by researchers. Often a researcher may opt to use the bioinformatics tools they are well acquainted with and opt not to use other tools which might require arduous training. However, the data submitted to providers of molecular databases require skilled professionals to update and secure systems and verify the accuracy of the data submitted1 and funds are required for such staff.

The growing availability of AI brings the possibility of self-learning, self-reasoning and self-improving molecular databases that can be considered to be 'next-generation enhanced molecular databases' which can assist in lowering the cost of maintenance, ultimately reducing databases becoming defunct and introducing novel applications and a new in-depth analysis of molecular data. These next-generation databases can enable bioinformatics to become more user-friendly to non-bioinformatics trained researchers who, owing to the multi- and interdisciplinary nature of molecular life sciences data, sometimes face a daunting task in analysing data.

This next generation may take the form of AI led or incorporated databases. These can have various forms of deep learning, deep reasoning and reading algorithms that enable the databases to learn experientially. These databases will not be static but will be capable of increasing the database routine tasks that can be accomplished through experiential learning so that activities such as curation, annotation and archiving can be partly accomplished, with human verification required, together with the introduction of new potential applications.

Molecular databases enhanced with AI may be useful for a new level of deep analysis that involves the AI selecting the parameters to be used through experiential learning in areas that include three-dimensional modelling5, binding predictions and domain calling6, epigenome, especially differential DNA methylation patterns7, deep analysis of multi-omic data8,9, sequence-based taxonomy3, precision medicine and drug discovery10,11. The experiential learning capability suggests that the database can, for example, interrogate a submitted set of molecular data and determine the nature of the data and carry out basic analysis to assign identity, annotation and generation of output from the AI selected parameters, like a phylogenetic tree or identified potential antimicrobial agents. A user interface can then pop up with suggestions for further data analysis, for example, if the AI has identified a potential therapeutic agent against existing or emerging pathogens, then prediction AI simulations12 are selected and run using deep learning algorithms that would run in silico trained parameter algorithms for predicting pathogen 'growth inhibition', or 'neutralise pathogen receptor access', or 'prevent multiplication'. This will greatly assist those researchers who are not bioinformatics trained to generate molecular life sciences data that require bioinformatics analysis. These next-generation databases can potentially provide improved user-friendly interaction with bioinformatics by providing single-click buttons for running particular bioinformatics tools whilst the parameters are selected by the AI after analysis of the submitted data set.

The current challenge of big data analysis may be ameliorated by development of AI that performs a comparative analysis of recently submitted big data sets against existing similar data sets and possibly suggests areas that need modification in terms of analysis for bioinformaticians to develop.


Competing interests

I declare that there are no competing interests.



1. Imker HJ. 25 Years of molecular biology databases: A study of proliferation, impact and maintenance. Front Res Metr Anal. 2018;3, Art. #18, 13 pages.        [ Links ]

2. Johnson G, Wu TT. Kabat Database and its applications: 30 Years after the first variability plot. Nucleic Acids Res. 2000;28(1):214-218.        [ Links ]

3. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, et al. GenBank. Nucleic Acids Res. 2013;41(D1):D36-D42.        [ Links ]

4. Rigden DJ, Fernandez XM. The 26th annual Nucleic Acids Research database issue and molecular biology database collection. Nucleic Acids Res. 2019;47(D1):D1-D7.        [ Links ]

5. Ay F, Noble WS. Analysis methods for studying the 3D architecture of the genome. Genome Biol. 2015;16(1), Art. #183, 15 pages.        [ Links ]

6. Joyce AR Zhang C, Bradley P Havranek, JJ. Structure-based modelling of protein: DNA specificity. Brief Funct Genomics. 2014;14(1):39-49.        [ Links ]

7. Huh I, Wu X, Park T, Yi SV. Detecting differential DNA methylation from sequencing of bisulfite converted DNA of diverse species. Brief Bioinform. 2019;20(1):33-46.        [ Links ]

8. Sangalaringam A, Ullah AZD, Marzek J, Gadaleta E, Nagano A, Ross-Adams H, et al. 'Multi-omic' data analysis using O-miner. Brief Bioinform. 2019;20(1):130-143.        [ Links ]

9. Spies D, Renz PF, Beyer TA, Ciaudo C. Comparative analysis of differential gene expression tools for RNA sequencing time course data. Brief Bioinform. 2019;20(1):288-289.        [ Links ]

10. Stokes JM, Yang K, Swanson K, Jin W, Cubillos-Ruiz A, Donghia NM, et al. A deep learning approach to antibiotic discovery. Cell. 2020;180(4):688-702.        [ Links ]

11. Hoffman M, Kleine-Weber H, Schroeder S, Krüger N, Herrler T, Erichsen S, et al. SARS-Cov-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically-proven protease inhibitor. Cell. 2020;181(2):271-280.        [ Links ]

12. Rodriguez AC. Simulation of genes and genomes forward in time. Curr Genomics. 2010;11(1):58-61.        [ Links ]



Chenjerayi Kashangura

Published: 29 January 2021

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons