Predicting bacterial phenotypic traits through improved machine learning using high-quality, curated datasets

Departments

Microbial Ecology and Diversity Bioinformatics, IT & Databases Microorganisms

Abstract

Predicting prokaryotic phenotypes—observable traits that govern functionality, adaptability, and interactions—holds significant potential for fields such as biotechnology, environmental sciences, and evolutionary biology. In this study, we leverage machine learning to explore the relationship between prokaryotic genotypes and phenotypes. Utilizing the highly standardized datasets in the Bac Dive database, we model eight physiological properties based on protein family inventories, evaluate model performance using multiple metrics, and examine the biological implications of our predictions. The high confidence values achieved underscore the importance of data quality and quantity for reliably inferring bacterial phenotypes. Our approach generates 50,396 completely new datapoints for 15,938 strains, now openly available in the Bac Dive database, thereby enriching existing phenotypic resources and enabling further research. The open-source software we provide can be readily applied to other datasets, such as those from metagenomic studies, and to various applications, including assessing the potential of soil bacteria for bioremediation.

Related Activities

This is referenced by
Refined variant calling pipeline on RNA-seq data of breast cancer cell lines without matched-normal samples
Eberth S., Koblitz J., Steenpaß L. and Pommerenke C.
BMC Res Notes 18(1): 67 (2025)

Associated Infrastructures

BacDive
Bacterial Diversity Database

Associated Projects

AIMARIA
Accelerating Innovation in Marine AI-powered Bioprospecting
Array  

01.04.2026-31.03.2030

DiASPora
Digital Approaches for the Synthesis of Poorly Accessible Biodiversity Information
Leibniz SAW  

01.04.2020-31.03.2023

Cite this activity

Koblitz J., Reimer L.C., Pukall R. and Overmann J.. Predicting bacterial phenotypic traits through improved machine learning using high-quality, curated datasets. Communications biology (2025). 10.1038/s42003-025-08313-3

Details

Research topics
Date 07.06.2025
Journal Communications biology
Issue 1
Volume 8
Pages 897
Publication Language English
Open Access Status Open Access (gold)
Online Ahead Of Print No

The content on this page is maintained by the authors.