Tackling bias in machine learning for protein engineering

#Machine Learning #Algorithmic Bias #Protein Engineering

Context

Machine learning has become essential for accelerating peptide and protein engineering. However, many ML models suffer from hidden biases — including taxonomic over-representation, structural homogeneity, and incomplete sequence-function data — which compromise their generalisability and utility in drug discovery.

Client/Partner Type:

Academic research, with direct implications for biotech/biopharma R&D

Challenge

Despite high accuracy claims, most protein ML models are:

Trained on biased datasets (e.g., overrepresentation of α-helical AMPs or specific bacterial sources)
Poor at detecting out-of-distribution sequences
Not evaluated with realistic structural or evolutionary diversity

These shortcomings hinder discovery of novel, safe, and effective molecules.

Solutions

Across three integrated studies, we developed and validated methods to diagnose, benchmark, and mitigate bias in ML-guided peptide and protein modelling:

(1) Structural & Taxonomic Bias in Peptide Datasets

Analysed the structural diversity of 5,800+ AMPs using PEP2D secondary structure predictions
Visualised fold distributions via ternary plots (Helix-Strand-Coil)
Identified overrepresentation of α-helical peptides in models trained on GRAMPA and APD

Plisson, F. Overcoming the Challenges in Machine Learning-Guided Antimicrobial Peptide Design. Proceedings of the 36th European and the 12th International Peptide Symposium 2022, 207-210. ISBN: 9798987214008. [PDF]

(2) Rapid Estimation of Peptide/Protein Structural Landscape with existing PSP tools

Evaluated the performance of 4 protein structure predictors - Jpred4, PEP2D, PSIPRED, AF2
Compared accuracy across sequence lengths, disordered regions, and structural classes
Identified most robust predictors (e.g., AlphaFold2) for ML-ready structure annotation
Published reproducible benchmarking pipelines on GitHub [Github 1] [Github 2]

Aldas-Bulos, V. D. & Plisson, F.*. Benchmarking protein structure predictors to assist machine learning-guided peptide discovery. Digital Discovery 2023, 2, 981-993. [DOI]

(3) Integrating Structure Awareness to ML modelling applied to peptides & proteins

Developed structure-aware pipelines to combine secondary structure, dimensionality reduction, and outlier detection
Mitigating structural bias on ML predictions with sampling methods

Aguilera-Puga, M. d. C. & Plisson, F.*. Structure-aware machine learning strategies for antimicrobial peptide discovery. Scientific Reports 2024, 14, 11995. [DOI] [Github]

Outcomes

Revealed and quantified algorithmic biases in widely used peptide ML tools
Provided benchmarking standards and best practices for building more generalizable models
Proposed the integration of structure-aware features and outlier detection into workflows
Published 3 peer-reviewed articles, including in Scientific Reports and Digital Discovery
Delivered open-source software for peptide structure validation and ML readiness

Tools/Expertise Used:

Python (Scikit-learn), Protein Structure Prediction (PEP2D, AlphaFold2), Dimensionality Reduction (Ternary Plots), Protein Structure Benchmarking, Sequence-Structure Feature Engineering

N.B. These innovations now underpin Ingenie Bio’s “Bias-Aware AI Modeling” service offering, helping clients build more trustworthy tools for peptide and protein design.