top of page

Tackling bias in machine learning for protein engineering

#Machine Learning #Algorithmic Bias #Protein Engineering

Context

Machine learning has become essential for accelerating peptide and protein engineering. However, many ML models suffer from hidden biases — including taxonomic over-representation, structural homogeneity, and incomplete sequence-function data — which compromise their generalisability and utility in drug discovery.


Client/Partner Type:

Academic research, with direct implications for biotech/biopharma R&D


Challenge

Despite high accuracy claims, most protein ML models are:

  • Trained on biased datasets (e.g., overrepresentation of α-helical AMPs or specific bacterial sources)

  • Poor at detecting out-of-distribution sequences

  • Not evaluated with realistic structural or evolutionary diversity


These shortcomings hinder discovery of novel, safe, and effective molecules.


Solutions

Across three integrated studies, we developed and validated methods to diagnose, benchmark, and mitigate bias in ML-guided peptide and protein modelling:


(1) Structural & Taxonomic Bias in Peptide Datasets

  • Analysed the structural diversity of 5,800+ AMPs using PEP2D secondary structure predictions

  • Visualised fold distributions via ternary plots (Helix-Strand-Coil)

  • Identified overrepresentation of α-helical peptides in models trained on GRAMPA and APD


Plisson, F. Overcoming the Challenges in Machine Learning-Guided Antimicrobial Peptide Design. Proceedings of the 36th European and the 12th International Peptide Symposium 2022, 207-210. ISBN: 9798987214008. [PDF]


(2) Rapid Estimation of Peptide/Protein Structural Landscape with existing PSP tools

  • Evaluated the performance of 4 protein structure predictors - Jpred4, PEP2D, PSIPRED, AF2

  • Compared accuracy across sequence lengths, disordered regions, and structural classes

  • Identified most robust predictors (e.g., AlphaFold2) for ML-ready structure annotation

  • Published reproducible benchmarking pipelines on GitHub [Github 1] [Github 2]


Aldas-Bulos, V. D. & Plisson, F.*. Benchmarking protein structure predictors to assist machine learning-guided peptide discovery. Digital Discovery 2023, 2, 981-993. [DOI]


(3) Integrating Structure Awareness to ML modelling applied to peptides & proteins

  • Developed structure-aware pipelines to combine secondary structure, dimensionality reduction, and outlier detection

  • Mitigating structural bias on ML predictions with sampling methods


Aguilera-Puga, M. d. C. & Plisson, F.*. Structure-aware machine learning strategies for antimicrobial peptide discovery. Scientific Reports 2024, 14, 11995. [DOI] [Github]


Outcomes

  • Revealed and quantified algorithmic biases in widely used peptide ML tools

  • Provided benchmarking standards and best practices for building more generalizable models

  • Proposed the integration of structure-aware features and outlier detection into workflows

  • Published 3 peer-reviewed articles, including in Scientific Reports and Digital Discovery

  • Delivered open-source software for peptide structure validation and ML readiness


Tools/Expertise Used:

Python (Scikit-learn), Protein Structure Prediction (PEP2D, AlphaFold2), Dimensionality Reduction (Ternary Plots), Protein Structure Benchmarking, Sequence-Structure Feature Engineering


N.B. These innovations now underpin Ingenie Bio’s “Bias-Aware AI Modeling” service offering, helping clients build more trustworthy tools for peptide and protein design.

bottom of page