AutoPeptideML: A study on how to build more trustworthy peptide bioactivity predictors
Abstract
Automated machine learning (AutoML) solutions can bridge the gap between new computational advances and their real-world applications by enabling experimental scientists to build trustworthy models. We considered the effect of different design choices in the development of peptide bioactivity binary predictors and found that the choice of negative peptides and the use of homology-based partitioning strategies when constructing the evaluation set have a significant impact on perceived model performance providing more realistic estimation of the performance of the model when exposed to new data. We also show that the use of protein language models to generate peptide representations can both simplify the computational pipelines and improve model performance, and that state-of-the-art protein language models perform similarly regardless of size or architecture. Finally, we integrate these results into an easy-to-use AutoML tool to support the development of new robust predictive models for peptide bioactivity by biologist without a strong machine learning expertise. Source code, documentation, and data are available at \url{https://github.com/IBM/AutoPeptideML} and a dedicated web-server at \url{http://peptide.ucd.ie/AutoPeptideML}.