Designing Target-specific Data Sets for Regioselectivity Predictions on Complex SubstratesShow others and affiliations
2025 (English)In: Journal of the American Chemical Society, ISSN 0002-7863, E-ISSN 1520-5126, Vol. 147, no 9, p. 7476-7484Article in journal (Refereed) Published
Abstract [en]
The development of machine learning models to predict the regioselectivity of C(sp3)-H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity of C-H oxidation. To assess whether smaller, intentionally designed data sets could provide accuracy on complex targets, a series of acquisition functions were developed to select the most informative molecules for the specific target. Active learning-based acquisition functions that leverage predicted reactivity and model uncertainty were found to outperform those based on molecular and site similarity alone. The use of acquisition functions for data set elaboration significantly reduced the number of data points needed to perform accurate prediction, and it was found that smaller, machine-designed data sets can give accurate predictions when larger, randomly selected data sets fail. Finally, the workflow was experimentally validated on five complex substrates and shown to be applicable to predicting the regioselectivity of arene C-H radical borylation. These studies provide a quantitative alternative to the intuitive extrapolation from “model substrates” that is frequently used to estimate reactivity on complex molecules.
Place, publisher, year, edition, pages
American Chemical Society (ACS) , 2025. Vol. 147, no 9, p. 7476-7484
National Category
Bioinformatics and Computational Biology Bioinformatics (Computational Biology) Theoretical Chemistry
Identifiers
URN: urn:nbn:se:kth:diva-361456DOI: 10.1021/jacs.4c15902ISI: 001437834600001PubMedID: 39982221Scopus ID: 2-s2.0-86000184966OAI: oai:DiVA.org:kth-361456DiVA, id: diva2:1945886
Note
QC 20250324
2025-03-192025-03-192025-03-24Bibliographically approved