kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Machine Learning Methods for Image-based Phenotypic Profiling in Early Drug Discovery
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).ORCID iD: 0000-0003-2920-8510
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In the search for new therapeutic treatments, strategies to make the drug discovery process more efficient are crucial. Image-based phenotypic profiling, with its millions of pictures of fluorescent stained cells, is a rich and effective means to capture the morphological effects of potential treatments on living systems. Within this complex data await biological insights and new therapeutic opportunities – but computational tools are needed to unlock them.

This thesis examines the role of machine learning in improving the utility and analysis of phenotypic screening data. It focuses on challenges specific to this domain, such as the lack of reliable labels that are essential for supervised learning, as well as confounding factors present in the data that are often unavoidable due to experimental variability. We explore transfer learning to boost model generalization and robustness, analyzing the impact of domain distance, initialization, dataset size, and architecture on the effectiveness of applying natural domain pre-trained weights to biomedical contexts. Building upon this, we delve into self-supervised pretraining for phenotypic image data, but find its direct application is inadequate in this context as it fails to differentiate between various biological effects. To overcome this, we develop new self-supervised learning strategies designed to enable the network to disregard confounding experimental noise, thus enhancing its ability to discern the impacts of various treatments. We further develop a technique that allows a model trained for phenotypic profiling to be adapted to new, unseen data without the need for any labels or supervised learning. Using this approach, a general phenotypic profiling model can be readily adapted to data from different sites without the need for any labels. Beyond our technical contributions, we also show that bioactive compounds identified using the approaches outlined in this thesis have been subsequently confirmed in biological assays through replication in an industrial setting. Our findings indicate that while phenotypic data and biomedical imaging present complex challenges, machine learning techniques can play a pivotal role in making early drug discovery more efficient and effective.

Abstract [sv]

I jakten på nya mediciner är strategier för att effektivisera processen för läkemedelsupptäckt avgörande. Bildbaserad fenotypisk profilering, med sina miljontals bilder på fluorescent färgade celler, erbjuder ett rikt och effektivt sätt att fånga de morfologiska effekterna av potentiella behandlingar på levande system. Inom sådan komplex data kan okända biologiska insikter identifieras och nya läkemedelsbehandlingar upptäckas, men analysmetoder kapabla att extrahera informationen krävs för att urskilja dem.

Denna avhandling utforskar maskininlärningens roll i att förbättra användbarheten och analysen av fenotypisk data. Den tar sig an utmaningar specifika för denna typ av data, såsom bristen på tillförlitliga annoteringar som krävs för övervakad inlärning, samt förväxlingsfaktorer i datan som ofta är oundvikliga på grund av experimentell variation. Vi utforskar överföringsinlärning för att öka modellernas generaliseringsförmåga och robusthet, samt analyserar hur faktorer som domänavstånd, initialisering, datamängd och modellarkitektur påverkar effektiviteten i att tillämpa förtränade vikter från naturliga domäner på biomedicinska.

Vidare fördjupar vi oss i oövervakad inlärning för fenotypiska bilddata, men upptäcker att dess direkta tillämpning är otillräcklig i detta sammanhang eftersom den inte lyckas skilja mellan olika biologiska effekter. För att hantera detta utvecklar vi nya strategier för oövervakat lärande, designade för att modellen ska kunna ignorera experimentellt brus, vilket förbättrar dess förmåga att urskilja effekterna av olika behandlingar. Vi utvecklar även en teknik som gör det möjligt för en modell tränad för fenotypisk profilering att anpassas till ny data från en okänd källa utan behov av några annoteringar eller övervakat lärande. Med denna metod kan en generell fenotypisk profilmodell enkelt anpassas till data från olika källor utan annoteringar.

Utöver våra tekniska bidrag visar vi också att bioaktiva substanser identifierade med metoderna i denna avhandling har bekräftats experimentellt. Våra resultat tyder på att även om fenotypiska data och biomedicinsk bilddata utgör komplexa utmaningar, kan maskininlärning spela en avgörande roll i att göra den tidiga fasen av läkemedelsupptäckt mer effektiv.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2024. , p. 79
Series
TRITA-EECS-AVL ; 2024:53
Keywords [en]
Phenotypic Profiling, Drug Discovery, Biomedical Imaging
Keywords [sv]
Fenotypisk profilering, läkemedelsupptäckt, biomedicinsk avbildning
National Category
Computer graphics and computer vision
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-346574ISBN: 978-91-8040-954-4 (print)OAI: oai:DiVA.org:kth-346574DiVA, id: diva2:1858989
Public defence
2024-06-12, https://kth-se.zoom.us/j/67796518372, D3, Lindstedtsvägen 9, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20240520

Available from: 2024-05-20 Created: 2024-05-20 Last updated: 2025-02-07Bibliographically approved
List of papers
1. What Makes Transfer Learning Work for Medical Images: Feature Reuse & Other Factors
Open this publication in new window or tab >>What Makes Transfer Learning Work for Medical Images: Feature Reuse & Other Factors
Show others...
2022 (English)In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Institute of Electrical and Electronics Engineers (IEEE) , 2022, p. 9215-9224Conference paper, Published paper (Refereed)
Abstract [en]

Transfer learning is a standard technique to transfer knowledge from one domain to another. For applications in medical imaging, transfer from ImageNet has become the de-facto approach, despite differences in the tasks and image characteristics between the domains. However, it is unclear what factors determine whether - and to what extent transfer learning to the medical domain is useful. The longstanding assumption that features from the source domain get reused has recently been called into question. Through a series of experiments on several medical image benchmark datasets, we explore the relationship between transfer learning, data size, the capacity and inductive bias of the model, as well as the distance between the source and target domain. Our findings suggest that transfer learning is beneficial in most cases, and we characterize the important role feature reuse plays in its success.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2022
Series
IEEE Conference on Computer Vision and Pattern Recognition, ISSN 1063-6919
National Category
Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-322794 (URN)10.1109/CVPR52688.2022.00901 (DOI)000870759102028 ()2-s2.0-85137378486 (Scopus ID)
Conference
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), JUN 18-24, 2022, New Orleans, LA
Note

Part of proceedings ISBN 978-1-6654-6946-3

QC 20230131

Available from: 2023-01-31 Created: 2023-01-31 Last updated: 2024-05-20Bibliographically approved
2. Cell Painting-based bioactivity prediction boosts high-throughput screening hit-rates and compound diversity
Open this publication in new window or tab >>Cell Painting-based bioactivity prediction boosts high-throughput screening hit-rates and compound diversity
Show others...
2024 (English)In: Nature Communications, E-ISSN 2041-1723, Vol. 15, no 1, article id 3470Article in journal (Refereed) Published
Abstract [en]

Identifying active compounds for a target is a time- and resource-intensive task in early drug discovery. Accurate bioactivity prediction using morphological profiles could streamline the process, enabling smaller, more focused compound screens. We investigate the potential of deep learning on unrefined single-concentration activity readouts and Cell Painting data, to predict compound activity across 140 diverse assays. We observe an average ROC-AUC of 0.744 ± 0.108 with 62% of assays achieving ≥0.7, 30% ≥0.8, and 7% ≥0.9. In many cases, the high prediction performance can be achieved using only brightfield images instead of multichannel fluorescence images. A comprehensive analysis shows that Cell Painting-based bioactivity prediction is robust across assay types, technologies, and target classes, with cell-based assays and kinase targets being particularly well-suited for prediction. Experimental validation confirms the enrichment of active compounds. Our findings indicate that models trained on Cell Painting data, combined with a small set of single-concentration data points, can reliably predict the activity of a compound library across diverse targets and assays while maintaining high hit rates and scaffold diversity. This approach has the potential to reduce the size of screening campaigns, saving time and resources, and enabling primary screening with more complex assays.

Place, publisher, year, edition, pages
Springer Nature, 2024
National Category
Biological Sciences
Identifiers
urn:nbn:se:kth:diva-346401 (URN)10.1038/s41467-024-47171-1 (DOI)38658534 (PubMedID)2-s2.0-85191297869 (Scopus ID)
Note

QC 20240516

Available from: 2024-05-14 Created: 2024-05-14 Last updated: 2024-05-20Bibliographically approved
3. Metadata-guided Consistency Learning for High Content Images
Open this publication in new window or tab >>Metadata-guided Consistency Learning for High Content Images
Show others...
2023 (English)In: Proceedings of Machine Learning Research, Volume 227: Medical Imaging with Deep Learning, ML Research Press , 2023Conference paper, Published paper (Refereed)
Abstract [en]

High content imaging assays can capture rich phenotypic response data for large sets of compound treatments, aiding in the characterization and discovery of novel drugs. However, extracting representative features from high content images that can capture subtle nuances in phenotypes remains challenging. The lack of high-quality labels makes it difficult to achieve satisfactory results with supervised deep learning. Self-Supervised learning methods have shown great success on natural images, and offer an attractive alternative also to microscopy images. However, we find that self-supervised learning techniques underperform on high content imaging assays. One challenge is the undesirable domain shifts present in the data known as batch effects, which are caused by biological noise or uncontrolled experimental conditions. To this end, we introduce Cross-Domain Consistency Learning (CDCL), a self-supervised approach that is able to learn in the presence of batch effects. CDCL enforces the learning of biological similarities while disregarding undesirable batch-specific signals, leading to more useful and versatile representations. These features are organised according to their morphological changes and are more useful for downstream tasks – such as distinguishing treatments and mechanism of action.

Place, publisher, year, edition, pages
ML Research Press, 2023
National Category
Computer and Information Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-346566 (URN)001221108600055 ()2-s2.0-85189329755 (Scopus ID)
Conference
6th International Conference on Medical Imaging with Deep Learning, MIDL 2023, Nashville, United States of America, Jul 10 2023 - Jul 12 2023
Note

QC 20240521

Available from: 2024-05-17 Created: 2024-05-17 Last updated: 2025-02-27Bibliographically approved
4. Bridging Generalization Gaps in High Content Imaging Through Online Self-Supervised Domain Adaptation
Open this publication in new window or tab >>Bridging Generalization Gaps in High Content Imaging Through Online Self-Supervised Domain Adaptation
2024 (English)In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2024,, 2024, p. 7723-7732Conference paper, Published paper (Refereed)
Abstract [en]

High Content Imaging (HCI) plays a vital role in modern drug discovery and development pipelines, facilitating various stages from hit identification to candidate drug characterization. Applying machine learning models to these datasets can prove challenging as they typically consist of multiple batches, affected by experimental variation, especially if different imaging equipment have been used. Moreover, as new data arrive, it is preferable that they are analyzed in an online fashion. To overcome this, we propose CODA, an online self-supervised domain adaptation approach. CODA divides the classifier’s role into a generic feature extractor and a task-specific model. We adapt the feature extractor’s weights to the new domain using cross-batch self-supervision while keeping the task-specific model unchanged. Our results demonstrate that this strategy significantly reduces the generalization gap, achieving up to a 300% improvement when applied to data from different labs utilizing different microscopes. CODA can be applied to new, unlabeled out-of-domain data sources of different sizes, from a single plate to multiple experimental batches.

National Category
Computer graphics and computer vision
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-346570 (URN)10.1109/WACV57701.2024.00756 (DOI)2-s2.0-85192009362 (Scopus ID)
Conference
the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 03-08 January 2024
Note

QC 20240522

Available from: 2024-05-17 Created: 2024-05-17 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

Kappa(11578 kB)561 downloads
File information
File name SUMMARY01.pdfFile size 11578 kBChecksum SHA-512
34c39bcf4cdf1c3c10b4b65324a2e53b07a5267b2e9c85b52a8866f353e33e04e79e8ab307142a0abb2dec898d80906d431f51d925cf6b676a22a541fd321dad
Type summaryMimetype application/pdf

Authority records

Fredin Haslum, Johan

Search in DiVA

By author/editor
Fredin Haslum, Johan
By organisation
Computational Science and Technology (CST)
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1809 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf