Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Handling Sparsity with Random Forests when Predicting Adverse Drug Events from Electronic Health Records
Stockholms universitet, Institutionen för data- och systemvetenskap.
Stockholms universitet, Institutionen för data- och systemvetenskap.
2014 (English)In: IEEE International Conference on Healthcare Informatics (ICHI): Proceedings, IEEE Computer Society , 2014, p. 17-22Conference paper, Published paper (Refereed)
Abstract [en]

When using electronic health record (EHR) data to build models for predicting adverse drug effects (ADEs), one is typically facing the problem of data sparsity, i.e., drugs and diagnosis codes that could be used for predicting a certain ADE are absent for most observations. For such tasks, the ability to effectively handle sparsity by the employed machine learning technique is crucial. The state-of-the-art random forest algorithm is frequently employed to handle this type of data. It has however recently been demonstrated that the algorithm is biased towards the majority class, which may result in a low predictive performance on EHR data with large numbers of sparse features. In this study, approaches to handle this problem are empirically evaluated using 14 ADE datasets and three performance metrics; F1-score, AUC and Brier score. Two resampling based techniques are investigated and compared to two baseline approaches. The experimental results indicate that, for larger forests, the resampling methods outperform the baseline approaches when considering F1-score, which is consistent with the metric being affected by class bias. The approaches perform on a similar level with respect to AUC, which can be explained by the metric not being sensitive to class bias. Finally, when considering the squared error (Brier score) of individual predictions, one of the baseline approaches turns out to be ahead of the others. A bias-variance analysis shows that this is an effect of the individual trees being more correct on average for the baseline approach and that this outweighs the expected loss from a lower variance. The main conclusion is that the suggested choice of approach to handle sparsity is highly dependent on the performance metric, or the task, of interest. If the task is to accurately assign an ADE to a patient record, a sampling based approach is recommended. If the task is to rank patients according to risk of a certain ADE, the choice of approach is of minor importance. Finally, if the task is to accurately assign probabilities for a certain ADE, then one of the baseline approaches is recommended.

Place, publisher, year, edition, pages
IEEE Computer Society , 2014. p. 17-22
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:kth:diva-221551DOI: 10.1109/ICHI.2014.10ISI: 000380410000003Scopus ID: 2-s2.0-84949924850ISBN: 978-1-4799-5701-9 (electronic)OAI: oai:DiVA.org:kth-221551DiVA, id: diva2:1175255
Conference
IEEE International Conference on Healthcare Informatics, Verona, Italy, 15-17 September 2014
Note

QC 20180123

Available from: 2014-12-19 Created: 2018-01-17 Last updated: 2018-01-23Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Boström, Henrik
Information Systems

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 25 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf