Change search
Refine search result
123 1 - 50 of 117
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Ahlberg, Ernst
    et al.
    Predictive Compound ADME & Safety, Drug Safety & Metabolism, AstraZeneca IMED Biotech Unit, Mölndal, Sweden.
    Winiwarter, Susanne
    Predictive Compound ADME & Safety, Drug Safety & Metabolism, AstraZeneca IMED Biotech Unit, Mölndal, Sweden.
    Boström, Henrik
    Department of Computer and Systems Sciences, Stockholm University, Sweden.
    Linusson, Henrik
    Department of Information Technology, University of Borås, Sweden.
    Löfström, Tuve
    Högskolan i Jönköping, JTH. Forskningsmiljö Datavetenskap och informatik.
    Norinder, Ulf
    Swetox, Karolinska Institutet, Unit of Toxicology Sciences, Sweden.
    Johansson, Ulf
    Högskolan i Jönköping, JTH, Datateknik och informatik.
    Engkvist, Ola
    External Sciences, Discovery Sciences, AstraZeneca IMED Biotech Unit, Mölndal, Sweden.
    Hammar, Oscar
    Quantitative Biology, Discovery Sciences, AstraZeneca IMED Biotech Unit, Mölndal, Sweden.
    Bendtsen, Claus
    Quantitative Biology, Discovery Sciences, AstraZeneca IMED Biotech Unit, Cambridge, UK.
    Carlsson, Lars
    Quantitative Biology, Discovery Sciences, AstraZeneca IMED Biotech Unit, Mölndal, Sweden.
    Using conformal prediction to prioritize compound synthesis in drug discovery2017In: Proceedings of Machine Learning Research: Volume 60: Conformal and Probabilistic Prediction and Applications, 13-16 June 2017, Stockholm, Sweden / [ed] Alex Gammerman, Vladimir Vovk, Zhiyuan Luo, and Harris Papadopoulos, 2017, p. 174-184Conference paper (Refereed)
    Abstract [en]

    The choice of how much money and resources to spend to understand certain problems is of high interest in many areas. This work illustrates how computational models can be more tightly coupled with experiments to generate decision data at lower cost without reducing the quality of the decision. Several different strategies are explored to illustrate the trade off between lowering costs and quality in decisions.

    AUC is used as a performance metric and the number of objects that can be learnt from is constrained. Some of the strategies described reach AUC values over 0.9 and outperforms strategies that are more random. The strategies that use conformal predictor p-values show varying results, although some are top performing.

    The application studied is taken from the drug discovery process. In the early stages of this process compounds, that potentially could become marketed drugs, are being routinely tested in experimental assays to understand the distribution and interactions in humans.

  • 2.
    Asker, Lars
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Karlsson, Isak
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Mining Candidates for Adverse Drug Interactions in Electronic Patient Records2014In: PETRA '14 Proceedings of the 7th International Conference on Pervasive Technologies Related to Assistive Environments, PETRA’14, New York: ACM Press, 2014, article id 22Conference paper (Refereed)
    Abstract [en]

    Electronic patient records provide a valuable source of information for detecting adverse drug events. In this paper, we explore two different but complementary approaches to extracting useful information from electronic patient records with the goal of identifying candidate drugs, or combinations of drugs, to be further investigated for suspected adverse drug events. We propose a novel filter-and-refine approach that combines sequential pattern mining and disproportionality analysis. The proposed method is expected to identify groups of possibly interacting drugs suspected for causing certain adverse drug events. We perform an empirical investigation of the proposed method using a subset of the Stockholm electronic patient record corpus. The data used in this study consists of all diagnoses and medications for a group of patients diagnoses with at least one heart related diagnosis during the period 2008--2010. The study shows that the method indeed is able to detect combinations of drugs that occur more frequently for patients with cardiovascular diseases than for patients in a control group, providing opportunities for finding candidate drugs that cause adverse drug effects through interaction.

  • 3.
    Asker, Lars
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Persson, Hans
    Identifying Factors for the Effectiveness of Treatment of Heart Failure: A Registry Study2016In: IEEE 29th International Symposiumon Computer-Based Medical Systems: CBMS 2016, IEEE Computer Society , 2016Conference paper (Refereed)
    Abstract [en]

    An administrative health register containing health care data for over 2 million patients will be used to search for factors that can affect the treatment of heart failure. In the study, we will measure the effects of employed treatment for various groups of heart failure patients, using different measures of effectiveness. Significant deviations in effectiveness of treatments of the various patient groups will be reported and factors that may help explaining the effect of treatment will be analyzed. Identification of the most important factors that may help explain the observed deviations between the different groups will be derived through generation of predictive models, for which variable importance can be calculated. The findings may affect recommended treatments as well as high-lighting deviations from national guidelines.

  • 4.
    Asker, Lars
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Learning from Swedish Healthcare Data2016In: Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments, Association for Computing Machinery (ACM), 2016, Vol. 29, article id 47Conference paper (Refereed)
    Abstract [en]

    We present two ongoing projects aimed at learning from health care records. The first project, DADEL, is focusing on high-performance data mining for detrecting adverse drug events in healthcare, and uses electronic patient records covering seven years of patient record data from the Stockholm region in Sweden. The second project is focusing on heart failure and on understanding the differences in treatment between various groups of patients. It uses a Swedish administrative health register containing health care data for over two million patients.

  • 5.
    Boström, Henrik
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Calibrating Random Forests2008In: Proceedings of the Seventh International Conference on Machine Learning and Applications (ICMLA'08), IEEE Computer Society, 2008, p. 121-126, article id 4724964Conference paper (Refereed)
    Abstract [en]

     When using the output of classifiers to calculate the expected utility of different alternatives in decision situations, the correctness of predicted class probabilities may be of crucial importance. However, even very accurate classifiers may output class probabilities of rather poor quality. One way of overcoming this problem is by means of calibration, i.e., mapping the original class probabilities to more accurate ones. Previous studies have however indicated that random forests are difficult to calibrate by standard calibration methods. In this work, a novel calibration method is introduced, which is based on a recent finding that probabilities predicted by forests of classification trees have a lower squared error compared to those predicted by forests of probability estimation trees (PETs). The novel calibration method is compared to the two standard methods, Platt scaling and isotonic regression, on 34 datasets from the UCI repository. The experiment shows that random forests of PETs calibrated by the novel method significantly outperform uncalibrated random forests of both PETs and classification trees, as well as random forests calibrated with the two standard methods, with respect to the squared error of predicted class probabilities.

     

     

  • 6.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Concurrent Learning of Large-Scale Random Forests2011In: Scandinavian Conference on Artificial Intelligence, IOS Press , 2011Conference paper (Refereed)
    Abstract [en]

    The random forest algorithm belongs to the class of ensemble learning methods that are embarassingly parallel, i.e., the learning task can be straightforwardly divided into subtasks that can be solved independently by concurrent processes. A parallel version of the random forest algorithm has been implemented in Erlang, a concurrent programming language originally developed for telecommunication applications. The implementation can be used for generating very large forests, or handling very large datasets, in a reasonable time frame. This allows for investigating potential gains in predictive performance from generating large-scale forests. An empirical investigation on 34 datasets from the UCI repository shows that forests of 1000 trees significantly outperform forests of 100 trees with respect to accuracy, area under ROC curve (AUC) and Brier score. However, increasing the forest sizes to 10 000 or 100 000 trees does not give any further significant performance gains.

  • 7.
    Boström, Henrik
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Estimating class probabilities in random forests2007In: Proceedings - 6th International Conference on Machine Learning and Applications, ICMLA 2007, IEEE Computer Society, 2007, p. 211-216Conference paper (Refereed)
    Abstract [en]

    For both single probability estimation trees (PETs) and ensembles of such trees, commonly employed class probability estimates correct the observed relative class frequencies in each leaf to avoid anomalies caused by small sample sizes. The effect of such corrections in random forests of PETs is investigated, and the use of the relative class frequency is compared to using two corrected estimates, the Laplace estimate and the m-estimate. An experiment with 34 datasets from the UCI repository shows that estimating class probabilities using relative class frequency clearly outperforms both using the Laplace estimate and the m-estimate with respect to accuracy, area under the ROC curve (AUC) and Brier score. Hence, in contrast to what is commonly employed for PETs and ensembles of PETs, these results strongly suggest that a non-corrected probability estimate should be used in random forests of PETs. The experiment further shows that learning random forests of PETs using relative class frequency significantly outperforms learning random forests of classification trees (i.e., trees for which only an unweighted vote on the most probable class is counted) with respect to both accuracy and AUC, but that the latter is clearly ahead of the former with respect to Brier score.

  • 8.
    Boström, Henrik
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Feature vs. classifier fusion for predictive data mining - A case study in pesticide classification2007In: FUSION 2007 - 2007 10th International Conference on Information Fusion, Institute of Electrical and Electronics Engineers (IEEE), 2007, p. 1-7, article id 4408024Conference paper (Refereed)
    Abstract [en]

    Two strategies for fusing information from multiple sources when generating predictive models in the domain of pesticide classification are investigated: i) fusing different sets of features (molecular descriptors) before building a model and ii) fusing the classifiers built from the individual descriptor sets. An empirical investigation demonstrates that the choice of strategy can have a significant impact on the predictive performance. Furthermore, the experiment shows that the best strategy is dependent on the type of predictive model considered. When generating a decision tree for pesticide classification, a statistically significant difference in accuracy is observed in favor of combining predictions from the individual models compared to generating a single model from the fused set of molecular descriptors. On the other hand, when the model consists of an ensemble of decision trees, a statistically significant difference in accuracy is observed in favor of building the model from the fused set of descriptors compared to fusing ensemble models built from the individual sources.

  • 9.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Forests of probability estimation trees2012In: International journal of pattern recognition and artificial intelligence, ISSN 0218-0014, Vol. 26, no 2, article id 1251001Article in journal (Refereed)
    Abstract [en]

    Probability estimation trees (PETs) generalize classification trees in that they assign class probability distributions instead of class labels to examples that are to be classified. This property has been demonstrated to allow PETs to outperform classification trees with respect to ranking performance, as measured by the area under the ROC curve (AUC). It has further been shown that the use of probability correction improves the performance of PETs. This has lead to the use of probability correction also in forests of PETs. However, it was recently observed that probability correction may in fact deteriorate performance of forests of PETs. A more detailed study of the phenomenon is presented and the reasons behind this observation are analyzed. An empirical investigation is presented, comparing forests of classification trees to forests of both corrected and uncorrected PETS on 34 data sets from the UCI repository. The experiment shows that a small forest (10 trees) of probability corrected PETs gives a higher AUC than a similar-sized forest of classification trees, hence providing evidence in favor of using forests of probability corrected PETs. However, the picture changes when increasing the forest size, as the AUC is no longer improved by probability correction. For accuracy and squared error of predicted class probabilities (Brier score), probability correction even leads to a negative effect. An analysis of the mean squared error of the trees in the forests and their variance, shows that although probability correction results in trees that are more correct on average, the variance is reduced at the same time, leading to an overall loss of performance for larger forests. The main conclusions are that probability correction should only be employed in small forests of PETs, and that for larger forests, classification trees and PETs are equally good alternatives.

  • 10.
    Boström, Henrik
    KTH, Superseded Departments (pre-2005), Computer and Systems Sciences, DSV. Stockholms universitet, Institutionen för data- och systemvetenskap.
    Maximizing the Area under the ROC Curve using Incremental Reduced Error Pruning2005In: Proceedings of the ICML 2005 Workshop on ROC Analysis in Machine Learning, Bonn: AMC Press , 2005Conference paper (Refereed)
  • 11.
    Boström, Henrik
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Maximizing the Area under the ROC Curve with Decision Lists and Rule Sets2007In: Proceedings of the 7th SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, 2007, p. 27-34Conference paper (Refereed)
    Abstract [en]

    Decision lists (or ordered rule sets) have two attractive properties compared to unordered rule sets: they require a simpler classi¯cation procedure and they allow for a more compact representation. However, it is an open question what effect these properties have on the area under the ROC curve (AUC). Two ways of forming decision lists are considered in this study: by generating a sequence of rules, with a default rule for one of the classes, and by imposing an order upon rules that have been generated for all classes. An empirical investigation shows that the latter method gives a significantly higher AUC than the former, demonstrating that the compactness obtained by using one of the classes as a default is indeed associated with a cost. Furthermore, by using all applicable rules rather than the first in an ordered set, an even further significant improvement in AUC is obtained, demonstrating that the simple classification procedure is also associated with a cost. The observed gains in AUC for unordered rule sets compared to decision lists can be explained by that learning rules for all classes as well as combining multiple rules allow for examples to be ranked according to a more fine-grained scale compared to when applying rules in a fixed order and providing a default rule for one of the classes.

  • 12.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Method for efficiently checking coverage of rules derived from a logical theory2003Patent (Other (popular science, discussion, etc.))
    Abstract [en]

    The method is used in a computer and includes the steps of providing a logical theory (12, 30) that has clauses. A rule (14) is generated that is a resolvent of clauses in the logical theory. An example (16) is retrieved. A proof tree (18, 40) is generated from the example (16) using the logical theory (12, 30). The proof tree (18, 40) is transformed into a database (20, 42) of a coverage check apparatus (28). The rule (14) is converted into a partial proof tree (60) that has nodes (62, 54, 66). The partial proof tree is transformed into a database query (22) of the coverage check apparatus (28). The query (22, 72) is executed to identify tuples in the database (20, 42) that correspond to the nodes of the partial proof tree.

  • 13.
    Boström, Henrik
    KTH, Superseded Departments (pre-2005), Computer and Systems Sciences, DSV. Stockholms universitet, Institutionen för data- och systemvetenskap.
    Pruning and Exclusion Criteria for Unordered Incremental Reduced Error Pruning2004In: Proceedings of the Workshop on Advances in Rule Learning at 15th European Conference on Machine Learning, 2004Conference paper (Refereed)
  • 14.
    Boström, Henrik
    et al.
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Andler, Sten F.
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Brohede, Marcus
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Johansson, Ronnie
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Karlsson, Alexander
    Högskolan i Skövde, Institutionen för kommunikation och information.
    van Laere, Joeri
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Niklasson, Lars
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Nilsson, Marie
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Persson, Anne
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Ziemke, Tom
    Högskolan i Skövde, Institutionen för kommunikation och information.
    On the Definition of Information Fusion as a Field of Research2007Report (Other academic)
    Abstract [en]

    A more precise definition of the field of information fusion can be of benefit to researchers within the field, who may use uch a definition when motivating their own work and evaluating the contribution of others. Moreover, it can enable researchers and practitioners outside the field to more easily relate their own work to the field and more easily understand the scope of the techniques and methods developed in the field. Previous definitions of information fusion are reviewed from that perspective, including definitions of data and sensor fusion, and their appropriateness as definitions for the entire research field are discussed. Based on strengths and weaknesses of existing definitions, a novel definition is proposed, which is argued to effectively fulfill the requirements that can be put on a definition of information fusion as a field of research.

  • 15.
    Boström, Henrik
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    De-identifying health records by means of active learning2012In: ICML 2012 workshop on Machine Learning for Clinical Data Analysis 2012, 2012Conference paper (Refereed)
    Abstract [en]

    An experiment on classifying words in Swedish health records as belonging to one of eight protected health information (PHI) classes, or to the non-PHI class, by means of active learning has been conducted, in which three selection strategies were evaluated in conjunction with random forests; the commonly employed approach of choosing the most uncertain examples, choosing randomly, and choosing the most certain examples. Surprisingly, random selection outperformed choosing the most uncertain examples with respect to ten considered performance metrics. Moreover, choosing the most certain examples outperformed random selection with respect to nine out of ten metrics.

  • 16.
    Boström, Henrik
    et al.
    KTH, School of Information and Communication Technology (ICT). Stockholms universitet, Institutionen för data- och systemvetenskap.
    Gurung, Ram Bahadur
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Asker, Lars
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Karlsson, Isak
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Lindgren, Tony
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Conformal prediction using random survival forests2017In: Proceedings - 16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017, Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 812-817Conference paper (Refereed)
    Abstract [en]

    Random survival forests constitute a robust approach to survival modeling, i.e., predicting the probability that an event will occur before or on a given point in time. Similar to most standard predictive models, no guarantee for the prediction error is provided for this model, which instead typically is empirically evaluated. Conformal prediction is a rather recent framework, which allows the error of a model to be determined by a user specified confidence level, something which is achieved by considering set rather than point predictions. The framework, which has been applied to some of the most popular classification and regression techniques, is here for the first time applied to survival modeling, through random survival forests. An empirical investigation is presented where the technique is evaluated on datasets from two real-world applications; predicting component failure in trucks using operational data and predicting survival and treatment of heart failure patients from administrative healthcare data. The experimental results show that the error levels indeed are very close to the provided confidence levels, as guaranteed by the conformal prediction framework, and that the error for predicting each outcome, i.e., event or no-event, can be controlled separately. The latter may, however, lead to less informative predictions, i.e., larger prediction sets, in case the class distribution is heavily imbalanced.

  • 17.
    Boström, Henrik
    et al.
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Johansson, Ronnie
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Karlsson, Alexander
    Högskolan i Skövde, Institutionen för kommunikation och information.
    On Evidential Combination Rules for Ensemble Classifiers2008In: Proceedings of the 11th International Conference on Information Fusion, IEEE, 2008, p. 553-560, article id 4632259Conference paper (Refereed)
    Abstract [en]

    Ensemble classifiers are known to generally perform better than each individual classifier of which they consist. One approach to classifier fusion is to apply Shafer’s theory of evidence. While most approaches have adopted Dempster’s rule of combination, a multitude of combination rules have been proposed. A number of combination rules as well as two voting rules are compared when used in conjunction with a specific kind of ensemble classifier, known as random forests, w.r.t. accuracy, area under ROC curve and Brier score on 27 datasets. The empirical evaluation shows that the choice of combination rule can have a significant impact on the performance for a single dataset, but in general the evidential combination rules do not perform better than the voting rules for this particular ensemble design. Furthermore, among the evidential rules, the associative ones appear to have better performance than the non-associative.

  • 18.
    Boström, Henrik
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Linusson, Henrik
    Löfström, Tuve
    Johansson, Ulf
    Accelerating difficulty estimation for conformal regression forests2017In: Annals of Mathematics and Artificial Intelligence, ISSN 1012-2443, E-ISSN 1573-7470, Vol. 81, no 1-2, p. 125-144Article in journal (Refereed)
    Abstract [en]

    The conformal prediction framework allows for specifying the probability of making incorrect predictions by a user-provided confidence level. In addition to a learning algorithm, the framework requires a real-valued function, called nonconformity measure, to be specified. The nonconformity measure does not affect the error rate, but the resulting efficiency, i.e., the size of output prediction regions, may vary substantially. A recent large-scale empirical evaluation of conformal regression approaches showed that using random forests as the learning algorithm together with a nonconformity measure based on out-of-bag errors normalized using a nearest-neighbor-based difficulty estimate, resulted in state-of-the-art performance with respect to efficiency. However, the nearest-neighbor procedure incurs a significant computational cost. In this study, a more straightforward nonconformity measure is investigated, where the difficulty estimate employed for normalization is based on the variance of the predictions made by the trees in a forest. A large-scale empirical evaluation is presented, showing that both the nearest-neighbor-based and the variance-based measures significantly outperform a standard (non-normalized) nonconformity measure, while no significant difference in efficiency between the two normalized approaches is observed. The evaluation moreover shows that the computational cost of the variance-based measure is several orders of magnitude lower than when employing the nearest-neighbor-based nonconformity measure. The use of out-of-bag instances for calibration does, however, result in nonconformity scores that are distributed differently from those obtained from test instances, questioning the validity of the approach. An adjustment of the variance-based measure is presented, which is shown to be valid and also to have a significant positive effect on the efficiency. For conformal regression forests, the variance-based nonconformity measure is hence a computationally efficient and theoretically well-founded alternative to the nearest-neighbor procedure.

  • 19.
    Boström, Henrik
    et al.
    Department of Computer and Systems Sciences, Stockholm University, Kista, Sweden.
    Linusson, Henrik
    Department of Information Technology, University of Borås, Borås, Sweden.
    Löfström, Tuve
    Department of Information Technology, University of Borås, Borås, Sweden.
    Johansson, Ulf
    Högskolan i Jönköping, JTH, Datateknik och informatik.
    Evaluation of a variance-based nonconformity measure for regression forests2016In: 5th International Symposium on Conformal and Probabilistic Prediction with Applications, COPA 2016, Springer, 2016, Vol. 9653, p. 75-89Conference paper (Refereed)
    Abstract [en]

    In a previous large-scale empirical evaluation of conformal regression approaches, random forests using out-of-bag instances for calibration together with a k-nearest neighbor-based nonconformity measure, was shown to obtain state-of-the-art performance with respect to efficiency, i.e., average size of prediction regions. However, the use of the nearest-neighbor procedure not only requires that all training data have to be retained in conjunction with the underlying model, but also that a significant computational overhead is incurred, during both training and testing. In this study, a more straightforward nonconformity measure is investigated, where the difficulty estimate employed for normalization is based on the variance of the predictions made by the trees in a forest. A large-scale empirical evaluation is presented, showing that both the nearest-neighbor-based and the variance-based measures significantly outperform a standard (non-normalized) nonconformity measure, while no significant difference in efficiency between the two normalized approaches is observed. Moreover, the evaluation shows that state-of-theart performance is achieved by the variance-based measure at a computational cost that is several orders of magnitude lower than when employing the nearest-neighbor-based nonconformity measure. 

  • 20.
    Boström, Henrik
    et al.
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Norinder, Ulf
    Utilizing Information on Uncertainty for In Silico Modeling using Random Forests2009In: Proceedings of the 3rd Skövde Workshop on Information Fusion Topics (SWIFT 2009), University of Skövde , 2009, p. 59-62Conference paper (Refereed)
    Abstract [en]

    Information on uncertainty of measurements or estimates of molecular properties are rarely utilized by in silico predictive models. In this study, different approaches to handling uncertain numerical features are explored when using the stateof- the-art random forest algorithm for generating predictive models. Two main approaches are considered: i) sampling from probability distributions prior to tree generation, which does not require any change to the underlying tree learning algorithm, and ii) adjusting the algorithm to allow for handling probability distributions, similar to how missing values typically are handled, i.e., partitions may include fractions of examples. An experiment with six datasets concerning the prediction of various chemical properties is presented, where 95% confidence intervals are included for one of the 92 numerical features. In total, five approaches to handling uncertain numeric features are compared: ignoring the uncertainty, sampling from distributions that are assumed to be uniform and normal respectively, and adjusting tree learning to handle probability distributions that are assumed to be uniform and normal respectively. The experimental results show that all approaches that utilize information on uncertainty indeed outperform the single approach ignoring this, both with respect to accuracy and area under ROC curve. A decomposition of the squared error of the constituent classification trees shows that the highest variance is obtained by ignoring the information on uncertainty, but that this also results in the highest mean squared error of the constituent trees.

  • 21. Carlsson, Lars
    et al.
    Ahlberg, Ernst
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Johansson, Ulf
    Linusson, Henrik
    Modifications to p-Values of Conformal Predictors2015In: Statistical Learning and Data Sciences: Third International Symposium, SLDS 2015, Egham, UK, April 20-23, 2015, Proceedings / [ed] Alexander Gammerman, Vladimir Vovk, Harris Papadopoulos, Springer, 2015, Vol. 9047, p. 251-259Conference paper (Refereed)
    Abstract [en]

    The original definition of a p-value in a conformal predictor can sometimes lead to too conservative prediction regions when the number of training or calibration examples is small. The situation can be improved by using a modification to define an approximate p-value. Two modified p-values are presented that converges to the original p-value as the number of training or calibration examples goes to infinity.

    Numerical experiments empirically support the use of a p-value we call the interpolated p-value for conformal prediction. The interpolated p-value seems to be producing prediction sets that have an error rate which corresponds well to the prescribed significance level.

  • 22.
    Dalianis, Hercules
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Releasing a Swedish Clinical Corpus after Removing all Words – De-identification Experiments with Conditional Random Fields and Random Forests2012In: Proceedings of the Third Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2012), 2012, p. 45-48Conference paper (Refereed)
    Abstract [en]

    Patient records contain valuable information in the form of both structured data and free text; however this information is sensitive since it can reveal the identity of patients. In order to allow new methods and techniques to be developed and evaluated on real world clinical data without revealing such sensitive information, researchers could be given access to de-identified records without protected health information (PHI), such as names, telephone numbers, and so on. One approach to minimizing the risk of revealing PHI when releasing text corpora from such records is to include only features of the words instead of the words themselves. Such features may include parts of speech, word length, and so on from which the sensitive information cannot be derived. In order to investigate what performance losses can be expected when replacing specific words with features, an experiment with two state-of-the-art machine learning methods, conditional random fields and random forests, is presented, comparing their ability to support de-identification, using the Stockholm EPR PHI corpus as a benchmark test. The results indicate severe performance losses when the actual words are removed, leading to the conclusion that the chosen features are not sufficient for the suggested approach to be viable.

  • 23.
    Deegalla, Sampath
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Classification of Microarrays with kNN: Comparison of Dimensionality Reduction Methods2007In: Intelligent Data Engineering and Automated Learning - IDEAL 2007 / [ed] Hujun Yin, Peter Tino, Emilio Corchado, Will Byrne, Xin Yao, Berlin, Heidelberg: Springer Verlag , 2007, p. 800-809Conference paper (Refereed)
    Abstract [en]

    Dimensionality reduction can often improve the performance of the k-nearest neighbor classifier (kNN) for high-dimensional data sets, such as microarrays. The effect of the choice of dimensionality reduction method on the predictive performance of kNN for classifying microarray data is an open issue, and four common dimensionality reduction methods, Principal Component Analysis (PCA), Random Projection (RP), Partial Least Squares (PLS) and Information Gain(IG), are compared on eight microarray data sets. It is observed that all dimensionality reduction methods result in more accurate classifiers than what is obtained from using the raw attributes. Furthermore, it is observed that both PCA and PLS reach their best accuracies with fewer components than the other two methods, and that RP needs far more components than the others to outperform kNN on the non-reduced dataset. None of the dimensionality reduction methods can be concluded to generally outperform the others, although PLS is shown to be superior on all four binary classification tasks, but the main conclusion from the study is that the choice of dimensionality reduction method can be of major importance when classifying microarrays using kNN.

  • 24.
    Deegalla, Sampath
    et al.
    Dept. of Computer and Systems Sciences, Stockholm University, Sweden.
    Boström, Henrik
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Fusion of Dimensionality Reduction Methods: a Case Study in Microarray Classification2009In: Proceedings of the 12th International Conference on Information Fusion, ISIF , 2009, p. 460-465, article id 5203771Conference paper (Refereed)
    Abstract [en]

    Dimensionality reduction has been demonstrated to improve the performance of the k-nearest neighbor (kNN) classifier for high-dimensional data sets, such as microarrays. However, the effectiveness of different dimensionality reduction methods varies, and it has been shown that no single method constantly outperforms the others. In contrast to using a single method, two approaches to fusing the result of applying dimensionality reduction methods are investigated: feature fusion and classifier fusion. It is shown that by fusing the output of multiple dimensionality reduction techniques, either by fusing the reduced features or by fusing the output of the resulting classifiers, both higher accuracy and higher robustness towards the choice of number of dimensions is obtained.

  • 25.
    Deegalla, Sampath
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Improving Fusion of Dimensionality Reduction Methods for Nearest Neighbor Classification2009In: 8th International Conference on Machine Learning and Applications, ICMLA 2009, IEEE Computer Society, 2009, p. 771-775Conference paper (Refereed)
    Abstract [en]

    In previous studies, performance improvement of nearest neighbor classification of high dimensional data, such as microarrays, has been investigated using dimensionality reduction. It has been demonstrated that the fusion of dimensionality reduction methods, either by fusing classifiers obtained from each set of reduced features, or by fusing all reduced features are better than than using any single dimensionality reduction method. However, none of the fusion methods consistently outperform the use of a single dimensionality reduction method. Therefore, a new way of fusing features and classifiers is proposed, which is based on searching for the optimal number of dimensions for each considered dimensionality reduction method. An empirical evaluation on microarray classification is presented, comparing classifier and feature fusion with and without the proposed method, in conjunction with three dimensionality reduction methods; Principal Component Analysis (PCA), Partial Least Squares (PLS) and Information Gain (IG). The new classifier fusion method outperforms the previous in 4 out of 8 cases, and is on par with the best single dimensionality reduction method. The novel feature fusion method is however outperformed by the previous method, which selects the same number of features from each dimensionality reduction method. Hence, it is concluded that the idea of optimizing the number of features separately for each dimensionality reduction method can only be recommended for classifier fusion.

  • 26.
    Deegalla, Sampath
    et al.
    KTH, School of Information and Communication Technology (ICT), Computer and Systems Sciences, DSV.
    Boström, Henrik
    KTH, School of Information and Communication Technology (ICT), Computer and Systems Sciences, DSV.
    Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification2006In: Publications of the Finnish Artificial Intelligence Society, 2006, p. 23-30Conference paper (Refereed)
    Abstract [en]

    The computational cost of using nearest neighbor classification often prevents the method from being applied in practice when dealing with high-dimensional data, such as images and micro arrays. One possible solution to this problem is to reduce the dimensionality of the data, ideally without loosing predictive performance. Two different dimensionality reduction methods, principal component analysis (PCA) and random projection (RP), are compared w.r.t. the performance of the resulting nearest neighbor classifier on five image data sets and two micro array data sets. The experimental results show that PCA results in higher accuracy than RP for all the data sets used in this study. However, it is also observed that RP generally outperforms PCA for higher numbers of dimensions. This leads to the conclusion that PCA is more suitable in time-critical cases (i.e., when distance calculations involving only a few dimensions can be afforded), while RP can be more suitable when less severe dimensionality reduction is required. In 6 respectively 4 cases out of 7, the use of PCA and RP even outperform using the non-reduced feature set, hence not only resulting in more efficient, but also more effective, nearest neighbor classification.

  • 27.
    Deegalla, Sampath
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Walgama, Keerthi
    Choice of Dimensionality Reduction Methods for Feature and Classifier Fusion with Nearest Neighbor Classifiers2012In: 15th International Conference on Information Fusion, IEEE Computer Society, 2012, p. 875-881, article id 6289894Conference paper (Refereed)
    Abstract [en]

    Often high dimensional data cause problems for currently used learning algorithms in terms of efficiency and effectiveness. One solution for this problem is to apply dimensionality reduction by which the original feature set could be reduced to a small number of features while gaining improved accuracy and/or efficiency of the learning algorithm. We have investigated multiple dimensionality reduction methods for nearest neighbor classification in high dimensions. In previous studies, we have demonstrated that fusion of different outputs of dimensionality reduction methods, either by combining classifiers built on reduced features, or by combining reduced features and then applying the classifier, may yield higher accuracies than when using individual reduction methods. However, none of the previous studies have investigated what dimensionality reduction methods to choose for fusion, when outputs of multiple dimensionality reduction methods are available. Therefore, we have empirically investigated different combinations of the output of four dimensionality reduction methods on 18 medicinal chemistry datasets. The empirical investigation demonstrates that fusion of nearest neighbor classifiers obtained from multiple reduction methods in all cases outperforms the use of individual dimensionality reduction methods, while fusion of different feature subsets is quite sensitive to the choice of dimensionality reduction methods.

  • 28. Dudas, C.
    et al.
    Ng, A.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Information Extraction in Manufacturing using Data Mining Techniques2008In: Proceedings of Swedish Production Symposium, 2008Conference paper (Refereed)
  • 29.
    Dudas, Catarina
    et al.
    Högskolan i Skövde, Forskningscentrum för Virtuella system.
    Boström, Henrik
    Högskolan i Skövde, Forskningscentrum för Informationsteknologi.
    Using Uncertain Chemical and Thermal Data to Predict Product Quality in a Casting Process2009In: Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data / [ed] Jian Pei; Lise Getoor; Ander De Keijzer, ACM Press, 2009, p. 57-61Conference paper (Refereed)
    Abstract [en]

    Process and casting data from different sources have been collected and merged for the purpose of predicting, and determining what factors affect, the quality of cast products in a foundry. One problem is that the measurements cannot be directly aligned, since they are collected at different points in time, and instead they have to be approximated for specific time points, hence introducing uncertainty. An approach for addressing this problem is investigated, where uncertain numeric features values are represented by intervals and random forests are extended to handle such intervals. A preliminary experiment shows that the suggested way of forming the intervals, together with the extension of random forests, results in higher predictive performance compared to using single (expected) values for the uncertain features together with standard random forests.

  • 30.
    Dudas, Catarina
    et al.
    University of Skövde.
    Ng, Amos
    University of Skövde.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Information extraction from solution set of simulation-based multi-objective optimization using data mining2009In: Proceedings of Industrial Simulation Conference (ISC) 2009, Eurosis , 2009, p. 65-69Conference paper (Refereed)
    Abstract [en]

    In this work, we investigate ways of extracting information from simulations, in particular from simulation-based multi-objective optimisation, in order to acquire information that can support human decision makers that aim for optimising manufacturing processes. Applying data mining for analyzing data generated using simulation is a fairly unexplored area. With the observation that the obtained solutions from a simulation-based multi-objective optimisation are all optimal (or close to the optimal Pareto front) so that they are bound to follow and exhibit certain relationships among variables vis-à-vis objectives, it is argued that using data mining to discover these relationships could be a promising procedure. The aim of this paper is to provide the empirical results from two simulation case studies to support such a hypothesis.

  • 31.
    Dudas, Catarina
    et al.
    Högskolan i Skövde, Institutionen för teknik och samhälle.
    Ng, Amos
    Högskolan i Skövde, Institutionen för teknik och samhälle.
    Boström, Henrik
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Knowledge Extraction in Manufacturing using Data Mining Techniques2008In: Proceedings of the Swedish Production Symposium 2008, Stockholm, Sweden, November 18-20, 2008, 2008, p. 8 sidor-Conference paper (Refereed)
    Abstract [en]

    Nowadays many production companies collect and store production and process data in large databases. Unfortunately the data is rarely used in the most value generating way, i.e.,  finding  patterns  of  inconsistencies  and  relationships  between  process  settings  and quality  outcome.  This  paper  addresses  the  benefits  of  using  data  mining  techniques  in manufacturing  applications.  Two  different  applications  are  being  laid  out  but  the  used technique  and  software  is  the  same  in  both  cases.  The  first  case  deals  with  how  data mining  can  be  used  to  discover  the  affect  of  process  timing  and  settings  on  the  quality outcome in the casting industry. The result of a multi objective optimization of a camshaft process  is  being  used  as  the  second  case.  This  study  focuses  on  finding  the  most appropriate dispatching rule settings in the buffers on the line.  The  use  of  data  mining  techniques  in  these  two  cases  generated  previously  unknown knowledge. For example, in order to maximize throughput in the camshaft production, let the dispatching rule for the most severe bottleneck be of type Shortest Processing Time (SPT) and for the second bottleneck use any but Most Work Remaining (MWKR).

  • 32.
    Dudas, Catarina
    et al.
    Högskolan i Skövde, Forskningscentrum för Virtuella system.
    Ng, Amos H.C.
    Högskolan i Skövde, Institutionen för ingenjörsvetenskap.
    Pehrsson, Leif
    Högskolan i Skövde, Institutionen för ingenjörsvetenskap.
    Boström, Henrik
    Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden.
    Integration of data mining and multi-objective optimisation for decision support in production system development2014In: International journal of computer integrated manufacturing (Print), ISSN 0951-192X, E-ISSN 1362-3052, Vol. 27, no 9, p. 824-839Article in journal (Refereed)
    Abstract [en]

    Multi-objective optimisation (MOO) is a powerful approach for generating a set of optimal trade-off (Pareto) design alternatives that the decision-maker can evaluate and then choose the most-suitable configuration, based on some high-level strategic information. Nevertheless, in practice, choosing among a large number of solutions on the Pareto front is often a daunting task, if proper analysis and visualisation techniques are not applied. Recent research advancements have shown the advantages of using data mining techniques to automate the post-optimality analysis of Pareto-optimal solutions for engineering design problems. Nonetheless, it is argued that the existing approaches are inadequate for generating high-quality results, when the set of the Pareto solutions is relatively small and the solutions close to the Pareto front have almost the same attributes as the Pareto-optimal solutions, of which both are commonly found in many real-world system problems. The aim of this paper is therefore to propose a distance-based data mining approach for the solution sets generated from simulation-based optimisation, in order to address these issues. Such an integrated data mining and MOO procedure is illustrated with the results of an industrial cost optimisation case study. Particular emphasis is paid to showing how the proposed procedure can be used to assist decision-makers in analysing and visualising the attributes of the design alternatives in different regions of the objective space, so that informed decisions can be made in production systems development.

  • 33. Dudas, Catarina
    et al.
    Ng, Amosh. C.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Post-analysis of multi-objective optimization solutions using decision trees2015In: Intelligent Data Analysis, ISSN 1088-467X, E-ISSN 1571-4128, Vol. 19, no 2, p. 259-278Article in journal (Refereed)
    Abstract [en]

    Evolutionary algorithms are often applied to solve multi-objective optimization problems. Such algorithms effectively generate solutions of wide spread, and have good convergence properties. However, they do not provide any characteristics of the found optimal solutions, something which may be very valuable to decision makers. By performing a post-analysis of the solution set from multi-objective optimization, relationships between the input space and the objective space can be identified. In this study, decision trees are used for this purpose. It is demonstrated that they may effectively capture important characteristics of the solution sets produced by multi-objective optimization methods. It is furthermore shown that the discovered relationships may be used for improving the search for additional solutions. Two multi-objective problems are considered in this paper; a well-studied benchmark function problem with on a beforehand known optimal Pareto front, which is used for verification purposes, and a multi-objective optimization problem of a real-world production system. The results show that useful relationships may be identified by employing decision tree analysis of the solution sets from multi-objective optimizations.

  • 34.
    Gammerman, Alexander
    et al.
    Royal Holloway Univ London, Egham, Surrey, England..
    Vovk, Vladimir
    Royal Holloway Univ London, Egham, Surrey, England..
    Boström, Henrik
    KTH, School of Electrical Engineering and Computer Science (EECS), Software and Computer systems, SCS.
    Carlsson, Lars
    Stena Line AB, Gothenburg, Sweden..
    Conformal and probabilistic prediction with applications: editorial2019In: Machine Learning, ISSN 0885-6125, E-ISSN 1573-0565, Vol. 108, no 3, p. 379-380Article in journal (Other academic)
  • 35. Gurung, R. B.
    et al.
    Lindgren, T.
    Boström, H.
    KTH, School of Information and Communication Technology (ICT).
    Learning random forest from histogram data using split specific axis rotation2018In: International Journal of Machine Learning and Computing, ISSN 2010-3700, Vol. 8, no 1, p. 74-79Article in journal (Refereed)
    Abstract [en]

    Machine learning algorithms for data containing histogram variables have not been explored to any major extent. In this paper, an adapted version of the random forest algorithm is proposed to handle variables of this type, assuming identical structure of the histograms across observations, i.e., the histograms for a variable all use the same number and width of the bins. The standard approach of representing bins as separate variables, may lead to that the learning algorithm overlooks the underlying dependencies. In contrast, the proposed algorithm handles each histogram as a unit. When performing split evaluation of a histogram variable during tree growth, a sliding window of fixed size is employed by the proposed algorithm to constrain the sets of bins that are considered together. A small number of all possible set of bins are randomly selected and principal component analysis (PCA) is applied locally on all examples in a node. Split evaluation is then performed on each principal component. Results from applying the algorithm to both synthetic and real world data are presented, showing that the proposed algorithm outperforms the standard approach of using random forests together with bins represented as separate variables, with respect to both AUC and accuracy. In addition to introducing the new algorithm, we elaborate on how real world data for predicting NOx sensor failure in heavy duty trucks was prepared, demonstrating that predictive performance can be further improved by adding variables that represent changes of the histograms over time. 

  • 36.
    Gurung, Ram B.
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Lindgren, Tony
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Learning Decision Trees from Histogram Data2015In: Proceedings of the 2015 International Conference on Data Mining: DMIN 2015 / [ed] Robert Stahlbock, Gary M. Weiss, AAAI Press, 2015, p. 139-145Conference paper (Refereed)
    Abstract [en]

    When applying learning algorithms to histogram data, bins of such variables are normally treated as separate independent variables. However, this may lead to a loss of information as the underlying dependencies may not be fully exploited. In this paper, we adapt the standard decision tree learning algorithm to handle histogram data by proposing a novel method for partitioning examples using binned variables. Results from employing the algorithm to both synthetic and real-world data sets demonstrate that exploiting dependencies in histogram data may have positive effects on both predictive performance and model size, as measured by number of nodes in the decision tree. These gains are however associated with an increased computational cost and more complex split conditions. To address the former issue, an approximate method is proposed, which speeds up the learning process substantially while retaining the predictive performance.

  • 37.
    Gurung, Ram B.
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Lindgren, Tony
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins2016In: Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference / [ed] Zdravko Markov, Ingrid Russell, AAAI Press , 2016, p. 430-435Conference paper (Refereed)
    Abstract [en]

    The standard approach of learning decision trees from histogram data is to treat the bins as independent variables. However, as the underlying dependencies among the bins might not be completely exploited by this approach, an algorithm has been proposed for learning decision trees from histogram data by considering all bins simultaneously while partitioning examples at each node of the tree. Although the algorithm has been demonstrated to improve predictive performance, its computational complexity has turned out to be a major bottleneck, in particular for histograms with a large number of bins. In this paper, we propose instead a sliding window approach to select subsets of the bins to be considered simultaneously while partitioning examples. This significantly reduces the number of possible splits to consider, allowing for substantially larger histograms to be handled. We also propose to evaluate the original bins independently, in addition to evaluating the subsets of bins when performing splits. This ensures that the information obtained by treating bins simultaneously is an additional gain compared to what is considered by the standard approach. Results of experiments on applying the new algorithm to both synthetic and real world datasets demonstrate positive results in terms of predictive performance without excessive computational cost.

  • 38.
    Gurung, Ram B.
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Lindgren, Tony
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Predicting NOx sensor failure in heavy duty trucks using histogram-based random forests2017In: International Journal of Prognostics and Health Management, ISSN 2153-2648, E-ISSN 2153-2648, Vol. 8, no 1, article id 008Article in journal (Refereed)
    Abstract [en]

    Being able to accurately predict the impending failures of truck components is often associated with significant amount of cost savings, customer satisfaction and flexibility in maintenance service plans. However, because of the diversity in the way trucks typically are configured and their usage under different conditions, the creation of accurate prediction models is not an easy task. This paper describes an effort in creating such a prediction model for the NOx sensor, i.e., a component measuring the emitted level of nitrogen oxide in the exhaust of the engine. This component was chosen because it is vital for the truck to function properly, while at the same time being very fragile and costly to repair. As input to the model, technical specifications of trucks and their operational data are used. The process of collecting the data and making it ready for training the model via a slightly modified Random Forest learning algorithm is described along with various challenges encountered during this process. The operational data consists of features represented as histograms, posing an additional challenge for the data analysis task. In the study, a modified version of the random forest algorithm is employed, which exploits the fact that the individual bins in the histograms are related, in contrast to the standard approach that would consider the bins as independent features. Experiments are conducted using the updated random forest algorithm, and they clearly show that the modified version is indeed beneficial when compared to the standard random forest algorithm. The performance of the resulting prediction model for the NOx sensor is promising and may be adopted for the benefit of operators of heavy trucks.

  • 39. Henelius, Andreas
    et al.
    Puolamaki, Kai
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Asker, Lars
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    A peek into the black box: exploring classifiers by randomization2014In: Data mining and knowledge discovery, ISSN 1384-5810, E-ISSN 1573-756X, Vol. 28, no 5-6, p. 1503-1529Article in journal (Refereed)
    Abstract [en]

    Classifiers are often opaque and cannot easily be inspected to gain understanding of which factors are of importance. We propose an efficient iterative algorithm to find the attributes and dependencies used by any classifier when making predictions. The performance and utility of the algorithm is demonstrated on two synthetic and 26 real-world datasets, using 15 commonly used learning algorithms to generate the classifiers. The empirical investigation shows that the novel algorithm is indeed able to find groupings of interacting attributes exploited by the different classifiers. These groupings allow for finding similarities among classifiers for a single dataset as well as for determining the extent to which different classifiers exploit such interactions in general.

  • 40. Henelius, Andreas
    et al.
    Puolamäki, Kai
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Clustering with Confidence: Finding Clusters with Statistical Guarantees2016In: Article in journal (Refereed)
    Abstract [en]

    Clustering is a widely used unsupervised learning method for finding structure in the data. However, the resulting clusters are typically presented without any guarantees on their robustness; slightly changing the used data sample or re-running a clustering algorithm involving some stochastic component may lead to completely different clusters. There is, hence, a need for techniques that can quantify the instability of the generated clusters. In this study, we propose a technique for quantifying the instability of a clustering solution and for finding robust clusters, termed core clusters, which correspond to clusters where the co-occurrence probability of each data item within a cluster is at least 1−α  . We demonstrate how solving the core clustering problem is linked to finding the largest maximal cliques in a graph. We show that the method can be used with both clustering and classification algorithms. The proposed method is tested on both simulated and real datasets. The results show that the obtained clusters indeed meet the guarantees on robustness.

  • 41.
    Henelius, Andreas
    et al.
    Finnish Institute of Occupational Health, , .
    Puolamäki, Kai
    Finnish Institute of Occupational Health, , .
    Karlsson, Isak
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Asker, Lars
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Papapetrou, Panagiotis
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    GoldenEye++: a Closer Look into the Black Box2015In: International Symposium on Statistical Learning and Data Science, Springer Publishing Company , 2015Conference paper (Refereed)
  • 42.
    Henriksson, Aron
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Modeling Electronic Health Records in Ensembles of Semantic Spaces for Adverse Drug Event Detection2015In: 2015 IEEE International Conference on Bioinformatics and Biomedicine: Proceedings / [ed] Jun (Luke) Huan et al., IEEE Computer Society, 2015, p. 343-350, article id 7359705Conference paper (Refereed)
    Abstract [en]

    Electronic health records (EHRs) are emerging as a potentially valuable source for pharmacovigilance; however, adverse drug events (ADEs), which can be encoded in EHRs by a set of diagnosis codes, are heavily underreported. Alerting systems, able to detect potential ADEs on the basis of patient- specific EHR data, would help to mitigate this problem. To that end, the use of machine learning has proven to be both efficient and effective; however, challenges remain in representing the heterogeneous EHR data, which moreover tends to be high- dimensional and exceedingly sparse, in a manner conducive to learning high-performing predictive models. Prior work has shown that distributional semantics – that is, natural language processing methods that, traditionally, model the meaning of words in semantic (vector) space on the basis of co-occurrence information – can be exploited to create effective representations of sequential EHR data, not only free-text in clinical notes but also various clinical events such as diagnoses, drugs and measurements. When modeling data in semantic space, an im- portant design decision concerns the size of the context window around an object of interest, which governs the scope of co- occurrence information that is taken into account and affects the composition of the resulting semantic space. Here, we report on experiments conducted on 27 clinical datasets, demonstrating that performance can be significantly improved by modeling EHR data in ensembles of semantic spaces, consisting of multiple semantic spaces built with different context window sizes. A follow-up investigation is conducted to study the impact on predictive performance as increasingly more semantic spaces are included in the ensemble, demonstrating that accuracy tends to improve with the number of semantic spaces, albeit not monotonically so. Finally, a number of different strategies for combining the semantic spaces are explored, demonstrating the advantage of early (feature) fusion over late (classifier) fusion. Ensembles of semantic spaces allow multiple views of (sparse) data to be captured (densely) and thereby enable improved performance to be obtained on the task of detecting ADEs in EHRs.

  • 43.
    Henriksson, Aron
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Modeling Heterogeneous Clinical Sequence Data in Semantic Space for Adverse Drug Event Detection2015In: Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics / [ed] Eric Gaussier, Longbing Cao, Patrick Gallinari, James Kwok, Gabriella Pasi, Osmar Zaiane, Institute of Electrical and Electronics Engineers (IEEE), 2015, p. 792-799, article id 7344867Conference paper (Refereed)
    Abstract [en]

    The enormous amounts of data that are continuously recorded in electronic health record systems offer ample opportunities for data science applications to improve healthcare. There are, however, challenges involved in using such data for machine learning, such as high dimensionality and sparsity, as well as an inherent heterogeneity that does not allow the distinct types of clinical data to be treated in an identical manner. On the other hand, there are also similarities across data types that may be exploited, e.g., the possibility of representing some of them as sequences. Here, we apply the notions underlying distributional semantics, i.e., methods that model the meaning of words in semantic (vector) space on the basis of co-occurrence information, to four distinct types of clinical data: free-text notes, on the one hand, and clinical events, in the form of diagnosis codes, drug codes and measurements, on the other hand. Each semantic space contains continuous vector representations for every unique word and event, which can then be used to create representations of, e.g., care episodes that, in turn, can be exploited by the learning algorithm. This approach does not only reduce sparsity, but also takes into account, and explicitly models, similarities between various items, and it does so in an entirely data-driven fashion. Here, we report on a series of experiments using the random forest learning algorithm that demonstrate the effectiveness, in terms of accuracy and area under ROC curve, of the proposed representation form over the commonly used bag-of-items counterpart. The experiments are conducted on 27 real datasets that each involves the (binary) classification task of detecting a particular adverse drug event. It is also shown that combining structured and unstructured data leads to significant improvements over using only one of them.

  • 44.
    Henriksson, Aron
    et al.
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Zhao, Jing
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Dalianis, Hercules
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Ensembles of randomized trees using diverse distributed representations of clinical events2016In: BMC Medical Informatics and Decision Making, ISSN 1472-6947, E-ISSN 1472-6947, Vol. 16, no 2, article id 69Article in journal (Refereed)
    Abstract [en]

    Background: Learning deep representations of clinical events based on their distributions in electronic health records has been shown to allow for subsequent training of higher-performing predictive models compared to the use of shallow, count-based representations. The predictive performance may be further improved by utilizing multiple representations of the same events, which can be obtained by, for instance, manipulating the representation learning procedure. The question, however, remains how to make best use of a set of diverse representations of clinical events – modeled in an ensemble of semantic spaces – for the purpose of predictive modeling. Methods: Three different ways of exploiting a set of (ten) distributed representations of four types of clinical events – diagnosis codes, drug codes, measurements, and words in clinical notes – are investigated in a series of experiments using ensembles of randomized trees. Here, the semantic space ensembles are obtained by varying the context window size in the representation learning procedure. The proposed method trains a forest wherein each tree is built from a bootstrap replicate of the training set whose entire original feature set is represented in a randomly selected set of semantic spaces – corresponding to the considered data types – of a given context window size. Results: The proposed method significantly outperforms concatenating the multiple representations of the bagged dataset; it also significantly outperforms representing, for each decision tree, only a subset of the features in a randomly selected set of semantic spaces. A follow-up analysis indicates that the proposed method exhibits less diversity while significantly improving average tree performance. It is also shown that the size of the semantic space ensemble has a significant impact on predictive performance and that performance tends to improve as the size increases. Conclusions: The strategy for utilizing a set of diverse distributed representations of clinical events when constructing ensembles of randomized trees has a significant impact on predictive performance. The most successful strategy – significantly outperforming the considered alternatives – involves randomly sampling distributed representations of the clinical events when building each decision tree in the forest.

  • 45.
    Hollmen, Jaakko
    et al.
    Aalto Univ, Dept Comp Sci, Espoo, Finland..
    Asker, Lars
    Stockholm Univ, Dept Comp & Syst Sci, Stockholm, Sweden..
    Karlsson, Isak
    Stockholm Univ, Dept Comp & Syst Sci, Stockholm, Sweden..
    Papapetrou, Panagiotis
    Stockholm Univ, Dept Comp & Syst Sci, Stockholm, Sweden..
    Boström, Henrik
    KTH, School of Electrical Engineering and Computer Science (EECS), Software and Computer systems, SCS.
    Wikner, Birgitta Norstedt
    Karolinska Inst, Dept Med, Ctr Pharmacoepidemiol CPE, Stockholm, Sweden..
    Ohman, Inger
    Karolinska Inst, Dept Med, Ctr Pharmacoepidemiol CPE, Stockholm, Sweden..
    Exploring epistaxis as an adverse effect of anti-thrombotic drugs and outdoor temperature2018In: 11TH ACM INTERNATIONAL CONFERENCE ON PERVASIVE TECHNOLOGIES RELATED TO ASSISTIVE ENVIRONMENTS (PETRA 2018), ASSOC COMPUTING MACHINERY , 2018, p. 1-4Conference paper (Refereed)
    Abstract [en]

    Electronic health records contain a wealth of epidemiological information about diseases at the population level. Using a database of medical diagnoses and drug prescriptions in electronic health records, we investigate the correlation between outdoor temperature and the incidence of epistaxis over time for two groups of patients. One group consists of patients that had been diagnosed with epistaxis and also been prescribed at least one of the three anti-thrombotic agents: Warfarin, Apixaban, or Rivaroxaban. The other group consists of patients that had been diagnosed with epistaxis and not been prescribed any of the three anti-thrombotic drugs. We find a strong negative correlation between the incidence of epistaxis and outdoor temperature for the group that had not been prescribed any of the three anti-thrombotic drugs, while there is a weaker correlation between incidence of epistaxis and outdoor temperature for the other group. It is, however, clear that both groups are affected in a similar way, such that the incidence of epistaxis increases with colder temperatures.

  • 46. Hulth, Anette
    et al.
    Karlgren, Jussi
    SICS.
    Jonsson, Anna
    Boström, Henrik
    Asker, Lars
    Automatic Keyword Extraction Using Domain Knowledge2008In: Computational Linguistics and Intelligent Text Processing, Berlin / Heidelberg: Springer , 2008, 1Chapter in book (Refereed)
    Abstract [en]

    Documents can be assigned keywords by frequency analysis of the terms found in the document text, which arguably is the primary source of knowledge about the document itself. By including a hierarchi- cally organised domain specific thesaurus as a second knowledge source the quality of such keywords was improved considerably, as measured by match to previously manually assigned keywords. In the presented ex- periment, the combination of the evidence from frequency analysis and the hierarchically organised thesaurus was done using inductive logic programming.

  • 47.
    Jacobsson, Micael
    et al.
    Uppsala universitet, Avdelningen för organisk farmaceutisk kemi.
    Lidén, Per
    Stjernschantz, Eva
    Boström, Henrik
    Stockholm University, Sweden.
    Norinder, Ulf
    Improving structure-based virtual screening by multivariate analysis of scoring data2003In: Journal of Medicinal Chemistry, ISSN 0022-2623, E-ISSN 1520-4804, Vol. 46, no 26, p. 5781-5789Article in journal (Refereed)
    Abstract [en]

    hree different multivariate statistical methods, PLS discriminant analysis, rule-based methods, and Bayesian classification, have been applied to multidimensional scoring data from four different target proteins: estrogen receptor alpha (ERalpha), matrix metalloprotease 3 (MMP3), factor Xa (fXa), and acetylcholine esterase (AChE). The purpose was to build classifiers able to discriminate between active and inactive compounds, given a structure-based virtual screen. Seven different scoring functions were used to generate the scoring matrices. The classifiers were compared to classical consensus scoring and single scoring functions. The classifiers show a superior performance, with rule-based methods being most effective. The precision of correctly predicting an active compound is about 90% for three of the targets and about 25% for acetylcholine esterase. On the basis of these results, a new two-stage approach is suggested for structure-based virtual screening where limited activity information is available.

  • 48. Jansson, Karl
    et al.
    Sundell, Håkan
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    Boström, Henrik
    Stockholms universitet, Institutionen för data- och systemvetenskap.
    gpuRF and gpuERT: efficient and Scalable GPU Algorithms for Decision Tree Ensembles2014In: Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, IEEE Computer Society , 2014, p. 1612-1621Conference paper (Refereed)
  • 49.
    Jansson, Karl
    et al.
    Högskolan i Borås, Institutionen Handels- och IT-högskolan.
    Sundell, Håkan
    Högskolan i Borås, Institutionen Handels- och IT-högskolan.
    Boström, Henrik
    Parallel tree-ensemble algorithms for GPUs using CUDA2013Conference paper (Refereed)
    Abstract [en]

    We present two new parallel implementations of the tree-ensemble algorithms Random Forest (RF) and Extremely randomized trees (ERT) for emerging many-core platforms, e.g., contemporary graphics cards suitable for general-purpose computing (GPGPU). Random Forest and Extremely randomized trees are ensemble learners for classification and regression. They operate by constructing a multitude of decision trees at training time and outputting a prediction by comparing the outputs of the individual trees. Thanks to the inherent parallelism of the task, an obvious platform for its computation is to employ contemporary GPUs with a large number of processing cores. Previous parallel algorithms for Random Forests in the literature are either designed for traditional multi-core CPU platforms or early history GPUs with simpler hardware architecture and relatively few number of cores. The new parallel algorithms are designed for contemporary GPUs with a large number of cores and take into account aspects of the newer hardware architectures as memory hierarchy and thread scheduling. They are implemented using the C/C++ language and the CUDA interface for best possible performance on NVidia-based GPUs. An experimental study comparing with the most important previous solutions for CPU and GPU platforms shows significant improvement for the new implementations, often with several magnitudes.

  • 50.
    Johansson, Ronnie
    et al.
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Boström, Henrik
    Högskolan i Skövde, Institutionen för kommunikation och information.
    Karlsson, Alexander
    Högskolan i Skövde, Institutionen för kommunikation och information.
    A Study on Class-Specifically Discounted Belief for Ensemble Classifiers2008In: Proceedings of the IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI 2008), IEEE Press , 2008, p. 614-619Conference paper (Refereed)
    Abstract [en]

    Ensemble classifiers are known to generally perform better than their constituent classifiers. Whereas a lot of work has been focusing on the generation of classifiers for ensembles, much less attention has been given to the fusion of individual classifier outputs. One approach to fuse the outputs is to apply Shafer’s theory of evidence, which provides a flexible framework for expressing and fusing beliefs. However, representing and fusing beliefs is non-trivial since it can be performed in a multitude of ways within the evidential framework. In a previous article, we compared different evidential combination rules for ensemble fusion. The study involved a single belief representation which involved discounting (i.e., weighting) the classifier outputs with classifier reliability. The classifier reliability was interpreted as the classifier’s estimated accuracy, i.e., the percentage of correctly classified examples. However, classifiers may have different performance for different classes and in this work we assign the reliability of a classifier output depending on the classspecific reliability of the classifier. Using 27 UCI datasets, we compare the two different ways of expressing beliefs and some evidential combination rules. The result of the study indicates that there is indeed an advantage of utilizing class-specific reliability compared to accuracy in an evidential framework for combining classifiers in the ensemble design considered.

123 1 - 50 of 117
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf