Modern smart assembly lines commonly include electric tools with built-in sensors to tighten safety-critical joints. These sensors generate data that are subsequently analyzed by human experts to diagnose potential tightening errors. Previous research aimed to automate the diagnosing process by developing diagnosing models based on tightening theory and calibration of the friction coefficient in specific lab setups. Generalizing these results is difficult and often unsuccessful since friction coefficients vary between lab and production environments. To overcome this problem, this paper presents a novel methodology that builds multi-label classification deep learning models for diagnosing tightening errors using production data. The proposed methodology comprises three key contributions, i.e., the Labrador method, the Model Combo (MoBo) framework, and a heuristic evaluation method. Labrador is an elastic deep learning based sensor fusion method that (1) uses feature encoders to extract features; (2) conducts data-level and/or feature-level sensor fusion in both time and frequency domains; and (3) performs multi-label classification to detect and diagnose tightening errors. MoBo is a configurable and modular framework that supports Labrador in identifying optimal feature encoders. With MoBo and Labrador, one can easily explore and design a bounded search space for sensor fusion strategies (SFSs) and feature encoders. In order to identify the optimal solution within the defined search space, this paper introduces a heuristic method. By evaluating the trade-off between machine learning (ML) metrics (e.g., accuracy, subset accuracy, and F1) and operational (OP) metrics (e.g., inference latency), the proposed method identifies the most suitable solution depending on the requirements of individual use cases. In the experimental evaluation, we adopt the proposed methodology to identify the most suitable multi-label classification solutions for diagnosing tightening errors. To optimize ML metrics, the identified solution achieved 99.69% accuracy, 93.39% subset accuracy, 97.39% F1, and 6.68ms inference latency. To optimize OP metrics, the identified solution achieved 99.66% accuracy, 92.65% subset accuracy, 97.28% F1, and 2.41ms inference latency.