Background
Deep learning (DL)-based systems have not yet been broadly implemented in clinical practice, in part due to unknown robustness across multiple imaging protocols.Purpose
To this end, we aim to evaluate the performance of several previously developed DL-based models, which were trained to distinguish idiopathic pulmonary fibrosis (IPF) from non-IPF among interstitial lung disease (ILD) patients, under standardized reference CT imaging protocols. In this study, we utilized CT scans from non-IPF ILD subjects, acquired using various imaging protocols, to assess the model performance.Methods
Three DL-based models, including one 2D and two 3D models, have been previously developed to classify ILD patients into IPF or non-IPF based on chest CT scans. These models were trained on CT image data from 389 IPF and 700 non-IPF ILD patients, retrospectively, obtained from five multicenter studies. For some patients, multiple CT scans were acquired (e.g., one at inhalation and one at exhalation) and/or reconstructed (e.g., thin slice and/or thick slice). Thus, for each patient, one CT image dataset was selected to be used in the construction of the classification model, so the parameters of that data set serve as the reference conditions. In one non-IPF ILD study, due to its specific study protocol, many patients had multiple CT image data sets that were acquired under both prone and supine positions and/or reconstructed under different imaging parameters. Therefore, to assess the robustness of the previously developed models under different (e.g., non-reference) imaging protocols, we identified 343 subjects from this study who had CT data from both the reference condition (used in model construction) and non-reference conditions (e.g., evaluation conditions), which we used in this model evaluation analysis. We reported the specificities from three model under the non-reference conditions. Generalized linear mixed effects model (GLMM) was utilized to identify the significant CT technical and clinical parameters that were associated with getting inconsistent diagnostic results between reference and evaluation conditions. Selected parameters include effective tube current-time product (known as "effective mAs"), reconstruction kernels, slice thickness, patient orientation (prone or supine), CT scanner model, and clinical diagnosis. Limitations include the retrospective nature of this study.Results
For all three DL models, the overall specificity of the previously trained IPF diagnosis model decreased (p < 0.05 for two out of three models). GLMM further suggests that for at least one out of three models, mean effective mAs across the scan is the key factor that leads to the decrease in model predictive performance (p < 0.001); the difference of mean effective mAs between the reference and evaluation conditions (p = 0.03) and slice thickness (3 mm; p = 0.03) are flagged as significant factors for one out of three models; other factors are not statistically significant (p > 0.05).Conclusion
Preliminary findings demonstrated the lack of robustness of IPF diagnosis model when the DL-based model is applied to CT series collected under different imaging protocols, which indicated that care should be taken as to the acquisition and reconstruction conditions used when developing and deploying DL models into clinical practice.