Recent advances in artificial intelligence (AI) have sparked interest in developing explainable AI (XAI) methods for clinical decision support systems, especially in translational research. Although using XAI methods may enhance trust in black-box models, evaluating their effectiveness has been challenging, primarily due to the absence of human (expert) intervention, additional annotations, and automated strategies. In order to conduct a thorough assessment, we propose a patch perturbation-based approach to automatically evaluate the quality of explanations in medical imaging analysis. To eliminate the need for human efforts in conventional evaluation methods, our approach executes poisoning attacks during model retraining by generating both static and dynamic triggers. We then propose a comprehensive set of evaluation metrics during the model inference stage to facilitate the evaluation from multiple perspectives, covering a wide range of correctness, completeness, consistency, and complexity. In addition, we include an extensive case study to showcase the proposed evaluation strategy by applying widely-used XAI methods on COVID-19 X-ray imaging classification tasks, as well as a thorough review of existing XAI methods in medical imaging analysis with evaluation availability. The proposed patch perturbation-based workflow offers model developers an automated and generalizable evaluation strategy to identify potential pitfalls and optimize their proposed explainable solutions, while also aiding end-users in comparing and selecting appropriate XAI methods that meet specific clinical needs in real-world clinical research and practice.
Multimodal image registration is a key for many clinical image‐guided interventions. However, it is a challenging task because of complicated and unknown relationships between different modalities. Currently, deep supervised learning is the state‐of‐theart method at which the registration is conducted in end‐to‐end manner and one‐shot. Therefore, a huge ground‐truth data is required to improve the results of deep neural networks for registration. Moreover, supervised methods may yield models that bias towards annotated structures. Here, to deal with above challenges, an alternative approach is using unsupervised learning models. In this study, we have designed a novel deep unsupervised Convolutional Neural Network (CNN)‐based model based on computer tomography/magnetic resonance (CT/MR) co‐registration of brain images in an affine manner. For this purpose, we created a dataset consisting of 1100 pairs of CT/MR slices from the brain of 110 neuropsychic patients with/without tumor. At the next step, 12 landmarks were selected by a well‐experienced radiologist and annotated on each slice resulting in the computation of series of metrics evaluation, target registration error (TRE), Dice similarity, Hausdorff, and Jaccard coefficients. The proposed method could register the multimodal images with TRE 9.89, Dice similarity 0.79, Hausdorff 7.15, and Jaccard 0.75 that are appreciable for clinical applications. Moreover, the approach registered the images in an acceptable time 203 ms and can be appreciable for clinical usage due to the short registration time and high accuracy. Here, the results illustrated that our proposed method achieved competitive performance against other related approaches from both reasonable computation time and the metrics evaluation.
by
Xin Tie;
Muheon Shin;
Ali Pirasteh;
Nevein Ibrahim;
Zachary Huemann;
Sharon M. Castellino;
Kara M. Kelly;
John Garrett;
Junjie Hu;
Steve Y. Cho;
Tyler J. Bradshaw
Purpose:
To determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports.
Materials and Methods:
Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician’s identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis.
Results:
Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman’s ρ correlations (ρ=0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P=0.41).
Conclusion:
Personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.
by
Andrea Sikora;
Tianyi Zhang;
David J Murphy;
Susan E. Smith;
Brian Murray;
Rishi Kamaleswaran;
Xianyan Chen;
Mitchell S. Buckley;
Sandra Rowe;
John W. Devlin
Fluid overload, while common in the ICU and associated with serious sequelae, is hard to predict and may be influenced by ICU medication use. Machine learning (ML) approaches may offer advantages over traditional regression techniques to predict it. We compared the ability of traditional regression techniques and different ML-based modeling approaches to identify clinically meaningful fluid overload predictors. This was a retrospective, observational cohort study of adult patients admitted to an ICU ≥ 72 h between 10/1/2015 and 10/31/2020 with available fluid balance data. Models to predict fluid overload (a positive fluid balance ≥ 10% of the admission body weight) in the 48–72 h after ICU admission were created. Potential patient and medication fluid overload predictor variables (n = 28) were collected at either baseline or 24 h after ICU admission. The optimal traditional logistic regression model was created using backward selection. Supervised, classification-based ML models were trained and optimized, including a meta-modeling approach. Area under the receiver operating characteristic (AUROC), positive predictive value (PPV), and negative predictive value (NPV) were compared between the traditional and ML fluid prediction models. A total of 49 of the 391 (12.5%) patients developed fluid overload. Among the ML models, the XGBoost model had the highest performance (AUROC 0.78, PPV 0.27, NPV 0.94) for fluid overload prediction. The XGBoost model performed similarly to the final traditional logistic regression model (AUROC 0.70; PPV 0.20, NPV 0.94). Feature importance analysis revealed severity of illness scores and medication-related data were the most important predictors of fluid overload. In the context of our study, ML and traditional models appear to perform similarly to predict fluid overload in the ICU. Baseline severity of illness and ICU medication regimen complexity are important predictors of fluid overload.