Skip to main content
Open Access Publications from the University of California

Open Access Policy Deposits

This series is automatically populated with publications deposited by UC Irvine Donald Bren School of Information and Computer Sciences Department of Computer Science researchers in accordance with the University of California’s open access policies. For more information see Open Access Policy Deposits and the UC Publication Management System.
Cover page of Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval.

Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval.


Large language models (LLMs) are fundamentally transforming human-facing applications in the health and well-being domains: boosting patient engagement, accelerating clinical decision-making, and facilitating medical education. Although state-of-the-art LLMs have shown superior performance in several conversational applications, evaluations within nutrition and diet applications are still insufficient. In this paper, we propose to employ the Registered Dietitian (RD) exam to conduct a standard and comprehensive evaluation of state-of-the-art LLMs, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, assessing both accuracy and consistency in nutrition queries. Our evaluation includes 1050 RD exam questions encompassing several nutrition topics and proficiency levels. In addition, for the first time, we examine the impact of Zero-Shot (ZS), Chain of Thought (CoT), Chain of Thought with Self Consistency (CoT-SC), and Retrieval Augmented Prompting (RAP) on both accuracy and consistency of the responses. Our findings revealed that while these LLMs obtained acceptable overall performance, their results varied considerably with different prompts and question domains. GPT-4o with CoT-SC prompting outperformed the other approaches, whereas Gemini 1.5 Pro with ZS recorded the highest consistency. For GPT-4o and Claude 3.5, CoT improved the accuracy, and CoT-SC improved both accuracy and consistency. RAP was particularly effective for GPT-4o to answer Expert level questions. Consequently, choosing the appropriate LLM and prompting technique, tailored to the proficiency level and specific domain, can mitigate errors and potential risks in diet and nutrition chatbots.

Cover page of Benefit of Varying Navigation Strategies in Robot Teams

Benefit of Varying Navigation Strategies in Robot Teams


Inspired by recent human studies, this paper investigates the benefits of employing varying navigation strategies in robot teams. We explore how mixed navigation strategies impact task completion time, environment exploration, and overall system effectiveness in multi-robot systems. Experiments were conducted in a simulated rectangular environment using Clearpath PR2 robots and evaluated different navigation strategies observed in humans: 1) Route (RT) knowledge where agents follow a predefined path, 2) Survey (SW) knowledge where agents take the shortest path while avoiding obstacles, 3) Mixed strategies with varying proportions, such as 40% RT and 60% SW (0.4RT 0.6SW) and 60% RT and 40% SW (0.6RT 0.4SW), and 4) An additional strategy where agents switch from RT to SW 10% of the time (0.9RT 0.1SW). While SW strategy is the most time-efficient, RT strategy covers more of the environment. Mixed strategies offer a balanced trade-off. These findings highlight the advantages of variability in navigation strategies, suggesting benefits in both biological and robotic populations. Additionally, we have observed that human participants in a similar study would start on a route, and then 10% of the time switch to survey. Therefore, we investigate a 90% Route 10% Survey (0.9RT 0.1SW) strategy for individual team members. While a pure Survey strategy is the most efficient regarding time taken and a pure Route strategy covers more of the environment, a mixture of strategies appears to be a beneficial tradeoff between time taken to complete a mission and area coverage. These results highlight the advantages of population variability, suggesting potential benefits in both biological and robotic populations.

{Princ-wiki-a Mathematica}: Wikipedia Editing and Mathematics


This essay incorporates with permission material from our pseudonymous colleague XOR'easter, who also contributed many suggestions during the writing process. By the extent of XOR’easter’s contributions, they would normally be credited as an author. However it was not possible in time to find a way to strictly preserve anonymity and assign legal copyright. All four contributors disagree with this exclusion. I regret its necessity — Ed.

Cover page of Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI


Generative Artificial Intelligence is set to revolutionize healthcare delivery by transforming traditional patient care into a more personalized, efficient, and proactive process. Chatbots, serving as interactive conversational models, will probably drive this patient-centered transformation in healthcare. Through the provision of various services, including diagnosis, personalized lifestyle recommendations, dynamic scheduling of follow-ups, and mental health support, the objective is to substantially augment patient health outcomes, all the while mitigating the workload burden on healthcare providers. The life-critical nature of healthcare applications necessitates establishing a unified and comprehensive set of evaluation metrics for conversational models. Existing evaluation metrics proposed for various generic large language models (LLMs) demonstrate a lack of comprehension regarding medical and health concepts and their significance in promoting patients’ well-being. Moreover, these metrics neglect pivotal user-centered aspects, including trust-building, ethics, personalization, empathy, user comprehension, and emotional support. The purpose of this paper is to explore state-of-the-art LLM-based evaluation metrics that are specifically applicable to the assessment of interactive conversational models in healthcare. Subsequently, we present a comprehensive set of evaluation metrics designed to thoroughly assess the performance of healthcare chatbots from an end-user perspective. These metrics encompass an evaluation of language processing abilities, impact on real-world clinical tasks, and effectiveness in user-interactive conversations. Finally, we engage in a discussion concerning the challenges associated with defining and implementing these metrics, with particular emphasis on confounding factors such as the target audience, evaluation methods, and prompt techniques involved in the evaluation process.

Cover page of Enhancing Radiologist Efficiency with AI: A Multi-Reader Multi-Case Study on Aortic Dissection Detection and Prioritization.

Enhancing Radiologist Efficiency with AI: A Multi-Reader Multi-Case Study on Aortic Dissection Detection and Prioritization.


BACKGROUND AND OBJECTIVES: Acute aortic dissection (AD) is a life-threatening condition in which early detection can significantly improve patient outcomes and survival. This study evaluates the clinical benefits of integrating a deep learning (DL)-based application for the automated detection and prioritization of AD on chest CT angiographies (CTAs) with a focus on the reduction in the scan-to-assessment time (STAT) and interpretation time (IT). MATERIALS AND METHODS: This retrospective Multi-Reader Multi-Case (MRMC) study compared AD detection with and without artificial intelligence (AI) assistance. The ground truth was established by two U.S. board-certified radiologists, while three additional expert radiologists served as readers. Each reader assessed the same CTAs in two phases: assessment unaided by AI assistance (pre-AI arm) and, after a 1-month washout period, assessment aided by device outputs (post-AI arm). STAT and IT metrics were compared between the two arms. RESULTS: This study included 285 CTAs (95 per reader, per arm) with a mean patient age of 58.5 years ±14.7 (SD), of which 52% were male and 37% had a prevalence of AD. AI assistance significantly reduced the STAT for detecting 33 true positive AD cases from 15.84 min (95% CI: 13.37-18.31 min) without AI to 5.07 min (95% CI: 4.23-5.91 min) with AI, representing a 68% reduction (p < 0.01). The IT also reduced significantly from 21.22 s (95% CI: 19.87-22.58 s) without AI to 14.17 s (95% CI: 13.39-14.95 s) with AI (p < 0.05). CONCLUSIONS: The integration of a DL-based algorithm for AD detection on chest CTAs significantly reduces both the STAT and IT. By prioritizing urgent cases, the AI-assisted approach outperforms the standard First-In, First-Out (FIFO) workflow.

Cover page of First measurement of the total inelastic cross section of positively charged kaons on argon at energies between 5.0 and 7.5 GeV

First measurement of the total inelastic cross section of positively charged kaons on argon at energies between 5.0 and 7.5 GeV


ProtoDUNE Single-Phase (ProtoDUNE-SP) is a 770-ton liquid argon time projection chamber that operated in a hadron test beam at the CERN Neutrino Platform in 2018. We present a measurement of the total inelastic cross section of charged kaons on argon as a function of kaon energy using 6 and 7 GeV/c beam momentum settings. The flux-weighted average of the extracted inelastic cross section at each beam momentum setting was measured to be 380±26 mbarns for the 6 GeV/c setting and 379±35 mbarns for the 7 GeV/c setting.

Noncrossing Longest Paths and Cycles


Edge crossings in geometric graphs are sometimes undesirable as they could lead to unwanted situations such as collisions in motion planning and inconsistency in VLSI layout. Short geometric structures such as shortest perfect matchings, shortest spanning trees, shortest spanning paths, and shortest spanning cycles on a given point set are inherently noncrossing. However, the longest such structures need not be noncrossing. In fact, it is intuitive to expect many edge crossings in various geometric graphs that are longest. Recently, Álvarez-Rebollar, Cravioto-Lagos, Marín, Solé-Pi, and Urrutia (Graphs and Combinatorics, 2024) constructed a set of points for which the longest perfect matching is noncrossing. They raised several challenging questions in this direction. In particular, they asked whether the longest spanning path, on any finite set of points in the plane, must have a pair of crossing edges. They also conjectured that the longest spanning cycle must have a pair of crossing edges. In this paper, we give a negative answer to the question and also refute the conjecture. We present a framework for constructing arbitrarily large point sets for which the longest perfect matchings, the longest spanning paths, and the longest spanning cycles are noncrossing.

Drawing Planar Graphs and 1-Planar Graphs Using Cubic Bézier Curves with Bounded Curvature


We study algorithms for drawing planar graphs and 1-planar graphs using cubic Bézier curves with bounded curvature. We show that any n-vertex 1-planar graph has a 1-planar RAC drawing using a single cubic Bézier curve per edge, and this drawing can be computed in O(n) time given a combinatorial 1-planar drawing. We also show that any n-vertex planar graph G can be drawn in O(n) time with a single cubic Bézier curve per edge, in an O(n) × O(n) bounding box, such that the edges have Θ(1/degree(v)) angular resolution, for each v ∈ G, and O(√ n) curvature.

Cover page of iMIRACLE: an Iterative Multi-View Graph Neural Network to Model Intercellular Gene Regulation from Spatial Transcriptomic Data.

iMIRACLE: an Iterative Multi-View Graph Neural Network to Model Intercellular Gene Regulation from Spatial Transcriptomic Data.


Spatial transcriptomics has transformed genomic research by measuring spatially resolved gene expressions, allowing us to investigate how cells adapt to their microenvironment via modulating their expressed genes. This essential process usually starts from cell-cell communication (CCC) via ligand-receptor (LR) interaction, leading to regulatory changes within the receiver cell. However, few methods were developed to connect them to provide biological insights into intercellular regulation. To fill this gap, we propose iMiracle, an iterative multi-view graph neural network that models each cells intercellular regulation with three key features. Firstly, iMiracle integrates inter- and intra-cellular networks to jointly estimate cell-type- and micro-environment-driven gene expressions. Optionally, it allows prior knowledge of intra-cellular networks as pre-structured masks to maintain biological relevance. Secondly, iMiracle employs iterative learning to overcome the sparsity of spatial transcriptomic data and gradually fill in the missing edges in the CCC network. Thirdly, iMiracle infers a cell-specific ligand-gene regulatory score based on the contributions of different LR pairs to interpret inter-cellular regulation. We applied iMiracle to nine simulated and eight real datasets from three sequencing platforms and demonstrated that iMiracle consistently outperformed ten methods in gene expression imputation and four methods in regulatory score inference. Lastly, we developed iMiracle as an open-source software and anticipate that it can be a powerful tool in decoding the complexities of inter-cellular transcriptional regulation.