Search

Scholarly Works (4 results)

Sort By:

Thesis
Peer Reviewed

Attentive representations for objects detection and instance segmentation

Wang, Xudong
Advisor(s): Vasconcelos, Nuno M

UC San Diego Electronic Theses and Dissertations (2019)

In this thesis, we focused on investigating novelty modules integrated into popular detection network for assisting it to learn attentive representations for several practical applications in objects detection and instance segmentation tasks, including universal object detection and 3D medical image segmentation tasks. For universal object detection task, despite increasing efforts on universal representations for visual recognition, few have addressed object detection. In this thesis, we develop an effective and efficient universal object detection system that is capable of working on various image domains, from human faces and traffic signs to medical CT images. Unlike multi-domain models, this universal model does not require prior knowledge of the domain of interest. This is achieved by the introduction of a new family of adaptation layers, based on the principles of squeeze and excitation, and a new domain-attention mechanism. In the proposed universal detector, all parameters and computations are shared across domains, and a single network processes all domains all the time. Experiments, on a newly established universal object detection benchmark of 11 diverse datasets, show that the proposed detector outperforms a bank of individual detectors, a multi-domain detector, and a baseline universal detector, with a 1.3 parameter increase over a single-domain baseline detector. For 3D medical images segmentation tasks, although high resolution 3D medical images offer abundant detail information of human body parts and allow early detection of small lesions, due to the limitation of GPU memory, most methods either use down-sampled 3D volume as input, which significantly affects the detectability of small lesions, or use 2.5D networks to crop out neighboring image slices at original resolution, which loses context information along z direction. Both ways can significantly affect the performance of final model. In this paper, we propose a cross-slice spatial and channel attention module, which can maintain spatial resolution of input data, and effectively utilize context information along z direction of 3D volume. In order to get higher quality mask prediction, a cascade mask refinement module is designed to provide an objectiveness pixel-wise attention map for input feature maps. Furthermore, our scheme allows us to utilize the pretrained 2D detection models to achieve good results even with limited amount of training data, which is often met in medical applications and imposes big challenge to many deep learning methods. By utilizing the two novel modules, we achieve state-of-art performance 74.10 dice per case on Liver Tumor Segmentation Challenge(LiTS), which outperforms previous year challenge winner by 6.7 points and rank as 1st on leader board of LiTS benchmark upon submission of this paper.

Cover page: Attentive representations for objects detection and instance segmentation

Thesis
Peer Reviewed

Holistic Bias Mitigation in Computer Vision and Beyond

Li, Yi
Advisor(s): Vasconcelos, Nuno M.

UC San Diego Electronic Theses and Dissertations (2024)

Deep learning models have become the backbone of modern computer vision systems, achieving striking success in a wide range of tasks from image to video understanding. While the benchmark performance of deep neural networks never appears to saturate as they are fed more data and compute, real-world applications of these models often fall short of expectations. This is due to the presence of biases in the data, which misrepresent the underlying distribution and lead to poor generalization and discriminatory outcomes. In this thesis, we investigate the problem of bias in vision and multimodal learning systems, proposing methods to identify, measure, and mitigate biases in both the data and the models.

The first part of the dissertation introduces a new form of bias, representation bias, which measures the extend to which ground-truth labels can be inferred from spurious features in the data. We apply the concept to study the static bias in video action recognition, and propose a procedure, RESOUND, to guide the collection of datasets free from representation bias. We further develop a debiasing method, REPAIR, to mitigate the bias in existing datasets, and show that debiasing the training data can improve the generalization of video models on new datasets.

The second part shifts the focus to model-centric approaches that alleviate the vulnerability of models to data biases. We propose dynamic representation learning, a novel framework for quantifying and minimizing the static bias of video classification models, and use this to study the impact of bias removal on the transferability of video representations. We then extend the debiasing to multimodal learning, proposing a sparse video-text transformer for efficient modeling over long clips, and a training curriculum that enables temporal learning beyond static visual features.

In the final part of the dissertation, we advocate for a holistic view of bias mitigation that considers dataset and model biases jointly, and apply this strategy to improve the fairness of vision-language foundation models using generated counterfactuals. We demonstrate the benefit of this approach in unifying bias mitigation for diverse tasks and domains, and discuss the potential of holistic debiasing in future research.

Cover page: Holistic Bias Mitigation in Computer Vision and Beyond

Article
Peer Reviewed

Automated High-Frequency Observations of Physical Activity Using Computer Vision.

UC San Diego Previously Published Works (2020)

Purpose

To test the validity of the Ecological Video Identification of Physical Activity (EVIP) computer vision algorithms for automated video-based ecological assessment of physical activity in settings such as parks and schoolyards.

Methods

Twenty-seven hours of video were collected from stationary overhead video cameras across 22 visits in nine sites capturing organized activities. Each person in the setting wore an accelerometer, and each second was classified as moderate-to-vigorous physical activity or sedentary/light activity. Data with 57,987 s were used to train and test computer vision algorithms for estimating the total number of people in the video and number of people active (in moderate-to-vigorous physical activity) each second. In the testing data set (38,658 s), video-based System for Observing Play and Recreation in Communities (SOPARC) observations were conducted every 5 min (130 observations). Concordance correlation coefficients (CCC) and mean absolute errors (MAE) assessed agreement between (1) EVIP and ground truth (people counts+accelerometry) and (2) SOPARC observation and ground truth. Site and scene-level correlates of error were investigated.

Results

Agreement between EVIP and ground truth was high for number of people in the scene (CCC = 0.88; MAE = 2.70) and moderate for number of people active (CCC = 0.55; MAE = 2.57). The EVIP error was uncorrelated with camera placement, presence of obstructions or shadows, and setting type. For both number in scene and number active, EVIP outperformed SOPARC observations in estimating ground truth values (CCC were larger by 0.11-0.12 and MAE smaller by 41%-48%).

Conclusions

Computer vision algorithms are promising for automated assessment of setting-based physical activity. Such tools would require less manpower than human observation, produce more and potentially more accurate data, and allow for ongoing monitoring and feedback to inform interventions.

Cover page: Automated High-Frequency Observations of Physical Activity Using Computer Vision.

Article
Peer Reviewed

Automated Ecological Assessment of Physical Activity: Advancing Direct Observation

UC San Diego Previously Published Works (2017)

Technological advances provide opportunities for automating direct observations of physical activity, which allow for continuous monitoring and feedback. This pilot study evaluated the initial validity of computer vision algorithms for ecological assessment of physical activity. The sample comprised 6630 seconds per camera (three cameras in total) of video capturing up to nine participants engaged in sitting, standing, walking, and jogging in an open outdoor space while wearing accelerometers. Computer vision algorithms were developed to assess the number and proportion of people in sedentary, light, moderate, and vigorous activity, and group-based metabolic equivalents of tasks (MET)-minutes. Means and standard deviations (SD) of bias/difference values, and intraclass correlation coefficients (ICC) assessed the criterion validity compared to accelerometry separately for each camera. The number and proportion of participants sedentary and in moderate-to-vigorous physical activity (MVPA) had small biases (within 20% of the criterion mean) and the ICCs were excellent (0.82-0.98). Total MET-minutes were slightly underestimated by 9.3-17.1% and the ICCs were good (0.68-0.79). The standard deviations of the bias estimates were moderate-to-large relative to the means. The computer vision algorithms appeared to have acceptable sample-level validity (i.e., across a sample of time intervals) and are promising for automated ecological assessment of activity in open outdoor settings, but further development and testing is needed before such tools can be used in a diverse range of settings.

Cover page: Automated Ecological Assessment of Physical Activity: Advancing Direct Observation