Search

Scholarly Works (1572 results)

Sort By:

Show:

Article
Peer Reviewed

A Distributional Account of Covariance Effects and Talker Adaptation in Infant and Adult Phonetic Category Recognition

Jones, Bevan

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 32 (2010)

Cover page: A Distributional Account of Covariance Effects and Talker Adaptation in Infant and Adult Phonetic Category Recognition

Article
Peer Reviewed

Helper T cells for cytotoxic T lymphocytes need not be I region restricted

UC Berkeley Previously Published Works (1982)

We investigated the antigenic requirements for restimulation of H-2- restricted cytolytic T lymphocytes (CTL) in vitro to determine whether H-2 I region-restricted helper T cells are required in these responses. In one set of experiments, we studied the in vitro response of (responder x nonresponder)F(1) female T cells to the male antigen H-Y. We chose to examine this response because it has been suggested that the defect in nonresponder strains is a failure of helper T cells to recognize H-Y in association with nonresponder I region determinants. However, we find that nonresponder male stimulator cells are as effective as F(1) male stimulator cells at inducing H-Y-specific CTL responses. This finding calls into question reports that secondary CTL responses to H-Y are dependent upon the activation of H-Y- specific helper T cells restricted to responder type I region determinants. In a second set of experiments, we examined the requirements for restimulation of H-2-restricted T cells specific for minor-histocompatibility antigens from long-term mixed lymphocyte cultures. These cultures were established by repeatedly restimulating cultures of specific T cells with H- 2-matched stimulator cells expressing foreign minor histocompatibility antigens. We found that H-2D-restricted T ceils, including CTL, could be restimulated with cells that were matched with the responding cells at only the D region genes. This response did not appear to result from positive allogeneic effects or from antigen processing and "representation" by responder type APC that might contaminate the cultures. Thus, we find no evidence for a requirement for I region-restricted helper T cells in these CTL responses. However, helper T cells are required because we find that CTL lines derived by limit-dilution cloning from these long-term MLC are absolutely dependent upon exogenous helper factors for growth. The most simple interpretation of these results is that the helper cells are restricted to H-2 antigens other than I region antigens or to antigens that code outside of the H-2 complex. Finally, we show that factor-dependent CTL lines must recognize their specific antigen to proliferate, even in the presence of exogenous factors. The requirement of activated CTL for antigen to proliferate provides an explanation for how specific CTL can be selectively enriched in MLC by specific antigen stimulation. Furthermore, it is at variance with reports that memory CTL or activated CTL require only interleukin 2 for restimulation.

Cover page: Helper T cells for cytotoxic T lymphocytes need not be I region restricted

Thesis
Peer Reviewed

Hardware Architectures for Lossless Compression

Sarangi, Satyabrata
Advisor(s): Baas, Bevan

UC Davis Electronic Theses and Dissertations (2022)

Demands for storing huge volumes of data and limited communication networkbandwidth call for effective data compression with high performance and energy efficiency to reduce high storage and communication costs for a wide range of systems and applications. Data compression consists of two types: lossy and lossless. Canonical Huffman encoding and Gzip are two popular lossless compression techniques. Hardware implementation of such lossless compression techniques in many-core processors and other hardware platforms is crucial in achieving optimized performance and energy-efficiency. This dissertation analyzes static and dynamic canonical Huffman codec algorithms, Gzip compression techniques, and presents high throughput and energy-efficient hardware and software implementations with good compression ratios.

Canonical Huffman encoding is naturally sequential which consists of several tasks:finding a histogram of symbol-frequencies, sorting of symbols, Huffman tree creation, building canonical code tables, and encoding symbols. This dissertation demonstrates energy-efficient canonical Huffman encoder architectures exploiting task-level parallelism for the above tasks and introduces a concurrent approach to execute sorting, Huffman tree creation, and code length computation tasks that yields better memory efficiency and performance than the conventional approach. The proposed architectures are implemented on a many-core array, Intel i7, Nvidia GT 750M, Intel FPGA, and 45-nm ASIC.

The many-core encoder implementations achieve a scaled throughput per chip areathat is 89.2x and 4.7x greater on average and 44.7x and 8.2x greater in terms of scaled energy efficiency (compressed bits encoded per energy) than the Intel i7 and Nvidia GT 750M, respectively executing the common Corpus benchmarks for data compression. Encoder implementations on the many-core processor array yield scaled throughput per chip area and scaled energy efficiency that is 58x and 4.8x greater on average than the state-of-the-art efficient canonical Huffman encoder implementation on Tesla V100 GPUs executing the enwik8 dataset.

Scaled synthesis results from a proposed pipelined canonical Huffman encoder 45-nm ASIC results in 2.44x greater on throughput, 3.5x lower on total power dissipation, and7.6x lower energy dissipation over Intel FPGA implementations.

Next, this dissertation presents bit-parallel static and dynamic canonical Huffmandecoder implementations using an optimized lookup table approach on a fine-grain manycore array, Intel i7, Nvidia GT 750M, Intel FPGA, and 45-nm ASIC. The many-core implementations achieve a scaled throughput per chip area that is 891x and 7x greater on average and scaled energy efficiency (compressed bits decoded per energy) that is 149.5x and 3.9x greater on average than the i7 and GT 750M, respectively.

The 45-nm ASIC synthesis results show that the pipelined and memory-efficient staticdecoder yields a 5.1x throughput improvement and 13.4x energy efficiency improvement over the FPGA implementation.

Furthermore, this dissertation presents energy-efficient and high throughput Gziphardware architectures using a chained hash bank memory design of depth three for LZ77 encoder and synthesis results of the Gzip compression engines implemented in a 45-nm ASIC. The proposed encoder architectures exploit both static and dynamic canonical Huffman encoders along with a pipelined LZ77 encoder. The pipelined Gzip engine using a static canonical Huffman encoder with a parallel window size (PWS) of 16 bytes per clock cycle, achieves a maximum input throughput of 2.53 GB/s, while the dynamic canonical Huffman encoder-based Gzip compressor achieves a maximum input throughput of 0.52 GB/s. To the best of our knowledge, this Gzip compressor offers the highest reported compression ratio in the literature of 2.47 for the Calgary Corpus benchmark.

Finally, this dissertation presents DeepScaleTool, an open-source tool for the accurateestimation of deep-submicron technology scaling by modeling and curve fitting published data by a leading commercial fabrication company for silicon fabrication technology generations from 130 nm to 7 nm for the key parameters of area, delay, and energy.

Cover page: Hardware Architectures for Lossless Compression

Thesis
Peer Reviewed

Computing Numerical Functions on Many-Core Processor Arrays

Huo, Yuxuan
Advisor(s): Baas, Bevan

UC Davis Electronic Theses and Dissertations (2024)

Numerical algorithm is a fundamental part of a chip and it plays a crucial rolein a chip. The efficient manipulation of numerical data is essential for achieving optimal performance and desired functionality of a chip. The algorithms are designed on the chip to solve complex mathematical problems in different fields. Therefore, an efficient and accurate numerical algorithm can improve the practicality of a chip. This paper presents some basic numerical algorithms that can apply to the target chip, and the target platform is Asynchronous Array of Simple Processors 3(AsAP3). The paper uses shift division as the basic dividing function throughout the algorithms to replace the traditional divisions. This paper implements Trigonometric functions, Exponential function, Natural Logarithm function, and LRN function on the AsAP3 platform. This paper applies Taylor series, CORDIC, and binary search algorithms to the implemented functions. Furthermore, this paper records the numerical results of these functions generated by AsAP3 and compares them with the reference values calculated by the MATLAB program. It analyzes the difference, SNR value, and throughput of simulated results to examine the accuracy of the calculation. The paper also displays difference and ratio graphs to visually present the magnitude of the difference. The results and comparisons show that the numerical algorithms offer a satisfactory performance in the target platform. The applications are programmed with C in Visual Studio and transferred to the AsAP3 platform. The comparison between the generated value and reference value is completed on MATLAB.

Cover page: Computing Numerical Functions on Many-Core Processor Arrays

Thesis
Peer Reviewed

Dynamic Voltage and Frequency Scaling Controller and Circuits Using Multiple Back Bias Voltages

Cui, Jin
Advisor(s): Baas, Bevan

UC Davis Electronic Theses and Dissertations (2023)

Power and thermal limits have become increasingly significant for integrated circuits as the scale of integration keeps growing. Ultra-Thin Body and Buried Oxide (UTBB) Fully Depleted Silicon-on-Insulator (FD-SOI) is a technology aimed at improving the device performance and power efficiency at the same time. A thin buried oxide (BOX) layer is introduced to not only lower the leakage currents, but also enable an strong back biasing (BB) voltage that is adjustable through front-side contacts. As a result, the threshold voltage is tunable to achieve high performance across a wide range of supply voltages. A 28 nm UTBB FD-SOI Low Threshold Voltage (LVT) technology from STMicroelectronics provides transistors that operate normally across a wide supply voltage range.

It is a common practice that digital circuits are throttled according to their real-time workload to conserve power and reduce heat generation. This is achieved by introducing a dynamic voltage and frequency scaling (DVFS) circuit which optimizes the supply voltage and clock frequency automatically.

In this thesis, the 28 nm UTBB FD-SOI technology is characterized through transistor-level circuit simulations. A DVFS controller design that supports two supply voltages and two back-bias voltages targeting the aforementioned technology to optimize circuit performance and reduce power consumption is proposed. Power gates are used to switch between voltages and shut down unused components. The DVFS controller suggests clock frequencies and voltages dynamically based on workload to maximize power efficiency without significantly sacrificing performance. Additionally, the controller’s output is manually configurable to accommodate user control. In a simulation conducted on inverter chains, BB provides as much as 17% reduction in propagation delay versus no BB at 1.0 V nominal supply voltage, and a maximum 56% reduction at 0.5 V. The DVFS design contributes to an average of 20.5% and a maximum of 56.3% reduction in total energy consumed in the simulated applications versus no DVFS while maintaining 96% of the throughput.

Cover page: Dynamic Voltage and Frequency Scaling Controller and Circuits Using Multiple Back Bias Voltages

Thesis
Peer Reviewed

H.264 Codec Implementation on a Many-Core Processor Array

Callahan, Aidan
Advisor(s): Baas, Bevan

UC Davis Electronic Theses and Dissertations (2022)

Due to the rise of higher resolution video over limited transmission bandwidths, video compression algorithms have revolutionized the way we view a digital video today. H.264, also known as Advanced Video Compression (AVC), is a popular standard for the compression of video content. H.264/AVC offers excellent compression performance due to a collection of algorithmic improvements over its predecessors. The H.264/AVC standard algorithm requires a high level of computational complexity with the opportunity to compute many subtasks in parallel. Consequently, a fine-grained many-core platform is a promising solution for the H.264/AVC algorithm. In this work a baseline H.264/AVC encoder and decoder (codec) is designed and simulated on the KiloCore II chip. The encoder processes 27,239 macroblocks-per-second at 449 mW without any algorithm specific hardware. With the introduction of a motion estimation accelerator, the encoder is able to process 73,010 macroblocks-per-second at 635 mW. The decoder, on the other hand, processes 24,347 macroblocks-per-second at 482 mW. KiloCore II is a competitive platform for video compression achieving a 1.8x - 49.1x and 1.4x - 8.1x higher throughput relative to compared codec designs.

Cover page: H.264 Codec Implementation on a Many-Core Processor Array

Thesis
Peer Reviewed

Design of Display Stream Compression Video Codecs

Wu, Shifu
Advisor(s): Baas, Bevan

UC Davis Electronic Theses and Dissertations (2021)

Video displays with ultra-high-definition (UHD) resolutions such as 4K (3840 ×2160) and 8K (7680×4320) are now available. Video frame rates such as 120 frames per second (fps) and beyond are becoming more prevalent. Moreover, new display technologies have enabled wide color gamut (WCG) and high dynamic range (HDR). As a result, the required bandwidth to transmit uncompressed video data over display links has dramatically increased (e.g., 120 Gbps for 8K videos with 30-bit color at 120 fps); however, the physical layer bandwidth is not keeping pace with this demand. To address the disparity, a widely accepted low-cost solution is to compress the video streams prior to transmission and decompress upon being displayed. The Display Stream Compression (DSC) standard developed by Video Electronics Standards Association (VESA) enables low-cost and low-latency hardware implementations of visually lossless video codecs over display links. This dissertation analyzes the DSC algorithm, presents three hardware encoder architectures and the design of a fabricated and first published encoder chip, discusses four hardware decoder architectures and six decoder implementations, and describes the design of many-core software DSC decoders.

The DSC encoder hardware architectures for a slice encoder, a slice-interleaved encoder, and a time-interleaved encoder are presented. A DSC encoder chip based on the time-interleaved encoder architecture and supporting up to 4K video resolution is designed and fabricated in TSMC 28 nm CMOS technology. The chip is capable of processing two slices in parallel, resulting in a throughput of two pixels per cycle for 4:4:4 pixels, and four pixels per cycle for 4:2:2 and 4:2:0 pixels. The chip shares combinational computational resources across slices, requiring 627.7 K logic gates, yielding a 1.75 times logic area reduction. The time-interleaved encoding scheme lowers energy per pixel by 1.87–1.96 times compared to non-interleaved encoding at nominal voltage. At 1.15 V, the chip achieves up to 1448 megapixels per second (Mpixels/s) at 362 MHz, which is equivalent to 174.6 fps for 4K videos. At 0.8 V, it achieves 441 Mpixels/s for 4:2:2/4:2:0 pixels and 221 Mpixels/s for 4:4:4 pixels, while demonstrating a minimum energy of 84 pJ, 92 pJ, and 163 pJ per 4:2:0, 4:2:2, and 4:4:4 pixel, respectively. The chip achieves 2.7–33 times lower area and 2.0–45 times better throughput per area compared to prior video encoder chips.

Next, this dissertation presents four hardware architectures for DSC decoders. A slice decoder design realizes the DSC algorithm and achieves a throughput of three pixels per cycle for 4:4:4 pixels, and six pixels per cycle for 4:2:2 and 4:2:0 pixel formats. A slice-interleaved decoder architecture is proposed to support decoding of multiple columns of slices per picture with minimum area overhead. A parallel slice decoder architecture utilizes multiple parallel slice decoders to linearly increase the throughput. In addition, a parallel-interleaved decoder architecture offers area and throughput trade offs. To evaluate the proposed architectures, six decoders are implemented in 28 nm CMOS using a standard-cell-based design flow. The six implemented decoders support decoding of 1080p (1920 × 1080) videos, and require 169.6–627 K logic gates. At 1.0 V, the decoders operate at maximum frequencies of 495–610 MHz, achieve maximum throughput of 1490–13,040 Mpixels/s while dissipating 22.9–58.6 pJ per pixel. The slice decoder achieves five times higher throughput than prior work.

Furthermore, this dissertation presents the design of software DSC decoders on a fine-grained many-core processor array. The slice decoder exploits fine-grained task-level and component-level parallelism and can decode pictures configured into one column of slices; it is implemented with 88 processors and 2 memory modules. The parallel slice decoders facilitate higher performance by leveraging scalable slice-level parallelism; two designs that process two and four slices in parallel are implemented utilizing from 178 processors and 4 memory modules to 359 processors and 6 memory modules. At 1.75 GHz and 1.1 V, the proposed decoders decode 1080p videos in 4:2:0, 4:2:2, and 4:4:4 pixel formats—achieving up to 94.7 fps, 95.6 fps, and 47.9 fps. The minimum energy of 11.8 nJ, 13.3 nJ, and 23.4 nJ per 4:2:0, 4:2:2, and 4:4:4 pixel is achieved in the slice decoder at 0.76 V. The proposed designs achieve up to 159 times higher throughput and 769 times lower energy per pixel than a DSC decoder implemented on one core of an Intel i7-7700HQ processor.

Cover page: Design of Display Stream Compression Video Codecs

Thesis
Peer Reviewed

An Energy-Efficient SqueezeNet Implementation on the KiloCore Platform

Dong, Ziyuan
Advisor(s): Baas, Bevan

UC Davis Electronic Theses and Dissertations (2022)

Many Convolutional Neural Networks (CNNs) have been developed for object detection, image classification, and facial recognition applications. Although many deep convolutional neural networks have focused on improving accuracy, few have focused on reducing the number of required hardware resources. While reducing hardware requirements is expected to reduce throughput performance, these simpler architectures are expected to provide advantages such as lower latency, lower power, and smaller memory requirements. In addition, simpler CNNs can be implemented on more devices, and are in general easier to train because they contain fewer parameters which required to be trained. This thesis proposes a KiloCore implementation of SqueezeNet, a lightweight CNN that offers low energy and high throughput, and contains 1,248,424 parameters inside 22 layers composed of 18 convolutional layers and 4 pooling layers.

This thesis presents an implementation of SqueezeNet running on a fine-grain many-core processor array called KiloCore. The metrics to be compared include energy per frame, power, throughput, throughput per area, energy-delay product (EDP), and memory. We compare with: SqueezeNet implementations running on an Intel Xeon E3-1275 v5 @ 3.6 GHz, an Intel i5-5250U @ 2.7 GHz, an Intel Knights Landing @ 1.7 GHz, a Qualcomm Snapdragon 810 @ 1.5 GHz, an NVIDIA Pascal @ 3.0 GHz, and an ARMv71 @ 0.9 GHz.

The KiloCore many-core implementation achieves a 1.0× – 17.0× lower energy per frame and 3.1× – 35.3× lower power dissipation. Regarding throughput performance, the KiloCore implementation is 4.8× higher than ARMv71 processor. The EDP value for KiloCore implementation is in the middle range among other hardware platform implementations, and the EDP is 95.2× lower compared to an ARMv71 processor. SqueezeNet implementation on KiloCore has significantly fewer memory requirements than other programmable processors.

Cover page: An Energy-Efficient SqueezeNet Implementation on the KiloCore Platform

Thesis
Peer Reviewed

Residual Neural Network on a Many-Core Platform

Wu, Haotian
Advisor(s): Baas, Bevan M

UC Davis Electronic Theses and Dissertations (2021)

Deep neural networks are used in many applications such as image classification, image recognition, natural language processing, etc. For real-time applications, lower latency(i.e., the time between when the input arrives and output is generated) is crucial. ResNet(Residual neural networks) is one of the most widely used deep neural networks in recent years. The most important reason for its prevalence is that residual structure can gain accuracy from considerably increased depth. Residual functions with reference to the layer inputs instead of learning unreferenced functions make this feature stand out. This thesis proposes a many-core implementation of ResNet-34 that is complete except for the softmax layer and offers low latency and high throughput performance, i.e., more images classified per second. Details of residual neural network architecture, algorithms of kernels, and mapping methodology are also presented.

The many-core implementation is compared against several general-purpose processors, GPUs, and FPGAs implementation of ResNet. The key metrics by which these platforms are compared are throughput per area, throughput per watt, and energy-delay product(EDP)gathered during the inference of 1 image. Since different fabrication technologies are used, throughput, area, and energy dissipation for all platforms are scaled to 32 nm. The many-core implementation offers 10.75×– 67.89×improvement over general-purpose processors, and6.6×– 8.6×over GPUs in throughput per area. Meanwhile, the many-core implementation provides 7.06×– 18.71×improvement in throughput per watt over general-purpose processors, 2.67×– 3.65×over FPGAs, and 1.32×– 4.98×over GPUs. Also, the proposed implementation has the lowest EDP among all platforms, which offers 2,329×– 2,529×improvement over CPUs, 46×– 579×over GPUs and FPGAs.

Cover page: Residual Neural Network on a Many-Core Platform

Article
Peer Reviewed

A Split-Decoding Message Passing Algorithm for Low Density Parity Check Decoders

UC Davis Previously Published Works (2010)

A Split decoding algorithm is proposed which divides each row of the parity check matrix into two or multiple nearly-independent simplified partitions. The proposed method significantly reduces the wire interconnect and decoder complexity and therefore results in fast, small, and high energy efficiency circuits. Three full-parallel decoder chips for a (2,048, 1,723) LDPC code compliant with the 10GBASE-T standard using MinSum normalized, MinSum Split-2, and MinSum Split-4 methods are designed in 65 nm, seven metal layer CMOS. The Split-4 decoder occupies 6.1 mm2, operates at 146 MHz, delivers 19.9 Gbps throughput, with 15 decoding iterations. At 0.79 V, it operates at 47 MHz, delivers 6.4 Gbps and dissipates 226 mW. Compared to MinSum normalized, the Split-4 decoder chip is 3.3 times smaller, has a clock rate and throughput 2.5 times higher, is 2.5 times more energy efficient, and has an error performance degradation of 0.55 dB with 15 iterations.

Cover page: A Split-Decoding Message Passing Algorithm for Low Density Parity Check Decoders