Search

Scholarly Works (32 results)

Sort By:

Show:

Thesis
Peer Reviewed

Non-linguistic Vocalization Recognition Based on Convolutional, Long Short-Term Memory, Deep Neural Networks

Qiu, Liang
Advisor(s): He, Lei

UCLA Electronic Theses and Dissertations (2018)

Non-linguistic Vocalization Recognition refers to the detection and classification of non-speech voice such as laughter, sneeze, cough, cry, screaming, etc. It could be seen as a subtask of Acoustic Event Detection (AED). Great progress has been made by previous research to increase the accuracy of AED. On the front end, multiple kinds of features such as Mel-Frequency Cepstral Coefficients (MFCCs), Gammatone Cepstral Coefficients (GTCCs) and many other hand-crafted features were explored. While on the back end, models or methods such as Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs), Bags-of-Audio-Words (BoAW), Support Vector Machine (SVM) and various types of neural networks were experimented.

Recent researches on Automatic Speech Recognition (ASR) and Acoustic Scene Classification (ASC) show the advantage of using Convolutional, Long Short-Term Memory, Deep Neural Networks (CLDNNs) on audio processing tasks. In this thesis, I am building a non-linguistic vocalization recognition system using CLDNNs. Log Mel-filterbank coefficients are adopted as input features and data augmentation methods such as random shifting and noise mixture are discussed. The built system is evaluated on a custom dataset collected from several resources and tested for real time application. The performance of CLDNNs for non-linguistic vocalization recognition is also compared with hybrid GMM-SVMs, Convolutional Neural Networks, Long Short-Term Memory and a fully connected Deep Neural Network trained on VGGish embeddings.

The results indicate that CLDNNs outperform the other models in classification precision and recall. Visualization of CLDNNs are presented to help understand the framework. The model is proved accurate and fast enough for real time applications.

Cover page: Non-linguistic Vocalization Recognition Based on Convolutional, Long Short-Term Memory, Deep Neural Networks

Thesis
Peer Reviewed

Applications of Formal And Semi-formal Verification on Software Testing, High-level Synthesis And Energy Internet

Gao, Min
Advisor(s): He, Lei

UCLA Electronic Theses and Dissertations (2018)

With the increasing power of computers and advances in constraint solving technologies, formal and semi-formal verification have received great attentions on many applications. Formal verification is the act of proving or disproving the correctness of intended algorithms underlying a system with respect to a certain formal specification or property. These verification techniques have wide range of applications in real life. This dissertation describes the applications of formal and semi-formal verification in four parts. The first part of the dissertation focuses on software testing. For software testing, symbolic/concolic testing reasons about data symbolically but enumerates program paths. The existing concolic technique enumerates paths sequentially, leading to poor branch coverage in limited time. We improve concolic testing by bounded model checking. During concolic testing, we identify program regions that can be encoded by BMC on the fly so that program paths within these regions are checked simultaneously. We have implemented the new algorithm on top of KLEE and called the new tool Llsplat. We have compared Llsplat with KLEE using 10 programs from the Windows NT Drivers Simplified and 88 programs from the GNU Coreutils benchmark sets. With 3600 second testing time for each program, Llsplat provides on average 13% relative branch coverage improvement on all 10 programs in the Windows drivers set, and on average 16% relative branch coverage improvement on 80 out of 88 programs in the GNU Coreutils set.

The second part of the dissertation implements symbolic/concolic testing methods onto an embedded platform. With the more extensive use and of higher demand of the embedded systems, reliability of the embedded software becomes a critical issue. Thus it is important to design a test harness that can test embedded software on the real platform or hardware in the loop framework comprehensively and systematically. We present our design prototype Codecomb. Codecomb implements symbolic/concolic execution that is able to achieve high branch coverage to generated test cases. It mainly exploits client/server architecture to achieve the isolation of testing tools and program under test such that complex computing job is performed in the server side. Experimental results show that Codecomb can detect program deficiency automatically on the embedded platform, and precisely locate errors such as buffer overflow, memory leak in a running program.

The third part of the dissertation applies formal and semi-methods to high-level synthesis (HLS) for VLSI. Verifying functional equivalence of high-level synthesis with formal methods ensures the correctness of the transformation flow. Current verification work widely uses static analysis such as model checking, while a pure dynamic execution flow is missing. In this part, we propose a functional verification flow for HLS utilizing symbolic execution on both C and Verilog directly. Specifically, on behavior C level we collect program traces via symbolic

execution. As for Verilog level, we first generate a circuit satisfiability modulo theory (SMT) representation. Then we propose a light-weight pure symbolic execution framework to collect Verilog’s on-the-fly time invariant version-based traces. To alleviate the scalability issue, we develop an operation abstraction method using SMT solvers to match potential C and Verilog traces. Extensive experiments on circuits from numerical computing and Chstone benchmark verify the validity and effectiveness of the flow.

The last part of the dissertation investigates the applications on Energy Internet. Energy Router based system is a crucial part in the energy transmission and management under the circumstance of Energy Internet for green cities. During its design process, a sound formal verification and a performance monitoring scheme are needed to check its reliability and meaningful quantitative properties. In this chapter, we provide formal verification solutions for ER based system by proposing a continuous-time Markov chain model describing the architecture of ER based system. To verify real world function of the ER based system, we choose electricity trading to propose a Markov decision process model running on an ER subsystem to describe the trading behaviour. To monitor the system performance, we project the energy scheduling process in ER based system, and then implemented this scheduling process on top of cloud computing experiment tool. Finally, we perform extensive experiment evaluations to investigate the system reliability properties, quantitative properties, and scheduling behaviours. The experiment verifies the effectiveness of the proposed models and the monitoring scheme.

Cover page: Applications of Formal And Semi-formal Verification on Software Testing, High-level Synthesis And Energy Internet

Thesis
Peer Reviewed

Efficient yet Accurate Models for Photovoltaic Modules with Shading Effects

Tu, Tianheng
Advisor(s): He, Lei

UCLA Electronic Theses and Dissertations (2014)

Mismatches between solar cells, such as shading effects, significantly reduce the output power of a photovoltaic (PV) module. An effective method to maintain power output in the presence of these mismatches is to add bypass diodes. However, existing modeling approaches cannot efficiently model PV modules with bypass diodes in place. Starting with an equivalent circuit model of a PV module, we develop a Colony-Wise model to reduce the model complexity while maintaining the same accuracy compared to the well-accepted Ground Truth model. To further reduce the model complexity, a Two-Colony model with a constant computational complexity is developed. We analyze the accuracy of the Two-Colony model by comparing its Power-Voltage (P-V) curves to the results generated by the Ground Truth model. Experimental results show that the maximum output power values have an average error of 3.4% and average correlation of 0.962 with respect to the Ground Truth model's results.

Cover page: Efficient yet Accurate Models for Photovoltaic Modules with Shading Effects

Thesis
Peer Reviewed

Conventional and Machine Learning Assisted High Sigma Analysis

Wu, Wei
Advisor(s): He, Lei

UCLA Electronic Theses and Dissertations (2016)

Statistical circuit simulation exhibits increasing importance for circuit designs under process variations. In particular, high sigma analysis is needed to optimize highly-duplicated standard cells, where an extremely rare circuit failure event could lead to catastrophe of the entire chip. Conventional importance sampling (IS) approaches perform high sigma analysis efficiently at low dimensionality, but perform poorly either when there are a larger number of process variation variables, or when the failing samples are distributed in multiple regions.

In this dissertation, a series of high sigma analysis approaches have been proposed. First, a high dimensional importance sampling (HDIS) is presented to mitigate the dimensionality problem in traditional IS. A maximum entropy (MAXENT) based approaches is proposed to model the distribution of circuit performance under process variation. MAXENT models the distribution in overall, but does not specifically model the tail. To fix this issue, a piecewise distribution model (PDM) is proposed to consider the distribution as multiple segments and model each segment using MAXENT, hence improve high accuracy in the high sigma tail.

Moreover, two machine learning assisted approaches are proposed for high sigma analysis. The rare-event microscope (REscope) trains classifier(s) to filter out the majority of the unlikely-to-fail samples and surgically look into those likely-to-fail ones, whose distribution is analytically modeled as a generalized pareto distribution to estimate failure probability. Finally, hyperspherical clustering and sampling (HSCS) algorithm is proposed to cluster failing samples and to perform importance sampling around those clusters to cover all failure regions. Experiment results demonstrate that the proposed approaches are 2-3 orders faster than Monte Carlo, and more accurate than both academia solutions such as IS, Markov Chain Monte Carlo, and industrial solutions such as mixture IS used by ProPlus Design Automation, Inc.

Cover page: Conventional and Machine Learning Assisted High Sigma Analysis

Thesis
Peer Reviewed

Towards Quantum Computing: Solving Satisfiability Problem by Quantum Annealing

Su, Juexiao
Advisor(s): He, Lei

UCLA Electronic Theses and Dissertations (2018)

To date, conventional computers have never been able to efficiently handle certain tasks, where the number of computation steps is likely to blow up as the problem size increases. As an emerging technology and new computing paradigm, quantum computing has a great potential to tackle those hard tasks efficiently. Among all the existing quantum computation models, quantum annealing has drawn significant attention in recent years due to the realization of the commercialized quantum annealer, sparking research interests in developing applications to solve problems that are intractable for classical computers.

However, designing and implementing algorithms that manage to harness the enormous computation power from quantum annealer remains a challenging task. Generally, it requires mapping of the given optimization problem into quadratic unconstrained binary optimization(QUBO) problem and embedding the subsequent QUBO onto the physical architecture of quantum annealer. Additionally, practical quantum annealers are susceptible to errors leading to low probability of the correct solution.

In this study, we focus on solving Boolean satisfiability (SAT) problem using quantum annealer while addressing practical limitations. We have proposed a mapping technique that maps SAT problem to QUBO, and we have further devised a tool flow that embeds the QUBO onto the architecture of a quantum annealing device. Additionally, We have optimized the proposed embedding flow to reduce run-time in addition to shortening the qubit chain length, leading to robust quantum annealing. To further improve the reliability of quantum annealing, we have also developed a post processing embedding technique that enlarges the energy gap between ground state and the first excited state. To demonstrate the effectiveness of proposed methods, we have conducted experiments on real quantum annealing devices manufactured by D-Wave Systems, showing compelling result of using quantum annealer to solve SAT problem.

Cover page: Towards Quantum Computing: Solving Satisfiability Problem by Quantum Annealing

Thesis
Peer Reviewed

Logic Synthesis for FPGA Reliability

Feng, Zhe
Advisor(s): He, Lei

UCLA Electronic Theses and Dissertations (2013)

Logic synthesis is one of the key stages in the computer-aided design (CAD) flow for a field programmable gate array (FPGA) based design. It usually consists of a series of optimization iterations to improve the quality of results (QoR) of the design. Besides the traditional optimization objectives (e.g., performance, area, power), the reliability is becoming a main concern as modern FPGAs have advanced to 20nm technology, due to reduction in core voltage, decrease in transistor geometry, and increase in switching speed. However, existing techniques for enhancing the reliability of FPGA based designs fall behind industrial needs in terms of cost (e.g., area and power overhead), CAD flow, runtime, and the FPGA architecture.

To address the problems, this dissertation proposes several novel logic synthesis algorithms. The first algorithm seeks a formal method to improve the reliability of FPGA based designs while incurring minimal area and power overhead. The algorithm formulates the problem of the FPGA reliability under random faults as a stochastic satisfiability (SSAT) based Boolean matching, and employs robust templates to rewrite the look-up table (LUT) based netlist, to maximize the stochastic yield rate. To ensure not breaking the current CAD flow, a logic synthesis algorithm is presented that performs a SAT-based in-place reconfiguration in the LUT to mask soft errors, without changing of the functionality and topology of the LUT based netlist. In addition, the dissertation proposes three fast in-place logic synthesis algorithms targeting the modern FPGA architecture including both LUTs and interconnects, which perform simulation guided netlist analyses and utilize don't cares in the netlist to enhance the reliability of the design. The effectiveness of the proposed algorithms are verified by experimental results.

Cover page: Logic Synthesis for FPGA Reliability

Thesis
Peer Reviewed

A Fast Method for SRAM Failure Estimation

Gao, Min
Advisor(s): He, Lei

UCLA Electronic Theses and Dissertations (2012)

The SRAM cell is an important memory component that is widely used in integrated circuit design. Its performance is crucial to the entire circuit. However, inevitable process variations have introduced significant changes in the performance of fabricated SRAM cells and led to severe circuit failure. Consequently, the failure probability of an SRAM cell must be kept extremely small. These extremely small probability events are considered to be "rare events". The most straightforward method of estimating failure probability is using the classical Monte Carlo method. However, this method is extremely impractical in the case of rare events because of its drastically long run time. Therefore, a method to efficiently estimate the failure probability of rare events is strongly desired.

In this thesis, a novel and fast failure analysis for SRAM cells is proposed to efficiently and accurately estimate the failure probability of a rare event. The proposed approach is based on Probability Collective based Importance Sampling. This approach increases the convergence rate by finding the closest approximation of the optimal distribution used in Importance Sampling. In order to find the closest approximation of the optimal sampling distribution, the proposed method minimizes the Figure of Merit of the estimated probability in each iteration. Experimental results show that the proposed method can have an average of 5x speed up compared to Probability Collective based Importance Sampling. Moreover, multiple trials between these two methods show that the proposed method offers a faster convergence rate and greater stability.

Cover page: A Fast Method for SRAM Failure Estimation

Thesis
Peer Reviewed

Software and Hardware Co-optimization for Deep Learning Algorithms on FPGA

Wu, Chen
Advisor(s): He, Lei

UCLA Electronic Theses and Dissertations (2022)

Over recent years, deep learning paradigms such as convolutional neural networks (CNNs) have shown great success in various families of tasks including object detection and au- tonomous driving, etc. To extend such success to non-euclidean data, graph convolutional networks (GCNs) have been introduced, and have quickly attracted industrial and academia attention as a popular solution to real-world problems. However, both CNNs and GCNs often have huge computation and memory complexity, which calls for specific hardware architec- tures to accelerate these algorithms. In this dissertation, we propose several architectures to accelerate CNNs and GCNs based on FPGA platforms. We start from the domain-specific FPGA-overlay processor (OPU) on commonly used CNNs, such as VGG, Inception, ResNet, and YoloV2. The data is first quantized to 8-bit fixed-point with little accuracy loss to reduce computation complexity and memory require- ment. A fully-pipelined dataflow architecture is proposed to accelerate the typical layers (i.e., convolutional, pooling, residual, inception, and activation layers) in CNNs. Experi- mental results show that OPU is 9.6� faster than GPU Jetson TX2 on a cascaded of three CNNs, which are used for the curbside parking system. However, 8-bit fixed-point data representation always need re-training to maintain accu- racy for deep CNNs. In this way, we propose a low precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitation. With- out any re-training, LPFP finds an optimal 8-bit data representation with negligible top- 1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder (MAC) and one 3-bit adder. Therefore, we can implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or one DSP48E2 of Xilinx Ultrascale/Ultrascale Plus family whereas one DSP can only imple- ment two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by 1.5� over existing FPGA accelerators. Particularly for VGG16 and Yolo, compared with seven FPGA accelerators, we improve average throughput by 3.5� and 27.5� and average throughput per DSP by 4.1� and 5�, respectively. CNNs quantized with mixed precision, on the other hand, benefits from low precision while maintaining accuracy. To better leverage the advantages of mixed precision, we propose a Mixed Precision FPGA-based Overlay Processor (MP-OPU) for both conventional and lightweight CNNs. The micro-architecture of MP-OPU considers sharing of computation core with mixed precision weights and activations to improve computation efficiency. In addition, run-time scheduling of external memory access and data arrangement are optimized to further leverage the advantages of mixed precision data representation. Our experimental results show that MP-OPU reaches 4.92 TOPS peak throughput when implemented on Xilinx VC709 FPGA (with all DSPs configured to support 2-bit multipliers). Moreover, MP-OPU achieves 12.9� latency reduction and 2.2� better throughput per DSP for conventional CNNs, while 7.6� latency reduction and 2.9� better throughput per DSP for lightweight CNNs, all on average compared with existing FPGA accelerators/processors, respectively. Graph convolutional networks (GCNs) have been introduced to effectively process non-euclidean graph data. However, GCNs incur large amount of irregularity in computation and memory access, which prevents efficient use of previous CNN accelerators/processors. In this way, we propose a lightweight FPGA-based accelerator, named LW-GCN, to tackle irregularity in computation and memory access in GCN inference. We first decompose the main GCN operations into Sparse Matrix-Matrix Multiplication (SpMM) and Matrix-Matrix Multiplication (MM). Thereafter, we propose a novel compression format to balance work- load across PEs and prevent data hazards. In addition, we quantize the data into 16-bit fixed-point and apply workload tiling, and map both SpMM and MM onto a uniform archi- tecture on resource limited devices. Evaluations on GCN and GraphSAGE are performed on Xilinx Kintex-7 FPGA with three popular datasets. Compared with existing CPU, GPU and state-of-the-art FPGA-based accelerator, LW-GCN reduces latency by up to 60�, 12� and 1.7� and increases power efficiency by up to 912�, 511� and 3.87�, respectively. Moreover, compared with Nvidia’s latest edge GPU Jetson Xavier NX, LW-GCN achieves speedup and energy savings of 32� and 84�, respectively. At last, we extend our GCN inference accelerator to a GCN training accelerator, called SkeletonGCN. To better fit the properties of GCN training, we add more software-hardware co-optimizations. First, we simplify the non-linear operations in GCN training to better fit the FPGA computation, and identify reusable intermediate results to eliminate redundant computation. Second, we optimize the previous compression format to further reduce mem- ory bandwidth while allowing efficient decompression on hardware. Finally, we propose a unified architecture to support SpMM, MM and MM with transpose, all on the same group of PEs to increase DSP utilization on FPGA. Evaluations are performed on Xilinx Alveo U200 board. Compared with existing FPGA-based accelerator on the same network archi- tecture, SkeletonGCN can achieve up to 11.3� speedup while maintaining the same training accuracy with 16-bit fixed-point data representation. In addition, SkeletonGCN is 178� and 13.1� faster than state-of-the-art CPU and GPU implementation on popular datasets, respectively. To summarize, we have been working on FPGA-based acceleration for deep learning algorithms of CNNs and GCNs in both inference and training process. All the accelera- tors/processors were hand-coded and have been fully verified. In addition, the related tool chains for generating golden results and running instructions for the accelerators/processors have also been finished.

Cover page: Software and Hardware Co-optimization for Deep Learning Algorithms on FPGA

Thesis
Peer Reviewed

A Moment Matching Based Fitting Algorithm for High Sigma Distribution Modeling

Krishnan, Rahul
Advisor(s): He, Lei

UCLA Electronic Theses and Dissertations (2015)

The impact of process variations continue to grow as transistor feature size shrinks. Such variations in transistor parameters lead to variations and unpredictability in circuit output, and may ultimately cause them to violate specifications leading to circuit failure. In fact, timely failures in critical circuits may lead to catastrophic failures in the entire chip. As such, statistical modeling of circuit behavior is becoming increasingly important. However, existing statistical circuit simulation approaches fail to accurately and efficiently analyze the high sigma behavior of probabilistic circuit output. To this end, we propose PDM (Piecewise Distribution Model) - a piecewise distribution fitting approach via moment matching using maximum entropy to model the high sigma behavior of analog/mixed-signal (AMS) circuit probability distributions. PDM is independent of the number of input dimensions and matches region specific probabilistic moments which allows for significantly greater accuracy compared to other moment matching approaches. PDM also utilizes Spearman's rank correlation coefficient to select the optimal approximation for the tail of the distribution. Experiments on a known mathematical distribution and various circuits obtain accurate results up to 4.8 sigma with more than 2 orders of speedup relative to Monte Carlo.

Cover page: A Moment Matching Based Fitting Algorithm for High Sigma Distribution Modeling

Thesis
Peer Reviewed

Stochastic Yield Analysis of Rare Failure Events in High-Dimensional Variation Space

Shi, Xiao
Advisor(s): He, Lei

UCLA Electronic Theses and Dissertations (2020)

As semiconductor industry kept shrinking the feature size to nanometer scale, circuit reliability has become an area of growing concern due to the uncertainty introduced by process variations. For highly-replicated standard cells, the failure event for each individual component must be extremely rare in order to maintain sufficiently high yield rate. Existing yield analysis approaches works fine at low dimension, but less effective either when there are a large amount of circuit parameters, or when the failure samples are distributed in multiple regions. In this thesis, four novel high sigma analysis approaches have been proposed.

First, we propose an adaptive importance sampling (AIS) algorithm. AIS has several iterations of sampling region adjustments, while existing methods pre-decide a static sampling distribution. At each iteration, AIS generates samples from current proposed distribution. Next, AIS carefully assigns weight to each sample based on its tilted occurrence probability between failure region and current failure region distribution. Then we design two adaptive frameworks based on Resampling and population Metropolis-Hastings (MH) to iteratively search for failure regions.

Second, we develop an Adaptive Clustering and Sampling (ACS) method to estimate the failure rate of high-dimensional and multi-failure-region circuit cases. The basic idea of the algorithm is to cluster failure samples and build global sampling distribution at each iteration. Specifically, in clustering step, we propose a multi-cone clustering method, which partitions the parametric space and clusters failure samples. Then global sampling distribution is constructed from a set of weighted Gaussian distributions. Next, we calculate importance weight for each sample based on the discrepancy between sampling distribution and target distribution. Failure probability is updated at the end of each iteration. This clustering and sampling procedure proceeds iteratively until all the failure regions are covered.

Moreover, two meta-model based approaches are proposed for high sigma analysis. The Low-Rank Tensor Approximation (LRTA) formulate the meta-model in tensor space by representing a multi-way tensor into a finite sum of rank-one tensor. The polynomial degree of our LRTA model grows linearly with circuit dimension, which makes it especially promising for high-dimensional circuit problems. Then we solve our LRTA model efficiently with a robust greedy algorithm, and calibrate iteratively with an adaptive sampling method. The meta-model based importance sampling (MIS) method utilizes Gaussian Process meta-model to construct quasi-optimal importance sampling distribution, and performs Markov Chain Monte Carlo (MCMC) simulation to generate new samples from the proposed distribution. By updating our global Importance Sampling estimator in an iterated framework, MIS leads to better efficiency and higher accuracy than traditional importance sampling methods. Experiment results validate that the proposed approaches are 3 orders faster than Monte Carlo, and more accurate than both academia solutions such as importance sampling and classification based methods, and industrial solutions such as mixture IS used by Intel.

Cover page: Stochastic Yield Analysis of Rare Failure Events in High-Dimensional Variation Space