# UCLA UCLA Electronic Theses and Dissertations

## Title

Baseband and LO Techniques with Integrated CMOS for Wideband Millimeter-Wave Sensing

**Permalink** https://escholarship.org/uc/item/8jv0v4nm

**Author** Zhang, Yan

Publication Date 2020

Peer reviewed|Thesis/dissertation

# UNIVERSITY OF CALIFORNIA

Los Angeles

Baseband and LO Techniques with Integrated CMOS for Wideband Millimeter-Waving Sensing

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical Engineering

by

Yan Zhang

© Copyright by

Yan Zhang

2020

## ABSTRACT OF THE DISSERTATION

Baseband and LO Techniques with Integrated CMOS for Wideband Millimeter-Waving Sensing

by

Yan Zhang Doctor of Philosophy in Electrical Engineering University of California, Los Angeles, 2020 Professor Asad M. Madni, Co-Chair Professor Mau-Chung Frank Chang, Co-Chair

Millimeter-wave circuits and systems are fundamental building blocks in scientific instruments for space and Earth exploration. One important class of such instruments is radio frequency (RF) spectroscopy with frontends operating at sub-millimeter wavelengths. Modern missions prefer cheaper, smaller, and more efficient components for multi-pixel and multi-sensor integration. This dissertation exams building blocks of RF spectroscopy and presents efficient baseband and LO designs in integrated CMOS technology for instrument minimization.

This work started with the development of three generations of RF spectrometer SoCs, where CMOS devices takes the more traditional role in baseband processing. Emphasis is given to the design, optimization, and implementation of high-speed and high-channel-count real-input FFT cores with specialized pre- and post-processing requirements. Optimization techniques at the algorithmic (zero-padded complex FFT), architectural (parallel-pipeline

partition and extended Radix- $2^{\kappa}$  factorization), and algebraic (linear approximation, constant multiplier, and tapered bit depth) levels allows the designs to fit in compact floorplans with limited routing resources at up to 12GHz equivalent speed. The FFT cores are among the largest and the most efficient designs for high-speed applications whereas the SoCs represents the first dedicated spectroscopic processors with the highest level of integration.

The second part of this dissertation, breaking away from convention, explores the use of CMOS digital inverter rings for ultra-wide millimeter-wave frequency synthesis as a compact and scalable alternative to high-order LC networks. Based on ring oscillator scaling properties and the optimal conditions for superharmonic injection locking, a new methodology is proposed to co-design robust multi-ratio VCO-ILFD pairs. Put in a cascaded PLL, the fabricated prototype covers 23-39GHz with phase noises below -96dBc/Hz at 1MHz offset. Further improvement can be easily achieved with better device and passive modeling. Additional reconfigurable frequency multiplier is also proposed to take advantage of the quadrature output for tri-band (28/39/60-GHz) LO generation.

The dissertation of Yan Zhang is approved.

Kuo-Nan Liou Yuanxun Ethan Wang Chih-Kong Ken Yang Asad M. Madni, Committee Co-Chair Mau-Chung Frank Chang, Committee Co-Chair

University of California, Los Angeles

2020

To my family

# **Table of Contents**

| ABSTRACT OF THE DISSERTATION                                        | ii   |
|---------------------------------------------------------------------|------|
| LIST OF FIGURES                                                     | ix   |
| LIST of TABLES                                                      | xii  |
| ACKNOWLEDGEMENTS                                                    | xiii |
| VITA                                                                | xv   |
| SELECT PUBLICATIONS                                                 | xvi  |
| CHAPTER 1 Introduction                                              | 1    |
| 1.1 Motivation                                                      | 1    |
| 1.2 Dissertation Organization                                       | 4    |
| CHAPTER 2 Spectroscopy and Spectrometer                             | 5    |
| 2.1 Overview                                                        | 5    |
| 2.2 The Radiometer Receiver                                         | 10   |
| 2.3 The LO Chain                                                    | 13   |
| 2.4 Spectrometer Considerations                                     | 15   |
| 2.4.1 Quantization Depth                                            | 16   |
| 2.4.2 Bandwidth and Resolution                                      | 18   |
| 2.4.3 Averaging Duration and Efficiency                             | 21   |
| 2.4.4 Robustness and Reconfigurability                              | 22   |
| CHAPTER 3 Integrated CMOS Spectrometers for Wideband Remote Sensing | 24   |
| 3.1 System Overview                                                 | 24   |

|   | 3.2 The Mixed-Signal Interface and Clock Management                                           | 28 |
|---|-----------------------------------------------------------------------------------------------|----|
|   | 3.3 Area- and Energy-Efficient Digital Design                                                 | 34 |
|   | 3.3.1 FFT Hardware and Architectural Optimization based on Radix-2 <sup>K</sup> Factorizatior | 34 |
|   | 3.3.2 Accumulation-Aware Real-Data FFT Algorithm                                              | 42 |
|   | 3.3.3 High Speed Accumulator with 48-bit Output                                               | 44 |
|   | 3.3.4 Efficient Windowing with On-the-Fly Coefficient Generation                              | 45 |
|   | 3.4 Laboratory Testing and Field Application                                                  | 50 |
| C | CHAPTER 4 Low-Noise Inverter-Ring-based Wideband Millimeter-Wave Synthesis                    | 55 |
|   | 4.1 Introduction                                                                              | 55 |
|   | 4.2 PLL-based Frequency Synthesizers                                                          | 56 |
|   | 4.3 Architectural Considerations for Low Noise and Ultra-Wide Frequency Synthesis             | 60 |
|   | 4.4 Wideband Ring-Based VCO-ILFD Co-Design with Multi-Ratio Pairing                           | 66 |
|   | 4.4.1 Injection-Locking and the Superharmonic ILFDs                                           | 67 |
|   | 4.4.2 Optimal Ring-Based VCO-ILFD Co-Design                                                   | 71 |
|   | 4.4.3 Multi-Ratio Pairing with Oscillator Multiplexing                                        | 75 |
|   | 4.5 Prototype Implementation                                                                  | 78 |
|   | 4.5.1 Frequency Planning and Ring Oscillator Design                                           | 78 |
|   | 4.5.2 Noise, Spur, and Power Considerations for the Cascaded PLL                              | 80 |
|   | 4.5.3 Dual-Band Frequency Buffer for Tri-Band Extension                                       | 82 |
|   | 4.6 Measurement and Comparison                                                                | 84 |
| C | CHAPTER 5 Conclusion and Future Directions                                                    | 88 |

| EFERENCES |
|-----------|
|-----------|

## LIST OF FIGURES

| Fig. | . 1.1. The LO chain in the PISSARRO instrument [6]                                                         | 2  |
|------|------------------------------------------------------------------------------------------------------------|----|
| Fig. | . 1.2. Noninvasive glucose monitoring using millimeter-wave transmission [11]                              | 3  |
| Fig. | . 1.3. Globally allocated and targeted bands for 5G communication as of mid-2019 [12]                      | 3  |
| Fig. | . 2.1. Spectral lines on Comet 67P taken by the MIRO radiometer/spectrometer                               | 5  |
| Fig. | . 2.2. Conceptual diagram of spectroscopic sensing on a THz telescope.                                     | 6  |
| Fig. | . 2.3. Astronomical requirement on spectral resolution and bandwidth [17].                                 | 7  |
| Fig. | . 2.4. A commercial FFT spectrometer from Omnisys Instruments.                                             | 9  |
| Fig. | . 2.5. Transition between spectroscopy and radiometry sharing the same frontend receiver                   |    |
|      |                                                                                                            | 10 |
| Fig. | . 2.6. Illustration of measurement resolution with input brightness $T_{sys}$ and output voltage $T_{sys}$ | v. |
|      |                                                                                                            | 10 |
| Fig. | . 2.7. Illustration of impact of phase noise on spectral resolution                                        | 15 |
| Fig. | . 2.8. ADC quantization noise, SQNR, and SFDR                                                              | 16 |
| Fig. | . 2.9. SNR degradation vs. quantization depth in a spectrometer system                                     | 17 |
| Fig. | . 2.10. Detection of added brightness with two-level digitization and failure to do so with an             |    |
|      | incorrect common-mode voltage or equivalently the comparator threshold.                                    | 18 |
| Fig. | . 2.11. Spectral continuum around 340GHz reported in [41].                                                 | 20 |
| Fig. | . 3.1. System level block diagram of proposed spectrometer SoCs                                            | 25 |
| Fig. | . 3.2. The simulation view of the system level diagram and outputs at labelled locations                   | 27 |
| Fig. | . 3.3. Single-channel ADC (S-VI) output DEMUX2 and its timing diagram                                      | 29 |
| Fig. | . 3.4. Interleaved ADC (S-VII) output DEMUX2 and its timing diagram                                        | 30 |
| Fig. | . 3.5. A simplified block diagram of the clock management unit (CMU).                                      | 32 |
| Fig. | . 3.6. (a) Divide-by-2 (b) Phase ambiguity (c) DFF-based arbitration (d) Corrected output                  | 33 |
| Fig. | . 3.7. Coarse- (top) and fine-delay unit cell schematics and their resolution simulations                  | 33 |

| Fig. | <b>3.8.</b> Two ways to decompose an 8-point FFT using Radix-4 and Radix-2 butterflies                         | 36 |
|------|----------------------------------------------------------------------------------------------------------------|----|
| Fig. | <b>3.9.</b> Diagram of (a) memory-based time-multiplexed architecture; (b) the SDF architecture                | ;  |
|      | and general tradeoffs between time-multiplexed and parallel implementations.                                   | 37 |
| Fig. | <b>3.10.</b> Example Radix-2 to Radix-2 <sup>2</sup> transformation in an 8-point DIF FFT                      | 39 |
| Fig. | <b>3.11.</b> Radix- $2^{K}$ (K = 1, 2, 3, 4) factorization with regrouped twiddle factors                      | 39 |
| Fig. | <b>3.12.</b> Multiplier controls in the Radix-2 <sup>4</sup> SDF coincide better with an incrementing counter- | 40 |
| Fig. | <b>3.13.</b> The FFT architecture in this work highlighting the parallel SDF stages                            | 41 |
| Fig. | <b>3.14.</b> The SFG and pipeline partition of the 32-point Radix-2 <sup>2</sup> FFT                           | 41 |
| Fig. | <b>3.15.</b> The SFG of the 16-point DIF RFFT in [49].                                                         | 44 |
| Fig. | <b>3.16.</b> Unit accumulators and the critical path in the SRAM-based design                                  | 45 |
| Fig. | <b>3.17.</b> frequency responses of different windowing techniques around center bin                           | 46 |
| Fig. | <b>3.18.</b> Illustration of the symmetry in the trigonometric and sinc functions and the 6-stage              |    |
|      | CORDIC divider (bottom)                                                                                        | 49 |
| Fig. | <b>3.19.</b> Frequency responses with difference coefficient digitization schemes.                             | 49 |
| Fig. | <b>3.20.</b> Laboratory measurement setup for the S-VI.                                                        | 50 |
| Fig. | <b>3.21.</b> Measured and expected spectral feature of $H_2O$ at 556.935 GHz.                                  | 51 |
| Fig. | <b>3.22.</b> (a) Diagram of measurement setup for the S-VII with the 180GHz receive; (b) and (c                | :) |
|      | spectral features for $CH_3CN$ and $H_2O$ respectively with the frequency dithering technique.                 | 52 |
| Fig. | <b>3.23.</b> The S-VII-based spectroscopy instrument onboard the RECTANGLE flight                              | 54 |
| Fig. | <b>4.1.</b> Frequency planning with a 24-40GHz source                                                          | 56 |
| Fig. | <b>4.2.</b> Simplified PLL and its phase and noise transfer model                                              | 57 |
| Fig. | <b>4.3.</b> Example type-I and type-II PLLs and their VCO noise suppression characteristic                     | 60 |
| Fig. | <b>4.4.</b> (a) Switched-capacitor and (b) LC-VCO performance swept against frequency                          | 60 |
| Fig. | <b>4.5.</b> Cascaded LO generation with a static frequency multiplier                                          | 61 |
| Fig. | <b>4.6.</b> Cascaded LO generation with an ILFM with an example floorplan in [77]                              | 62 |
| Fig. | 4.7. The cascaded PLL architecture and its phase noise improvement                                             | 63 |
|      | x                                                                                                              |    |

| Fig. 4.8. Proposed Cascaded PLL with mmW ring oscillator and programmable dividers65                    |
|---------------------------------------------------------------------------------------------------------|
| Fig. 4.9. Logic implementation of basic programmable digital dividers                                   |
| Fig. 4.10. Model for analyzing injection-locked ring oscillator with the effect of injection in red. 68 |
| Fig. 4.11. Phasor diagrams for a free-running (left) and an injection-locked ring oscillator68          |
| Fig. 4.12. Behavior model for a superharmonic ILFD70                                                    |
| Fig. 4.13. Example LC-VCO- and CML-ring-based ILFD-2 with embedded mixer71                              |
| Fig. 4.14. Desired frequency relationship for best noise suppression in ILFD73                          |
| Fig. 4.15. Ring oscillator scaling properties73                                                         |
| Fig. 4.16. ILFD-2 optimization with exposed feedback and simulated locking range increase74             |
| Fig. 4.17. (a) Latch-up in a 4-stage ring (b) multiplexing between ILFD-2 and ILFD-376                  |
| Fig. 4.18. Final functional design of (a) ILFD-4 and (b) ILFD-577                                       |
| Fig. 4.19. Simulated input-output relationship under process variation for the ILFD-4 with main         |
| VCO (left) and the input-referred locking range for the ILFD-578                                        |
| Fig. 4.20. Top-level block diagram of the proposed cascaded PLL79                                       |
| Fig. 4.21. Main ring VCO design and its simulated tuning with negative $K_{VCO}$ 80                     |
| Fig. 4.22. PLL <sub>LB</sub> implementation details                                                     |
| Fig. 4.23. Select PLL <sub>HB</sub> implementation details                                              |
| Fig. 4.24. Gilbert-cell-based buffer/doubler design and its dual-band load                              |
| Fig. 4.25. Prototype chip micrographs and power consumption breakdown                                   |
| Fig. 4.26. Spectrum and phase noise measurement of PLL <sub>LB</sub> 85                                 |
| Fig. 4.27. High- and low-band phase noise measurement of $PLL_{HB}$                                     |
| Fig. 4.28. PLL <sub>HB</sub> mid-band output phase noise meansurement at 31.46GHz86                     |

# LIST of TABLES

| Table 1. HIFI Heterodyne Spectrometer for Astrophysics                     | 8  |
|----------------------------------------------------------------------------|----|
| Table 2. MIRO Heterodyne Spectrometer for Planetary Science                | 8  |
| Table 3. MLS Heterodyne Spectrometer for Earth Science                     | 8  |
| Table 4. Receiver Technologies for THz Remote Sensing                      | 12 |
| Table 5. System Specifications for Proposed Spectrometer SoCs              | 26 |
| Table 6. CSD Representation of Recurring Values for Select Twiddle Factors | 38 |
| Table 7. FFT Specifications for the Proposed Spectrometer SoCs             | 42 |
| Table 8. Comparison with Prior Arts                                        | 53 |
| Table 9. Wideband mmW PLL Comparison                                       | 87 |

### ACKNOWLEDGEMENTS

First and foremost, I would like to express my uttermost gratitude to my advisor, Prof. Mau-Chung Frank Chang, for his continuous support of my study and research at UCLA, as well as various related excursions. His patience and trust guided me through a few initial setbacks in this journey and has kept me going ever since. His openness and vision allowed me to follow my passion, exploring diverse topics within the realm of circuits and embedded systems, and eventually converge under one common theme. I feel immensely privileged being able to focus on innovating and executing without worrying about the complex logistics associated with integrated circuit fabrication, all thanks to Prof. Chang's entrepreneurship and resourcefulness.

My sincere gratitude also goes to Prof. Asad Madni who was so kind to serve as my advisor when Prof. Chang was away on leave-of-absence, to which I am eternally in debt. I am also in debt to the rest of my Ph.D. committee: Prof. Chih-Kong Ken Yang, Prof. Yuanxun Ethan Wang, and Prof. Kuo-Nan Liou, whose combined expertise from digital to RF design, and from signalprocessing techniques to the final application in astronomical and atmospheric study, have set the bar for this dissertation.

My special thanks go to Dr. Adrian Tang and Dr. Yanghyo Rod Kim, with whom I worked day and night on the development of spectrometer SoCs and from whom I learned so much. In particular, Dr. Tang introduced me to the fascinating world of space instrumentation along with many practical considerations in millimeter-wave sensing.

I thank and cherish the guidance, help, and companionship from my friends and colleagues in the electrical engineering department, including but not limited to Dr. Wei-Han Cho, Dr. Wenlong Jiang, Dr. Long Kong, Dr. Jinxi Guo, Dr. Jieqiong Du, Dr. Mahmoud Elhebeary, Bu Shi, Yu Zhao, Weiyu Len, Rulin Huang, Weikang Qiao, Chia-Jen Liang, and Jia Zhou. I would also like to thank our wonderful administrative analyst, Janet Lin, for her support on the day-to-day logistics of our research. Lastly, I would like to express my deepest appreciation to my parents, Chunhua Zhang and Huoying Shi, for their love, support, and any unspoken sacrifice made for the betterment of my education and wellbeing throughout all these years. I know deep down that they are my anchor. The same appreciation also goes to my fiancée, Zhengwei Zhou, who has stood by my side since day one of our graduate school. Her love and patience calm me down while together riding through the ups and downs of school and life.

# VITA

| 2008-2012 | B.S. (Electrical Engineering), Arizona State University, Tempe, AZ       |
|-----------|--------------------------------------------------------------------------|
| 2012-2014 | M.S. (Electrical Engineering), University of California, Los Angeles, CA |
| 2014-2016 | Analog Design Engineer, Teradyne Inc., Agoura Hills, CA                  |

### SELECT PUBLICATIONS

**Y. Zhang,** Y. Zhao, R. Huang, C.-J. Liang, C.-W. Chiang, Y.-C. Kuan, and M.-C. F. Chang, "A 23.6-38.3GHz Low-Noise PLL with Digital Ring Oscillator and Multi-Ratio Injection-Locked Dividers for Millimeter-Wave Sensing," *2020 IEEE Radio-Frequency Integrated Circuits Symposium (RFIC)*, Los Angeles, CA, USA, 2020, pp. 1-4.

Y. Kim, **Y. Zhang**, T. J. Reck, D. J. Nemchick, G. Chattopadhyay, B. Drouin, M.-C. F. Chang, and A. Tang, "A 183-GHz InP/CMOS-Hybrid Heterodyne-Spectrometer for Spaceborne Atmospheric Remote Sensing," in *IEEE Transactions on Terahertz Science and Technology*, vol. 9, no. 3, pp. 313-334, May 2019.

**Y. Zhang,** Y. Kim, A. Tang, J. H. Kawamura, T. J. Reck and M. Frank Chang, "Integrated Wide-Band CMOS Spectrometer Systems for Spaceborne Telescopic Sensing," in *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 66, no. 5, pp. 1863-1873, May 2019.

**Y. Zhang,** Y. Kim, A. Tang, J. Kawamura, T. Reck and M.-C. F. Chang, "A 2.6GS/s Spectrometer System in 65nm CMOS for Spaceborne Telescopic Sensing," *2018 IEEE International Symposium on Circuits and Systems (ISCAS),* Florence, 2018, pp. 1-4.

L. Du, **Y. Zhang,** F. Hsiao, A. Tang, Y. Zhao, Y. Li, Z.-Z. Chen, L. Huang, and M.-C. F. Chang, "A 2.3mW 11cm-range bootstrapped and correlated-double-sampling (BCDS) 3D touch sensor for mobile devices," *2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers*, San Francisco, CA, 2015, pp. 1-3.

### CHAPTER 1

### Introduction

#### 1.1 Motivation

In 2015, our group debuted the first partially integrated CMOS back-end processor for RF spectroscopy [1] (hereafter referred to as "spectrometer") that incorporated a 7-bit analog-todigital converter (ADC) with a Nyquist bandwidth of 1.1GHz and a 512-point FFT processor. Both are IPs designed originally for a 60GHz (802.15.3c) transceiver [2] in 65nm CMOS technology. Despite being an expedited effort, it was sufficient to convince the science community of the stability and sensitivity attainable using the fully custom design approach for spectroscopic measurements. The significant reduction in instrument size, weight, and power consumption (SWaP), combined with hardware reconfigurability and ease-of-use, led to tremendous savings in mission launch and operation, which further validated custom design as worthwhile investment. The then-self-initiated attempt has now grown into officially funded projects supporting potentially life-changing space explorations and Earth science studies.

In addition to spectrometers, the minimization effort has extended to components of other sensing instruments such as radiometers and radars. One common design need is the custom digital blocks such as the FFT processors in spectrometers and the arbitrary-waveform-generators (AWG) for direct digital synthesis or carrier modulation in radars [3]. As the enabler for large-scale integration, CMOS technology is naturally suited for digital implementation, which were indeed the priority targets of our effort. However, they were not easy targets, with two key challenges. First, to not only satisfy mission requirements but also to outperform the off-the-shelf alternatives, the specifications of our designs were ambitious. For example, while the 3GPP-LTE standard required an 20MHz FFT with up to 2048 points in size [4], or the DVB-T standard an 8192-point FFT at 8MHz [5], our design target was 16384 points at close to 400MHz. The

ambitious goals accentuated the gravity of the second challenge: limited design resources due to being a small research team developing SoCs for an important yet niche application space.



Fig 1.1. The LO chain in the PISSARRO instrument [6].

Dealing with RF instruments with frontends or carriers operating at millimeter-wave (mmW) frequencies (30- to 300-GHz) and beyond, local oscillator (LO) generation was the second recurring design scenario where a CMOS implementation would dramatically improve the system efficiency [6]. In a typical frequency multiplication chain as indicated in Fig. 1.1, the first stage can be replaced with a CMOS-based mmW phase-locked loop (PLL). Our group has demonstrated several LC-VCO-based PLLs with fundamental output up to 180GHz [7-9], but they were still lacking in frequency coverage, as LC-based approaches are inherently narrow-band. On the other hand, frequencies of interest for radiometer and spectrometers vary widely depending on the observation target. If the PLL range remained small, it would be impossible to find the common denominator during frequency planning and therefore would require a new design for a different mission. It was only logical to develop an efficient and ultra-wide range mmW frequency synthesizer as the versatile first stage in the LO chain. Such a wideband synthesizer would also benefit other applications, such as biomedical sensing, with one example shown in Fig. 1.2, or more pervasively, mmW communication, as cellular communication enters its fifth generation and

supports mmW frequencies for the first time [10]. As shown in Fig. 1.3, even though bands around 28-, 39-, and 67-GHz are allocated for mmW 5G by the FCC in the United States, the international assignments are much more diverse. A compact and scalable continuous ultra-wide frequency synthesis technique prove invaluable in the global adoption of mmW 5G.



Fig 1.2. Noninvasive glucose monitoring using millimeter-wave transmission [11]

| - ) | <1GHz 30                          | Hz — 4GHz         | 5GHz         | 24-28GHz                                          |                                                      |
|-----|-----------------------------------|-------------------|--------------|---------------------------------------------------|------------------------------------------------------|
| ٠   | 600MHz (2x35MHz) 2.5GHz (LTE 841) | 355-37 GHz 3.742G | Hz 5.9-71GHz | 24 25-24 45GHz<br>24 75-25 25GHz<br>27 5-28 35GHz | 37-37-6GHz<br>37.6-40GHz<br>47.2-48.2GHz<br>64-71GHz |
| (+) | 600MHz (2x35MHz)                  |                   |              | 27.5-28.35GHz                                     | 37-57/6GHz 64-71GHz                                  |
| 0   | 700MHz (2x30 MHz)                 | 34-3.8GHz         | 5.9-6.4GHz   | 24.5-27.5GHz                                      |                                                      |
| +   | 700MHz (2x30 MHz)                 | 34-3.8GHz         |              | 26GHz                                             |                                                      |
|     | 700MHz (2x30 MHz)                 | 3.4-3.8GHz        |              | 26GHz                                             |                                                      |
| 0   | 700MHz (2x30 MHz)                 | 3.46-3.8GHz       |              | 26GHz                                             |                                                      |
| 0   | 700MHz (2x30 MHz)                 | 3.6-3.8GHz        |              | 26.5-27.5GHz                                      |                                                      |
| 0   |                                   | 33-36644          | 4.8-5GHz     | 24.5-27.5GHz                                      | 37.5-42.5GHz                                         |
| :*: |                                   | 3.4-3.7GHz        |              | 26.5-29.5GHz                                      |                                                      |
| •   |                                   | 36-42GHz          | 4.4-4.9GHz   | 27.5-29.5GHz                                      |                                                      |
| 6   |                                   | 3.4-3.7GHz        |              | 24.25-27.5GHz                                     | 39GHz                                                |

Fig 1.3. Globally allocated and targeted bands for 5G communication as of mid-2019 [12].

#### 1.2 Dissertation Organization

The remaining chapters are organized as follows. Chapter 2 introduces the scientific principles behind RF spectroscopy and derives the specification for each component. Special attention is given to compare and assess whether a CMOS-based block-level implementation is viable or not for an optimal system-on-chip (SoC) design. Chapter 3 gives the detailed treatment of the design, implementation, and measurement of three generations of spectrometers with a focus on the digital side of the mixed-signal SoCs. Chapter 4 transitions to mmW LO generation by first looking at two, one well perceived and the other well neglected, challenges to ultra-wide mmW frequency synthesis. The proposed inverter-based mmW VCO-and-multi-ratio-divider subsystem with a simple co-design methodology is then introduced. After discussing current design, analysis, and measurement of the proposed cascaded PLL architecture and its prototypes are presented. Chapter 5 concludes the dissertation and outlines future directions.

## **CHAPTER 2**

## **Spectroscopy and Spectrometer**

### 2.1 Overview

Sub-mmW or terahertz (THz) spectroscopy is an essential passive remote sensing technique for scientific study. Not only does it help with astrophysics missions with large land-based or space-borne radio telescopes such as ALMA [13] and Herschel [14], it also provides invaluable insights for Earth and planetary explorations on compact platforms such as probes and CubeSats. For example, the MIRO radiometer/spectrometer [15] onboard the Rosetta spacecraft measured seven gas species (CO, CH<sub>3</sub>OH, NH<sub>3</sub>, H<sub>2</sub>O, as well as three oxygen-related isotopes of water, H<sub>2</sub><sup>16</sup>O, H<sub>2</sub><sup>17</sup>O, and H<sub>2</sub><sup>18</sup>O) near 190 GHz and 562 GHz to study the evolution of outgassing water and other molecules on Comet 67P/Churyumov-Gerasimenko. The regular to heavy water ratio (D/H isotope ratio) collected in this way has raised new questions in our understanding of how the solar system first formed [16]. A map of the spectral diagrams superimposed on the picture of the planetary body is shown in Fig. 2.1.



Fig 2.1. Spectral lines on Comet 67P taken by the MIRO radiometer/spectrometer.



Fig 2.2. Conceptual diagram of spectroscopic sensing on a THz telescope.

As depicted in the conceptual diagram in Fig. 2.2, spectroscopic sensing starts at the THz front end with a large aperture antenna. The presumably white noise in the background gets shaped by the absorption or emission from the resonating molecules in the observation path, producing characteristic spectra lines. As molecules only exhibit discernable features in the gas phase at low pressure, typical targets of the investigations are distant interstellar clouds as well as atmospheres of Earth and other planetary bodies whose surface emit gasses and other volatiles. The spectra are captured by the antenna and then coherently downconverted by a mixer and a local oscillator (LO) to a frequency convenient for information extraction and display. The back-end processor that follows and resolves the frequency features is defined as the spectrometer. In our case, frequency resolution is performed digitally through analog-to-digital conversion and Fourier transformation.

The combined construct of the THz receiver and the spectrometer bears unwavering topological resemblance to a heterodyne RF receiver for wireless communication. We make the distinctions in the later sections, but it is important to note that while digital signal processing (DSP) is ubiquitous in the world of cellular communication, it is relatively new in spectroscopy. Major

missions to date have relied on analog spectrometers such as the filter-bank spectrometer (FBS), the chirp transform spectrometer (CTS), the analog autocorrelator spectrometer (AACS), and the acousto-optical spectrometer (AOS) [17]. FBS, CTS and AACS all have roots in the Fourier transform and their implementations involve analog generation of variable-frequency tones (for filtering and chirping) and delays that are sensitive to changes in operating conditions such as temperature and supply. AOS works by mapping the RF spectra to angular dispersion of diffracted light modulated by the input-induced ultrasound waves [18] and thus requires bulky light source and lens combos. Special care is also needed to address the mechanical and temperature instability of AOS as well as the nonlinearity presented in the mapping process. Ironically, all the analog spectrometers would still eventually need a digital computer to process and display the spectra information. Fig. 2.3 gives a rough idea on the bandwidth and resolution requirement for different observation targets and those achievable using analog spectrometers. To give a few quantitative examples and to set up for the ensuing discussions, detailed performance metrics for the Herschel-HIFI, the ROSETTA-MIRO, and the microwave limb sounder (MLS) spectroscopic instruments are listed in Table 1-3.



Fig 2.3. Astronomical requirement on spectral resolution and bandwidth [17].

| Receiver Tech.     | Receiver Freq. (GHz) | Receiver Noise Temp. (K) |
|--------------------|----------------------|--------------------------|
| SIS                | 480-1250             | 40-1000                  |
| HEB                | 1410-1910            | 40-1000                  |
| Spectrometer Tech. | IF Bandwidth (MHz)   | Resolution (MHz)         |
| AACS/AOS           | 4000                 | 0.125-1                  |

Table 1. HIFI Heterodyne Spectrometer for Astrophysics

Table 2. MIRO Heterodyne Spectrometer for Planetary Science

| Receiver Tech.     | Receiver Freq. (GHz) | Receiver Noise Temp. (K) |
|--------------------|----------------------|--------------------------|
| Schottky Diode     | 190                  | 800                      |
| Schottky Diode     | 562                  | 3600                     |
| Spectrometer Tech. | IF Bandwidth (MHz)   | Resolution (MHz)         |
| CTS                | 180                  | 0.044                    |

Table 3. MLS Heterodyne Spectrometer for Earth Science

| Receiver Tech.     | Receiver Freq. (GHz) | Receiver Noise Temp. (K) |
|--------------------|----------------------|--------------------------|
| InP/Schottky       | 118                  | 1200                     |
| Schottky Diode     | 190-2500             | 1000-13000               |
| Spectrometer Tech. | IF Bandwidth (MHz)   | Resolution               |
| FBS                | 1300                 | 6-96                     |
|                    |                      |                          |



Fig 2.4. A commercial FFT spectrometer from Omnisys Instruments.

The slow adoption of digital techniques for spectral analysis is partially due to the extended development cycle of space missions and scientific instruments in general. Nonetheless, their advantages are overwhelming and the effort to adopt them has been consistent. Well-designed digital circuits are inherently stable, immune to noise, and robust against supply and temperature variations. Run-time parameters such as bandwidth (channel count and resolution) and integration time can easily be reconfigured, especially for FPGA-based solution. Fig. 2.4 shows a commercial FFT spectrometer from Omnisys Instrument. Built with discrete (10-bit) ADCs and FPGAs, it takes up the size of a bench-top instrument and consumes 20 watts per channel. In comparison, a typical non-solar power source, such as the radioisotope thermal generators (RTG) on the Voyager spacecraft or the Curiosity rover, can provide a total power of less than 200 watts [19]. In addition, the payload cost today is \$10,000 per pound to Earth orbit [20] and much more into outer space. Towards a low-SPaW instrument, the true power of digital spectrometer lies in its potential for integration, especially with ultra-scaled CMOS technology. Most importantly, such integration must go beyond just the digital processor unlike in [21] as the overhead to ensure signal integrity among different blocks at multi-GHz speed quickly adds up. A deep understanding of each block, both in terms of how they fit into an efficient spectroscopy system and their own design tradeoffs, is required to proceed.



Fig 2.5. Transition between spectroscopy and radiometry sharing the same frontend receiver.





#### 2.2 The Radiometer Receiver

Spectroscopy is a wideband sensing technique where multiple features can be resolved in the frequency domain across a certain range. If the input signal power at only one frequency point is of interest, that is, replacing the spectrometer by a power detector as indicted in Fig. 2.5, then it becomes a radiometer. We start with the analysis of a simple direct-detection radiometer receiver in order to rationalize the design choices in the subsequent sections. As noise and changes in noise are of concern here, we adopt the term "noise temperature", which is defined as  $T = \frac{N}{kB}$ , where *N* is the noise power within bandwidth *B*, and  $k = 1.38 \times 10^{-23}$  J·K<sup>-1</sup> is the Boltzmann constant.

Functionally, a radiometer receiver outputs an average power or voltage value,  $\bar{v}$ , according to the input signal level,  $T_{svs}$ . Both input and output are stochastic processes and therefore,

$$\bar{v}(T_{sys}) = E\{v(t)\}|_{T_{sys}}$$

$$(2.1)$$

where the ensemble average includes all possible outputs, v(t), that corresponds to the input level  $T_{sys}$  [22]. For real measurements, we care about the change at the output,  $\Delta v$ , in response to the change at the input,  $\Delta T$ , as depicted in Fig. 2.6. The condition for such resolvability is,

$$\frac{\Delta S_0}{N_0} \ge 1 \tag{2.2}$$

where  $\Delta S_0$  is the power in the *signal change* and  $N_0$  is the signal noise power at the output [23]. The minimal  $\Delta T$  that can be resolved at the output is defined as the receiver resolution or sensitivity, or sometimes the noise equivalent temperature difference (NE $\Delta$ T). Skipping the middle steps, the NE $\Delta$ T for an ideal direct-detection radiometer can be approximated as,

$$\Delta T \cong \frac{T_{sys}}{\sqrt{B\tau}} \tag{2.3}$$

where *B* is the bandwidth of the detector and  $\tau$  is the post-detection integration time constant. To achieve a low NE $\Delta$ T, a lower  $T_{sys}$ , a wider detection bandwidth, and a long integration time are desired. The receiver noise,  $T_{rec}$ , directly contributes to  $T_{sys}$ , on top of the contribution from receiver background,  $T_b$ .

For Earth and planetary targets,  $T_b$  is relatively high and therefore the measurements are limited by the background noise.  $T_{rec}$  becomes less critical and is only required to be comparable to  $T_b$  in the typical range of several hundred Kelvin. For space-borne missions with limited payload and available power, non-cryogenic receivers are preferred. The primary technology of choice is NASA's GaAs Schottky mixer operating up to 1.2THz with noise level between 800- to 4000-K at room temperature [24-26]. The emerging InP technology is also a viable option with low noise amplifier (LNA) demonstrated up to 850GHz [27]. Astrophysics observations, on the other hand, are only limited by the cosmic microwave background (CMB) at around 3K and hence a much higher sensitivity is needed for the receiver. Ground- and space-station-based telescopes are the platforms of choice where size and weight requirements are much relaxed. Cryogenic systems such as hot-electron bolometer receivers (HEB) and superconductor-insulator-superconductor receivers (SIS) are used to achieve noise temperature down to the 50- to 60-K range at around 500GHz when cooled to 15K [28].

CMOS devices, with the maximal transit frequency ( $f_T$ ) around 300GHz in modern generations, is intrinsically unfit for THz radiometer receiver. Even at more modest mmW frequencies such as 180GHz, InP-based MMIC receivers can readily deliver a noise temperature as low as 200K [29], which is equivalent to a noise figure of 2.3dB and is at the same level as the best CMOS receiver can achieve at low GHz range [30]. The only possibility to incorporate CMOS for its integration advantage is to exploit Friis' formulas for noise and rely on non-CMOS initial stages to suppress the higher noise in the trailing CMOS components including mixers and IF amplifiers. A 183GHz fully integrated CMOS heterodyne receiver with an external InP LNA is reported in [31] with noise temperature less than 1000K. Table 4 summarizes the key receiver technologies for THz remote sensing.

| Tech. | Freq. Range (THz) | Min. T <sub>rec</sub> (K) | Cryogenic | Remote Sensing Application  |
|-------|-------------------|---------------------------|-----------|-----------------------------|
| HEB   | < 5.0             | 40                        | Yes       | Astrophysics                |
| SIS   | < 1.5             | 40                        | Yes       | Astrophysics                |
| GaAs  | < 1.5             | 800                       | Optional  | Planetary and Earth Science |
| InP   | < 1.0             | 1000                      | Optional  | Planetary and Earth Science |
| SiGe  | < 0.5             | 4000                      | Optional  | TBD                         |
| CMOS  | < 0.5             | 90000                     | No        | TBD                         |

 Table 4. Receiver Technologies for THz Remote Sensing

### 2.3 The LO Chain

Spectroscopy down-converts the incident radiation to lower frequencies for power spectral density (PSD) calculation. Since mixing is multiplication in the time domain and convolution in the frequency domain, an ideal LO would preserve the THz spectrum while moving it to IF without changing its shape or the signal-to-noise ratio (SNR) as defined in (2.2). A realistic LO, however, suffers from random disturbances in phase due to various noise sources at the circuit and system level. Such a clock signal can be represented by,

$$x(t) = A\cos(2\pi f_0 t + \phi(t)) \tag{2.4}$$

where  $\phi(t)$  represents the phase noise. Under the "narrowband FM" assumption, the one-sided PSD of x(t) can be derived as,

$$S_{\chi}^{(1)}(f) = A^2 \pi \delta (2\pi f - 2\pi f_0) + \frac{A^2}{2} S_{\phi}^{(2)}(f - f_0)$$
(2.5)

where  $\delta(\cdot)$  is the Dirac delta function and  $S_{\phi}^{(2)}$  is the two-sided PSD of phase noise. In an oscillator such as a voltage-controlled oscillator (VCO), the phase noise PSD assumes the skirt-resembling shape typically of the form,

$$S_{\phi}(\Delta f) = n \left( \frac{1}{\Delta f^2} + \frac{f_c}{\Delta f^3} \right)$$
(2.6)

where *n* is a constant of proportionality representing the portion of  $S_{\phi}$  due to white noise sources at  $\Delta f = 1$  Hz and  $f_c$  is the flicker noise corner frequency [32]. As oscillators by themselves are autonomous circuits with open-loop responses to operation conditions and therefore exhibit frequency drift, typical THz LO chains for spectroscopy employ phase-locked sources starting at microwave frequencies (3 - 30GHz) followed by wideband multipliers. At the output of a phaselocked loop (PLL), the VCO spectrum is shaped with a more apparent pedestal. [33] estimates the final output PSD after frequency multiplication as,

$$S_{LO}(f) = \exp(-\phi_p) \,\delta(f) + \frac{1 - \exp(-\phi_p)}{(\pi B_0/2) \left[1 + \left(\frac{f}{B_0/2}\right)^2\right]}$$
(2.7)

where  $\phi_p$  is the mean-square value of the phase fluctuations and  $B_0$  is the 3-dB pedestal bandwidth that can be further approximated as,

$$B_0 = b \frac{\phi_p}{1 - \exp(-\phi_p)} \tag{2.8}$$

(2.7) can be interpreted as power in carrier and pedestal are given by  $\exp(-\phi_p)$  and  $1 - \exp(-\phi_p)$  respectively. Again, due to the mixing process as illustrated in Fig. 2.7, the THz spectral lines get shaped by the LO spectrum during down-conversion. With  $S_i$  and  $S_o$  representing the two-sided power spectrum at the input and output, and *H* the single-bin frequency response, their relationship can be formulated as,

$$S_0(f) = |H(f)|^2 \int_{-\infty}^{\infty} S_{LO}(f - f') S_i(f') df'$$
(2.9)

Mathematically from (2.9), a realistic LO spectrum effectively broadens the single-bin frequency response and reduces resolution of spectroscopy. The pedestal, if wide enough, encroaches onto adjacent channels and thus its width must be minimized. Such a simple conclusion remains true in general, but the consideration on resolution and its tolerance is much more complicated. For example, if the lines are already strong enough and much broader than the designed resolution, the degradation from LO nonidealities will not matter. Furthermore, if the pedestal is due to white noise sources, its impact can be mitigated with more averaging. More discussion on resolution is given in the next section, but it is reasonable to say that SNR consideration alone does not exclude a well-designed CMOS-based synthesizer as the locked frequency source.

With noise being less of a concern, device and circuit capabilities ultimately determine what components go in the LO chain. The traditional approach adopted mechanically tuned cascaded Schottky diode frequency multipliers at sub-mmW wavelengths that are driven by phased-locked Gunn oscillators at microwave frequencies. Above 1THz, massive far-infrared (FIR) laser-based sources might be required, where changing frequencies implies changing the gas in the laser. Breakthroughs in device fabrication technology beyond integrated CMOS has enabled solid-state-

based replacement along the entire chain. A 200- to 1500-GHz all-solid-state broad-band frequency multiplier chain was reported in [34] with planar Schottky-barrier varactor diode frequency doublers in GaAs-based substrate-less membrane technology. CMOS-based frequency synthesizers and frequency multipliers can reliably cover up to around 500GHz but suffer from small and narrowband output power. Thanks to the development in monolithic microwave integrated circuit (MMIC) power amplifiers (PA) in the Ka- and W-band [35-37], the trend has moved towards pairing ultra-wide-range CMOS sources with external MMIC PAs as the frequency-agile initial stages in the LO chain.



Fig 2.7. Illustration of impact of phase noise on spectral resolution.

#### 2.4 Spectrometer Considerations

Unlike the frontend components discussed so far, the design considerations at IF and baseband are less concerned with noise and raw device capability as we adopt digital processing. Rather, practical sensing scenarios and implementation constraints guide the design process.

#### 2.4.1 Quantization Depth

At IF, the down-converted analog spectra first go through the ADCs where errors between the analog values and the digitized outputs are introduced as the quantization noise. With proper rounding and sufficient levels, it can be approximated to be uniformly distributed with a zero mean and a constant variance (Fig. 2.8). The term signal-to-quantization-noise-ratio (SQNR) is often used to characterize the relationship between the maximum nominal signal strength and the quantization error in an *ideal* ADC. It is directly related to the ADC resolution by,

SQNR (dB) = 
$$20 \log_{10} 2^D \approx 6.02 \cdot D$$
 (2.10)

where *D* is the number of quantization bits. A similar concept is the spurious-free dynamic range (SFDR), also shown in Fig. 2.8, that measures the ratio between the fundamental signal strength to the strongest spurious tone at the output. It is typically limited by the nonlinearity and deterministic mismatch in the system and represents the smallest power signal that can be distinguished from a large interfering signal [38].



Fig 2.8. ADC quantization noise, SQNR, and SFDR.

Both SQNR and SFDR are important metrics to measure ADC sensitivity. For many applications where information is deterministically embedded in small signal changes, it also sets the system sensitivity. Spectroscopy, on the other hand, detects the relative change in noise power and is fundamentally immune to quantization noise. Rewrite the SNR definition in (2.2) as,

$$SNR = \frac{\sigma^2}{\bar{v}^2} \tag{2.11}$$

where  $\sigma^2$  is the variance of the analog noise spectra and  $\bar{v}^2$  is the total noise power. With enough averaging,  $\bar{v}^2$  remains constant as quantization noise has a zero mean. The added "randomness" due to digitization does not impact  $\sigma^2$  directly but rather raises the floor when a narrow-band spectral feature is to be extracted, resulting in an equivalent SNR loss. The ratio between the variance of the analog noise voltage and that of the digitized version is defined as quantization efficiency (inverse of SNR degradation in dB) and is exhaustively studied in [39] with both closedform expressions and numerical lookup tables. Fig. 2.9 is a replot of such data to represent the equivalent SNR degradation. Intuitively speaking, the larger degradation can be attributed to the breaking down of the well-behaving assumptions of quantization noise at small bit depth.



Fig 2.9. SNR degradation vs. quantization depth in a spectrometer system.

Fig. 2.10 illustrates the detection process using a simple comparator (two levels). Averaging is the essence here but another underlying assumption has been that the common-mode level across the detection bandwidth stays constant relative to the thresholds, which are also typically assumed to be constant, so that no bias is introduced to the quantization noise distribution. In other words, gain flatness at IF needs to be budgeted in the SNR degradation as if it is due to

quantization. For astrophysics applications where measurements are extremely SNR limited and the averaging time ranges from weeks to months, their IF response are well controlled with 1-2dB gain fluctuation across several GHz and so a very modest bit count is required. For planetary and Earth science studies where IF passband ripple could be as high as 6-7dB [40], a higher bit-depth is required to ensure the changes in common mode are also well digitized and the non-biased noise variance is extracted in the end. In this work, a simple 3-bit ADC is adopted for two of the three spectrometers and a 4-bit ADC is used for the third design at a more advanced technology node.



**Fig 2.10.** Detection of added brightness with two-level digitization and failure to do so with an incorrect common-mode voltage or equivalently the comparator threshold.

#### 2.4.2 Bandwidth and Resolution

Intuitively, a wider IF passband allows more spectral features to be simultaneously processed by the FFT engine along a continuous frequency range and thus has a direct impact on the scientific value of the spectrometer system. As evidenced by the MIRO example in Section 2.1, concurrent observation of multiple features is important because the ratio of two or more quantities (water isotopes in the MIRO example) may bear equal or greater value to an
investigation than individual abundance. For a heterodyne receiver, a wide IF passband can be achieved by either a sliding LO or a fixed LO at a larger frequency offset from RF. In the latter case, the ADC and the subsequent stages takes the full blunt of the bandwidth requirement as the sampling speed and digital throughput must both keep up. The seemingly smart approach with the sliding LO, however, is not suited for concurrent observation. As the orbiting or fly-by spacecrafts often have high velocity relative to their sensing targets and LO switching takes time to settle, by the time the instrument has sensed the first line and the LO switched to the second, the spacecraft will be at a different location with little to be inferred about the substance ratios in either location. For this reason, only the fixed-LO approach is considered in this work and as a result, the desire for faster ADCs and FFTs goes unabated. Shown in Fig. 2.11 is the emission spectrum for the Earth's limb around 340GHz. It was compiled for the compact-and-adaptable microwave limb sounder (CAMLS) project [41] highlighting molecules of interest to both pollution and climate change. The line marker at 341GHz is the center LO location where both lower-sideband (LSB) and upper-side-band (USB) can be covered with a sampling bandwidth of 20GHz. One thing worth noting here is that since the line locations are all known in-prior, they can be recovered even with frequency folding, or aliasing, which helps to alleviate the speed requirement of the spectrometer system.

Besides bandwidth, the achievable spectral resolution also impacts the applicability of an instrument. Referring back to Fig. 2.3, the resolution requirement for spectroscopy ranges from tens of kHz to nearly 100MHz depending on the observation target. The scientific justifications of these requirements, whether it is any of the various pressure broadening scenarios or the need for velocity resolution due to Doppler shift, are fascinating yet quite irrelevant to the hardware treatment here. What does matter is that the number of FFT channels, *N*, grows linearly as the bandwidth expands and resolution refines, which in turn more than linearly increase the hardware cost according to the  $O(N \cdot logN)$  complexity of FFT. This work sets its eye on the sliver of bandwidth where water resides in Fig. 2.11, which is about 6GHz. As multi-GHz ADCs are not

trivial to design, the development actually started with 3GS/s ADCs (1.5GHz Nyquist bandwidth) and worked its way up. A minimal resolution of around 1MHz is maintained throughout the generations even as the speed and numbers of points double.

As an interesting aside, Fig. 2.11 illustrates more than just the location of spectral lines. The conical shape of the actual features, coupled with uneven peak heights, visually indicates a varying degree of detectability and possibly a changing resolution requirement. Indeed, a figure-of-merit (FoM) can be defined to gauge how difficult it is to observe a certain molecule using spectroscopy. The expression is given as,

$$FoM = \frac{a\mu^2}{Q} \tag{2.12}$$

where *a* is the volume mixing ratio, or abundance,  $\mu$  the molecule's permanent dipole moment, and *Q* the approximate number of molecular states over which the total abundance is distributed [42]. Such a self-explanator FoM can be collaborated by the strong H<sub>2</sub>O, O<sub>2</sub>, and SO<sub>2</sub> lines in Fig. 2.11, whose abundances are understandably higher in the Earth atmosphere. Molecules with lower FoM's are can be detected with extra effort to improve the SNR, such as longer averaging or with a higher-sensitivity instrument. Outside of these options, the only way to improve the FoM is to enhance the abundance, as both u and Q are intrinsic molecular properties. For lab testing to verify the performance of our spectrometer design, measurements are done with artificially controlled abundance of H<sub>2</sub>O and CH<sub>3</sub>CN molecules.



Fig 2.11. Spectral continuum around 340GHz reported in [41].

### 2.4.3 Averaging Duration and Efficiency

Taking a few steps back, one may question the necessity of dedicated FFT hardware when there is no apparent need for real-time processing. The math involved, which will be detailed in Chapter 3, also appears complicated enough to warrant a software implementation. However, all software routines run on a microprocessor, or CPU, which is naturally inferior in terms of computation throughput when compared to FPGA- and ASIC-based designs. In order to bridge the speed differential with respect to data acquisition without having to discard any, a massive amount of memory is required to serve as the reservoir for the fast-incoming data while the CPU slowly processes them. For the "entry-level" 3-bit 3GS/s ADC, 9Gb of storage accumulates every second. As spectroscopic observation can last from seconds to several months long, the sheer volume of storage needed to accommodate the software solution renders it an impractical option.

The underlying condition for requiring the massive storage above is that no sampled data is discarded. To discard any data is equivalent to not observing for some interval within certain time continuum, which would potentially degrade the SNR. The fundamental principle behind this is that the convergence of mean and variance using averaging assumes a stationary input, or wide-sense stationary (WSS) within a certain time frame. It is highly desirable to maximize data points within a continuous period rather than taking them in intervals over which the distribution of the process itself might have changed significantly. From this perspective, the spectrometer is an implicit real-time system in that data must be processed as they come in without incurring large buffer and long delays. The real-time constraint is another way to explain why a fixed LO is preferred over a sliding LO for the receiver design, in that the spacecraft movement typically dictates a short WSS window and tolerates no LO settling.

For real hardware implementations, some processing latency and downtime for resetting internal states are expected within a "reset-and-run" cycle. Averaging efficiency can then be defined as the maximal continuous run time over that the total cycle time,

21

$$\eta_A = \frac{T_{run}}{T_{run} + T_{stop}} \tag{2.13}$$

For a well-thought architecture,  $T_{run}$  is largely set by the maximal average count and  $T_{stop}$  by the reset time, which usually masks out the latency. In this work, averaging is implemented as accumulation with normalization done in software. The maximal accumulation count with full scaled input is about 100 million (2<sup>27</sup>) while the reset time is kept under 300 cycles to keep  $\eta_A$  as close to unity as possible. The corresponding accumulation duration is about 180 seconds. Beyond this duration, a combination of hardware and software averaging can be adopted – a modern CPU should be able to handle a 64-bit addition every three minutes.

### 2.4.4 Robustness and Reconfigurability

A spaceborne instrument experiences extreme operating conditions such as wide temperature variations due to entering and exiting direct sunlight, secondary radiation effects, and even vapor condensation for certain Earth science missions. The spectrometer in this work, with analog and digital blocks integrated on the same die, tends to be more vulnerable to timing uncertainties and random disturbances. To safeguard the robustness of the system, monitoring and actuating circuits are included at numerous places to allow real time calibration of each block. For example, replica biasing scheme [43] is used for the ADC where an exact copy of the active ADC is placed in close vicinity for common-mode and DC bias control. Coarse and fine programmable delay elements are inserted at the mixed-signal interface to guarantee a positive setup time. For the FFT processor, a few self-test sequences are included, at negligible hardware cost, to rapidly detect internal random errors at the normal output without involving additional scan-chain interface, which is slower and more costly considering the size of our designs.

Despite not having the full programmability of an FPGA, our custom silicon incorporates reconfigurable architectures wherever it can in order to satisfy extended mission requirements. It is also interesting that many of these knobs and switches come for free with the reliability safeguards. For example, the common-mode levels and comparator thresholds of the ADC are

all fully programmable. The system supports both external and internal clocks for different bandwidth requirement, with the internal clock frequency selectable through the programmable divider in an integrated PLL. Even the FFT size can be made reconfigurable, which is indeed the case for the first design in Chapter 3. But this feature is quickly dropped in subsequent iterations as the scientists would simply like to keep the maximal channel count for the best resolution. The next chapter exams these architectures and their implementations in detail.

## CHAPTER 3

# Integrated CMOS Spectrometers for Wideband Remote Sensing

## 3.1 System Overview

With the understanding developed in Chapter 2, a system level block diagram can be defined as in Fig. 3.1. For the main signal path, the ADC first digitizes the IF output into thermometer codes. The binary translation is done after a two-way de-multiplexor (DEMUX2) so that the decoder can be synthesized using library standard cells with relaxed timing requirement at half the sampling speed. A 16-way de-multiplexor (DEMUX16) immediately follows and further splits the data into a total of 32 parallel streams. The DSP core thus runs at a much lower speed (1/32 of the sampling speed) which opens up the design space for better energy-delay optimization. The two-stage de-multiplexor design, with the DEMUX2 implemented in the analog flow and the DEMUX16 digital, is deliberate. First, the decoder has the logic complexity of a single-bit full adder and therefore can operate at a much higher speed without consuming too much power. More importantly, the DEMUX2 output marks the mixed-signal interface that is the starting point for digital timing constraint. However, the input delay, defined as the time the input becomes stable after the clock transition which is equivalent to the output latency of preceding stages (analog comparators or buffers in this case), can vary significantly given their analog nature and their sensitivity to operating conditions. For this reason, extensive edge tuning is applied at this interface and it is only sensible to keep the number of edges to be matched to a minimum. Outside of the main path, the integrated PLL and frequency dividers at various locations complete the clock management system of the SoC. The replica ADC helps to provide important DC and common-mode information for main ADC calibration. All the monitoring and calibration relies on the communication between the chip and the external host, which is accomplished through the standard Serial-Parallel Interface (SPI).



Fig 3.1. System level block diagram of proposed spectrometer SoCs.

The above architecture is shared by the three generations of spectrometer SoCs developed as part of this dissertation. Codenamed S-VI, S-VII, and S-VIII after a series of trial-and-errors, the S-VI is the proof-of-concept in 65nm CMOS featuring a programmable FFT with up to 1024 channels, a full-fledged four-tap polyphase filter bank (PFB4) with sinc weights, and a designed bandwidth of 1.5GHz. The S-VII doubles the sampling speed and system bandwidth by two-way interleaving the same ADC. The worst-case resolution is still improved as the size of the FFT, now fixed, is quadrupled to 4096. To offset the prohibitive cost of multi-tap implementation in the absence of smaller storage macros, the PFB is substituted with a conventional Hann window. The wider (in terms of bit width) accumulator, on the other hand, readily adopts the available SRAM macros and ends up saving quite a bit of area over the D flip-flop (DFF)-based implementation in the SVI. Propositioned to be the "endgame" for FFT spectrometers, the S-VIII keeps the same architecture but advances from the 65nm node to 28nm. The redesigned ADC reaches 12GS/s

and the channel count doubles again to 8192. For this third installment, several types of foundry memory IPs were made available to us and as a result, the high-performing PFB is reinstated with a more efficient design. At the time of this writing, both the S-VI and the S-VII have been field tested in real flight applications whereas the S-VIII is still in laboratory validation. For the ensuing discussions where it is easier to apply numerical values, the S-VII design will serve as the default subject unless otherwise indicated. Table 5 lists the main specifications for the three SoCs.

| Parameter                        | S-VI | S-VII | S-VIII |
|----------------------------------|------|-------|--------|
| Designed Bandwidth (GHz)         | 1.5  | 3.0   | 6.0    |
| ADC Sampling Speed (GS/s)        | 3.0  | 6.0   | 12.0   |
| ADC Quantization Depth           | 3    | 3     | 4      |
| FFT Input Bit Width              | 9    | 9     | 10     |
| FFT Channel Count (including DC) | 1024 | 4096  | 8192   |
| Windowing Function               | PFB4 | Hann  | PFB4   |
| Worst-case Resolution (MHz)      | 1.46 | 0.73  | 0.73   |

**Table 5.** System Specifications for Proposed Spectrometer SoCs

Before delving into block-level discussions, it is imperative to simulate at the system level to verify and visualize the spectroscopic sensing process, which has so far only been presented with intuitions and equations. Fig. 3.2. presents a signal flow view of the system level block diagram. The input is a superposition of three streams of uncorrelated white noise, with two undergone bandpass filtering as the way to model the noise shaping effect of two imaginary molecules. The corresponding output at each stage is simulated and plotted using MATLAB and it is clear that the shaped noise gets extracted and the SNR improved with more averaging. Note that the output has spectral even-order symmetry thanks to the real-only input to the FFT.



Fig 3.2. The simulation view of the system level diagram and outputs at labelled locations.

The remaining of the chapter is organized as follows. Since the flash ADC and the chargepump PLL use rather conventional designs, we do not further repeat here what is already covered in the existing publications [31][44]. Instead, we start in Section 3.2 with an in-depth treatment of the potential timing problem at the mixed-signal interface and the extend this work goes to for mitigating such a risk. The baseband techniques that are crucial to the success of the spectrometer SoCs are covered in Section 3.3. Considerations for algorithm selection and architectural-level optimization are discussed extensively. One of the key motivations is to be able to integrate, with ease, a large digital core with limited number of routing layers in RF-oriented processes. Laboratory measurement results are presented in Section 3.4 along with a comparison to other reported spectrometer works as well as high performance FFT processors.

### 3.2 The Mixed-Signal Interface and Clock Management

To enable better energy-delay tradeoff for the digital blocks downstream, the ADC thermometer output at full speed is slowed down and initially split into two parallel data streams. Such a serial-to-parallel conversion can be easily achieved with DFFs clocked at lower speeds, but some offset typically remains between the instants at which the parallel outputs become available. As illustrated in Fig. 3.3, the ADC clock runs at 3GHz and the DEMUX clock (CK A and B) at 1.5GHz. The split data streams, DATA A and B, are offset by one full-rate cycle at 333.33ps and must be realigned for the subsequent synchronous decoder. It is clear that the retime edge must land within the full-rate cycle immediately after DATA B becomes available in order to yield the correct output sequence. The situation is considerably more complicated when the ADC adopts interleaving and already runs on complementary clocks. Depicted in Fig. 3.4 is the case for S-VII, which interleaves the 3GHz ADC by a factor of two. As the result of interleaving, the two sub-ADC outputs are already offset by half a full-rate cycle at 166.67ps. The DEMUX DFFs split them further into four streams but the smallest offset between any two remains the same. If the same logic in the non-interleaving case is followed here, the retiming edge must land in the indicated 166.66ps window, which is difficult to enforce as the shortest clock period in the system is now twice as long. Instead, a two-step retiming process is adopted as the solution. In the first step, the DEMUX CK D signal samples DATA A/B/C meanwhile DATA D is passed through a

fixed delay that matches the first retime operation. In the second step, the DEMUX CK C signal resamples DATA A/B/C as well as the past-through DATA D to complete retiming.



Fig 3.3. Single-channel ADC (S-VI) output DEMUX2 and its timing diagram.



Fig 3.4. Interleaved ADC (S-VII) output DEMUX2 and its timing diagram.

The discussion on data realignment so far has only considered ideal clock waveforms. In reality, clock uncertainty such as skew and jitter, as well as propagation delays and their mismatch, which can be data and operation condition dependent, all eat into the sampling window. To

guarantee proper realignment under all conditions, the retiming clocks are designed to be programmable with coarse- and fine-tune delay elements covering the entire 360° range at full rate. In fact, since clock mismatch impacts the signal-to-noise-and-distortion ratio (SNDR) in an interleaved ADC, extensive phase tuning is a standard part of the overall clock management strategy in our SoCs and it starts with the sub-ADC clocks using 8-bit coarse and 6-bit fine controls. Estimation of phase mismatches, based on which the external host asserts control, is obtained from XOR-based phase detectors (PD) at both the ADC and the DEMUX tuner outputs. Fig. 3.5. presents a simplified diagram of the CMU. One detail worth mentioning is the potential phase ambiguity at the outputs of the asynchronous dividers. The DEMUX CK A/B/C/D are generated but their order can be either A-C-B-D or A-D-B-C when the chip is first powered up. A simple phase arbitration unit is built with a DFF and MUX and its operation is illustrated in Fig. 3.6.

Fig. 3.7 presents the delay cell design and their simulation. The coarse delay unit cell consists of inverter chains, transmission gates (TGs) and capacitors. When the TGs are turned on, the extra capacitive loading appears and adds additional delay. The smaller cross-coupled inverters form weak latches to minimize the timing skew between the pseudo-differential edges even if a large number of these cells are cascaded. The fine delay unit cell is designed as current-controlled buffers with bias controlled by an 8-bit R-2R DAC. The simulation shows a step of 8ps for the coarse delay cell and 600fs for the fine delay cell. According to the SNDR expression [45],

$$SNDR = \frac{\frac{V_0^2}{2}}{\frac{1}{12} \left(\frac{2V_0}{2^M}\right)^2 + \pi^2 \Delta T^2 f^2}$$
(3.1)

where  $V_0$  is the input signal amplitude (300mV by design), *M* the number of bits (*M* = 3 for S-VII and *M* = 4 for S-VIII), and *f* the input frequency (1.5GHz used in our calculation), a timing skew of less than 600ps confines the SNDR penalty of our interleaved ADC within 1dB.



Fig 3.5. A simplified block diagram of the clock management unit (CMU).



Fig 3.6. (a) Divide-by-2 (b) Phase ambiguity (c) DFF-based arbitration (d) Corrected output



Fig 3.7. Coarse- (top) and fine-delay unit cell schematics and their resolution simulations.

## 3.3 Area- and Energy-Efficient Digital Design

As previously mentioned in Chapter 1 and further justified in Chapter 2, the size and speed requirement of FFT hardware for spectroscopy far exceeds that for other applications. For most architectures, a higher channel count naturally involves more logic gates and silicon real estate. The increased loading from the extra fanout and interconnect, more so in a sub-optimal design, severely impedes speed unless more power is consumed. Although there is no set power budget for the spectrometers, the interconnect parasitic eventually dominates the delay and at the same time the available silicon area is ultimately limited by cost. The digital core was considered a "tougher nut to crack" given our relative longer heritage in RF and mixed-signal design.

## 3.3.1 FFT Hardware and Architectural Optimization based on Radix-2<sup>K</sup> Factorization

The FFT algorithm efficiently computes the *N*-point discrete Fourier transform (DFT) of an input sequence,

$$X[k] = \sum_{n=0}^{N-1} x[n] W_N^{nk}$$
(3.2)

where k = 0, 1, 2, ..., N - 1 and  $W_N = e^{-\frac{j2\pi}{N}}$  known as the twiddle factor, by decomposing it into smaller-size DFTs. Such a divide-and-conquer approach can be applied recursively, as in the Cooley-Tukey algorithm [46], to partition a DFT of size  $N = M \times L$  into many smaller DFT of sizes M and L. For  $N = M \times L$  and k = Mp + q, where  $0 \le p \le L - 1$  and  $0 \le q \le M - 1$ , the N-point FFT can be represented in a two-dimensional form as [4],

$$X[k] = \sum_{l=0}^{L-1} \left( \sum_{m=0}^{M-1} x[l,m] W_M^{mq} \cdot W_N^{lq} \right) W_L^{lp}$$
(3.3)

where  $W_N^{lq}$  is the inter-stage twiddle factor to compute the final *L*-point DFT from the *M*-point DFT inside the bracket. Due to decomposition, the number of complex multiplications is reduced from  $N^2$  to N(M + L + 1) and the number of additions is reduced from N(N - 1) to N(M + L - 2).

The most common adaptation of the Cooley-Turley algorithm is to recursively divide the transform into two half-sized operations until it is a simple 2-point DFT (hence the term "Radix-2 factorization"). By limiting N to power-of-two numbers, the radix-2 algorithm not only explores the symmetry  $(W_N^{k+N/2} = -W_N^k)$  and periodicity  $(W_N^{k+N} = W_N^k)$  of the twiddle factors, it also has the most regular signal flow structure. These advantages, along with other important details, can be better visualized through the simple example of an 8-point FFT. As shown in Fig. 3.8, the 8-point FFT is first decomposed to two 4-point FFTs (M = 4, L = 2) with the inter-stage twiddle factor  $W_8^{lq}$  (l = 0, 1 and q = 0, 1, 2, 3, the cases with l = 0 are the trivial multiply-by-unity and not shown but the q = 0 case is). The 4-point FFT can be either computed with the indicated Radix-4 butterfly or be further decomposed into two 2-point FFTs with additional inter-stage twiddle factors  $W_4^{lq}$ (l = 0, 1 and q = 0, 1). In the latter case, the entire signal flow graph (SFG) consists of only Radix-2 butterflies. As an interesting detail, the output order differs for the two cases. The mixed-radix decomposition produces partially bit-reversed output index when the input is naturally ordered. Exactly which two of the three bits (for indexing from 0 to 7) are reversed further depends on whether the 8-point is decomposed into two 4-point FFTs or four 2-point FFTs. In comparison, the output index in the Radix-2 case is fully bit-reversed with no ambiguity. Such regularity is much appreciated for efficient hardware design and control.

Generally, a Radix-*r* FFT undergoes  $\log_r N$  stages of decomposition with N/r Radix-*r* butterflies per stage. Each butterfly requires *r* complex additions (subtraction counts as addition) and r - 1 complex multiplications. In the case of the example above, only eight, instead of  $(8/2) \cdot \log_2 8 = 12$ , multiplications are needed because the twiddle factors of the last stages are always equal to one thanks to r = 2. The SFG can be directly translated into hardware as the directmapped (DM) fully parallel architecture, which has the lowest execution time, or latency if each stage is pipelined. However, it requires all butterflies to be physically implemented and therefore demands the most area for logic and interconnect. Opposite to this approach is to reuse one

single butterfly unit and rely on memory devices, such as RAM and ROM, to carefully schedule the inputs, the intermediate outputs, and the twiddle factors to be in the right place at the right time. Since RAMs and ROMs are typically very area-efficient, the memory-based time-multiplexed architecture, as shown in Fig. 3.9 (a), has the smallest area but suffers the longest execution time. It is ill-suited for our application considering the high channel count and the high accumulation efficiency requirement. A good middle ground between the two extremes is the single-path delay feedback (SDF) architecture. As illustrated in Fig. 3.9 (b) for the 8-point FFT, the architecture retains the three stages visible in the direct-mapped architecture but collapses each stage into one butterfly. Storage hardware is used for pipelining the input and output of each stage. By limiting time-multiplexing to only within a stage, not only is the latency improved from the  $O(N^2)$ dependence to O(N), the scheduling is also greatly simplified such that only clocked delays are needed. The interconnect gets further streamlined and localized as a result.



Fig 3.8. Two ways to decompose an 8-point FFT using Radix-4 and Radix-2 butterflies.



**Fig 3.9.** Diagram of (a) memory-based time-multiplexed architecture; (b) the SDF architecture and general tradeoffs between time-multiplexed and parallel implementations.

Despite its simplicity, the Radix-2 SDF architecture still requires  $(\log_2 N - 1)$  complex multipliers, which are resource-heavy especially for multi-bit operands. Following the design methodology in [4], it is recognized that the twiddle factors for small *N* contain only a handful of recurring numerical values that can be estimated using the canonical signed digit (CSD) representation. Multiplication with any of these values are implemented instead with constant multipliers that involve only shift-and-add operations. As indicated in Table 6, three such repeating values exist for *N* = 16 and none requires more than four adders. For *N* = 32, not only is the number of repeating values more than doubled but the representation of four new values necessitates more adders and deeper scaling. Since the twiddle factors are sets of *N* equally distributed points on the complex plane unit circle, it makes sense that more arithmetical operations are needed to differentiate between any two points for a larger *N*. In this work, the CSD-based constant multiplier technique is applied to FFT with  $N \leq 32$ .

| Recurring Value | CSD Representation                                                                                   | Adders | Associated Twiddle Factors                                            |
|-----------------|------------------------------------------------------------------------------------------------------|--------|-----------------------------------------------------------------------|
| -j              | -                                                                                                    | 0      | $W_{16}^4, W_{32}^8$                                                  |
| 0.7071          | 2 <sup>-1</sup> +2 <sup>-3</sup> +2 <sup>-4</sup> +2 <sup>-6</sup> +2 <sup>-8</sup>                  | 4      | $W_{16}^2, W_{16}^6, W_{32}^4, W_{32}^{12}$                           |
| 0.9238          | 2 <sup>0</sup> +2 <sup>-3</sup> +2 <sup>-5</sup> +2 <sup>-6</sup> +2 <sup>-9</sup>                   | 4      | $W_{16}^1, W_{16}^3, W_{16}^5, W_{16}^7$                              |
| 0.3828          | 2 <sup>-3</sup> +2 <sup>-3</sup> +2 <sup>-7</sup>                                                    | 2      | $W_{32}^2, W_{32}^6, W_{32}^{10}, W_{32}^{12}$                        |
| 0.9807          | 2 <sup>0</sup> -2 <sup>-6</sup> -2 <sup>-8</sup> +2 <sup>-12</sup>                                   | 3      | xar1 xar7 xar9 xar15                                                  |
| 0.1951          | 2 <sup>-3</sup> +2 <sup>-4</sup> +2 <sup>-7</sup> -2 <sup>-12</sup>                                  | 3      | W 32, W 32, W 32, W 32                                                |
| 0.8315          | 2 <sup>0</sup> -2 <sup>-3</sup> -2 <sup>-5</sup> -2 <sup>-6</sup> +2 <sup>-8</sup> -2 <sup>-11</sup> | 5      | 1473 1475 14711 14713                                                 |
| 0.5556          | 2-1+2-4-2-7+2-10                                                                                     | 3      | w <sub>32</sub> , w <sub>32</sub> , w <sub>32</sub> , w <sub>32</sub> |
|                 |                                                                                                      |        |                                                                       |

Table 6. CSD Representation of Recurring Values for Select Twiddle Factors

To minimize the number of complex multipliers for larger *N*, the CSD must be combined with a larger *r* so that the number of twiddle multiplications due to factorization is inherently smaller whereas the internal *r*-point FFTs can rely solely on constant multipliers. The Radix-2<sup>K</sup> algorithm [47] provides a convenient way to scale *r* from 2 to 2<sup>K</sup> while retaining the structural regularity and wiring simplicity of the basic Radix-2 algorithm. Going from Radix-2 to Radix-2<sup>K</sup> is straightforward. Reusing the previous 8-point FFT example, the SFG to the left of Fig. 3.10 uses the regular Radix-2 factorization. The twiddle factors are represented by the single-digit short-hand notation representing *k*. By the associative property of multiplication (addition in the exponent) and decimation-in-frequency (DIF), partial factors in the first stage twiddle factors can be shifted to the second stage following the butterfly arithmetic to form the Radix-2<sup>2</sup>-based (identical to Radix-2<sup>3</sup> in this case) SFG to the right. The first stage twiddle factors then become trivial ( $W_8^0 = 1$ ,  $W_8^2 = -j$ ) and reduces the number of arithmetical operations for many architectures. The twiddle factors are also regrouped, which is better illustrated with color in Fig. 3.11 for the 16-point Radix-2<sup>4</sup> case. Not only is the added regularity easily explored for supporting reconfigurable input sizes [4][5], it also simplifies the multiplexor control governing all the constant multipliers at each stage in the SDF architecture. This point is illustrated with the Radix-2<sup>4</sup> SDF chain and its control timing diagram in Fig. 3.12.



Fig 3.10. Example Radix-2 to Radix-2<sup>2</sup> transformation in an 8-point DIF FFT.



**Fig 3.11.** Radix- $2^{K}$  (K = 1, 2, 3, 4) factorization with regrouped twiddle factors.



Fig 3.12. Multiplier controls in the Radix-2<sup>4</sup> SDF coincide better with an incrementing counter

The FFT in this work is naturally partitioned to 32 parallel sub-FFTs and a 32-point pipelined stage. While 32 is a high number for parallelism and may not yield the best area, it is necessary, except for S-VI, to slow down the clock enough to accommodate the wide accumulator. Each sub-FFT adopts the Radix-2<sup>K</sup> SDF architecture. The choices of radices are set to minimize the number of stages which makes the square root of the sub-FFT size a good starting point. Overall, only two full multipliers are needed per parallel stream, one for the intra-stage twiddle-factor multiplication within the SDF and the other for the inter-stage multiplication at the interface between parallel and pipeline stages, as shown in Fig. 3.13. The 32-point pipeline FFT employs Radix-2<sup>2</sup> factorization but the direct-mapped architecture benefits little in terms of hardware saving. Instead, the regrouping of trivial and non-trivial twiddle factors is exploited to balance the propagation delay between different pipeline stages. The non-trivial twiddle-factor multiplications, all implemented using constant multipliers and thus take a few adder delays to complete, are each a pipeline stage. The two butterflies before and after trivial twiddle-factors are grouped together for a minimum of two adder delays. Fig. 3.14 presents the SFG of the 32-point FFT and Table 7 summaries the FFT design specifications for each generation of spectrometer.



Fig 3.13. The FFT architecture in this work highlighting the parallel SDF stages.



Fig 3.14. The SFG and pipeline partition of the 32-point Radix-2<sup>2</sup> FFT.

| Design | FFT Size | Architecture and Partition              | <b>f<sub>clk-adc</sub></b> | <b>f<sub>clk-dsp</sub></b> |
|--------|----------|-----------------------------------------|----------------------------|----------------------------|
| S-VI   | 2048     | 32-way 64-pt (8-by-8) SDF + 32-pt DM    | 3GHz                       | 93.75MHz                   |
| S-VII  | 8192     | 32-way 256-pt (16-by-16) SDF + 32-pt DM | 6GHz                       | 187.5MHz                   |
| S-VIII | 16384    | 32-way 512-pt (32-by 16) SDF + 32-pt DM | 12GHz                      | 375MHz                     |

 Table 7. FFT Specifications for the Proposed Spectrometer SoCs

### 3.3.2 Accumulation-Aware Real-Data FFT Algorithm

Most real-world spectrum analysis problems involve real-valued data. It is more so for spectroscopy due to the complexity of the THz frontend. Many algorithms are available to directly exploit the spectral symmetry of real inputs in an attempt to reduce hardware cost by half. The ideal saving is always compromised by degradation in regularity compared to the complex FFT (CFFT) algorithm. However, it is the post-FFT steps inherent in the real FFT (RFFT) algorithms that deems them unfit for our application.

One of the more commonly adopted approach for RFFT is the real-from-complex strategy. Given a 2*N*-point real signal x[m], m = 0, 1, ..., 2N - 1, it is possible to compute the RFFT using an *N*-point CFFT. This technique is sometimes called the packing algorithm [49] because it takes the odd and even indexed input of x[m] and pack them into a new complex signal y[n] = x[2n] + $j \cdot x[2n + 1], n = 0, 1, ..., N - 1$ . Then the *N*-point CFFT is applied to obtain Y[k], k = 0, 1, ..., N -1. Up to this point, all the optimization techniques discussed in the previous section apply and the hardware saving is exactly 50%. In order to go from Y[k] to X[k], an additional unpacking step is required according to the following equations,

$$\operatorname{Re}\{X[k]\} = \frac{1}{2}\operatorname{Re}\{Y[k] + Y[N-k]\} + \frac{1}{2}\operatorname{cos}(\frac{k\pi}{N})\operatorname{Im}\{Y[k] + Y[N-k]\} - \frac{1}{2}\operatorname{sin}(\frac{k\pi}{N})\operatorname{Re}\{Y[k] - Y[N-k]\}$$
$$\operatorname{Im}\{X[k]\} = \frac{1}{2}\operatorname{Im}\{Y[k] - Y[N-k]\} + \frac{1}{2}\operatorname{sin}(\frac{k\pi}{N})\operatorname{Im}\{Y[k] + Y[N-k]\} - \frac{1}{2}\operatorname{cos}(\frac{k\pi}{N})\operatorname{Re}\{Y[k] - Y[N-k]\}$$

If all the Y[k] become available at once, such as in the direct-mapped architecture, the unpacking process looks to be no more than another butterfly stage with base-*N* twiddle factor rotation. For any other pipelined implementation, especially those with parallelism applied for power efficiency, Y[k] must wait for Y[N - k] in order to compute X[k]. Unfortunately, there is no guarantee how many cycles they are apart as the output order of FFT is inherently scrambled. The waiting in this case necessitates practically unbounded hardware resource for buffering, which takes either one DFF bank or memory block *per cycle*. Note that the unpacking arithmetic (including buffering) occurs at the end of the *N*-point CFFT, which is much more expensive than if it is done at the input for a system with tapered bit-width per step of calculation.

A dedicated RFFT algorithm does not escape this problem. Fig. 3.15 shows the SFG of the 16-point DIF RFFT proposed in [49]. The algorithm starts with a CFFT but removes redundancy based on zero cancellation and conjugate symmetry. Although no unpacking is needed at the output, the real and imaginary part of each output point are still separated unless the exact 4point grouping is adopted. Since averaging or accumulation is applied on the power of the signal that is the sum of real and imaginary part squared, not having them together entails again buffering with storage devices. In comparison, the regular pipelined CFFT always outputs the complete complex pair regardless of partitioning and therefore requires no buffering for subsequent operations. The accumulation can be done in place with single block of memory and without any disturbance to the pipeline flow. For the 2N-point RFFT, this work simply pads zero to all the imaginary part of the input signal. The redundant N-point output is left unconnected and the accumulation is only performed on the remaining N channels. From a strictly RTL point of view, the redundancy is confined to the low-bit-width (9- to 21-bit) FFT portion of the design whereas the wide-bit-width (48-bit) PSD calculation and accumulation is spared. In addition, modern synthesis tool is smart enough to identify and remove redundant hardware such as multiply-by-zero and unconnected output, alleviating the penalty of redundancy even more.

43



Fig 3.15. The SFG of the 16-point DIF RFFT in [49].

#### 3.3.3 High Speed Accumulator with 48-bit Output

It is rather interesting that the simple act of addition not only holds the key to system SNR improvement but also drives the choice of FFT algorithm including parallel-pipeline partition. To maximize both the averaging count and efficiency as defined in Section 2.4.3, the bit-width of the accumulator is set to 48, the maximum we can assign to data given the 64-bit external host and the header/address bits reserved for bus communication. Each addition must also complete within one clock cycle to avoid pipeline stall. In terms of design, the S-VI uses gated DFFs to implement the unit accumulator. Although it has the shortest clock-to-output delay (t<sub>C2Q</sub>), each DFF bank occupies about 5X the area of a 48-bit-by-128-word single-port (SP) SRAM. With the channel count quadrupled in the S-VII and further doubled in the S-VIII, the DFF bank is replaced by two single-port SRAMs in a ping-pong configuration. The additional timing margin allocated for memory access does worsen the timing closure of this inherent feedback path, which forbids extra

pipeline stages unless more SRAMs are cascaded for buffering. This is in contrast to places with no feedback, where long arithmetic operations can be broken down into multiple pipeline stages. One such example is the inter-stage twiddle-factor multiplication. Fig. 3.16. illustrates both the DFF-based and the SRAM-based unit accumulator design. In the latter case, the pipelining register in the actual implementation is placed within the 48-bit adder, that is, the adder is manually and asymmetrically broken down into to two shorter adders to better align with the oftenunequal memory write and read time.



**Fig 3.16.** Unit accumulators and the critical path in the SRAM-based design.

## 3.3.4 Efficient Windowing with On-the-Fly Coefficient Generation

The direct application of FFT, which implies a rectangular window on a finite set of data, has two apparent drawbacks, namely spectrum leakage and scalloping loss. The former can be attributed to the strong sidelobes of the sinc function and the latter to the non-flat nature of the single-bin frequency response. A conventional window, such as the raised cosine function, has smaller sidelobes and therefore less leakage, but its broadened main lobe compromises frequency resolution. The polyphase filter bank (PFB) technique [51], on the other hand, not only provides excellent suppression of out-of-band signals with its passband filtering characteristics, but also produces a flat response across the entire passband for a sinc-weighted implementation. Fig. 3.17 plots and compares the frequency responses of the rectangular window, the Hann window, the Hann window with 4X input data, the 4-tap PFB, and as well as the 8-tap PFB. As the number of taps goes up, the response increasingly resembles a brick wall, which is expected as a more complete sinc sequence in the time domain corresponds to a better pulse in the frequency domain.



Fig 3.17. frequency responses of different windowing techniques around center bin.

For an *N*-point FFT, a *P*-tap PFB works by applying a sinc window of length  $P \times N$  on data *P* times the FFT sizes and then superimposing each *N*-point segment together. Such an operation is also known as "weighted overlap-add (WOLA)" or "window per-sum-FFT" [51]. To implement this technique, at least  $(P - 1) \times N$  data points must be stored for the weighting process, which can be expensive in hardware terms even for a moderate number of taps. The 4-tap PFB is considered to give the best tradeoff between performance and cost. The S-VI is designed without memory macros and as a result large amounts of DFF-based delay buffers are cramped into a pre-allocated 2mm × 2mm floorplan. The minimal supported configuration for the SRAM in the 65nm process is 2-bit by 128-word, which is about the size needed for the unit PFB buffer in S-VII (using two half-sized, 3-bit-by-128-word, SP SRAMs to realize the function of a single 3-bit-

by-256-word dual-port SRAM). However, the area efficiency of memory macro decreases for smaller capacity as the memory controller and peripheral circuits take more proportion compared to the bit cells. For the 32-way parallel implementation with unit SRAM macros that each occupies  $0.1\text{mm} \times 0.1\text{mm}$ , the 4-tap PFB then demands a total area of  $0.01 \times 3 \times 32 = 0.96\text{mm}^2$ . Assuming a larger floorplan of 2.5mm × 2.5mm and a routing efficiency of 60%, which is optimistic with three routing layers, the memory macros alone take up more than 25% of the active floorplan (0.96 / (2.5 × 2.5 × 0.6) = 0.256). The S-VII instead adopts the conventional Hann window. With 4X channel count, the loss in resolution is tolerable as indicted in Fig. 3.17 but the channel flatness suffers compared to the S-VI. The PFB makes its comeback in the S-VIII with a compact, SRAM-based design in the 28nm technology.

An integral part of the windowing process is the generation of the weighting coefficients. In the case of the 4-tap PFB, the sinc weights span 32768 points and are simultaneously applied at  $4 \times 32 = 128$  locations in our parallel architecture. Rather than resorting to 128 pieces of ROM, the coefficients are generated on-the-fly with adders and multiplexers through mathematical simplification. Based on the equation,

$$\omega[n] = \operatorname{sinc}\left(\frac{n}{N'-1}\right) = \frac{\sin\left(\frac{\pi n}{N'-1}\right)}{\frac{\pi n}{N'-1}}, n = 0, 1, 2, \dots, N', N' = P \times N$$

a sinc function can be expressed as the division between a sine and a linear function. The sine function can be well estimated with ease using the first-order approximation detailed in [4] whereas the division can be implemented using the CORDIC algorithm with simple combinational logic. Symmetry in both the trigonometric function and the sinc function are further exploited to reduce design effort. As illustrated in Fig. 3.18, to generate sine and cosine values across  $2\pi$ , only the argument in  $[0, \pi/4)$  is needed; all the relevant values in other seven quadrants can be derived based on a permutation of the signs and/or the real and imaginary parts. The sinc function by itself has even symmetry but can be partitioned into four sections as a result of trigonometric

symmetry. Only a quarter of the arithmetic operations need to be designed with the rest simply involving a change of order on the control bits. The same technique is also applied to generate the raised cosine weights (the Hann window) as well as the intra- and inter-stage twiddle factors in the FFT, for both are derived from trigonometric functions as shown below.

$$\omega_{Hann}[n] = \frac{1}{2} \left[ 1 - \cos\left(\frac{2\pi n}{N-1}\right) \right], n = 0, 1, 2, \dots, N-1$$
$$W_N^{nk} = e^{-j\frac{2\pi nk}{N}} = \cos\left(\frac{2\pi nk}{N}\right) - j\sin(\frac{2\pi nk}{N})$$

Lastly, the impact of coefficient quantization on the passband frequency response is studied. Fig. 3.19 compares the frequency responses with the ideal (double precision floating point) coefficients, the ideally quantized (fixed-point truncation in MATLAB) coefficients, and the hardware generated coefficients for the case of 6-bit fractional precision. Negligible difference can be observed within the center bin. The reason that 6-bit is considered first is because the number of CORDIC stages that can fit within one clock cycle without upsizing the logic is six. A smaller bit-width also helps to reduce hardware cost for the full multipliers in the PFB. In addition, the 6-bit fractional precision serves as the baseline for the FFT computation, whose data path maintains an 8-bit fractional precision in the SDF stages, gradually tapers down in the pipeline stage, and eventually turns full integer in the accumulator. The combined integer and fraction bit-width in the pipelined stage and before PSD 20. Unlike the windowing coefficients or the input data, it is possible for the truncation errors in the twiddle factors to get amplified at the output, higher fractional precisions,11-bit and 15-bit respectively, are therefore assigned to the intra- and inter-stage twiddle factors. These values are in good agreement with the FFT SNR study in [5].



**Fig 3.18.** Illustration of the symmetry in the trigonometric and sinc functions and the 6-stage CORDIC divider (bottom).



Fig 3.19. Frequency responses with difference coefficient digitization schemes.

## 3.4 Laboratory Testing and Field Application

To validate our designs for spectroscopic observations, careful laboratory measurements were conducted first. For a size and weight comparison, the S-VI is wire-bonded to a PCB module with supporting peripherals such as supply regulators and USB controllers. The 3-D printed casing for the PCB measures  $10 \times 6 \times 2$  cm<sup>3</sup> and weighs about 120 grams. The module is then connected to a 500- to 600-GHz receiver developed at JPL. Fig. 3.20 shows the entire setup. The receiver output is coupled to a gas cell containing 1.0mTorr of water. The background of the gas cell is provided by a liquid nitrogen (L-N<sub>2</sub>) soaked absorber that emulates the cold background of space necessary to enhance the detection contrast. The receiver LO is tuned to 555.900GHz in order to observe the well-known H<sub>2</sub>O rotational response at 556.935GHz. Two measurements, one with water and the other without, were conducted in the so-called hot-and-cold configuration where the difference is the final output. Such a correlated double sampling technique is widely adopted for scientific measurement to remove instrument artifacts and other slow-changing fluctuations. Fig. 3.21 plots the spectrometer output which well aligned against the expected line shape based on target abundance and temperature.



Fig 3.20. Laboratory measurement setup for the S-VI.



Fig 3.21. Measured and expected spectral feature of H<sub>2</sub>O at 556.935 GHz.

The S-VII measurements shown in Fig. 3.22 were performed with the 183GHz InP-CMOS hybrid receiver developed in house. Samples flow through the 2.5m-long flow cell at a reduced pressure of around 10mTorr again with liquid nitrogen in the background. The LO frequency is set to 183.43GHz and 182.81-GHz respectively for methyl-cyanide (CH3CN) and water with the IF at 500MHz. Unlike the hot-and-cold measurement with the S-VI, the frequency switching routing [16][52][53] is used here as the alternative correlated double sampling method to calibrate for receiver gain variation and other artifacts. A small dithering, around 2MHz, is applied to the LO frequency and the difference spectra is taken as the output. The subtraction results in the distinctive up-and-down spikes where spectral features are expected. They are clearly observed in the measurement results in Fig. 3.22 (b) and (c). Spurious signals marked with "X" are interference from facility wireless and nearby cell towers.



**Fig 3.22.** (a) Diagram of measurement setup for the S-VII with the 180GHz receive; (b) and (c) spectral features for CH<sub>3</sub>CN and H<sub>2</sub>O respectively with the frequency dithering technique.

Table 8 compares the integrated spectrometer SoCs developed in this work to a few reported prior attempts as well as one instance of high performance FFT processor [54]. Just as they are intended to, our designs enjoy the highest level of integration. In terms of efficiency, a simple Figure-of-Merit that considers only the FFT size, Nyquist bandwidth, and power consumption is

proposed. Note that such a simplified FoM should put our work at a disadvantage because the reported power consumption includes the integrated ADC, PLL, and accumulator. Even so, the S-VI achieves a comparable efficiency to [21] which contains only the PFB and the FFT. The lower FoM of both designs adopting the 4-tap PFB reflects just how expensive the technique is. Without this technique, the S-VII achieves an FoM that is even higher than the FFT processor alone. The huge difference in performance and efficiency between the S-VII and the S-II is a manifestation of the effectiveness of our optimization, from conception to implementation. The S-VII has been baselined for NASA's spectroscopic missions and has been included in various missions such as the SLS and the CAMLS. A picture of it flying in space during a balloon mission is included in Fig. 3.23.

| Parameter             | S-VII       | S-VI        | S-II [1]    | [54]  | [21] |
|-----------------------|-------------|-------------|-------------|-------|------|
| Bandwidth (GHz)       | 3.0         | 1.3         | 1.1         | 1.2   | 0.78 |
| CMOS Technology       | 65nm        | 65nm        | 65nm        | 90nm  | 90nm |
| Metal Stack           | 1P6M (3X)   | 1P6M (3X)   | 1P6M (3X)   | 1P9M  | 1P7M |
| SRAM Usage            | Yes         | No          | No          | Yes   | Yes  |
| Windowing             | Hann        | PFB4        | Hann        | N/A   | PFB4 |
| FFT Size              | 8192        | 2048        | 512         | 2048  | 4096 |
| Integrated ACC        | Yes         | Yes         | Yes         | No    | No   |
| Integrated ADC        | Yes (3-bit) | Yes (3-bit) | Yes (7-bit) | No    | No   |
| Integrated PLL        | Yes         | Yes         | No          | No    | No   |
| Total Power (mW)      | 1500        | 650         | 188         | 159   | 710  |
| FoM = (Size·BW)/Power | 16.38       | 4.10        | 2.95        | 15.46 | 4.50 |

Table 8. Comparison with Prior Arts



Fig 3.23. The S-VII-based spectroscopy instrument onboard the RECTANGLE flight.
# CHAPTER 4

# Low-Noise Inverter-Ring-based Wideband Millimeter-Wave Synthesis

### 4.1 Introduction

As mmW applications continue to proliferate, the notion of a wideband mmW system is also evolving. On the communication front, wireless transceivers [2, 55-57] built for standards such as the old IEEE 802.15.3c and the present IEEE 802.11ad/ay are well recognized wideband systems. They require LOs with roughly 15% tuning range (9 / 60 = 0.15) in order to capture the 9GHz unlicensed spectrum around 60GHz. In comparison, as previously indicated in Fig. 1.3, the more recent mmW 5G standard inspires high-purity signal sources with tuning range greater than 50% for the support of multi-band wireless and backhaul communication from 24GHz to well above 40GHz.

The pursuit of signal bandwidth tends to be more aggressive for sensing applications, and with good reason. The bandwidth, or equivalently the tuning range, usually translates to something more meaningful than just data rate. For example, modulation bandwidth,  $\Delta f/f$ , determines the range resolution of an FMCW radar and an automotive system that can distinguish pedestrians from cars is undoubtedly more valuable than ones that cannot. In THz spectroscopy, an ultra-wideband CMOS source bridges the coverage gap between such a low-SPaW alternative against the conventional Dielectric-Resonator-Oscillator-based solutions [6]. The added range at the beginning of the LO chain greatly simplifies frequency planning, which would otherwise involve multiple narrow band LO chips. Fig. 4.1 illustrates how a 24-40GHz first stage can be repurposed for sensing in the 75-110GHz (the full W-band), 180GHz, 340GHz, and 560GHz bands with module-based wideband frequency multipliers. In addition, having access to an affordable and flexible mmW source helps to bring mmW sensing technology closer to our everyday life, with examples such as medical diagnosis [11] and microscopic detection [58][59].



Fig 4.1. Frequency planning with a 24-40GHz source

Of course, affordable high performance mmW circuits are only possible with improved integration and better-performing devices in super-scaled CMOS processes. However, as feature size continues to shrink towards single-digit nanometers, the improvement of device  $f_T$  starts to level off while numerous factors, including lower breakdown voltage, thinning backend, higher device noise factor, and tightened design rules, all tend to adversely impact conventional RF and mmW designs. It is with this observation in mind that we embarked on this search for a scalable and digital-friendly architecture that better exploits technology scaling.

In terms of organization, Section 4.2 starts with a brief background on PLL-based frequency synthesizers, followed by discussions on recent design trends at a broader scope in Section 4.3 in an effort to fully justify our design choices at the architectural level. The design and analysis of the proposed inverter-ring-based multi-ratio VCO-and-divider pairing technique is given in Section 4.4. The technique is adopted in a cascaded PLL. Two prototypes, one with an additional reconfigurable frequency extender and the other without, are implemented and Section 4.5 presents the details. Section 4.6 concludes this chapter with measurement results and comparison with prior arts.

# 4.2 PLL-based Frequency Synthesizers

The phase-locked loop is a feedback system that, in its simplest form, consists of a stable frequency reference, a phase detector (PD), a loop filter (LF), and a VCO. The phase difference between the VCO and the reference, which implies negative feedback, is extracted by the phase detector, averaged and converted by the loop filter, to a control voltage,  $V_{CTRL}$ , so that, on average,

56

 $f_{VCO} = f_{REF}$  in the locked state. In real-world applications, a piezoelectric resonator, commonly known as the crystal and available from a few kHz to several hundred MHz, is used as the frequency-setting component. To stabilize VCOs above this range, a frequency divider can be inserted in the feedback path to obtain  $f_{VCO} = N \cdot f_{REF}$ . Fig. 4.2 gives the block diagram of a basic PLL and its phase and noise SFG. The loop gain with the divider can then be calculated is,

$$G(s) = K_{PD}H(s)\left(\frac{K_{VCO}}{s}\right)\left(\frac{1}{N}\right)$$
(4.1)

where  $K_{PD}$  is the phase detector gain, H(s) the loop filter transfer function, and  $K_{VCO}$  the VCO sensitivity in rad/s/V.



Fig 4.2. Simplified PLL and its phase and noise transfer model

The closed loop bandwidth,  $f_{BW}$ , as an indication of how fast the feedback responds, is an important parameter for both the loop filter design and the PLL performance [60]. Within this bandwidth, the VCO phase fluctuation, or noise, can be corrected by the clean reference through the feedback mechanism and results in a high-pass noise transfer characteristic from the VCO to the PLL output. On the other hand, the input-referred noises from the crystal, phase detector, loop filter, and frequency divider show up at the PLL output low-passed but with a gain of *N*. The closed-loop noise transfer functions for the VCO and the input-referred noises assumes the typical feedback-associated forms in Eq. 4.2 and 4.3, respectively.

$$\frac{\Phi_{OUT}}{\mathcal{L}_{VCO}} = \frac{1}{1+G(s)} \tag{4.2}$$

$$\frac{\Phi_{OUT}}{\mathcal{L}_{in}} = \frac{NG(s)}{1+G(s)} \tag{4.3}$$

The total output phase noise, commonly referred to as the integrated phase noise or jitter, is now a convex function of the loop bandwidth. A clear minimum can be achieved when the in-band and out-of-band noises contribute equally [61]. The optimal bandwidth for a generic PLL,  $f_{OPT}$ , is derived in [62] and re-written here for convenience in Eq. 4.5 and 4.6. The derivation assumes a white input referred phase noise,  $PN_{in}$ , and a VCO with phase noise,

$$\mathcal{L}(\Delta f) = \frac{K_{\omega}}{2(Q\Delta f)^2} \tag{4.4}$$

$$\sigma_{OUT}^2(f_{BW}) = \int_0^\infty 2\mathcal{L}_{OUT}(\Delta f) d\Delta f = \frac{4K_\omega}{2Q^2 f_{BW}} + 4N^2 \cdot PN_{in} \cdot f_{BW}$$
(4.5)

$$f_{OPT} = \sqrt{\frac{K_{\omega}}{2Q^2 N^2 \cdot PN_{in}}}$$
(4.6)

where  $K_{\omega}$  is the noise coefficient determined by the device and Q is the oscillator quality factor. Note that Eq. 4.4 agrees with Eq. 2.6 but considers white noise sources only and breaks down n into physical parameters.

The two equations offer much insight into common design practices. For example, most commercial PLL products adopt relatively smaller loop bandwidths partially because a faster and lower-noise crystal is a lot more expensive. To maintain an overall low-jitter performance, a smaller  $f_{BW}$  calls for a better-performing VCO, whose noise is now less filtered. For PLLs with output frequency at mmW, the increased *N*, as a result of higher output frequency, also shrinks  $f_{OPT}$ . Unfortunately, improving oscillator *Q* is much harder at higher frequencies due to several physical limitations. For LC-VCOs, which are the default choice for low-noise RF applications, the capacitor *Q* naturally scales down with increasing frequency while the upscaling of inductor *Q* is hampered by more pronounced skin effect and substrate loss. Any tuning element, such as

switches and varactors, contributes additional degradation and can easily dominate tank loss if not meticulously designed. Thus, the VCO becomes the first and the rather fundamental challenge to the efficient design of high-performance mmW PLLs, even with modest tuning range that is just enough to cover process variations.

Before any  $f_{OPT}$  can be adopted, there are more practical constraints to setting the bandwidth. Intuition should first establish that  $f_{BW}$  cannot exceed  $f_{REF}$ . In fact, for type-II topologies, where the loop contains two integrators including the VCO, the generally accepted  $f_{BW}$  for loop stability is around  $f_{REF}/10$  [63]. A type-I PLL, in comparison, has only one integrator and a maximal phase shift of 90° around the loop. It can tolerate a larger bandwidth, with up to  $f_{REF}/2$  reported [64], but exhibits a flatter, 1<sup>st</sup>-order filtering characteristic than the 2<sup>nd</sup>-order response with a type-II loop, as shown in Fig. 4.3. The stability-derived loop bandwidth is still generous before additional requirements such as spur suppression, spot noise mask, and frequency multiplexing are even considered. For integer-N frequency synthesizers where N is a programmable integer, a reference divider must precede the phase detector if the required step size is smaller than the crystal frequency. The equivalent reference frequency is now  $f_{REF}/R$ , a possibly very small number that severely limits  $f_{BW}$ . The fractional-N synthesizer circumvents this limitation by creating an effectively non-integer divider ratio through weighted averaging among several integer values. A delta-sigma modulator (DSM) is used to breaks up the periodicity in the weighting process in order to reduce output spurious tones. The tradeoff, however, is the additional quantization noise introduced by the DSM that has the same noise transfer function as the inputreferred sources. A higher  $PN_{in}$  again suggests a smaller  $f_{OPT}$  which leads us right back to the quest for a higher Q and smaller N (or equivalently a higher  $f_{REF}$ ). Practical optimization of frequency synthesizers involves complex and clever compromises and more so at mmW.

59



Fig 4.3. Example type-I and type-II PLLs and their VCO noise suppression characteristic.



Fig 4.4. (a) Switched-capacitor and (b) LC-VCO performance swept against frequency.

# 4.3 Architectural Considerations for Low Noise and Ultra-Wide Frequency Synthesis

In Fig. 4.4 (a), the simulated capacitance and quality factor for a switched-capacitor coarsetune element in 28nm CMOS technology is plotted against frequency. The switch dimension is sized so that  $C_{ON}/C_{OFF} \approx 2$ . Fig. 4.4 (b) presents a survey of phase noise and Figure-of-Merit (FoM) of the state-of-the-art VCOs across a wide frequency range. The VCO FoM is defined as

$$FoM = 20 \log_{10} \left( \frac{f_{OUT}}{\Delta f} \right) - 10 \log_{10} \left( \frac{P_{DC}}{1 \text{mW}} \right) - \mathcal{L} \{ \Delta f \}$$

$$(4.7)$$

It is evident that VCOs at lower frequencies are both better-performing and easier-to-design. Combined with the fact that a programmable divider, or a multi-modulus one that is often required

for fractional division, is much more viable at lower frequencies as well, two-stage architectures centered around a lower-frequency PLL have become popular remedies for the degrading Q at mmW. Depicted in Fig. 4.5, 4.6, and 4.7 are the three major cascaded topologies. The static frequency multiplier (Fig. 4.5) extracts higher-order harmonics of the lower-frequency PLL output through either device nonlinearity or harmonic mixing. In terms of noise, the M<sup>th</sup>-order harmonic simply scales up the phase noise of the fundamental by a factor of M, effectively retaining the low-frequency VCO Q for the harmonic output. The multiplier, being a driven circuit unlike the VCO, does not integrate noise and therefore contributes one copy of wideband noise that is less dependent on passive quality. Frequency multipliers have seen wide adoption with compound semiconductor devices thanks to their higher  $f_T$  and more pronounced nonlinearity [69]. For CMOS adaptations, the scaling down in device  $f_T$  and nonlinearity, as well as overall passive selfresonance-frequency (SRF) and quality, results in lower conversion gain and higher power consumption. The latest reincarnation of the technique embeds harmonic enhancement into the VCO, which is a nonlinear circuit by design, followed by a narrow-band buffer at the desired harmonic frequency [70-73]. Regardless of implementation, the diminishing harmonic power at higher frequencies and the limited intrinsic nonlinearity confine the single-stage multiplication factor to small numbers, with doublers (M = 2) and triplers (M = 3) being the most common.



Fig 4.5. Cascaded LO generation with a static frequency multiplier.



Fig 4.6. Cascaded LO generation with an ILFM with an example floorplan in [77]

The injection-locked frequency multiplier (ILFM) in Fig. 4.6 improves upon the previous topology by eliminating the laborious harmonic extraction process. By injecting energy at  $M \cdot f_0$ into a VCO already free running in the vicinity, synchronization ensues where the VCO "locks onto" the injecting signal. The VCO phase noise in the locked state follows that of the injecting signal up to a certain offset frequency,  $\Delta f_L$ . If the free-running frequency is really close to the target frequency, a small injecting power would suffice and an injection at  $f_0$ , with its harmonics, can lock the VCO to  $M \cdot f_0$ , and thereby realize frequency multiplication. Within  $\Delta f_L'$  (different from  $\Delta f_L$ ), the output phase noise is again scaled up from the lower-frequency PLL which enjoys a VCO with higher Q. Compared to the static multiplier, the single-stage multiplication factor can now take on larger values, with M = 114 reported in [74] but generally below 30 for mmW applications [75-78]. The larger selection in M allows more flexibility for frequency planning. For the purpose of minimizing jitter and power, a lower frequency is preferred for the first stage. Indeed, frequency synthesizers adopting this topology for single-band mmW 5G [76-78] have dominated the performance chart in recent years. It is worth noting that, as M grows larger, the diminishing harmonic power makes the "really close" requirement more exacting and any disturbance that pulls the free-running frequency away may interrupt the locking. As a result, some form of frequency tracking, such as a frequency locking loop (FLL), must be included. A quantitative review of the injection locking mechanism will be given in Section 4.4.



Fig 4.7. The cascaded PLL architecture and its phase noise improvement.

The ILFM evolves from the cascaded PLL topology in Fig. 4.7, which is the original architecture proposed for high frequency clock generation in response to the larger *N* and the lower *Q* [79]. Assuming that the quality of VCO<sub>1</sub> is higher than that of VCO<sub>2</sub>, and that the loop bandwidth of PLL<sub>2</sub> is larger than that of PLL<sub>1</sub>, the overall integrated output noise or jitter improves over that of the single-stage PLL with VCO<sub>2</sub>. Compared with the frequency-multiplier-based topologies, the cascaded PLL offers unparalleled flexibility in terms of *N* and *M* values and as well as each loop and oscillator types. However, in terms of best jitter performance for single-band applications, where both VCOs are likely LC-based designs as in the ILFM, the cascade is at a disadvantage due to the additional noises from the PLL<sub>2</sub> loop components. These components, especially the feedback divider, also consumes additional power and further degrades the overall FoM. As a counterexample, the cascaded synthesizer for the 28GHz 5G band in [80] intentionally avoids the high-frequency feedback divider by adopting the sub-sampling technique [61] in PLL<sub>2</sub>. The PLL jitter FoM is defined as

$$FoM = 20 \log(jitter) + 10 \log\left(\frac{P_{DC}}{1mW}\right)$$
(4.8)

Given a practical crystal reference, the two-stage topologies achieve lower jitter and power consumption but require larger footprints. Not only are the lower-frequency LC tanks much bigger

than their mmW counterparts, enough physical distance must also be maintained among inductive components to minimize magnetic coupling. Such calculated sacrifice on the precious silicon real estate is another indicator of the challenge and cost involved to obtain high-purity single-band mmW signal sources. To cover multiple bands or a continuous ultra-wide range, additional narrow-band second stages can be attached in parallel in the three cascaded topologies but at a greater abuse of chip area. A multiplexor is also needed for single-LO-port interface and FET switches at mmW frequencies exhibit significant loss and poor isolation. On the other hand, recent works on ultra-wide frequency multipliers and LC-VCOs for mmW 5G [81-83] do offer some promises for single-path implementation. By exploiting different modes in higher-order LC networks, switch loss can be mitigated for a better *Q*, compared to large switched-capacitor banks or switched-inductors. Since magnetic coupling is an integral part of mode switching, "coils" are placed closer together which results in a smaller area than multiple separated LC tanks.

Despite relaxed tradeoff between range and performance, a higher-order LC network still occupies a large area and is more susceptible to the aforementioned physical limitations on integrated passive design. In terms of energy storage, the maximally achievable *Q* for a multi-mode tank in each mode also cannot exceed that in a single-mode tank and thus more current is needed to achieve a comparable noise performance. At a higher level, having only the ultra-wide multiplier or VCO at mmW is inadequate for a PLL-based frequency synthesizer. Without a programmable feedback divider, any single-loop architecture will not work with a fixed crystal frequency. In the cascade topologies presented so far, the fixed *M* merely shifts the ultra-wide tuning requirement to the first stage VCO yet the tuning range of a single-mode LC VCO is limited to around 30%, even at RF frequencies. More importantly, mmW technology is often reserved as an add-on option to traditional RF bands for particular line-of-slight scenarios. For example, the 60GHz 802.11ad/ay standards are meant to supplement the 2.4/5GHz standards for Wi-Fi and the mmW bands are envisioned to complement the sub-6GHz bands for 5G cellular. While a

higher power consumption may be acceptable for something that is only used occasionally, any additional chip area is permanent and should ideally be minimized.

By adopting a wide-tuning ring oscillator in the second stage of a cascaded PLL while assuming the availability of a mmW programmable feedback divider, the proposed architecture, shown in Fig. 4.8, addresses the many downsides of the conventional passive-reliant design approach. Compared to an LC-VCO, the frequency of a ring oscillator is associated with the charging and discharging of a capacitor ( $f_{RO} \propto I/C$ ) rather than the inductor and capacitor sizes  $(f_{LC} \propto 1/\sqrt{LC})$ . As scaling current is much more convenient than changing L or C, tuning range spanning even decades can be easily achieved in a ring oscillator without entailing the area, coupling, and Q degradation of multi-order passives. Active devices also get to play a more prominent role in setting the overall performance, which allows this new mmW design methodology to better benefit from technology scaling. Similar to many prior arts exploiting ring oscillators for RF applications [64, 84], a large loop bandwidth is needed (in PLL<sub>2</sub>) to filter out the phase noise of the ring oscillator, which is orders of magnitude higher than that of an LC-VCO. The beauty of having the programmable feedback divider, then, is to be able to shrink the tuning range requirement for the first-stage LC-VCO, which is naturally geared towards a high-purity narrow-band output. The optimized noise performance of PLL<sub>1</sub>, serving as the reference to PLL<sub>2</sub>, also makes room for the in-band noise sources and eventually improves the output jitter across the entire range. The success of the proposed architecture hinges upon the design of a compact and efficient programmable feedback divider range-compatible with the mmW ring oscillator.



Fig 4.8. Proposed Cascaded PLL with mmW ring oscillator and programmable dividers



Fig 4.9. Logic implementation of basic programmable digital dividers

# 4.4 Wideband Ring-Based VCO-ILFD Co-Design with Multi-Ratio Pairing

Unfortunately, the divider we desire goes beyond the existing arsenal for frequency division. As shown in Fig. 4.9, a conventional programmable divider requires a cascade of digital flip-flops and logic gates. At above 20GHz, current mode logic (CML) is the only viable option but requires a hefty power budget. Compared with a programmable design, a simple cascade of divide-by-two involves much smaller fanout at each stage while eliminating the combinational gate delays. Many mmW PLLs [7-8, 85] then proceeds to adopt a fast CML-based fixed-ratio divider, also known as a prescaler, followed by a low-frequency programmable divider built with efficient static or dynamic CMOS logic styles. In [86], it is recognized that the limitation on power and maximum frequency of operation of digital frequency dividers is associated with their wide-band nature, that is, these dividers are supposed to work from DC to a certain frequency. Since traditional RF and mmW synthesizers all assume a narrow-band output, the wide-band characteristic is not needed and the injection-locked frequency divider (ILFD) emerges as the energy-efficient narrow-band alternative for prescaler design. Despite its many incarnations and the tremendous effort to widen the locking range for a specific division ratio, an ILFD has never been envisioned to be programmable. In this section, after reviewing injection locking and ILFDs in general, a calibrationfree wide-locking ring-based VCO-ILFD co-design methodology is first proposed and then scaled to realize programmable division through a minimum-overhead multi-pairing technique.

### 4.4.1 Injection-Locking and the Superharmonic ILFDs

The synchronization of a free-running oscillator under the impression of an external signal is a well-studied phenomenon that applies to both LC- and ring-based oscillators [87-89]. Conceptually speaking, the external injection signal introduces an additional phase shift,  $\phi$ , in the positive-feedback loop. To compensate for  $\phi$  while still satisfying the Barkhausen criterion for sustained oscillation, an LC oscillator shifts its resonance frequency away from  $1/(2\pi\sqrt{LC})$  while each stage in a ring oscillator changes their phase shift (delay) so that, in both cases,  $f_{OSC} = f_{INJ}$ . The locking range, defined as the maximal deviation,  $\Delta f$ , from the free-running frequency,  $f_0$ , to  $f_{INJ}$ , is thus predetermined by the maximal phase shift the injection signal can impart. For a common *N*-stage (*N* being an odd number) inverter ring with transistors modelled as ideal  $G_m$ stages for enough loop gain, as shown in Fig. 4.10, each stage contributes a phase shift of  $\pi/N$ . The oscillation current in each stage can be visualized through the phasor diagram in Fig. 4.11. With the additional phase shift from the injection signal, the per-stage phase shift deviates by  $\theta$ , with the constraint that [89],

$$\left(\pi + \frac{\pi}{N} + \theta\right) \times N + \phi = 2k\pi \tag{4.9}$$

where k is an integer. Fig. 4.11(b) gives the phasor diagram under injection and we have,

$$\theta = -\frac{1}{N}\phi \tag{4.10}$$

$$\sin\phi = \left|\frac{I_{INJ}}{I_{OSC}}\right|\sin(\alpha - \phi) = \left|\frac{I_{INJ}}{I_{OSC}}\right|(\sin\alpha \cdot \cos\phi - \cos\alpha \cdot \sin\phi)$$
(4.11)

Assuming a small  $I_{INJ}$  compared to  $I_{OSC}$  and therefore a likely small  $\phi$  so that  $\sin \phi \approx \phi$ , Eq. 4.11 can be reduced to,

$$\phi = \frac{|I_{INJ}| \sin \alpha}{|I_{INJ}| \cos \alpha + |I_{OSC}|}$$
(4.12)



Fig 4.10. Model for analyzing injection-locked ring oscillator with the effect of injection in red.



Fig 4.11. Phasor diagrams for a free-running (left) and an injection-locked ring oscillator.

And the maximally achievable phase shift for a given injection can be found by setting,

$$\frac{\partial \phi}{\partial \alpha} = 0 \tag{4.13}$$

which gives,

$$\phi_{MAX} = \frac{|I_{INJ}|}{\sqrt{|I_{OSC}|^2 - |I_{INJ}|^2}}$$
(4.14)

The phase shift contributed by the per-stage RC load near the free-running frequency is,

$$\tan\left(\frac{\pi}{N} + \theta\right) = \frac{\Delta f}{f_0} \tag{4.15}$$

Using Taylor series expansion in the vicinity of  $f_0$ ,  $\theta$  can be approximated from Eq. 4.14 as,

$$\theta \approx \frac{\tan \frac{\pi}{N}}{1 + \tan^2 \frac{\pi}{N}} \cdot \frac{\Delta f}{f_0}$$
(4.16)

By substituting Eq. 4.13 and 4.15 into 4.10, the one-sided locking range can be obtained as,

$$\frac{\Delta f}{f_0} \le \frac{1}{N} \cdot \frac{1 + \tan^2\left(\frac{\pi}{N}\right)}{\tan\left(\frac{\pi}{N}\right)} \cdot \left|\frac{I_{INJ}}{I_{OSC}}\right| \cdot \left(1 - \left|\frac{I_{INJ}}{I_{OSC}}\right|^2\right)^{-\frac{1}{2}}$$
(4.17)

For an LC oscillator under injection, the analysis is similar with the main difference being the responding phase shift is given by the tank, instead of the *N*-stage RC cascade,

$$\tan(\phi) = 2Q \cdot \frac{\Delta f}{f_0} \tag{4.18}$$

And the one-sided locking range for an LC oscillator is,

$$\frac{\Delta f}{f_0} \le \frac{1}{2Q} \cdot \left| \frac{I_{INJ}}{I_{OSC}} \right| \cdot \left( 1 - \left| \frac{I_{INJ}}{I_{OSC}} \right|^2 \right)^{-\frac{1}{2}}$$

$$(4.19)$$

It is worth noting that in both LC and ring cases, the maximum achievable phase shift based on Eq. 4.13 happens when  $I_{INJ}$  becomes orthogonal ( $\alpha - \phi = \pi/2$ ) to  $I_{LOAD}$ , which is the incident current at the point of injection and the vector sum of  $I_{INJ}$  and  $I_{OSC}$ . At this threshold point, the injection signal has no effect on the phase of the resultant oscillation and no phase noise correction is observed. Conversely, the phase noise reduction reaches a maximum for  $f_{INJ} = f_0$ , with the locked and free-running phase noise profiles meeting at the edges of the locking range [79].

Functionally and literally opposing the subharmonic ILFM, the superharmonic ILFD receives an injection signal at  $f_0$  but outputs a tone at  $f_0/N$ , with N being the division ratio. The injectionlocked oscillator (ILO) in the ILFD is expected to have a free-running frequency around  $f_0/N$  and the exact injecting frequency is guaranteed through a negative feedback loop. As depicted in Fig. 4.12, assuming the ILO is already locked at  $f_0/N$ , the (N - 1)<sup>th</sup> harmonic of the ILO can be extracted and then mixed with  $f_0$  to produce, with the help of some band-pass filtering, the actual injection signal at,

$$f_0 - \frac{N-1}{N} \cdot f_0 = \frac{f_0}{N}$$
(4.20)

The actual implementation of a simple injection-locked divide-by-2 (ILFD-2) with (a) an LC-based ILO and (b) a CML-ring-based ILO are given in Fig. 4.13. For N = 2, only the fundamental (N - 1 = 1) of the ILO frequency is needed in the feedback path and the mixing is conveniently embedded in the differential pair virtual ground node presented in both topologies. In terms of the locking range for the ILFD, the conversion gain of the harmonic extraction and mixing process matters now as it determines the final injection strength at  $f_0/N$ . For the divide-by-2 stage in Fig. 4.13(a). The locking range can be approximated as [88],

$$\frac{\Delta f}{f_0} \le \frac{1}{2Q} \cdot \frac{2}{\pi} \cdot \left| \frac{I_{INJ}}{I_{OSC}} \right| \tag{4.21}$$

with  $2/\pi$  being the mixer conversion gain and the squared-root term in Eq. 4.19 rounded to unity for  $I_{INJ} \ll I_{OSC}$ .



Fig 4.12. Behavior model for a superharmonic ILFD.



Fig 4.13. Example LC-VCO- and CML-ring-based ILFD-2 with embedded mixer.

At the risk of repetition, any LC-based circuit technique is inherently narrow-band. With the locking range inversely proportional to Q as derived in Eq. 4.19, injection locking makes no exception. For the subharmonic ILFM technique where the injection power at fundamental is bound to be small, [76-78] report less than 1% locking range (200MHz at 30GHz). In an ILFD, on the other hand, the fundamental injection power is typically stronger but the working and locking range still rarely exceed 30% for single-tank implementations despite the use of techniques such as tanking tuning [8] and multi-path injection [90]. The ILO, now resonating at the divided-down frequency, also occupies a much larger area. Additionally, the minimalistic construct of an LC oscillator with embedded mixing restricts the design options for harmonic extraction and injection, making it very difficult to realize larger (N > 3) fixed division ratios - not to mention programmable ones. We therefore consider the ring ILFDs for our application.

## 4.4.2 Optimal Ring-Based VCO-ILFD Co-Design

Assuming a 5-stage ring oscillator, an LC oscillator with Q = 10 (a modest value at multi-GHz range), and an identical injection efficiency,  $I_{INJ}/I_{OSC}$ , the scaling factor in Eq. 4.17 evaluates to 0.35 while the 1/2Q in Eq. 4.19 yields 0.05. The much wider locking range in a ring ILO directly benefits a ring ILFD. In fact, Eq. 4.17 remains a rather pessimistic estimation as it is derived with single-path injection whereas the multi-phase construct of a ring oscillator invites range widening

techniques such as multi-phase and/or multi-path injection techniques [89, 91-93]. For example, [92] reports a CML-ring-based divide-by-4 with dual-path-four-phase injection that locks from 1.2GHz to 20.7GHz. In a similar fashion, the wide tuning characteristic of a ring oscillator also trickles down to ring ILFDs with [93] reporting a load-modulated CML-ring-based divide-by-4 for an input frequency from 4GHz to 120GHz.

In some sense, the wider locking and tuning range for ring-based ILOs is as much a requirement as it is a blessing. As ring oscillator free-running frequency is much more susceptible to process, voltage, and temperature (PVT) variations than that of an LC oscillator, being easier to synchronize to the injection signal improves the functional robustness of the underlying application in the absence of a heavy-handed frequency calibration process. Using the same divider-by-4 in [92] as an example, the measured free-running frequency of the ILO is 3GHz, yet it is able to synchronize from 0.3GHz on the low side and up to more than 5GHz. The tolerated large deviation from the free-running frequency, however, implies that the optimal phase noise suppression, which happens when  $f_0 = f_{INJ}$ , cannot be guaranteed. The misalignment reduces the noise suppression bandwidth from several GHz in the ideal case to a much smaller value with the worst case being the ILFD as noisy as a free-running ring oscillator. Tuning the ILO, in an attempt to match  $f_0$  to  $f_{INJ}$ , is nearly impossible in the locked state as it requires access to the ILO phase noise profile. Referring back to Eq. 4.3, the divider phase noise shows up at the PLL output within the loop bandwidth. If the loop bandwidth is small, the lesser phase noise suppression bandwidth (hopefully still in the MHz range) may not degrade PLL output jitter much. However, since a large loop bandwidth is critical for cleaning up the noisy ring in our proposed architecture, the ILFD, as a result, not only has to be wide-tuning and wide-locking, it must also maintain a free-running frequency close to  $f_0/N$  for all supported N values throughout the entire tuning range. In other words, frequency relationship between the main VCO and the ILO in the free-running state should already follow that in a frequency divider, as depicted in Fig. 4.14.



Fig 4.14. Desired frequency relationship for best noise suppression in ILFD.

#### Series Scaling



Fig 4.15. Ring oscillator scaling properties.

As daunting as it appears, the answer is right under our nose: simple ring oscillators with scaled number of stages display linearly scaled free-running frequencies. With an ideal *M*-stage ring as the main VCO and an  $M \times N$ -stage ring as the ILO, the relationship in Fig. 4.14 is satisfied. Since rings with even number of stages are adopted in this work, for their various advantages, Fig. 4.15 illustrates such a series scaling property using inverter rings with M = 4, N = 2, and  $\tau$  the unit inverter delay. To turn the 4-stage- and 8-stage rings into a robust ILFD-2, the only block that is still needed is the mixer. By intentionally not limiting the rings to conventional CML-based topologies and thereby getting rid of the implicit mixers, two advantages are discovered. Firstly, the proposed VCO-ILFD pair can be implemented with digital-friendly inverter rings for possibly lower noise and larger output power (depends on technology  $f_T$  and targeted frequency range). Secondly, the entire feedback path from the ILO output (where harmonic extract starts), via the mixer, and to the injection nodes are now fully exposed. Considering that the embedded single-ended active mixer available in the CML differential pair offers no additional filtering of harmonic

contents while incurring unnecessary tradeoff between gain and noise [86], a better mixer, in addition to dedicated harmonic filtering and amplification circuits, can be explicitly installed in the exposed feedback path. As a result, only the injection power at  $f_0/N$  is boosted while other harmonics are attenuated to alleviate both injection spur and the possibility for false locking. This point is illustrated in Fig. 4.16 with one ILFD-2 implemented with a differential passive mixer and the other with a differential active mixer with RC filtering followed by inverter buffers. Notice that differential injection is easily adopted here to further widen the locking range, thanks to the (pseudo-) differential nature of ring oscillators with even number of stages. Fig. 4.16 also plots the simulated locking range for the two ILFD-2s and a clear improvement is observed for the one with active mixer.



Fig 4.16. ILFD-2 optimization with exposed feedback and simulated locking range increase.

## 4.4.3 Multi-Ratio Pairing with Oscillator Multiplexing

To implement a different division ratio using the proposed co-design methodology, one can simply scale the number of stages in the ILO and opt for a different harmonic extractor in the feedback path. For example, a divide-by-3 can be implemented with a 12-stage ring with  $2^{nd}$ -order harmonic. Since 12 is an integer multiple of 4, quadrature phases will be available among the 12 phases in the ring which make the inputs for a mixer-based frequency doubler. The ILFD-3 is shown as part of Fig. 4.17(b). According to Eq. 4.17, the increasing number of stages in the ILO, as *N* grows, tends to reduce the locking range. To overcome this potential drawback, the parallel scaling property, shown also in Fig. 4.15 but never mentioned thus far, proves rather valuable. It is based on the observation that parallelizing two identical rings, which is equivalent to doubling the width of inverter transistors, does not affect the free-running frequency. This property is applied in reverse in the ILO to effectively scale down  $I_{OSC}$  for a bigger  $I_{INJ}/I_{OSC}$ . As a result, the locking range is recovered and possibly boosted while ideally maintaining the free-running frequency relationship with respect to the main VCO.

It should be obvious that each VCO-ILFD pair optimizes both the locking range and the phase noise suppression bandwidth for a particular *N*. In order to preserve these properties for programmable division, this work simply leverages the compact area of ring ILFDs and proposes a multi-ratio pairing arrangement, with one main VCO corresponding to multiple ILFDs, to support all desired *N* values. However, instead of letting all ILOs run at the same time, an oscillator multiplexing technique unique to even-numbered-stage inverter rings is proposed to minimize power consumption overhead as well as the possibility for the ILOs to couple through supply and substrate. As indicated in Fig. 4.16 (a), an inverter ring with even number of stages need the cross-coupled inverters to prevent latch-up. By modifying these cross-coupled inverters to be tristateable, the oscillation can then be stopped and resumed gracefully and rapidly. Compared to the common approach to turn on and off an oscillator using a supply head switch, the intentional latch-up proves to be more effective, as a ring may continue to oscillate at very low supply levels.

75

Fig. 4.17(b) shows such multiplexing between ILFD-2 and ILFD-3. Additional tri-stateable buffers and switches to ground are inserted between the ILFD outputs and the mixer input ports. Such precautions are necessary to guarantee a complete mixer shut-off with zero DC current as the ILFD outputs may be both high when latched up.



Fig 4.17. (a) Latch-up in a 4-stage ring (b) multiplexing between ILFD-2 and ILFD-3

Since the VCO and ILFDs are packed so closely together, device, temperature, and process mismatches play a less role in potential degradation of the scaled free-running frequency relationship. Rather, the down-sized and weaker inverters, combined with the larger loads from the tristate-able inverters, slow down the ILOs more in the presence of layout parasitics. Given the relatively larger division ratios (N = 4 & 5) used in the final design, as well as the confidence in the all the measures to widen the locking range, the greater retardation in the ILO is conveniently compensated by shortening the inverter chain. The modified scaling factor for the number of stages is N - 1, instead of the division ratio, N. The ILFD-4 and ILFD-5 are designed

with 12- and 16-stage rings instead of 16 and 20 stages, respectively. In the ILFD-4 case, the 3<sup>rd</sup>order harmonic is easily extracted from buffer non-linearity. On the other hand, the 4<sup>th</sup>-order harmonic needed for the divide-by-5 is a bit tricky. Fortunately, the 16-stage ring not only improves frequency alignment, it also makes it easier to extract the 4<sup>th</sup>-order harmonic by providing the eight evenly spaced phases (16 is integer multiple of 8 while 20 is not) needed for a differential frequency quadrupler. Fig. 4.18 illustrates both designs. Fig. 4.19 plots the simulated locking behavior for the ILFD-4 among different process corners when the input frequency is swept from 18GHz to 48GHz, as well as the simulated single-point locking range at the highest supply setting (1.1V) for the ILFD-5.



Fig 4.18. Final functional design of (a) ILFD-4 and (b) ILFD-5.



**Fig 4.19.** Simulated input-output relationship under process variation for the ILFD-4 with main VCO (left) and the input-referred locking range for the ILFD-5

# 4.5 Prototype Implementation

#### 4.5.1 Frequency Planning and Ring Oscillator Design

The cascaded PLL prototypes aim to first cover the versatile 24-40GHz spectrum. Only two division ratios, 8 and 10, are needed in the second loop (PLL<sub>HB</sub>) in order to seamlessly shrink the 50% tuning range at mmW to less than 30% (3-4GHz, 28.6%) at RF. The dual ratio is implemented following the proposed methodology by having the ILFD-4 in parallel with the ILFD-5 with their multiplexed output going into a static divide-by-2. Only integer frequency step is considered. With a 50MHz crystal reference, the final output step size is 400MHz in the 8x mode and 500MHz in the 10x mode, which is acceptable for the target sensing applications. If a finer frequency resolution is desired, fractional-N support can be easily incorporated into PLL<sub>LB</sub>. A detailed block diagram is shown in Fig. 4.20. The proposed cascaded architecture with ring-based mmW stage minimizes design tradeoffs and amplifies the best features of each frequency component: the high-purity reference cleans up the narrow-band LC-VCO and then the wide-tuning ring-VCO, meanwhile the multi-pairing extends frequency coverage with negligible area penalty. Note that the labelled frequencies are slightly lower due to the exact reference being at 49.152MHz rather than 50MHz.



Fig 4.20. Top-level block diagram of the proposed cascaded PLL.

To further exploit the convenience and area efficiency of ring oscillators, the main mmW VCO is designed to offer quadrature phases. With a simulated technology (NMOS)  $f_T$  of just over 250GHz, the 4-stage 8-inverter ring requires minimum-length ultra-low-threshold devices to reach 99GHz under nominal conditions. The slim margin in speed aggravates the manifold challenge in frequency tuning. First off, any method that degrades the top oscillation speed, such as adding varactor or current starving, is automatically disgualified. The only remaining option seems to be adjusting the supply voltage, to which a ring is super sensitive. The resulting excessive  $K_{VCO}$  is undesirable as it typically leads to higher output noise and spur, as well as potentially a much bigger loop filter. On the other hand, a small (< 10GHz)  $K_{VCO}$  makes it impossible to cover more than 10GHz given the 0.9V core supply for the charge pump. The solution is to employ two-step tuning, much like in an LC-VCO with switched-capacitor array. As can be seen in Fig. 4.20, the supply to all the rings (main VCO and divider ILOs) is made programmable through a highresolution (8-bit) digital-to-analog converter (DAC), buffered with an LDO, to realize coarse tuning. For fine-tuning, it is reminded that an MOSFET is a 4-terminal device after all. Back-gate bias lightly modulates the series resistance of a transistor (Eq. 4.22), which is perfect for a smaller  $K_{VCO}$ . Since NMOS body bias requires not only deep-N-well isolation but also negative control voltage, back-gate tuning is only applied on the naturally isolated PMOS devices in signal-path

inverters (not the cross-coupled ones) to keep the design clean and compact. Fortunately, both tuning knobs exert their influence on individual inverters and thus does not jeopardize either the series or parallel scaling properties of ring oscillators. The inverter sizes in the ILFDs are scaled to about one-tenth of those in the main VCO in order to maximize locking range while minimizing power consumption. Fig. 4.21 shows the detailed implementation of the 4-stage ring along with its simulated tuning characteristics.



 $g_{mb} = \frac{\gamma}{2\sqrt{-2\phi_P - V_{BS}}} \cdot \frac{W}{L} \mu C_{ox} (V_{GS} - V_T)$ (4.22)

**Fig 4.21.** Main ring VCO design and its simulated tuning with negative  $K_{VCO}$ .

## 4.5.2 Noise, Spur, and Power Considerations for the Cascaded PLL

With a reference frequency centered around 3.5GHz, PLL<sub>HB</sub> can afford a loop bandwidth up to a few hundred MHz with a type-II implementation. The 40dB per decade noise suppression characteristic is better appreciated when the VCO has a high flicker noise corner, which is exactly the case for a ring oscillator built with short-channel devices. A standard charge pump PLL is adopted. Since all loop components now operate at 3GHz and above, dynamic CMOS logic, such as the C<sup>2</sup>MOS-based divide-by-2 and the TSPC-based D-flip-flops (DFFs), are used extensively to improve energy efficiency. A switched filter is inserted before the loop filter to better isolate the

VCO from charge pump settling and it is controlled by a synchronous 25% duty-cycle clock derived from the differential outputs of the ILFDs. Since the charge pump output centers around VDD/2 in  $PLL_{HB}$ , a PMOS source follower is required to shift up the control voltage, meant for N-well biasing, to around VDD. Fig. 4.22 illustrates these details.



Fig 4.22. PLL<sub>LB</sub> implementation details.

The larger loop bandwidth of  $PLL_{HB}$ , on the other hand, also implies that the noise and spur of  $PLL_{LB}$  will present themselves at the final output not only unfiltered but also scaled up by a factor of 8 or 10. With the proposed architecture, it turns out to be more cost-effective to allocate more design resources, such as power and area, to optimize  $PLL_{LB}$  for a better overall FoM. As suggested in Section 4.2, minimization of integrated phase noise involves both in-band and outof-band components. As shown in Fig. 4.23, the sub-sampling technique is adopted here to reduce the noise contributions from many input-referred sources (PD/CP/LF) while eliminating the divider noise all-together [61]. However, the technique tends to introduce larger reference spurs due to VCO load modulation (BPSK effect) as well as charge sharing and injection from sampling switches [94]. Despite several low power techniques reported so far [94-96], this work considers multi-stage buffering, with the second stage offering a low output impedance, the most effective at isolating the VCO in the presence of device mismatch and PVT variations. Simple dummy switches are included to compensate for the charge sharing and injection. For a low out-of-band noise, the LC-VCO design follows the classic NMOS cross-coupled topology but adopts the implicit common-mode resonance technique [98] is for best-possible noise performance without using a second inductor. PLL<sub>HB</sub> is an IO-voltage-based design except for the LC-VCO and its first stage CML buffer.



Fig 4.23. Select PLL<sub>HB</sub> implementation details.

#### 4.5.3 Dual-Band Frequency Buffer for Tri-Band Extension

The convenient quadrature or multiple phases offered by ring oscillators at mmW open up new possibilities for frequency multiplier design. To demonstrate this point, a single-path reconfigurable buffer/doubler is implemented as a compact and modular add-on to the cascaded PLL as a potential candidate for a single-path tri-band (28/39/60GHz) synthesizer. The design takes its inspiration from the conventional mixer-based frequency doubler. Shown in Fig. 4.24(a) is a tail-less Gilbert cell. In the doubler mode,  $\Phi_1 = \Phi_4$  and  $\Phi_2 = \Phi_3$ . For differential signals  $\Phi_5$  and  $\Phi_6$ , although it is not necessary, it is easily proven [86] that being quadrature to  $\Phi_1$  and  $\Phi_2$ (or  $\Phi_3$  and  $\Phi_4$ ) minimizes DC imbalance for the differential output. By simply switching the polarity of  $\Phi_3$  and  $\Phi_4$  (or  $\Phi_1$  and  $\Phi_2$ ) so that  $\Phi_1 = \Phi_3$  and  $\Phi_2 = \Phi_4$ , the Gilbert cell can effectively be folded and turned into a CML buffer, with color-matched simulation result shown on the side. The simulation is run with single-band 30GHz load which leads to the diminishing output waveform in the doubler mode. To boost the conversion gain, a compact dual-band loading based on a 2<sup>nd</sup>order LC network is designed with the Gilbert cell core. As can be seen in Fig. 4.24(b), the Q in the buffer mode is intentionally lower in order to best preserve the original wideband output.



Fig 4.24. Gilbert-cell-based buffer/doubler design and its dual-band load.

## 4.6 Measurement and Comparison

The prototypes are fabricated in the TSMC 28nm CMOS process with ultra-thick top metal option. The cascaded PLL occupies an active area of 0.29mm<sup>2</sup>. PLL<sub>HB</sub> takes up less than 5% at 0.012mm<sup>2</sup> including the decoupling capacitor for the LDO. The frequency extension unit adds an additional active area of 0.06mm<sup>2</sup> in a second chip. The maximal power consumption from both IO and core voltage domain, excluding frequency extension, is 35mW. Fig. 4.25 shows the chip micrographs along with the power consumption breakdown.



Fig 4.25. Prototype chip micrographs and power consumption breakdown.

The output of PLL<sub>LB</sub> is buffered out and measured with a Keysight 5052A signal source analyzer. Among the six chips measured, the worst-case locking range is from 2.95GHz to 3.83GHz, about 3% below the planned coverage. Fig. 4.26 gives the spectrum and phase noise measurements at both frequencies. The implicit common-mode resonance frequency, ideally at  $2 \cdot f_{OSC}$ , heavily depends on the ratio of common-mode and differential mode capacitances in the tank. In [8], 1024 settings are swept to optimize the performance yet only 32 settings are included in this work. Comparing the out-of-band spot noise in Fig. 4.26 and the measurement in [97], further improvement of at least 6dB can be achieved through either finer adjustment or better electromagnetic simulation. The worst-case reference spur at the PLL<sub>LB</sub> output is -79dBc and is observed at the  $2 \cdot f_{REF}$  offset.



Fig 4.26. Spectrum and phase noise measurement of PLL<sub>LB</sub>.



Fig 4.27. High- and low-band phase noise measurement of PLL<sub>HB</sub>.

The final mmW output is measured directly through a GSG probe. An Agilent 8565EC spectrum analyzer is used for the initial testing. With help from our colleagues at National Chiao Tung University, a much more capable Keysight N9041B signal analyzer is used later to repeat select measurements for verification and better presentation. With the two ILFDs in the prototype, PLL<sub>HB</sub> achieves a worst-case continuous locking range of 47.5% from 23.59 to 38.34GHz (23.59-31.06GHz in the 8x mode and 29.50-38.34GHz in the 10x mode). Fig. 4.27 shows the phase noise measurements at both ends whereas Fig. 4.28 gives the detailed mid-band performance.

Comparing Fig. 4.27 and Fig. 4.26,  $PLL_{HB}$  contributes an additional 1-2dB noise at 1MHz offset. The estimated mid-band jitter FoM is - 232dB when integrated from 1kHz to 100MHz.



Fig 4.28. PLL<sub>HB</sub> mid-band output phase noise measurement at 31.46GHz.

Table 9 compares this work against other reported wideband mmW frequency synthesizers that have tuning ranges above 20% and similar center frequencies. Both [7] and [8] adopt single-loop topologies but with specialized mmW hardware in their LC-VCOs, the digitally controlled artificial dielectric (DiCAD) [x-y], to maintain a good *Q* across the tuning ranges. In fact, both works come out of our group with the former being the predecessor of this work and the latter developed for a 60GHz heterodyne transceiver. Our cascaded PLL prototype, with a single LC-VCO at RF, achieves a comparable noise performance but covers a much wider frequency range with less power and area. For completeness, [99] and [100] are included in the comparison which represent the two extremes in terms of trading area, power, and design complexity with performance. The proposed architecture and circuit techniques not only offers a much more practical middle ground but also allows mmW synthesizer design to evolve and benefit more from CMOS technology scaling.

| Parameter                                                              | This Work | [99]     | [7]     | [100]     | [8]     |
|------------------------------------------------------------------------|-----------|----------|---------|-----------|---------|
| CMOS Technology                                                        | 28nm      | 28nm     | 65nm    | 65nm      | 65nm    |
| Architecture                                                           | Cas. PLL  | Cas. PLL | PLL     | PLL+ILFM  | PLL     |
| Frequency (GHz)                                                        | 31.46     | 32       | 27.86   | 42        | 50.11   |
| Frequency Range (GHz)                                                  | 23.6-38.3 | 21-32*   | 28-34   | 20.6-48.2 | 42.1-53 |
| (Percentage %)                                                         | (47.6%)   | (41.5%)  | (19.4%) | (80.2%)   | (22%)   |
| Quadrature                                                             | Yes       | Yes      | No      | No        | No      |
| $f_{\it REF}$ (MHz)                                                    | 49        | 125      | 54      | 100       | 54      |
| PN @1MHz (dBc/Hz)                                                      | -97.7     | -91.2    | -96.7   | -108.1    | -97.5   |
| Power (mW)                                                             | 28.8      | 30       | N/A     | 148       | 72      |
| "Coil" Count                                                           | 1         | 0        | 1       | 8         | 3       |
| Active Area                                                            | 0.29      | 0.015    | N/A     | 2.1       | 0.37**  |
| PN FoM @1MHz                                                           | 173       | 166      | N/A     | 178       | 173     |
| Jitter FoM @1MHz                                                       | -232      | -222     | N/A     | N/A       | N/A     |
| *fixed divider only, tuned by sweep $f_{REF}$ ; **off-chip loop filter |           |          |         |           |         |

# Table 9. Wideband mmW PLL Comparison

# CHAPTER 5

# **Conclusion and Future Directions**

With spectroscopic remoting sensing as the entry point, this dissertation represents a more refined and systematic foray into custom CMOS design for high-frequency scientific instrumentation. The working principle of RF spectroscopy is introduced first, and each component is then analyzed for their speed and noise requirement. While modern ultra-scaled CMOS technology still falls short in terms of noise for THz front ends, their unparalleled level of integration and ever-increasing speed prompted this investigation into wideband spectrometer SoC and ring-based ultra-wide-range frequency synthesis for maximal improvement in instrument SWaP and overall operation efficiency.

The spectrometer SoCs integrate high-speed data conversion with bandwidth-compatible FFT-based backend processing. To ensure robust operation at the mixed-signal interface, onchip clocking with extensive tuning capability is included. The efficient design of the highthroughput and high-channel-count FFT processors, including the costly PBF and the timingcritical accumulators, is enabled through parallel-pipelined partitioning, the adoption and extension of the SDF-based Radix-2<sup>K</sup> factorization, and as well as linear approximation of trigonometric functions. Mathematical symmetry and bit-level hardware reduction is also exploited wherever possible. Three generations of deployable SoCs are developed as part of this work, with the latest achieving 6GHz Nyquist bandwidth with sub-MHz full-band resolution. Although area and routing efficiency is the intended goal for optimization, the SoCs, excluding the PFB, exhibits better energy efficiency than reported standalone design of high-performance FFT processors.

The inverter-ring-based frequency synthesizer extends the benefit of CMOS technology scaling to mmW LO generation. The proposed VCO-ILFD co-design methodology marries the simple and elegant scaling properties of ring oscillators with optimal working conditions of ILFDs.

88

As a result, wide tuning and locking, as well as uncompromised noise suppression, are achieved simultaneously without excessive power consumption. Rings with even number of stages further facilitates oscillator multiplexing through controlled latch-up, making the multi-ratio VCO-ILFD pairing techniques ideal for expanding frequency coverage from a high-purity narrow-band source, such as an LC-VCO. A cascaded PLL prototype targeting 24-40GHz in 28nm CMOS verifies the proposed ideas by reporting comparable noise performance to that of its mmW-LC-VCO-based counterparts while more than doubling the tuning range. It is especially suited for power- and area-constrained sensing and instrumentation applications at Ka-band and beyond.

The substantial reduction in instrument SPaW opens up new observation scenarios such as large-scale multi-pixel-array-based spectroscopic sensing. New algorithms can be developed on top of the existing design to account for phase relations between different pixels. Hardware reconfigurability should then be added to select among single-, dual-, and multi-channel sensing modes. On the frequency synthesis side, the small footprint of ring oscillators and the convenient oscillator multiplexing technique proves rather powerful. For example, multiple  $2^{nd}$ -stages similar to the existing one can be attached in parallel to cover more range below the maximally oscillation frequency ( $f_{osc}$ ) allowed by the technology. Note that the proposed techniques are fully compatible with CML rings which is theoretically faster than inverter rings thanks to the smaller swing. To get rid of the remaining LC-VCO, existing research on ring-based PLLs in the sub-10GHz rang can be leveraged. Applying measurement results in [x] to the cascaded PLL, a spot noise performance of -106dBc at 1MHz offset can be projected. In addition, reconfigurable frequency multipliers, with ratios beyond integer values (e.g. 1.5 and 2.5) can be developed with ease, given the large-swing multi-phase outputs, to seamlessly cover more range beyond maximum  $f_{osc}$ .

89

# REFERENCES

- [1] F. Hsiao, A. Tang, Y. Kim, B. Drouin, G. Chattopadhyay and M. F. Chang, "A 2.2 GS/s 188mW spectrometer processor in 65nm CMOS for supporting low-power THz planetary instruments," *IEEE Custom Integrated Circuits Conference*, San Jose, CA, 2015, pp. 1-3.
- [2] A. Tang et al., "A low-overhead self-healing embedded system for ensuring high yield and long-term sustainability of 60GHz 4Gb/s radio-on-a-chip," *IEEE International Solid-State Circuits Conference*, San Francisco, CA, 2012, pp. 316-318.
- [3] A. Tang, R. Carey, G. Virbila, Y. Zhang, R. Huang and M. Frank Chang, "A Delay-Correlating Direct-Sequence Spread-Spectrum (DS/SS) Radar System-on-Chip Operating at 183–205 GHz in 28 nm CMOS," in *IEEE Transactions on Terahertz Science and Technology,* vol. 10, no. 2, pp. 212-220, March 2020.
- [4] C. Yang, T. Yu, and D. Markovic, "Power and Area Minimization of Reconfigurable FFT Processors: A 3GPP-LTE Example," in *IEEE Journal of Solid-State Circuits*, vol. 47, no. 3, pp. 757-768, March 2012.
- [5] A. Cortes, J. F. Sevillano, I. Velez and A. Irizar, "An FFT Core for DVB-T/DVB-H Receivers," IEEE International Conference on Electronics, Circuits and Systems, Nice, 2006, pp. 102-105.
- [6] A. Tang et al., "CMOS (Sub)-mm-Wave System-on-Chip for exploration of deep space and outer planetary systems," *Proceedings of the IEEE 2014 Custom Integrated Circuits Conference,* San Jose, CA, 2014, pp. 1-4.
- [7] Z. -. Chen et al., "A wide-band 65nm CMOS 28–34 GHz synthesizer module enabling l6ow power heterodyne spectrometers for planetary exploration," *IEEE MTT-S International Microwave Symposium*, Phoenix, AZ, 2015, pp. 1-3.
- [8] D. Murphy et al., "A Low Phase Noise, Wideband and Compact CMOS PLL for Use in a Heterodyne 802.15.3c Transceiver," in *IEEE Journal of Solid-State Circuits*, vol. 46, no. 7, pp. 1606-1617, July 2011.
- Y. Zhao et al., "A 0.56 THz Phase-Locked Frequency Synthesizer in 65 nm CMOS Technology," in *IEEE Journal of Solid-State Circuits*, vol. 51, no. 12, pp. 3005-3019, Dec. 2016.
- S. Ek et al., "A 28-nm FD-SOI 115-fs Jitter PLL-Based LO System for 24–30-GHz Sliding-IF 5G Transceivers," in *IEEE Journal of Solid-State Circuits*, vol. 53, no. 7, pp. 1988-2000, July 2018.
- [11] P. H. Siegel, A. Tang, R. Kim, G. Virbila, F. Chang and V. Pikov, "Noninvasive in vivo millimeter-wave measurements of glucose: First results in human subjects," *International Conference on Infrared, Millimeter, and Terahertz Waves (IRMMW-THz),* Cancun, 2017, pp. 1-2.
- [12] R. Saracco, "6G will follow 5G, that much we know," *IEEE Future Directions,* June 5, 2019.
- [13] J. C. Webber, "The ALMA Telescope," IEEE MTT-S International Microwave Symposium Digest (MTT), Seattle, WA, 2013, pp. 1-3.
- [14] P. F. Goldsmith and D. C. Lis, "Early Science Results from the Heterodyne Instrument for the Far Infrared (HIFI) on the Herschel Space Observatory," in *IEEE Transactions on Terahertz Science and Technology*, vol. 2, no. 4, pp. 383-392, July 2012.
- [15] V. Jamnejad, "A dual band telescope for microwave instrument on Rosetta Orbiter (MIRO)," IEEE Aerospace Conference Proceedings (Cat. No.99TH8403), Snowmass at Aspen, CO, USA, 1999, pp. 265-269, vol.3.
- [16] S. Gulkis, et al., "MIRO: Microwave Instrument for Rosetta Orbiter," *Space Sci Rev* (2007) 128: 561.
- [17] Rolf Guesten, Paul Hartogh, Heinz-Wilhelm Huebers, Urs U. Graf, K. Jacobs, Hans-Peter Roeser, Frank Schaefer, Rudolf T. Schieder, Ronald Stark, Juergen Stutzki, Peter Van der

Wal, Achim Wunsch, "GREAT: the first-generation German heterodyne receiver for SOFIA," *Proc. SPIE* 4014, Airborne Telescope Systems, 20 June 2000.

- [18] N. Uchida and N. Niizeki, "Acousto-optic deflection materials and techniques," in *Proceedings of the IEEE*, vol. 61, no. 8, pp. 1073-1092, Aug. 1973.
- [19] Gary Bennett, "Space Nuclear Power: Opening the Final Frontier," *4<sup>th</sup> International Energy Conversion Engineering Conference and Exhibit,* San Diego, CA, June 2006.
- [20] Dave Drachlis, "Advanced Space Transportation Program: Paving the Highway to Space," https://www.nasa.gov/centers/marshall/news/background/facts/astp.html
- [21] B. Richards, N. Nicolici, H. Chen, K, Chao, R. Abiad, D. Werthimer, B. Nikolic, "A 1.5GS/s
  4096 spectrum analyzer for spaceborne applications," *IEEE Custom Integrated Circuits Conference*, San Jose, CA, 2009, pp 499-502.
- [22] P. Racette and R. H. Lang, "Radiometer design analysis based upon measurement uncertainty," in *Radio Science*, vol. 40, no. 05, pp. 1-22, Oct. 2005
- [23] R. H. Dicke, "The measurement of thermal radiation at microwave frequencies," in *Rev Sci Instrum*, vol. 17, pp. 268-275, 1946.
- [24] E. T. Schlecht et al., "Schottky Diode Based 1.2 THz Receivers Operating at Room-Temperature and Below for Planetary Atmospheric Sounding," in *IEEE Transactions on Terahertz Science and Technology*, vol. 4, no. 6, pp. 661-669, Nov. 2014.
- [25] B. Thomas et al., "A Broadband 835–900-GHz Fundamental Balanced Mixer Based on Monolithic GaAs Membrane Schottky Diodes," in *IEEE Transactions on Microwave Theory and Techniques*, vol. 58, no. 7, pp. 1917-1924, July 2010.
- [26] E. T. Schlecht, J. J. Gill, R. H. Lin, R. J. Dengler and I. Mehdi, "A 520–590 GHz Crossbar Balanced Fundamental Schottky Mixer," in *IEEE Microwave and Wireless Components Letters*, vol. 20, no. 7, pp. 387-389, July 2010.

- [27] K. M. K. H. Leong et al., "850 GHz Receiver and Transmitter Front-Ends Using InP HEMT," in IEEE Transactions on Terahertz Science and Technology, vol. 7, no. 4, pp. 466-475, July 2017.
- [28] J. W. Kooi et al., "Performance of the Caltech Submillimeter Observatory Dual-Color 180– 720 GHz Balanced SIS Receivers," in *IEEE Transactions on Terahertz Science and Technology*, vol. 4, no. 2, pp. 149-164, March 2014.
- [29] P. Kangaslahti et al., "Low noise amplifier receivers for millimeter wave atmospheric remote sensing," 2012 IEEE/MTT-S International Microwave Symposium Digest, Montreal, QC, 2012, pp. 1-3.
- [30] D. Murphy et al., "A blocker-tolerant wideband noise-cancelling receiver with a 2dB noise figure," *IEEE International Solid-State Circuits Conference,* San Francisco, CA, 2012, pp. 74-76.
- [31] Y. Kim et al., "A 183-GHz InP/CMOS-Hybrid Heterodyne-Spectrometer for Spaceborne Atmospheric Remote Sensing," in *IEEE Transactions on Terahertz Science and Technology*, vol. 9, no. 3, pp. 313-334, May 2019.
- [32] Kenneth S. Kundert, "Predicting the Phase Noise and Jitter of PLL-Based Frequency Synthesizers," in *Phase-Locking in High-Performance Systems: From Devices to Architectures*, IEEE, 2003, pp.46-69.
- [33] J. L. Besada, "Influence of Local Oscillator Phase Noise on the Resolution of Millimeter-Wave Spectral-Line Radiometers," in *IEEE Transactions on Instrumentation and Measurement*, vol. 28, no. 2, pp. 169-171, June 1979.
- [34] G. Chattopadhyay et al., "An all-solid-state broad-band frequency multiplier chain at 1500 GHz," in IEEE Transactions on Microwave Theory and Techniques, vol. 52, no. 5, pp. 1538-1547, May 2004.

- [35] H. Wang et al., "Monolithic power amplifiers covering 70-113 GHz," 2000 IEEE Radio Frequency Integrated Circuits (RFIC) Symposium Digest of Papers, Boston, MA, USA, 2000, pp. 39-42.
- [36] Z. Griffith, M. Urteaga, P. Rowell and R. Pierson, "71–95 GHz (23–40% PAE) and 96–120
  GHz (19–22% PAE) high efficiency 100–130 mW power amplifiers in InP HBT," *2016 IEEE MTT-S International Microwave Symposium (IMS),* San Francisco, CA, 2016, pp. 1-4.
- [37] Z. Griffith, M. Urteaga, and P. Rowell, "A 140-GHz 0.25-W PA and a 55-135 GHz 115-135 mW PA, High-Gain, Broadband Power Amplifier MMICs in 250-nm InP HBT," 2019 IEEE MTT-S International Microwave Symposium (IMS), Boston, MA, USA, 2019, pp. 1245-1248.
- [38] I. Beavers, "Understanding Spurious-Free Dynamic Range in Wideband GSPS ADCs," Analog Devices Inc., Norwood, MA, USA, Tech. Rep. MS2660, 2014.
- [39] A. R. Thompson, D. T. Emerson, and F. R. Schwab, "Convenient formulas for quantization efficiency," in *Radio Science*, vol. 42, no. 03, pp. 1-5, June 2007.
- [40] G. Chattopadhyay et al., "Silicon micromachined terahertz spectrometer instruments," 2016
  41st International Conference on Infrared, Millimeter, and Terahertz waves (IRMMW-THz),
  Copenhagen, 2016, pp. 1-2.
- [41] N. Livesey, G. Chattopadhyay, R. Jarnot, J. Kook, J. Kroz, T. Reck and R. Stachnik, "The Compact Adaptable Microwave Limb Sounder (CAMLS) Project," *American Meteorological Society Conference on Transition of Research to Operations,* Jan 2017.
- [42] J. W. Waters, "Microwave Limb Sounding," Atmospheric Remote Sensing by Microwave Radiometry, Hoboken, NJ, USA: Wiley, 1993, ch. 8.
- [43] S. Kathiah and S. Aniruddhan, "Replica bias scheme for efficient power utilization in highfrequency CMOS digital circuits," 2014 IEEE International Symposium on Circuits and Systems (ISCAS), Melbourne VIC, 2014, pp. 1002-1005.
- [44] Y. Zhang, Y. Kim, A. Tang, J. H. Kawamura, T. J. Reck and M. Frank Chang, "Integrated Wide-Band CMOS Spectrometer Systems for Spaceborne Telescopic Sensing," in *IEEE*

*Transactions on Circuits and Systems I: Regular Papers*, vol. 66, no. 5, pp. 1863-1873, May 2019.

- [45] B. Razavi, "Design Considerations for Interleaved ADCs," in IEEE Journal of Solid-State Circuits, vol. 48, no. 8, pp. 1806-1817, Aug. 2013.
- [46] J. W. Cooley and J. W. Tukey, "An Algorithm for the Machine Calculation of Complex Fourier Series," *Math. Comp.*, vol. 19, pp. 297-301, 1965.
- [47] T.-D. Chiueh and P.-Y. Tsai, OFDM Baseband Receiver Design for Wireless Communication. New York: Wiley 2007.
- [48] Jones, Keith. The Regularized Fast Hartley Transform: Optimal Formulation of Real-data Fast Fourier Transform for Silicon-based Implementation in Resource-constrained Environments. Dordrecht: Springer, 2010.
- [49] M. Garrido, K. K. Parhi and J. Grajal, "A Pipelined FFT Architecture for Real-Valued Signals," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 56, no. 12, pp. 2634-2643, Dec. 2009.
- [50] D. C. Price, "Spectrometers and Polyphase Filter Banks in Radio Astronomy," to be published in the WSPC Handbook of Astronomical Instrumentation.
- [51] J. Chennamangalam, "The Polyphase Filter Bank Technique," 2014 [Online]. Available: http:// casper.ssl.berkeley.edu/wiki/The Polyphase Filter Bank Technique.
- [52] F. I. Shimabukuro, P. L. Smith, and W. J. Wilson, "Estimation of the ozone distribution from millimeter wavelength absorption measurements," *J. Geophys. Res.*, 80 (21), 2957–2959, 1975.
- [53] W. J. Wilson, and P. R. Schwartz, "Diurnal variations of mesospheric ozone using millimeterwave measurements," *J. Geophys. Res.*, 86 (C8), 7385-7388, 1988.
- [54] S. Tang, J. Tsai, and T. Chang, "A 2.4-GS/s FFT Processor for OFDM-Based WPAN Applications," in *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 57, no. 6, pp. 451-455, June 2010.

- [55] K. Okada et al., "A 60GHz 16QAM/8PSK/QPSK/BPSK direct-conversion transceiver for IEEE 802.15.3c," IEEE International Solid-State Circuits Conference, San Francisco, CA, 2011, pp. 160-162.
- [56] R. Wu et al., "64-QAM 60-GHz CMOS Transceivers for IEEE 802.11ad/ay," in IEEE Journal of Solid-State Circuits, vol. 52, no. 11, pp. 2871-2891, Nov. 2017.
- [57] B. Sadhu et al., "A 250-mW 60-GHz CMOS Transceiver SoC Integrated with a Four-Element AiP Providing Broad Angular Link Coverage," in *IEEE Journal of Solid-State Circuits*, vol. 55, no. 6, pp. 1516-1529, June 2020.
- [58] T. Mitsunaka et al., "CMOS Biosensor IC Focusing on Dielectric Relaxations of Biological Water With 120 and 60 GHz Oscillator Arrays," in *IEEE Journal of Solid-State Circuits*, vol. 51, no. 11, pp. 2534-2544, Nov. 2016.
- [59] M. Farina et al., "Imaging of sub-cellular structures and organelles by an STM-assisted Scanning Microwave Microscope at mm-Waves," *IEEE/MTT-S International Microwave Symposium*, Philadelphia, PA, 2018, pp. 111-114.
- [60] D. Banerjee, *PLL Performance Simulation and Design*, 4th Edition, National Semiconductor, 2006.
- [61] X. Gao, E. A. M. Klumperink, M. Bohsali and B. Nauta, "A Low Noise Sub-Sampling PLL in Which Divider Noise is Eliminated and PD/CP Noise is Not Multiplied by N<sup>2</sup>," in *IEEE Journal* of Solid-State Circuits, vol. 44, no. 12, pp. 3253-3263, Dec. 2009.
- [62] D. Yang, "A Multi-Loop Calibration-Free Phase-Locked Loop (PLL) for Wideband Clock Generation," *Ph.D. dissertation*, University of California, Los Angeles, Los Angeles, Spring 2019.
- [63] L. Kong and B. Razavi, "A 2.4 GHz 4 mW Integer-N Inductor-less RF Synthesizer," in *IEEE Journal of Solid-State Circuits*, vol. 51, no. 3, pp. 626-635, March 2016.

- [64] T. Seong, Y. Lee, S. Yoo and J. Choi, "A –242-dB FOM and –71-dBc Reference Spur Ring-VCO-based Ultra-Low-Jitter Switched-Loop-Filter PLL Using a Fast Phase-Error Correction Technique," *Symposium on VLSI Circuits*, Kyoto, 2017, pp. C186-C187.
- [65] M. Babaie and R. B. Staszewski, "A Class-F CMOS Oscillator," in IEEE Journal of Solid-State Circuits, vol. 48, no. 12, pp. 3120-3133, Dec. 2013.
- [66] C. C. Lim, H. Ramiah, J. Yin, P. Mak and R. P. Martins, "An Inverse-Class-F CMOS Oscillator With Intrinsic-High-Q First Harmonic and Second Harmonic Resonances," in *IEEE Journal of Solid-State Circuits*, vol. 53, no. 12, pp. 3528-3539, Dec. 2018.
- [67] H. Guo, Y. Chen, P.-I. Mak, and R. P. Martins, "A 0.082mm<sup>2</sup> 24.5-to-28.3GHz Multi-LC-Tank Fully-Differential VCO Using Two Separate Single-Turn Inductors and a 1D-Tuning Capacitor Achieving 189.4dBc/Hz FOM and 200±50kHz 1/f<sup>3</sup> PN Corner", *Radio-Frequency Integrated Circuits Symposium*, Los Angeles, CA, 2020.
- [68] H. Guo, Y. Chen, P. Mak and R. P. Martins, "A 0.08mm2 25.5-to-29.9GHz Multi-Resonant-RLCM-Tank VCO Using a Single-Turn Multi-Tap Inductor and CM-Only Capacitors Achieving 191.6dBc/Hz FoM and 130kHz 1/f<sup>3</sup> PN Corner," *IEEE International Solid- State Circuits Conference - (ISSCC),* San Francisco, CA, USA, 2019, pp. 410-412.
- [69] E. Monaco, M. Pozzoni, F. Svelto and A. Mazzanti, "Injection-Locked CMOS Frequency Doublers for µm-Wave and mm-Wave Applications," in *IEEE Journal of Solid-State Circuits,* vol. 45, no. 8, pp. 1565-1574, Aug. 2010.
- [70] Y. Hu, T. Siriburanon and R. B. Staszewski, "A Low-Flicker-Noise 30-GHz Class-F23 Oscillator in 28-nm CMOS Using Implicit Resonance and Explicit Common-Mode Return Path," in *IEEE Journal of Solid-State Circuits*, vol. 53, no. 7, pp. 1977-1987, July 2018.
- [71] Z. Zong, M. Babaie and R. B. Staszewski, "A 60 GHz Frequency Generator Based on a 20 GHz Oscillator and an Implicit Multiplier", *IEEE Journal Solid-State Circuits*, vol. 51, no. 5, pp. 1261-1273, May 2016.

- [72] Y. Hu et al., "A 21.7-to-26.5GHz Charge-Sharing Locking Quadrature PLL with Implicit Digital Frequency-Tracking Loop Achieving 75fs Jitter and –250dB FoM," *IEEE International Solid-State Circuits Conference - (ISSCC)*, San Francisco, CA, USA, 2020, pp. 276-278.
- [73] C. Fan, J. Yin, C. Lim, P. Mak and R. P. Martins, "A 9mW 54.9-to-63.5GHz Current-Reuse LO Generator with a 186.7dBc/Hz FoM by Unifying a 20GHz 3rd-Harmonic-Rich Current-Output VCO, a Harmonic-Current Filter and a 60GHz TIA," *IEEE International Solid- State Circuits Conference - (ISSCC),* San Francisco, CA, USA, 2020, pp. 282-284.
- [74] S. Choi et al., "An Ultra-Low-Jitter 22.8-GHz Ring-LC-Hybrid Injection-Locked Clock Multiplier with a Multiplication Factor of 114," in *IEEE Journal of Solid-State Circuits*, vol. 54, no. 4, pp. 927-936, April 2019.
- [75] C. Jany, A. Siligaris, J. L. Gonzalez-Jimenez, P. Vincent, and P. Ferrari, "A Programmable Frequency Multiplier-by-29 Architecture for Millimeter Wave Applications," in *IEEE Journal* of Solid-State Circuits, vol. 50, no. 7, pp. 1669-1679, July 2015.
- [76] S. Yoo, S. Choi, J. Kim, H. Yoon, Y. Lee and J. Choi, "A PVT-robust –39dBc 1kHz-to-100MHz integrated-phase-noise 29GHz injection-locked frequency multiplier with a 600µW frequency-tracking loop using the averages of phase deviations for mm-band 5G transceivers," *IEEE International Solid-State Circuits Conference (ISSCC)*, San Francisco, CA, 2017, pp. 324-325.
- [77] H. Yoon et al., "A –31dBc integrated-phase-noise 29GHz fractional-N frequency synthesizer supporting multiple frequency bands for backward-compatible 5G using a frequency doubler and injection-locked frequency multipliers," *IEEE International Solid-State Circuits Conference - (ISSCC),* San Francisco, CA, 2018, pp. 366-368.
- [78] J. Kim et al., "A 76fsrms Jitter and –40dBc Integrated-Phase-Noise 28-to-31GHz Frequency Synthesizer Based on Digital Sub-Sampling PLL Using Optimally Spaced Voltage Comparators and Background Loop-Gain Optimization," *IEEE International Solid- State Circuits Conference - (ISSCC),* San Francisco, CA, USA, 2019, pp. 258-260.

- [79] B. Razavi, "The Role of PLLs in Future Wireline Transmitters," in *IEEE Transactions on Circuits and Systems I: Regular Papers,* vol. 56, no. 8, pp. 1786-1793, Aug. 2009.
- [80] W. El-Halwagy, A. Nag, P. Hisayasu, F. Aryanfar, P. Mousavi and M. Hossain, "A 28-GHz Quadrature Fractional-N Frequency Synthesizer for 5G Transceivers With Less Than 100fs Jitter Based on Cascaded PLL Architecture," in *IEEE Transactions on Microwave Theory and Techniques*, vol. 65, no. 2, pp. 396-413, Feb. 2017.
- [81] J. Zhang, H. Liu, C. Zhao and K. Kang, "A 22.8-to-43.2GHz tuning-less injection-locked frequency tripler using injection-current boosting with 76.4% locking range for multiband 5G applications," *IEEE International Solid-State Circuits Conference - (ISSCC),* San Francisco, CA, 2018, pp. 370-372.
- [82] A. Bhat and N. Krishnapura, "A 25-to-38GHz, 195dB FoM<sub>T</sub> LC QVCO in 65nm LP CMOS Using a 4-Port Dual-Mode Resonator for 5G Radios," *IEEE International Solid-State Circuits Conference - (ISSCC),* San Francisco, CA, USA, 2019, pp. 412-414.
- [83] Y. Shu, H. J. Qian and X. Luo, "A 18.6-to-40.1GHz 201.7dBc/Hz FoM<sub>T</sub> Multi-Core Oscillator Using E-M Mixed-Coupling Resonance Boosting," *IEEE International Solid-State Circuits Conference - (ISSCC),* San Francisco, CA, USA, 2020, pp. 272-274.
- [84] L. Kong and B. Razavi, "A 2.4 GHz 4 mW Integer-N Inductorless RF Synthesizer," in *IEEE Journal of Solid-State Circuits,* vol. 51, no. 3, pp. 626-635, March 2016.
- [85] O. Richard et al., "A 17.5-to-20.94GHz and 35-to-41.88GHz PLL in 65nm CMOS for wireless HD applications," *IEEE International Solid-State Circuits Conference - (ISSCC),* San Francisco, CA, 2010, pp. 252-253.
- [86] H. R. Rategh and T. H. Lee, "Superharmonic injection-locked frequency dividers," in IEEE Journal of Solid-State Circuits, vol. 34, no. 6, pp. 813-821, June 1999.
- [87] R. Adler "A study of locking phenomena in oscillators," *Proceedings of I.R.E. and Waves and Electrons*, vol. 34, pp. 351-357, 1946.

- [88] B. Razavi, "A study of injection locking and pulling in oscillators," in IEEE Journal of Solid-State Circuits, vol. 39, no. 9, pp. 1415-1424, Sept. 2004.
- [89] J. Chien and L. Lu, "Analysis and Design of Wideband Injection-Locked Ring Oscillators with Multiple-Input Injection," in *IEEE Journal of Solid-State Circuits*, vol. 42, no. 9, pp. 1906-1915, Sept. 2007.
- [90] L. Grimaldi et al., "16.7 A 30GHz Digital Sub-Sampling Fractional-N PLL with 198fsrms Jitter in 65nm LP CMOS," IEEE International Solid- State Circuits Conference - (ISSCC), San Francisco, CA, USA, 2019, pp. 268-270.
- [91] A. Mirzaei, M. E. Heidari, R. Bagheri and A. A. Abidi, "Multi-Phase Injection Widens Lock Range of Ring-Oscillator-Based Frequency Dividers," in *IEEE Journal of Solid-State Circuits*, vol. 43, no. 3, pp. 656-671, March 2008.
- [92] S. Liu, Y. Zheng, W. M. Lim, and W. Yang, "Ring Oscillator Based Injection Locked Frequency Divider Using Dual Injection Paths," in *IEEE Microwave and Wireless Components Letters*, vol. 25, no. 5, pp. 322-324, May 2015.
- [93] M. Vigilante and P. Reynaert, "A 25-102GHz 2.81-5.64mW tunable divide-by-4 in 28nm CMOS," IEEE Asian Solid-State Circuits Conference (A-SSCC), Xiamen, 2015, pp. 1-4.
- [94] X. Gao, E. A. M. Klumperink, G. Socci, M. Bohsali and B. Nauta, "Spur Reduction Techniques for Phase-Locked Loops Exploiting A Sub-Sampling Phase Detector," in *IEEE Journal of Solid-State Circuits*, vol. 45, no. 9, pp. 1809-1821, Sept. 2010.
- [95] Z. Yang, Y. Chen, S. Yang, P. Mak and R. P. Martins, "16.8 A 25.4-to-29.5GHz 10.2mW Isolated Sub-Sampling PLL Achieving -252.9dB Jitter-Power FoM and -63dBc Reference Spur," *IEEE International Solid-State Circuits Conference - (ISSCC),* San Francisco, CA, USA, 2019, pp. 270-272.
- [96] A. Sharkia, S. Mirabbasi and S. Shekhar, "A Type-I Sub-Sampling PLL with a 100×100 μm<sup>2</sup> Footprint and -255-dB FOM," in *IEEE Journal of Solid-State Circuits*, vol. 53, no. 12, pp. 3553-3564, Dec. 2018.

- [97] D. Murphy, H. Darabi and H. Wu, "Implicit Common-Mode Resonance in LC Oscillators," in *IEEE Journal of Solid-State Circuits,* vol. 52, no. 3, pp. 812-821, March 2017.
- [98] S. Yuan and H. Schumacher, "Compact V band frequency doubler with true balanced differential output," 2013 IEEE Bipolar/BiCMOS Circuits and Technology Meeting (BCTM), Bordeaux, 2013, pp. 191-194.
- [99] G. Jeong, W. Kim, J. Park, T. Kim, H. Park and D. Jeong, "A 0.015-mm<sup>2</sup> Inductorless 32-GHz Clock Generator With Wide Frequency-Tuning Range in 28-nm CMOS Technology," in *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 64, no. 6, pp. 655-659, June 2017..
- [100] A. Li, S. Zheng, J. Yin, X. Luo and H. C. Luong, "A 21–48 GHz Subharmonic Injection-Locked Fractional-N Frequency Synthesizer for Multiband Point-to-Point Backhaul Communications," in *IEEE Journal of Solid-State Circuits*, vol. 49, no. 8, pp. 1785-1799, Aug. 2014.
- [101] Y. Lee, T. Seong, J. Lee, C. Hwang, H. Park and J. Choi, "A -240dB-FoMjitter and -115dBc/Hz PN @ 100kHz, 7.7GHz Ring-DCO-Based Digital PLL Using P/I-Gain Co-Optimization and Sequence-Rearranged Optimally Spaced TDC for Flicker-Noise Reduction," 2020 IEEE International Solid-State Circuits Conference - (ISSCC), San Francisco, CA, USA, 2020, pp. 266-268.