Data visualization is the process of taking very high dimensional data and representing it in two or three dimensions. The main goal of data visualization is to create a representation of the data in such a way that a human observer can gain insight into the data’s structure or patterns. There are many methods that produce visually appealing representations on a variety of datasets, but there is limited theory on why some of these methods work. The purpose of this dissertation is to identify quality metrics of good visualizations and use them to create new methods.
Chapter 1 is an introduction to t-stochastic neighbor embedding (t-SNE), one of the standard algorithms for data visualization. It is a non-convex method that represents the differences between data points as weighted probabilities in high and low dimensions and then minimizes the ‘distance’ between these two distributions using the Kullback-Leibler divergence. Despite its wide success, there is still very little mathematical understanding of the algorithm.
One of t-SNE’s more interesting properties is its tendency to preserve local linear structure of data. For example, if t-SNE is given a dataset of multiple rings in high dimensions, its low dimensional representation will include multiple rings rather than multiple clusters. This preservation of fine local structure sets it apart from other methods. Chapter 1 also defines and explores discrete curves, a new mathematical definition meant to represent these fine local structures both in high-dimensions and low-dimensions. Chapter 2 then rigorously proves that given a 1D structure in high dimensions, t-SNE will visualize that structure in its output.
This dissertation not only proves that t-SNE preserves this discrete curves in theory, but also demonstrates that knowledge can be applied successfully in practice, specifically for data integration of single-cell measurements. Chapter 3 introduces single-cell analysis, the study of human cells in the same population (liver cells, skin cells, etc.) that are genetically identical but behave differently. The differences between these cells can have important impacts on the health and function of the whole cell population. Single-cell analysis is the process of studying cell-to-cell variation within a certain cell population by looking at different properties of their genome, like gene expression and chromatin accessibility, which are referred to as single-cell measurements. Single-cell analysis has been applied in studying diseases, drug development, and in-depth analysis of stem cell differentiation.
One of the big challenges in single-cell analysis is processing the single-cell measurements. Due to technical limitations, it can be hard to obtain multiple types of measurements of the same cell. For example, for a small population of liver cells, a user may only have access to a dataset representing the gene expression of each cell and another dataset representing the chromatin accessibility. These datasets will not only be very high-dimensional and have local discrete structures, but they will not live in the same dimension since they represent different properties. Thus there is no direct way to identify corresponding features in both datasets since they represent different domains. Chapter 4 discusses AVIDA, an algorithm to process these datasets that produces a single dataset representing both single-cell measurement. AVIDA achieves this by using t-SNE and Optimal Transport methods to not only integrate these two datasets into the same domain, but to also generate a visualization that highlights the local underlying structures in the single-cell measurements.