Understanding the mobility pattern and network flow provides fundamental knowledge for decision making in transportation planning and operations. These insights play a critical role in travel demand modeling, traffic management and control, and processes in the development of robust and sustainable transportation systems. There has been a long history of developing estimation methods to better understand networked data. However, direct observations often suffer from several challenges: high-dimensionality, limited coverage, and low fidelity. Furthermore, existing literature tends to separate data with data-driven methods and domain knowledge with behavioral-based and physics-informed models.
This dissertation focuses on statistical learning on high-dimensional networked data in transportation. We consider signal from direct observations and the interdependence among various components in the network, under the regularization of the domain knowledge and structural information. The goal of the research is to develop a methodology to improve the quality, efficiency, and robustness of the estimation of high-dimensional networked data in transportation systems. Based on the interdependency of data from various network levels, the development of the statistical learning frameworks targets the following specific key objectives : 1) estimating unknown OD demand from observable link flow in the networks; 2) learning meaningful data representations in networks for critical information extraction and anomaly detection; and 3) statistical filtering for data on directed graphs.
In OD estimation, we present a modeling framework for OD demand estimation based on observed traffic flow data in a transportation network, from a fresh angle of stochastic programming. The proposed two-stage stochastic programming method is flexible for incorporating various design principles and risk preferences into domain knowledge regarding travel behavioral and physical rules. Besides, a benefit comes from the scenario representation, where the point estimate can be combined with estimation of the discrete approximation to the demand distribution. We demonstrate that under the proposed framework, well-established theories and methods for stochastic programming, including epi-convergence and scenario decomposition, can be exploited to advance the analytical and computational capability of the estimation model. The applicability and efficiency of the proposed method are illustrated via numerical examples based on highway and transit networks of various sizes.
In representation learning, we introduce a new perspective that the critical information of the data should reflect how the data is used in downstream applications, which carries a different research design philosophy adopted in most existing data representation learning methods. We propose an application-driven representation learning framework by incorporating information loss for the downstream application into the data encoding-decoding process. The proposed approach is formulated as a Stiefel manifold optimization problem. The effectiveness of the proposed framework is demonstrated through three case studies: transportation network performance assessment, vehicular emission estimation, and anomaly detection of travel demand. Experiments show that our proposed approach performs better than classical representation learning approaches, especially in applications involving complex network interdependence.
In graph signal filtering, we developed a Continuous-time Markov chain based filtering methods on directed graphs using nonparametric regression technique. Through bridging the stochastic process and algebraic graph theory, we utilize the transition matrix basis that is dependent on the connectivity of points across varied density regions. Compared with methods dedicated to undirected graph and spatial filtering methods, our approach is capable of capturing directional information flows and local asymmetric structure in data observations. The performance of our approach is evidenced through a series of synthetic and real-world case studies on network traffic flow. This work demonstrates the potential for incorporating heterogeneous structures for data defined on irregular domains.