Dialogue systems, also known as conversational agents, are intelligent computer systems which converse with a human via natural language (text, or speech). They generally fall into two categories based on their applications, that is, chit-chat dialogue systems (chatbots) and task-oriented dialogue systems. The former is primarily interacting with humans to provide reasonable responses and entertainment on open domains, whereas the latter aims to help users complete a dedicated task on one specific domain, for example, inquiring about weather, reserving restaurants, or booking flights. This thesis focuses primarily on task-oriented dialogue systems.
The earliest dialogue systems were highly dependent on complicated and hand-crafted rules and logics, which were costly and unscalable. With the recently unprecedented progress in natural language processing propelled by deep learning, statistical dialogue systems have gradually become well used for reducing cost and providing robustness. Traditionally, these systems were highly modularized in a pipelined manner, including spoken language understanding, dialogue state tracking, dialogue management, and response generation. However, modularized design typically requires expensive human labeling and easily leads to error propagation due to module dependencies. On the other hand, end-to-end neural-based models learn the dialogue hidden representation automatically and generate system responses directly, thereby requiring much less human effort and being more scalable to new domains. Although recurrent neural network (RNN) based neural models show promising results, they still suffer from several major drawbacks. First, dominant RNNs are inherently unstable over long-time sequences as RNNs tend to focus more on short-term memories and forcefully compress historical records into one hidden state vector \citep{weston2014memory}. Second, RNNs focus primarily on modeling sequential dependencies, and thus rich graph structure information hidden in the dialogue context is completely ignored. Lastly, effectively incorporating external knowledge into end-to-end task-oriented dialog systems still remains a challenge.
In this thesis, we are dedicated to addressing these limitations of conventional neural models and propose end-to-end learning frameworks to model long-term dialogue context and to learn dialogue graph structures. In addressing the weakness of RNNs processing long sequences, we propose a novel approach to model long-term slot context and to fully utilize the semantic correlation between slots and intents. We adopt a key-value memory network to model slot context dynamically and to track more important slot tags decoded before, which are then fed into our decoder for slot tagging. Furthermore, gated memory information is utilized to perform intent detection, mutually improving both tasks through global optimization. We empirically show that our key-value memory networks enable effectively tracking long-term dialogue context and enhance the performance of spoken language understanding by a large margin.
In addressing modeling the rich graph structures in dialogue utterances, we introduce a new Graph Convolutional LSTM (GC-LSTM) to learn the semantics contained in the graph-structured dialogues by incorporating a powerful graph convolutional operator. Our proposed GC-LSTM can not only capture the spatio-temporal semantic features in a dialogue, but also learn the co-occurrence relationship between intent detection and slot filling. Furthermore, we propose a Graph-to-Sequence learning framework to push the performance of spoken language understanding to a new start-of-the-art.
In addressing effectively integrating knowledge into dialogue generation systems, we propose a novel end-to-end learning framework to incorporate an external knowledge base (KB) and to capture the intrinsic graph semantics of the dialog history. We propose a novel Graph Memory Network (GMN) based sequence-to-sequence model, GraphMemDialogue, to effectively learn the inherent structural information hidden in dialog history, and to model the dynamic interaction between dialog history and KBs. We adopt a modified graph attention network to learn the rich structure representation in the dialog history, whereas the context-aware representation of KB entities are learnt by our novel GMN. To fully exploit this dynamic interaction, we design a learnable memory controller coupled with external KB entity memories to recurrently incorporate dialog history context into KB entities through a multi-hop reasoning mechanism. Experiments on three public datasets show that our GraphMemDialog model achieves state-of-the-art performance and outperforms strong baselines by a large margin, especially on datatests with more complicated KB information.