The goal of a machine translation (MT) system is to automatically translate a document written in some human input language (e.g., Mandarin Chinese) into an equivalent document written in an output language (e.g., English). This task---so simple in its specification, and yet so rich in its complexities---has challenged computer science researchers for 60 years. While MT systems are in wide use today, the problem of producing human-quality translations remains unsolved.
Statistical approaches have substantially improved the quality of MT systems by effectively exploiting parallel corpora: large collections of documents that have been translated by people, and therefore naturally occur in both the input and output languages. Broadly characterized, statistical MT systems translate an input document by matching fragments of its contents to examples in a parallel corpus, and then stitching together the translations of those fragments into a coherent document in an output language.
The central challenge of this approach is to distill example translations into reusable parts: fragments of sentences that we know how to translate robustly and are likely to recur. Individual words are certainly common enough to recur, but they often cannot be translated correctly in isolation. At the other extreme, whole sentences can be translated without much context, but rarely repeat, and so cannot be recycled to build new translations.
This thesis focuses on acquiring translations of phrases: contiguous sequences of a few words that encapsulate enough context to be translatable, but recur frequently in large corpora. We automatically identify phrase-level translations that are contained within human-translated sentences by partitioning each sentence into phrases and aligning phrases across languages. This alignment-based approach to acquiring phrasal translations gives rise to statistical models of phrase alignment.
A statistical phrase alignment model assigns a score to each possible analysis of a sentence-level translation, where an analysis describes which phrases within that sentence can be translated and how to translate them. If the model assigns a high score to a particular phrasal translation, we should be willing to reuse that translation in new sentences that contain the same phrase. Chapter 1 provides a non-technical introduction to phrase alignment models and machine translation. Chapter 2 describes a complete state-of-the-art phrase-based translation system to clarify the role of phrase alignment models. The remainder of this thesis presents a series of novel models, analyses, and experimental results that together constitute a thorough investigation of phrase alignment models for statistical machine translation.
Chapter 3 presents the formal properties of the class of phrase alignment models, including inference algorithms and tractability results. We present two specific models, along with statistical learning techniques to fit their parameters to data. Our experimental evaluation identifies two primary challenges to training and employing phrase alignment models, and we address each of these in turn.
The first broad challenge is that generative phrase models are structured to prefer very long, rare phrases. These models require external pressure to explain observed translations using small, reusable phrases rather than large, unique ones. Chapter 4 describes three Bayesian models and a corresponding Gibbs sampler to address this challenge. These models outperform the word-level models that are widely employed in research and production MT systems.
The second broad challenge is structural: there are many consistent and coherent ways of analyzing a translated sentence using phrases. Long phrases, short phrases, and overlapping phrases can all simultaneously express correct, translatable units. However, no previous phrase alignment models have leveraged this rich structure to predict alignments. We describe a discriminative model of multi-scale, overlapping phrases that outperforms all previously proposed models.
The cumulative result of this thesis is to establish model-based phrase alignment as the most effective approach to acquiring phrasal translations. Only phrase alignment models are able to incorporate statistical signals about multi-word constructions into alignment decisions and score coherent phrasal analyses of full sentence pairs. As a result, phrase alignment models outperform classical word-level models in both generative and discriminative settings. This result is fundamental to the field: the models proposed in this thesis address a general, language-independent alignment problem that arises in all state-of-the-art statistical machine translation systems in use today.