Baliabideak eta tresnak

Zientzia-egoera 2008

Erreferentziak 2009


State of the art

Collection, annotation and exploitation of multilingual corpora

As Koehn (2003) and Och (2005) stated, more data carries better translations. So our first challenge is to develop multilingual resources (collecting large monolingual, parallel and comparable corpora), and a general framework to represent them.

Building large and representative monolingual corpora for open or restricted domains helps improving the performance of the statistical MT systems. Och (2005) shows that bigger monolingual data (source to infer the language model) implies better translations. He also shows a log-scale improvements on BLEU: doubling the training data gives constant improvement (+0.5% BLEU). However, building monolingual corpus in an assisted way is a difficult task, mainly due to the necessary effort to compile documents of different formats and sources. This problem is specially relevant for non-central languages. To face this problem, SIGWAC (The Special Interest Group of the ACL on Web as Corpus) proposes to use Internet as a big source of documents (Kilgarriff and Grefenstette, 2003). However, new problems arise in this approach: boilerplate removal (CLEANEVAL), duplication detection (Broder, 2000), topic filtering (Leturia et al., 2008), etc.

The other type of corpora widely used in SMT are parallel corpora. Specifically, these corpora are used to train translation models. Therefore, the size is extremely important. Koehn (2003) shows how the BLEU score raises when the size of the bilingual parallel corpus increases. Again, log-scale improvements on BLEU for all languages are showed. Now, doubling the training data gives a constant improvement of +1% BLEU. This experiment was performed using a statistical system and the Europarl parallel corpus (30 million words in 11 languages, Parallel corpora can also be exploited for terminology extraction and for extending dictionaries used in RBMT. They are often derived from translation memories. Unfortunately, this kind of corpus is scarce. As an alternative, some authors propose to use multilingual web sites as a source to build parallel corpora automatically (Resnik and Smith, 2003; Zhang et al., 2006; Cheng and Nie, 2000).

Nevertheless, parallel corpora are still scarce for unusual pairs of languages or specific domains. Thus, some authors propose to exploit comparable corpora to extract translation knowledge. They are mainly focused on extracting bilingual terminology (Rapp, 1995; Fung, 1996), which can be very valuable to update and extend bilingual dictionaries used in RBMT. Other authors have improved SMT systems? performance by integrating comparable corpora exploitation (Talbot 2003; Munteanu et al., 2004; Hewavitharana and Vogel, 2008; Snover, 2008).

Similarly to the problem of compiling parallel and monolingual corpora, different approaches have been proposed to compile comparable corpus: assisted way (e.g., BNC), RSS feeds (Fairon et al., 2008), web crawling (Chakrabarti et al., 1999), and search APIs (Baroni and Bernardini, 2004). Nevertheless, the collection of comparable corpora (Talvensaari et al. 2008) is a relatively new research topic, and some problems have not been resolved yet. For example, obtaining a high comparability degree, the parameters to determine it, and its effect on the performance of the exploitation are still open questions.

However, it is important to say that the precision and recall of terminology extraction from comparable corpora are still far from those achieved with parallel corpora (Melamed, 2001; Och, 2002). While studies report around 80% precision (taking into account the first 10-20 candidates) with comparable corpora (Fung, 1995; Rapp, 1995), the precision that can be achieved with parallel corpora is above 95% taking into account only the top candidates (Tiedemann, 1998). This is, mainly, due to the more implicitness of the knowledge to infer. The most used paradigms are context similarity (Fung, 1995; Rapp, 1995) and string similarity (Al-Onaizan and Knight, 2002). There are several works discussing how to determine which words make up the context of a word and which relevance they have respect to the word in question (Petkar et al., 2007; Gamallo 2008). Another way of representing the contexts is by using language models (Shao et al. 2004; Saralegi et al., 2008). Another important issue is the treatment of Multiword Terms (MWT). There are no many works dealing with MWT (Daille and Morin, 2005; Morin et al., 2007), but they represent an essential lexicon to adapt a RBMT system to a specific domain.

MT combined/hybrid systems

Traditionally, MT was performed via rule-based machine translation systems (RBMT) that worked by applying a set of linguistic rules in three phases: analysis, transfer and generation. Since late 80s there is much interest in exploring new techniques in corpus-based approaches: statistical text analysis (alignment, etc.), example-based machine translation (EBMT) and statistical machine translation (SMT). Nowadays, research is oriented towards hybrid systems, which combine traditional linguistic rules and corpus-based

EBMT and SMT are the two main models in corpus-based MT. Both need a set of sentences in one language aligned with their translation in another. GIZA++ is a well-known tool to perform this alignment. The two models induce translation knowledge from sentence-aligned corpora, but there are significant differences regarding both the type of information learnt and how this is brought to bear in dealing with new input.

In SMT, essentially, the translation model (obtained from parallel corpus) establishes the set of target language words (and more recently, phrases), which are most likely to be useful in translating the source string, while the language model (obtained from monolingual corpus) tries to assemble these words (and phrases) in the most probable target word order. Nowadays, however, SMT practitioners also get their systems to learn phrasal as well as lexical alignments, e.g., (Koehn et al., 2003; Och, 2003). Novel approaches to reordering in phrase-based statistical MT have been proposed (Kanthak et all 2005).

Recently, several possible approaches have been developed to combine the RBMT, EBMT and SMT engines (some of them are being explored in our current OpenMT project):

  • Combining MT paradigms: Multi-Engine MT. van Zaanen and Somers (2005), Matusov et al., (2006), and Macherey and Och (2007) review a set of references about MEMT (Multi-Engine MT) including the first attempt by Frederking and Nirenburg (1994). Recent works (Chen et al., 2007; Matusov et al., 2006; Macherey and Och, 2007; Mellebeek et al., 2006; Huang and Papineni, 2007; and Rosti et al., 2007) reports improvements of up to 18% BLEU score.

  • Combining MT paradigms: Statistical post-edition (SPE) on the RBMT output. In (Simard et al., 2007a) and (Isabelle et al., 2007) SPE task is viewed as translation from the language of RBMT outputs into the language of their manually post-edited counterparts. So they do not use a parallel corpus created for human translation. They report 5 BLEU points improvement over the direct SMT approach, and conclude that such a RBMT+SPE system appears to be an excellent way to improve the output of a vanilla RBMT system and constitutes a worthwhile alternative to costly manual adaptation efforts for such systems. So a SPE system using a corpus with no more than 100.000 words of post-edited translations is enough to outperform an expensive lexicon enriched baseline RBMT system. The same group (Simard et al., 2007b) show that while post-editing is more effective when little training data is available, it remains competitive with SMT translation even with larger amounts of data. Similar approaches are reported by Dugast et al. (2007) and Elming (2006). Inside the OpenMT project a post-edition system has got an improvement greater than 40% in BLEU metric (Diaz de Ilarraza et al., 08).

  • Hybridizing MT paradigms. Less works exist in the recent literature, which present a truly internal hybridization of several MT paradigms (i.e., where the output is constructed by taking into account simultaneous information from different paradigms). A couple of strategies to be explored are: 1) to introduce statistical knowledge to resolve ambiguities in RBMT systems. For instance, most rule-based systems lack of a deciding which translation rule must be applied when several rules are equally applicable in the same context to the same sentence span. Discriminative lexical selection techniques could provide an effective solution to this problem. Chan and Chiang (2007) have applied similar ideas to their hierarchical statistical MT system. In this system, however, rules are not manually defined but automatically induced from word alignments. A second and more general approach would be to devise an inference procedure which is able to deal with multiple proposed fragment translations, coming from different sources (translation tables from SMT, fragments translated by RBMT, examples from EBMT, etc.). The search for the optimal translation should coherently combine the translation candidates (from different granularities), taking also into account constraints regarding the target language. The works by Groves and Way (2005) introducing EBMT translation pairs into a SMT system fall in this category.

Integration of advanced linguistic knowledge in shallow machine translation

  • Integration of syntax into statistical machine translation

A limitation of standard phrase-based SMT systems is that reordering models are very simple. For instance, non-contiguous phrases are not allowed, long distance dependencies are not modeled, and syntactic transformations are not captured. Syntax-based approaches seek to remedy these deficiencies by explicitly taking into account syntactic knowledge. Approaches to syntax-based MT differ in several aspects: (i) side of parsing (source, target, or both sides), (ii) type of parsing (dependencies vs. constituents), (iii) modeling of probabilities (generative vs. discriminative), (iv) core (structured predictions vs. transformation rules), and (v) type of decoding (standard phrase-based, modeled by transducers, based on parsing, graph-based). Approaches to syntax-based SMT may be grouped in three different families:

  • Bilingual Parsing (Wu 1997;2000; Alshawi, 1996; Alshawi et al., 2000; Melamed, 2004; Melamed et al., 2005; Chiang, 2005;2007).

  • Tree-to-String, String-to-Tree and Tree-to-Tree Models (Yamada & Knight, 2001; Yamada, 2002; Gildea, 2003; Lin, 2004; Quirk et al., 2005; Cowan et al., 2006; Galley et al., 2006; Marcu et al., 2006).

  • Source Reordering (Collins et al., 2005; Crego et al., 2006; Li et al., 2007). Significant improvements have been reported using this technique.

  • Integration of semantic knowledge into machine translation

One natural and appealing extension of the current RBMT and SMT paradigms for machine translation is the use of richer sources of linguistic knowledge, e.g., semantics. Such an extension would presumably lead to a qualitative improvement of state-of-the-art performance. However, reasoning with explicit semantics, is really a hard goal for open-text NLP, which has been generally ignored in the development of practical systems. Two concrete and feasible aspects, which are described below, will concentrate our efforts in this respect:

  • Word Sense Disambiguation in word translation

One of the challenges in MT is that of lexical choice (or word selection) in the case of semantic ambiguity, i.e., the choice for the most appropriate word in the target language for a polysemous word in the source language when the target language offers more than one option for the translation and these options have different meanings. The area that deals with this general disambiguation problem is referred to as word sense disambiguation (WSD).

Recently, there has been a growing interest in the application of discriminative learning models to word selection, and, more generally, phrase selection in the context of SMT. Discriminative models allow for taking into account a richer feature context, and probability estimates are more informed than the simple frequency counts used in SMT translation models.

Brown et al. (1991a; 1991b) were the first to suggest using dedicated WSD models in SMT. They integrated a WSD system based on mutual information into their French-to-English word-based SMT system. Some years passed until these ideas were recovered by Carpuat and Wu (2005b), who suggested integrating WSD predictions into a phrase-based SMT system. In a later work, Carpuat and Wu (2005a) analyzed the converse question, i.e., they measured the WSD performance of SMT systems. They showed that dedicated WSD models significantly outperform the WSD ability of current state-of-the-art SMT models. Simultaneously, Vickrey et al. (2005) also studied the application of context-aware discriminative word selection models based on WSD to SMT. They did not approach the full translation task but limited to the blank-filling task.

Following similar approaches, Cabezas and Resnik (2005) and Carpuat et al. (2006) used WSD-based models in the context of the full translation task to aid a phrase-based SMT system. They reported a small improvement in terms of BLEU score. More recently, other authors, including ourselves, have extended these works by moving from words to phrases and allowing discriminative models to cooperate with other phrase translation models. Moderate improvements have been reported (Bangalore et al., 2007; Carpuat & Wu, 2007b; Carpuat & Wu, 2007a; Giménez & Màrquez, 2007a; Giménez & Màrquez, 2008a; Stroppa et al., 2007; Venkatapathy & Bangalore, 2007). One interesting observation by Giménez and Màrquez (2008a) is that the improvement of WSD-based phrase selection models is mainly related to the adequacy dimension, whereas for fluency there is a slight decrease. These results reveal a problem of integration: phrase selection classifiers have been trained locally, i.e., so as to maximize local phrase translation accuracy. In order to test this hypothesis and further improve the system, we plan to work on global classifiers directed towards maximizing overall translation quality instead.

Other integration strategies have been tried. Specia et al. (2008) used dedicated predictions for the reranking of n-best translations, using models based on Inductive Logic Programming (Specia et al., 2007). Chan et al. (2007) used a WSD system to provide additional features for the hierarchical phrase-based SMT system based on bilingual parsing developed by Chiang (2005; 2007). Sánchez-Martínez et al. (2007) integrated a simple lexical selector, based on source lemma co-occurrences in a very local scope, into their hybrid corpus-based/rule-based MT system.

  • Semantic Role Labeling in machine translation

In the past few years there has been an increasing interest in Semantic Role Labeling (SRL), which is becoming an important component in many NLP applications. SRL is a well-defined task with a substantial body of work and comparative evaluation. Given a sentence, the task consists of detecting basic event structures such as "who" did "what" to "whom", "when" and "where". From a linguistic point of view, this corresponds to identifying the semantic arguments filling the roles of the sentence predicates.

The identification of such event frames might have a significant impact in many Natural Language Processing (NLP) applications, including machine translation. Although the use of SRL systems in real-world applications has so far been limited, we think the potential is very high and we expect a spread of this type of analysis to all applications requiring some level of semantic interpretation. The UPC team has currently developed SRL prototypes for English, Spanish and Catalan (Surdeanu et al., 2007a; Màrquez et al., 2007; Surdeanu et al., 2008).

Pre-edition and Post-edition

In the last AMTA conference (2008) we detected a special interest in post-editing. A wide number of papers mentioned this process and two very interesting papers were devoted to post-editing (Doyon et al., 2008; Schütz, 2008). Google also is trying to collect this kind of corpora through its public MT service.

Our experience in providing MT services ( shows us two main sources of translation errors that could be handled via pre-edition: spelling-errors and the use of too long sentences in source texts. The spelling checker Xuxen (Aduriz et al., 97) is a very effective tool in this kind of situations.

In the same way, in our research groups at EHU and UPC extensive work has been carried out in syntax and especially in Clause Identification (Alegria et al., 2008, Carreras et al., 2002). The task is an interesting case to apply those techniques and our previous works can be adapted to pre-edition of texts to be automatically translated.

A post-edition system applying SMT to our RBMT output has got an improvement greater than 40% in BLEU metric (Diaz de Ilarraza et al., 08). These results suggest us new ways to investigate.

But we could not work with a big manually post-edited corpora as (Simard et al., 2007a) and (Isabelle et al., 2007). They report that such a Statistical Post-Editing system using a corpus with no more than 100.000 words of post-edited translations was enough to outperform an expensive lexicon enriched baseline RBMT system. Obviously, we could not do it because there is no such a big corpus for Basque or Catalan. But in this new project OpenMT-2 we are planning to collect such a manually post-edited corpus. We have signed collaboration with the Wikipedia community for Basque. And then, we will be able to exploit this corpus.

Advanced Evaluation

As human evaluation is very costly, MT researchers developed several automatic evaluation metrics. The commonly accepted criterion to define a plausible metric is that it must correlate well with human evaluators.

The use of N -gram based metrics has represented a significant advance in MT research in the last decade. A number of evaluation metrics were suggested. The most successful evaluation metrics have been word error rate (WER), position independent word error rate (PER), bilingual evaluation understudy (BLEU) (Papineni et al., 2001), an improved version of BLEU by National Institute of Standards and Technology (NIST) (Doddington, 2002), the F-measure provided by the General Text Matcher (GTM) (Melamed et al., 2003), and the ROUGE (Lin & Och, 2004).

We suggest a `divide and conquer' strategy to build a set of specialized metrics each one devoted to the evaluation of a concrete partial aspect of MT quality, and then to combine a set of metrics into a single measure of MT quality. In OpenMT we have used the IQmt package (Giménez et al., 2005), which permits metric combination, with the singularity that there is no need to perform any training or adjustment of parameters. Besides considering the similarity of automatic translations to human references, IQmt additionally considers the distribution of similarities among human references.

In the last few years, several approaches have been suggested. Some of them are based on extending the reference lexicon. ROUGE and METEOR allow for morphological variations by applying stemming. Additionally, METEOR may perform a lookup for synonymy in WordNet (Fellbaum, 1998). Others have suggested taking advantage of paraphrasing support (Russo-Lassner et al., 2005; Zhou et al., 2006; Kauchak & Barzilay, 2006; Owczarzak et al., 2006). These are still attempts at the lexical level. At a deeper linguistic level, we may find, for instance, the work by Liu and Gildea (2005) who introduced a series of syntax-based metrics. They developed the Syntactic Tree Matching (STM) metric based on constituency parsing, and the Head-Word Chain Matching (HWCM) metric based on dependency parsing. Also based on syntax, Mehay and Brew (2007) suggested flattening syntactic dependencies only in the reference translations so as to compute string-based similarities without requiring syntactic parsing of the possibly ill-formed automatic candidate translations. We may find as well the work by Owczarzak et al. (2007a; 2007b) who presented a metric that compares dependency structures according to a probabilistic Lexical-Functional Grammar. Popovic and Ney (2007) proposed a novel translation errors based on WER and PER measures computed over different parts-of-speech. The ?NEE? metric defined by Reeder et al. (2001), which was devoted to measure MT quality over named entities9. The metrics developed by Giménez and Màquez (2007), as part of the OpenMT project, operate over semantic role structures and discourse representations.

Apart from incorporating linguistic information, another promising research direction suggested in the last years is based on combining the scores conferred by different metrics into a single measure of quality (Corston-Oliver et al., 2001; Kulesza & Shieber, 2004; Quirk, 2004; Gamon et al., 2005; Liu & Gildea, 2007; Albrecht & Hwa, 2007a; Albrecht & Hwa, 2007b; Paul et al., 2007; Giménez & Màrquez, 2008a; Giménez & Màrquez, 2008c).

The most relevant Spanish research groups on MT

Traducens Group. Universitat d'Alacant (, Speech and Language Aplications and Technology (TALP). Universitat Politècnica de Catalunya (, Pattern Recognition and Human Language Technology Group (PRHLT). Universitat Politècnica de València (, Department of Translation and Philology. Universitat Pompeu Fabra (, and Deli Group. University of Deusto. (

The most relevant European research groups on MT

Statistical Machine Translation Group. University of Edinburgh. UK (, National Centre for Language Technology / Centre for Next Generation Localisation (NCLT-MT). Dublin City University, Ireland (, Language Technology and Computational Linguistics (STP). Uppsala Universitet. Sweden (, Institute of Formal and Applied Linguistics (UFAL). Univerzita Karlova v Praze. Czech Republic (, Human Language Technology and Pattern Recognition, RWTH Aachen (, Human Language Technology (HLT). Fondazione Bruno Kessler, Trento. Italy (, Groupe d'Étude pour la Traduction Automatique (GETA). Laboratoire d'Informatique de Grenoble. France (, Dept. of Computational Linguistics and Phonetics (COLI) Uiversität des Saarlandes. Germany (, and Centre for Language Technology (CST). Københavns Universitet. Denmark. (