Baliabideak eta tresnak

Zientzia egoera 2008

Erreferentziak 2008

State of the art / Antecedentes y estado actual de los conocimientos científico-técnicos

The State of the art will be described for the main innovative points of the project in order to justify how this project will advance in the current situation:

Collection, annotation and exploitation of multilingual corpora

As Koehn (2003) and (Och,2005) stated, more data carries better translations. So our first challenge is to develop multilingual resources:

  1. collecting large parallel corpora for most languages pairs and in various domains.

  2. using linguistic tools to analyse and score that parallel corpora with syntactic and semantic information,

  3. collecting and experimenting with comparable corpora

  4. collecting and applying large monolingual corpus.

Building large and representative monolingual corpora for open or restricted domains is of advantage to the performance of the statistical translator. The figure below from (Och, 2005) shows that also bigger monolingual data (text to define the language model) brings better translations. Also log-scale improvements on BLEU: doubling the training data gives constant improvement (+0.5 %BLEU) (last addition is 218 billion words out-of-domain web data).

However, building monolingual corpus in an assisted way is a difficult task, mainly due to the necessary effort to compile documents of different formats and sources. This problem is specially marked for non-central languages. To face this problem, SIGWAC (The Special Interest Group of the Association for Computational Linguistics on Web as Corpus) proposes to use Internet as a big source of documents (A. Kilgarriff, G. Grefenstette, 2003). However, new problems arise in this approach: boilerplate removal (CLEANEVAL), duplication detection (A.Z. Broder, 2000), topic filtering (I. Leturia et al. 2008) ...

The other kind of corpora widely used in SMT are parallel corpora. Specifically, these corpora are used to train translation models. Therefore, the size is extremely important. The figure below from (Koehn, 2003) shows how the BLEU score raises when the size of the bilingual parallel corpus augments. They are log-scale improvements on BLEU for all languages. Doubling the training data gives constant improvement (+1 %BLEU). This experiment was performed using a n statistical system and the Europarl parallel corpus, 30 million words in 11 languages (

Parallel corpora can also be exploited for terminology extraction and for extending dictionaries used in RBMT. They are often derived from translation memories. Unluckily, this kind of corpus is scarce. As an alternative some authors propose to use multilingual web-sites as a source to build parallel corpora automatically (P. Resnick and N.A. Smith 2003, Zhang et al 2006, Nie et al. 1999).

Nevertheless, parallel corpora are still scarce for unusual pairs of languages or specific domains. Thus, some authors propose to exploit comparable corpora to extract translation knowledge. They are mainly focused on extract bilingual terminology (Rapp, 1995), (Fung, 1996), which can be very valuable to update and extend bilingual dictionaries used in RBMT. Other authors have improved SMT systems' performance by integrating comproblemsparable corpora exploitation (David Talbot 2003, Munteanu and Marcu 2003, Hewavitharana and Vogel 2008).

As for the compilation of parallel and monolingual corpora, different approaches have been proposed to compile comparable corpus: assisted way (e.g., BNC), RSS feeds (Fairon et al. 2008), web crawling (Chakrabarti et al., 1999), and search APIs (Baroni and Bernardini, 2004). Nevertheless, the collection of comparable corpora (Talvensaari et al. 2008) is a relatively new research topic, and some problems have not been resolved yet. For example, obtaining a high comparability degree, the parameters to determine it, and its effect on the performance of the exploitation are still open questions.

As mentioned before, terminology extraction from comparable corpora is very attractive for many reasons:

* Comparable corpora can be easily obtained, unlike parallel corpora.

* Comparable corpora are easily updated, so new terminology will be detected.

However, it is necessary to say that the precision and recall of terminology extraction from comparable corpora are still far from those achieved with parallel corpora (Melamed, 2001; Och, 2002). While studies report around a 80% precision (taking into account the first 10-20 candidates) with comparable corpora (Fung, 1995; Rapp, 1995), the precision that can be achieved with parallel corpora is above 95% taking into account only the top candidates (Tiedemann, 1998). This is, mainly, due to the more implicitness of the knowledge to infer. The most used paradigms are context similarity (Fung, 1995; Rapp, 1995) and string similarity (Al-Onaizan & KNight, 2002).

The standard algorithm to recognize context similarity consists of three steps: modeling of the contexts, translation of the source contexts using a seed bilingual lexicon and calculation of the degree of similarity. The majority of the "bag-of-words" paradigm. Thus, the contexts are represented by weighted collections of words. There are several works discussing how to determine which words make up the context of a word and which relevance they have respect to the word in question (Petkar et al., 2007) (Gamallo 2008). Another way of representing the contexts is by using language models (Shao et al. 2004; Saralegi et al., 2008).

Another important issue is the treatment of MWT. There are no many works dealing with MWT (Daille & Morin, 2005) (Morin et al. 2007), but they represent a essential lexicon to adapt a RBMT system to a specific domain.

General framework to represent annotated parallel corpus.

Nowadays the Natural Language Processing development community has generated many tools to manage linguistic data in corpus such as those resulting from tokenizers, morphosyntactic analyzers, lemmatizers, and so on. The integration of these tools have to deal with several types of incompatibilities. These problems are more prominent when we need to integrate tools for different languages and developed by different groups.

A few years ago the language processing community has defined a standard for effective Language Resource Management (ISO TC37/SC4) whose goal is to provide a framework for the creation, annotation and manipulation of linguistic resources and software processing [Ide & Romary 2004]. Adopting this formalism to represent this information is not trivial and different attempts develop such a framework have been made. For example ALEP [Simkins 1994] can be considered the first integration environment for NLP design; GATE

[Cunningham et al. 2002] perhaps is the most influential system in the area. ATLAS and AGTK provide an architecture which facilitates the development of linguistic annotation applications. The ATLAS system [Bird et al. 2000] and the Annotation Graphs Toolkit (AGTK) [HaeJoong et al. 2002] are implementations of the Annotation Graphs formalism. They exhibit problems when encoding some linguistic structures because they do not allow the separation of information into layers. GATE uses TIPSTER architecture for annotation and also presents some disadvantages for encoding non-continuous multiword lexical units. The NLTK framework ( is not able to represent the ambiguities. In the Emdros [Ulrik 2004] framework it is not possible to properly represent some types of relations between elements or classification ambiguity. We have adopted our own approach because our representation requirements are not completely fulfilled by the annotation schemes proposed in the mentioned systems. Unfortunately, ATLAS and AGTK systems do not allow the separation of information; and NLTK and Emdros do not represent properly ambiguities.

We have developed an environment called Eulibeltz (Casillas et al., 2006) which is an extension of Eulia [Artola et al. 2004] an interface for monolingual corpus. Eulibeltz follows [Ide & Romary 2004] and the stand-off markup approach inspired on TEI guidelines [Sperberg-McQueen & Burnard 2002] to represent linguistic information and translation units obtained by several NLP tools. Also, it provides a way to represent ambiguities, non-continuous multiword lexical units and relations between translation units of bilingual documents. Unlike the other environments, with Eulibeltz the human experts can modify the annotations incorporated by the automatic process when these annotations are incorrect. Eulibeltz is a software that contributes on making easy the integration of linguistic tools developed for the treatment of two languages: Spanish and Basque. This interface also facilitates the access. Our main contribution is that our annotation proposal deals with monolingual and bilingual parallel documents and the software designed allows to edit the markup automatically generated. Exploring the annotated documents is possible. From the end of the data flow it is possible to generate linguistic resources such as translation memories that contain different types of translation units and translation patterns.

MT combined/hybrid systems

Traditionally, MT was performed via rule-based machine translation systems (RBMT) that worked by applying a set of linguistic rules in three phases: analysis, transfer and generation. Since late 80s there is much interest in exploring new techniques in corpus-based approaches: statistical text analysis (alignment, etc.), example -based machine translation (EBMT) and statistical machine translation (SMT). Nowadays, research is oriented towards hybrid systems which combine traditional linguistic rules and corpus-based

EBMT and SMT are the two main models in corpus-based MT. Both need a set of sentences in one language aligned with their translation in another. GIZA++ is a well-known tool to perform this alignment. The two models induce translation knowledge from sentence-aligned corpora, but there are significant differences regarding both the type of information learnt and how this is brought to bear in dealing with new input.

In SMT, essentially, the translation model (obtained from parallel corpus) establishes the set of target language words (and more recently, phrases), which are most likely to be useful in translating the source string, while the language model (obtained from monolingual corpus) tries to assemble these words (and phrases) in the most likely target word order. Nowadays, however, SMT practitioners also get their systems to learn phrasal as well as lexical alignments (e.g. (Koehn et al., 2003); (Och, 2003)). Novel approaches to reordering in phrase-based statistical MT have been proposed (Kanthak et all 2005).

Moreover, in the last decade several approaches to introducing syntactic knowledge into SMT have been proposed. At the shallow parsing level, Koehn and Knight (2002), Schafer and Yarowsky (2003), and Giménez and Màrquez (2005) have proposed systems that integrate linguistic concepts such as morphosyntactic analysis (part-of-speech tags), lemmatization and shallow parsing (chunks) in the frame of SMT. Moving onto full parsing, Yamada and Knight (2001, 2002) presented a syntax based tree-to-string probability model in which tree constituents are aligned to strings. Gildea (2003, 2004) followed and improved this same idea by working on constituency/dependency tree-to-tree alignments. Finally, others have suggested the idea of bilingual parsing applied to phrasal alignment: Wu (1997) presented a novel stochastic inversion transduction grammar formalism for bilingual language modeling of sentence-pairs, and, recently, Melamed (2004) suggested a similar approach based on multitext grammars.

In EBMT, as Somers (2003) and Hutchins (2005) recently stated, the essence is the matching of SL fragments (from an input text) against source language fragments (in a database) and the extraction of the equivalent TL fragments (as potential partial translations). In this light, whether the matching involves pre-compiled fragments (templates derived from the corpus), whether the fragments are derived at run-time, and whether the fragments (chunks) contain variables or not, are all secondary factors however useful in distinguishing EBMT subtypes (as Carl and Way (2003) in their collection). Input sentences may be treated as wholes, divided into fragments or even analysed as tree structures; what matters is that in transfer (matching/extraction) there is reference to the example database and not, as in RBMT, the application of rules and features for the transduction of SL structures into TL structures. Groves and Way (2005) developed a used a set of closed-class words to segment aligned source sentences and to derive an additional set of lexical and phrasal resources. This hybrid example-based SMT system improved the results of pure SMT or EBMT systems.

Recently, several possible approaches have been developed to combine the RBMT, EBMT and SMT engines. Some of them will be explored in our project. In fact the third challenge we face in this project is the combination or effective hybridization of these three single paradigms.

  • Combining MT paradigms: Multi-Engine MT.

(van Zaanen and Somers, 2005), (Matusov et al.,2006) and (Macherey and Och, 2007) review a set of references about MEMT (Multi-Engine MT) including the first attempt by (Frederking and Nirenburg, 1994). All the papers on MEMT reach the same conclusion: combining the outputs results in a better translation. Most of the approaches generate a new consensus translation combining different SMT systems using different language models and in some cases combining also with RBMT systems. Some of the approaches require confidence scores for each of the outputs. The improvement in translation quality is always lower than 18% relative increasing in BLEU score.

(Chen et al., 2007) reports 18% relative increment for in-domain evaluation and 8% for out-domain, by incorporating phrases (extracted from alignments from one or more RBMT systems with the source texts) into the phrase table of the SMT system and use the open-source decoder Moses to find good combinations of phrases from SMT training data with the phrases derived from RBMT.

(Matusov et al., 2006) reports 15% relative increment in BLEU score using consensus translation computed by voting on a confusion network. Pairwise word alignments of the original translation hypotheses were estimated for an enhanced statistical alignment model in order to explicitly capture reordering.

(Macherey and Och, 2007) presented an empirical study on how different selections of translation outputs affect translation quality in system combination. Composite translations were computed using (i) a candidate selection (ii) a ROVERlike combination scheme, and (iii) a novel two-pass search algorithm which determines and re-orders bags of words that build the constituents of the final consensus hypothesis. All gave statistically significant relative improvements of up to 10% BLEU score. They combine large numbers of different research systems.

(Mellebeek et al., 2006) reports improvements of up to 9% BLEU score. Their experiment is based in the recursive decomposition of the input sentence into smaller chunks, and a selection procedure based on majority voting that finds the best translation hypothesis for each input chunk using a language model score and a confidence score assigned to each MT engine.

(Huang and Papineni, 2007) and (Rosti et al., 2007) combines multiple MT systems output at word-, phrase- and sentence-levels. They report improvements of up to 10% BLEU score.

In OpenMT project we performed a successful first attempt for Spanish-Basque multi-engine MT (Alegria et al., 08). We applied the system to a restricted domain (translation in public administration). We built a hierarchical strategy for combining MT engines: first EBMT (translation patterns), then SMT (if its confidence score was greater than a threshold), and then RBMT. The results of the initial automatic evaluation showed very significant improvements: 193.55% relative increase for BLEU comparing EBMT+SMT with SMT single system, and 15.08% relative increase for BLEU comparing EBMT+SMT with EBMT single system. However these results were obtained by using automatic metrics with only one reference. As the RBMT systems use to be infravaluated with such evaluations a deeper evaluation is necessary (more than one reference with BLEU and NIST, and human evaluation HTER), and we expect that the results will be even better.

These initial successful results encourage us to follow investigating in this line.

  • Combining MT paradigms: Statistical post-edition on the RBMT output.

In the experiments related by (Simard et al., 2007a) and (Isabelle et al., 2007) SPE task is viewed as translation from the language of RBMT outputs into the language of their manually post-edited counterparts. So they don't use a parallel corpus created by human translation. Their RBMT system is SYSTRAN and their SMT system PORTAGE. (Simard et al., 2007a) reports a reduction in post-editing effort of up to a third when compared to the output of the rule-based system, i.e., the input to the SPE, and as much as 5 BLEU points improvement over the direct SMT approach. (Isabelle et al., 2007) concludes that such a RBMT+SPE system appears to be an excellent way to improve the output of a vanilla RBMT system and constitutes a worthwhile alternative to costly manual adaptation efforts for such systems. So a SPE system using a corpus with no more than 100.000 words of post-edited translations is enough to outperform an expensive lexicon enriched baseline RBMT system.

The same group recognizes (Simard et al., 2007b) that this sort of training data is seldom available, and they conclude that the training data for the post-editing component does not need to be manually post-edited translations, that can be generated even from standard parallel corpora. Their new RBMT+SPE system outperforms both the RBMT and SMT systems again. The experiments show that while post-editing is more effective when little training data is available, it remains competitive with SMT translation even when larger amounts of data. After a linguistic analysis they conclude that the main improvement is due to lexical selection.

In (Dugast et al., 2007), the authors of SYSTRAN's RBMT system present a huge improvement of the BLEU score for a SPE system when comparing to raw translation output. They get an improvement of around 10 BLEU points for German-English using the Europarl test set of WMT2007.

(Ehara, 2007) presents two experiments to compare RBMT and RBMT+SPE systems. Two different corpora are issued, one is the reference translation (PAJ, Patent Abstracts of Japan), the other is a large scaled target language corpus. In the former case, RBMT+SPE wins, in the later case RBMT wins. Evaluation is performed using NIST scores and a new evaluation measure NMG that counts the number of words in the longest sequence matched between the test sentence and the target language reference corpus.

Finally, (Elming, 2006) works in the more general field called as Automatic Post-Processing (APE). They use transformation-based learning (TBL), a learning algorithm for extracting rules to correct MT output by means of a post-processing module. The algorithm learns from a parallel corpus of MT output and human-corrected versions of this output. The machine translations are provided by a commercial MT system, PaTrans, which is based on Eurotra. Elming reports a 4.6 point increase in BLEU score.

Inside the OpenMT project a post-edition system applying SMT to the RBMT output has got an improvement greater than 40% in BLEU metric (Diaz de Ilarraza et al., 08). These results suggest us new ways to investigate. In OpenMT project we performed two experiments to verify the improvement obtained for other languages by using statistical post editing. Our experiments differ from other similar works because we used a morphological component in both RBMT and SMT translations, and because the size of the available corpora is small. Our results are coherent with huge improvements when using a RBMT+SPE approach on a restricted domain presented by (Dugast eta al., 2007; Ehara, 2007; Simard et al., 2007b). We obtain 200% improvement in the BLEU score for a RBMT+SPE system working with Matxin RBMT system, when comparing to raw translation output, and 40% when comparing to SMT system. Our results also are coherent with a smaller improvement when using more general corpora as presented by (Ehara, 2007; Simard et al., 2007b). Then even dealing with small size corpora this paradigm combination the results were satisfactory.

  • Hybridizing MT paradigms.

Less works exist in the recent literature, which present a truly internal hybridization of several MT paradigms (i.e., where the output is constructed by taking into account simultaneous information from different paradigms). A couple of strategies to be explored are : to introduce statistical knowledge to resolve ambiguities in RBMT systems. For instance, most rule-based systems lack of a deciding which translation rule must be applied when several rules are equally applicable in the same context to the same sentence span. Discriminative lexical selection techniques could provide an effective solution to this problem. Chan and Chiang (2007) have applied similar ideas to their hierarchical statistical MT system. In this system, however, rules are not manually defined but automatically induced from word alignments. A second and more general approach would be to devise an inference procedure which is able to deal with multiple proposed fragment translations, coming from different sources (translation tables from SMT, fragments translated by RBMT, examples from EBMT, etc.). The search for the optimal translation should coherently combine the translation candidates (from different granularities), taking also into account constraints regarding the target language. The works by Groves and Way (2005) introducing EBMT translation pairs into a SMT system fall in this category.

The ambitious proposal to raise an investigation not as a hybrid combination of external systems but that actually combine in the architecture of translation issues the three paradigms.

Integration of advanced linguistic knowledge in shallow machine translation

  • Integration of syntax into statistical machine translation

A limitation of standard phrase-based SMT systems is that reordering models are very simple. For instance, non-contiguous phrases are not allowed, long distance dependencies are not modeled, and syntactic transformations are not captured. Syntax-based approaches seek to remedy these deficiencies by explicitly taking into account syntactic knowledge. Approaches to syntax-based MT differ in several aspects: (i) side of parsing (source, target, or both sides), (ii) type of parsing (dependencies vs. constituents), (iii) modeling of probabilities (generative vs. discriminative), (iv) core (structured predictions vs. transformation rules), and (v) type of decoding (standard phrase-based, modeled by transducers, based on parsing, graph-based). Approaches to syntax-based SMT may be grouped in three different families:

  • Bilingual Parsing. The translation process is approached as a case of synchronous bilingual parsing. Derivation rules are automatically learned from parallel corpora, either annotated or unannotated (Wu, 1997; Wu, 2000; Alshawi, 1996; Alshawi et al., 2000; Melamed, 2004; Melamed et al., 2005; Chiang, 2005; Chiang, 2007).

  • Tree-to-String, String-to-Tree and Tree-to-Tree Models. These models exploit syntactic annotation, either in the source or target language or both, to estimate more informed translation and reordering models or translation rules (Yamada & Knight, 2001; Yamada, 2002; Gildea, 2003; Lin, 2004; Quirk et al., 2005; Cowan et al., 2006; Galley et al., 2006; Marcu et al., 2006).

  • Source Reordering. Another interesting approach consists in reordering the source text prior to translation using syntactic information so it shapes to the appropriate word ordering of the target language (Collins et al., 2005; Crego et al., 2006; Li et al., 2007). Significant improvements have been reported using this technique.

  • Integration of semantic knowledge into machine translation

One natural and appealing extension of the current RBMT and SMT paradigms for machine translation is the use of richer sources of linguistic knowledge, e.g., semantics. Such an extension would presumably lead to a qualitative improvement of state-of-the-art performance. However, reasoning with explicit semantics, is really a hard goal for open-text NLP, which has been generally ignored in the development of practical systems. Two concrete and feasible aspects, which are described below, will concentrate our efforts in this respect:

Word Sense Disambiguation in word translation

One of the challenges in MT is that of lexical choice (or word selection) in the case of semantic ambiguity, i.e., the choice for the most appropriate word in the target language for a polysemous word in the source language when the target language offers more than one option for the translation and these options have different meanings. The area that deals with this general disambiguation problem is referred to as word sense disambiguation (WSD). Note that, as emphasized by Hutchins and Somers (1992), monolingual WSD is different from the multilingual task, since the latter is concerned only with the ambiguities that come along in the translation from one language to another.

Recently, there has been a growing interest in the application of discriminative learning models to word selection, and, more generally, phrase selection in the context of SMT. Discriminative models allow for taking into account a richer feature context, and probability estimates are more informed than the simple frequency counts used in SMT translation models. In these systems lexical selection is addressed as a classification task. For each possible source word (or phrase) according to a given bilingual lexical inventory (e.g., the translation model), a distinct classifier is trained to predict lexical correspondences based on local context. Thus, during decoding, for every distinct instance of every source phrase a distinct context-aware translation probability distribution is potentially available.

Brown et al. (1991a; 1991b) were the first to suggest using dedicated WSD models in SMT. In a pilot experiment, they integrated a WSD system based on mutual information into their French-to-English word-based SMT system. Results were limited to the case of binary disambiguation and to a reduced set of very common words. Some years passed until these ideas were recovered by Carpuat and Wu (2005b), who suggested integrating WSD predictions into a phrase-based SMT system. In a first approach, they did so in a hard manner, either for decoding, by constraining the set of acceptable word translation candidates, or for post-processing the SMT system output, by directly replacing the translation of each selected word with the WSD system prediction. However, they did not manage to improve MT quality. They encountered several problems inherent to the SMT architecture. In particular, they described what they called the language model effect in SMT: "The lexical choices are made in a way that heavily prefers phrasal cohesion in the output target sentence, as scored by the language model". In a later work, Carpuat and Wu (2005a) analyzed the converse question, i.e., they measured the WSD performance of SMT systems. They showed that dedicated WSD models significantly outperform the WSD ability of current state-of-the-art SMT models. Consequently, SMT should benefit from WSD predictions. Simultaneously, Vickrey et al. (2005) also studied the application of context-aware discriminative word selection models based on WSD to SMT. They did not encounter the language model effect because they approached the task in a soft way, i.e., allowing WSD-based probabilities to interact with other models during decoding. However, they did not approach the full translation task but limited to the blank-filling task, a simplified version, in which the target context surrounding the word translation is available.

Following similar approaches, Cabezas and Resnik (2005) and Carpuat et al. (2006) used WSD-based models in the context of the full translation task to aid a phrase-based SMT system. They reported a small improvement in terms of BLEU score, possibly because they did not work with phrases but limited to single words. Besides, they did not allow WSD-based predictions to interact with other translation probabilities. More recently, other of authors, including ourselves, have extended these works by moving from words to phrases and allowing discriminative models to cooperate with other phrase translation models as an additional feature. Moderate improvements have been reported (Bangalore et al., 2007; Carpuat & Wu, 2007b; Carpuat & Wu, 2007a; Giménez & Màrquez, 2007a; Giménez & Màrquez, 2008a; Stroppa et al., 2007; Venkatapathy & Bangalore, 2007). All these works were being elaborated at the same time, and were presented in very near dates with very similar conclusions. One interesting observation by Giménez and Màrquez (2008a) is that the improvement of WSD-based phrase selection models is mainly related to the adequacy dimension, whereas for fluency there is a slight decrease. These results reveal a problem of integration: phrase selection classifiers have been trained locally, i.e., so as to maximize local phrase translation accuracy. In order to test this hypothesis and further improve the system, we plan to work on global classifiers directed towards maximizing overall translation quality instead.

Other integration strategies have been tried. For instance, Specia et al. (2008) used dedicated predictions for the reranking of n-best translations. Their models were based on Inductive Logic Programming (Specia et al., 2007). They limited to a small set of words from different grammatical categories. A very significant BLEU improvement was reported. In a different approach, Chan et al. (2007) used a WSD system to provide additional features for the hierarchical phrase-based SMT system based on bilingual parsing developed by Chiang (2005; 2007). These features were intended to give a bigger weight to the application of rules that are consistent with WSD predictions. A moderate but significant BLEU improvement was reported. Finally, Sánchez-Martínez et al. (2007) integrated a simple lexical selector, based on source lemma co-occurrences in a very local scope, into their hybrid corpus-based/rule-based MT system.

Semantic Role Labeling in machine translation

In the past few years there has been an increasing interest in Semantic Role Labeling (SRL), which is becoming an important component in many NLP applications. SRL is a well-defined task with a substantial body of work and comparative evaluation. Given a sentence, the task consists of detecting basic event structures such as "who" did "what" to "whom", "when" and "where". From a linguistic point of view, this corresponds to identifying the semantic arguments filling the roles of the sentence predicates.

The identification of such event frames might have a significant impact in many Natural Language Processing (NLP) applications, including machine translation. Although the use of SRL systems in real-world applications has so far been limited, we think the potential is very high and we expect a spread of this type of analysis to all applications requiring some level of semantic interpretation.

OpenMT-2 will take advantage of the previous works on SRL carried out by the UPC team researchers. They conducted two international evaluation exercises for SRL in the context of the CoNLL-2004 and 2005 (Carreras and Màrquez, 2004; 2005) shared tasks, and, more recently two additional evaluation exercises at CoNLL-2008 and 2009 on joint extraction of syntactic and semantic dependencies for multiple languages (Surdeanu et al., 2008), that is, a combination of syntactic dependency parsing and SRL. The second evaluation is currently underway. The UPC team has currently developed SRL prototypes for English, Spanish and Catalan (Surdeanu et al., 2007a; Màrquez et al., 2007; Surdeanu et al., 2008). Additionally, as explained in the following section, the UPC team has worked on MT evaluation metrics based on Semantic Roles (Giménez 2008). See tasks in WP3 for a prospective on the possibilities of applying the SRL technology in machine translation.

  • [no he comprobado las referencias más abajo]

Pre-edition and Post-edition

Our experience in providing MT services ( ) shows us two main sources of translation errors that could be handled via pre-edition: spelling-errors and the use of too long sentences in source texts. We plan to profit from the partners' know-how on spelling correction and on syntax parsing to get better results in automatic translation. Because of the late standardization of Basque, and because adult speakers nowadays did not learn it at school, the number of misspelled words is relatively high in MT input texts. The spelling checker Xuxen (Aduriz et al., 97) is a very effective tool in this kind of situation, giving people more confidence in the text they are writing. We think it will be very useful to be integrated in a pre-edition module with the MT system.

In the same way, we are designing a syntax module in the pre-edition module to generate, or help the writer to generate, shorter sentences due to the fact that quality of the translation drops significantly when the number of words augments. This task has to be based on syntax. In our research groups at EHU and UPC broad work has been carried out in syntax and specially in Clause Identification (Alegria et al., 2008, Carreras et al., 2002). The task is a interesting case to apply those techniques and our previous works can be adapted to pre-edition of texts to be automatically transtated.

On the other hand, In the last AMTA conference (2008) we detected an special interest in post-editing. A wide number of papers mentioned this process and two very interesting papers were devoted to post-editing (Doyon et al., 2008, Schütz, 2008). Google also is trying to collect this kind of corpora through its MT service (

Based on the collaborative philosophy of Web2.0, we want to create a network community to involve human translators in the creation of a significant corpus of postedition. We believe that the use on these new lines may lead to a qualitative jump in the quality of machine translation, even for languages with no huge parallel corpus.

A post-edition system applying SMT to the RBMT output has got an improvement greater than 40% in BLEU metric (Diaz de Ilarraza et al., 08). These results suggest us new ways to investigate. We performed two experiments to verify the improvement obtained for other languages by using statistical post editing.

But we could not work with manually post-edited corpora as (Simard et al., 2007a) and (Isabelle et al., 2007), so a Statistical Post-Editing system using a corpus with no more than 100.000 words of post-edited translations was enough to outperform an expensive lexicon enriched baseline RBMT system. Obviously we could not do it because there is no such a big corpus for Basque or Catalan. So in this new project OpenMT-2 we are planning to collect and exploit such a manually post-edited corpus. We have signed collaboration with the Wikipedia community for Basque, we will provide them with our on-line MT systems enhanced with post-editing facilities and they will promote the use of this translating and postediting tools between their collaborators when creating Wikipedia contents, and, of course, simultaneously to enrich automatically the post-edition corpora we need. This alliance with Wikipedia Community is very significant for further uses since for each postedited translation we will not have just the source text, the automatically translated text and the post-edited text, but also its restricted domain (it will be automatically deduced from its corresponding wikipedia entry).

Advanced Evaluation

Because human evaluation is very costly, MT researchers have developed several automatic evaluation metrics. The commonly accepted criterion that defines a plausible evaluation metric is that it must correlate well with human evaluators.

The use of N -gram based metrics in the context of system development has represented a significant advance in MT research in the last decade. A number of evaluation metrics have been suggested. The most successful evaluation metrics have been word error rate (WER), position independent word error rate (PER), bilingual evaluation understudy (BLEU) (Papineni et al., 2001), an improved version of BLEU by National Institute of Standards and Technology (NIST) (Doddington, 2002), the F-measure provided by the General Text Matcher (GTM) (Melamed et al., 2003), and the ROUGE (Lin & Och, 2004), etc.

Without a single doubt, the construction of a metric that is able to capture all the linguistic aspects that distinguish `correct' translations from `incorrect' ones is a very difficult path to trace. We approach this challenge by following a `divide and conquer' strategy. We suggest to build a set of specialized metrics each one devoted to the evaluation of a concrete partial aspect of MT quality. The point then is how to combine a set of metrics into a single measure of MT quality.

In OpenMT we have used the IQMT package (Giménez et al., 2005), which permits metric combination, with the singularity that there is no need to perform any training or adjustment of parameters. Besides considering the similarity of automatic translations to human references, IQMT additionally considers the distribution of similarities among human references.

However, having reached a certain degree of maturity, current MT technology requires nowadays the usage of more sophisticated metrics. In the last few years, several approaches have been suggested. Some of them are based on extending the reference lexicon. For instance, ROUGE and METEOR allow for morphological variations by applying stemming. Additionally, METEOR may perform a lookup for synonymy in WordNet (Fellbaum, 1998). Others have suggested taking advantage of paraphrasing support (Russo-Lassner et al., 2005; Zhou et al., 2006; Kauchak & Barzilay, 2006; Owczarzak et al., 2006). These are still attempts at the lexical level. At a deeper linguistic level, we may find, for instance, the work by Liu and Gildea (2005) who introduced a series of syntax-based metrics. They developed the Syntactic Tree Matching (STM) metric based on constituency parsing, and the Head-Word Chain Matching (HWCM) metric based on dependency parsing. Also based on syntax, Mehay and Brew (2007) suggested flattening syntactic dependencies only in the reference translations so as to compute string-based similarities without requiring syntactic parsing of the possibly ill-formed automatic candidate translations. We may find as well the work by Owczarzak et al. (2007a; 2007b) who presented a metric which compares dependency structures according to a probabilistic Lexical-Functional Grammar. They used paraphrases as well. Their metric obtains very competitive results, specially as a fluency predictor. Other authors have designed metrics based on shallow-syntactic information. For instance, Popovic and Ney (2007) proposed a novel analyzing translation errors based on WER and PER measures computed over different parts-of-speech. At the semantic level, we may find the 'NEE' metric defined by Reeder et al. (2001), which was devoted to measure MT quality over named entities9. We may also find the metrics developed by Giménez and Màquez (2007), as part of the OpenMT project, which operate over semantic role structures and discourse representations.

Apart from incorporating linguistic information, another promising research direction suggested in the last years is based on combining the scores conferred by different metrics into a single measure of quality (Corston-Oliver et al., 2001; Kulesza & Shieber, 2004; Quirk, 2004; Gamon et al., 2005; Liu & Gildea, 2007; Albrecht & Hwa, 2007a; Albrecht & Hwa, 2007b; Paul et al., 2007; Giménez & Màrquez, 2008a; Giménez & Màrquez, 2008c). This solution requires two important ingredients. First, the combination scheme, i.e., how to combine several metric scores into a single score. Second, the meta-Evaluation criterion, i.e., how to evaluate the quality of a metric combination.

Main Spanish research groups working on MT

Traducens Group. Universitat d'Alacant <>, Speech and Language Aplications and Technology (TALP). Universitat Politècnica de Catalunya. <>, Pattern Recognition and Human Language Technology Group (PRHLT). Universitat Politècnica de València <>, Department of Translation and Philology. Universitat Pompeu Fabra (UPF) <> , and Deli Group. University of Deusto.

Other main European research groups working on MT

Statistical Machine Translation Group. University of Edinburgh. UK, National Centre for Language Technology / Centre for Next Generation Localisation (NCLT-MT). Dublin City University. Ireland, Language Technology and Computational Linguistics (STP). Uppsala Universitet. Sweden, Institute of Formal and Applied Linguistics (UFAL). Univerzita Karlova v Praze. Czech Republic, Human Language Technology and Pattern Recognition. RWTH Aachen, Human Language Technology (HLT). Fondazione Bruno Kessler, Trento. Italy, Groupe d'Étude pour la Traduction Automatique (GETA). Laboratoire d'Informatique de Grenoble. France, Dept. of Computational Linguistics and Phonetics (COLI) Uiversität des Saarlandes. Germany, and Centre for Language Technology (CST). Københavns Universitet. Denmark.