|      | |||
EU / EN | ||||
State of the artCollection, annotation and exploitation of multilingual corpora As Koehn (2003) and Och (2005) stated, more data carries better translations. So our first challenge is to develop multilingual resources (collecting large monolingual, parallel and comparable corpora), and a general framework to represent them. Building large and representative monolingual corpora for open or restricted domains helps improving the performance of the statistical MT systems. Och (2005) shows that bigger monolingual data (source to infer the language model) implies better translations. He also shows a log-scale improvements on BLEU: doubling the training data gives constant improvement (+0.5% BLEU). However, building monolingual corpus in an assisted way is a difficult task, mainly due to the necessary effort to compile documents of different formats and sources. This problem is specially relevant for non-central languages. To face this problem, SIGWAC (The Special Interest Group of the ACL on Web as Corpus) proposes to use Internet as a big source of documents (Kilgarriff and Grefenstette, 2003). However, new problems arise in this approach: boilerplate removal (CLEANEVAL), duplication detection (Broder, 2000), topic filtering (Leturia et al., 2008), etc. The other type of corpora widely used in SMT are parallel corpora. Specifically, these corpora are used to train translation models. Therefore, the size is extremely important. Koehn (2003) shows how the BLEU score raises when the size of the bilingual parallel corpus increases. Again, log-scale improvements on BLEU for all languages are showed. Now, doubling the training data gives a constant improvement of +1% BLEU. This experiment was performed using a statistical system and the Europarl parallel corpus (30 million words in 11 languages, http://www.statmt.org/europarl/). Parallel corpora can also be exploited for terminology extraction and for extending dictionaries used in RBMT. They are often derived from translation memories. Unfortunately, this kind of corpus is scarce. As an alternative, some authors propose to use multilingual web sites as a source to build parallel corpora automatically (Resnik and Smith, 2003; Zhang et al., 2006; Cheng and Nie, 2000). Nevertheless, parallel corpora are still scarce for unusual pairs of languages or specific domains. Thus, some authors propose to exploit comparable corpora to extract translation knowledge. They are mainly focused on extracting bilingual terminology (Rapp, 1995; Fung, 1996), which can be very valuable to update and extend bilingual dictionaries used in RBMT. Other authors have improved SMT systems? performance by integrating comparable corpora exploitation (Talbot 2003; Munteanu et al., 2004; Hewavitharana and Vogel, 2008; Snover, 2008). Similarly to the problem of compiling parallel and monolingual corpora, different approaches have been proposed to compile comparable corpus: assisted way (e.g., BNC), RSS feeds (Fairon et al., 2008), web crawling (Chakrabarti et al., 1999), and search APIs (Baroni and Bernardini, 2004). Nevertheless, the collection of comparable corpora (Talvensaari et al. 2008) is a relatively new research topic, and some problems have not been resolved yet. For example, obtaining a high comparability degree, the parameters to determine it, and its effect on the performance of the exploitation are still open questions. However, it is important to say that the precision and recall of terminology extraction from comparable corpora are still far from those achieved with parallel corpora (Melamed, 2001; Och, 2002). While studies report around 80% precision (taking into account the first 10-20 candidates) with comparable corpora (Fung, 1995; Rapp, 1995), the precision that can be achieved with parallel corpora is above 95% taking into account only the top candidates (Tiedemann, 1998). This is, mainly, due to the more implicitness of the knowledge to infer. The most used paradigms are context similarity (Fung, 1995; Rapp, 1995) and string similarity (Al-Onaizan and Knight, 2002). There are several works discussing how to determine which words make up the context of a word and which relevance they have respect to the word in question (Petkar et al., 2007; Gamallo 2008). Another way of representing the contexts is by using language models (Shao et al. 2004; Saralegi et al., 2008). Another important issue is the treatment of Multiword Terms (MWT). There are no many works dealing with MWT (Daille and Morin, 2005; Morin et al., 2007), but they represent an essential lexicon to adapt a RBMT system to a specific domain.
MT combined/hybrid systems Traditionally,
MT was performed via rule-based machine translation systems
(RBMT) that worked by applying a set of linguistic rules in three
phases: analysis, transfer and generation. Since late 80s there is
much interest in exploring new techniques in corpus-based approaches:
statistical text analysis (alignment, etc.), example-based machine
translation (EBMT) and statistical machine translation
(SMT). Nowadays, research is oriented towards hybrid systems, which
combine traditional linguistic rules and corpus-based
EBMT and SMT are the two main models in corpus-based MT.
Both need a set of sentences in one language aligned with their
translation in another. GIZA++ is a well-known tool to perform this
alignment. The two models induce translation knowledge from
sentence-aligned corpora, but there are significant differences
regarding both the type of information learnt and how this is brought
to bear in dealing with new input. In
SMT, essentially, the translation model (obtained from parallel
corpus) establishes the set of target language words (and more
recently, phrases), which are most likely to be useful in translating
the source string, while the language model (obtained from
monolingual corpus) tries to assemble these words (and phrases) in
the most probable target word order. Nowadays, however, SMT
practitioners also get their systems to learn phrasal as well as
lexical alignments, e.g., (Koehn et al., 2003; Och, 2003). Novel
approaches to reordering in phrase-based statistical MT have been
proposed (Kanthak et all 2005).
Recently, several possible approaches have been
developed to combine the RBMT, EBMT and SMT engines (some of them are
being explored in our current OpenMT project): Combining
MT paradigms: Multi-Engine MT.
van Zaanen and Somers (2005), Matusov et al., (2006), and Macherey
and Och (2007) review a set of references about MEMT (Multi-Engine
MT) including the first attempt by Frederking and Nirenburg (1994).
Recent works (Chen et al., 2007; Matusov et al., 2006; Macherey and
Och, 2007; Mellebeek et al., 2006; Huang and Papineni, 2007; and
Rosti et al., 2007) reports improvements of up to 18% BLEU score.
Combining
MT paradigms: Statistical post-edition (SPE) on the RBMT output. In
(Simard et al., 2007a) and (Isabelle et al., 2007) SPE task is
viewed as translation from the language of RBMT outputs into the
language of their manually post-edited counterparts. So they do not
use a parallel corpus created for human translation. They report 5
BLEU points improvement over the direct SMT approach, and conclude
that such a RBMT+SPE system appears to be an excellent way to
improve the output of a vanilla RBMT system and constitutes a
worthwhile alternative to costly manual adaptation efforts for such
systems. So a SPE system using a corpus with no more than 100.000
words of post-edited translations is enough to outperform an
expensive lexicon enriched baseline RBMT system. The same group
(Simard et al., 2007b) show that while post-editing is more
effective when little training data is available, it remains
competitive with SMT translation even with larger amounts of data.
Similar approaches are reported by Dugast et al. (2007) and Elming
(2006). Inside the OpenMT project a post-edition system has got an
improvement greater than 40% in BLEU metric (Diaz de Ilarraza et
al., 08).
Hybridizing
MT paradigms. Less
works exist in the recent literature, which present a truly internal
hybridization of several MT paradigms (i.e., where the output is
constructed by taking into account simultaneous information from
different paradigms). A couple of strategies to be explored are: 1)
to introduce statistical knowledge to resolve ambiguities in RBMT
systems. For instance, most rule-based systems lack of a
deciding which translation rule must be applied when several rules
are equally applicable in the same context to the same sentence
span. Discriminative lexical selection techniques could provide an
effective solution to this problem. Chan and Chiang (2007) have
applied similar ideas to their hierarchical statistical MT system.
In this system, however, rules are not manually defined but
automatically induced from word alignments. A second and more
general approach would be to devise an inference procedure which is
able to deal with multiple proposed fragment translations, coming
from different sources (translation tables from SMT, fragments
translated by RBMT, examples from EBMT, etc.). The search for the
optimal translation should coherently combine the translation
candidates (from different granularities), taking also into account
constraints regarding the target language. The works by Groves and
Way (2005) introducing EBMT translation pairs into a SMT system fall
in this category.
Integration
of advanced linguistic knowledge in shallow machine translation
Integration
of syntax into statistical machine translation
A
limitation of standard phrase-based SMT systems is that reordering
models are very simple. For instance, non-contiguous phrases are not
allowed, long distance dependencies are not modeled, and syntactic
transformations are not captured. Syntax-based approaches seek to
remedy these deficiencies by explicitly taking into account syntactic
knowledge. Approaches to syntax-based MT differ in several aspects:
(i) side of parsing (source, target, or both sides), (ii) type of
parsing (dependencies vs. constituents), (iii) modeling of
probabilities (generative vs. discriminative), (iv) core (structured
predictions vs. transformation rules), and (v) type of decoding
(standard phrase-based, modeled by transducers, based on parsing,
graph-based). Approaches to syntax-based SMT may be grouped in three
different families: Bilingual
Parsing (Wu
1997;2000; Alshawi, 1996; Alshawi et al., 2000; Melamed, 2004;
Melamed et al., 2005; Chiang, 2005;2007). Tree-to-String,
String-to-Tree and Tree-to-Tree Models
(Yamada & Knight, 2001; Yamada, 2002; Gildea, 2003; Lin, 2004;
Quirk et al., 2005; Cowan et al., 2006; Galley et al., 2006; Marcu
et al., 2006). Source
Reordering
(Collins et al., 2005; Crego et al., 2006; Li et al., 2007).
Significant improvements have been reported using this technique. Integration
of semantic knowledge into machine translation
One
natural and appealing extension of the current RBMT and SMT paradigms
for machine translation is the use of richer sources of linguistic
knowledge, e.g., semantics. Such an extension would presumably lead
to a qualitative improvement of state-of-the-art performance.
However, reasoning with explicit semantics, is really a hard goal for
open-text NLP, which has been generally ignored in the development of
practical systems. Two concrete and feasible aspects, which are
described below, will concentrate our efforts in this respect:
Word
Sense Disambiguation in word translation
One of the challenges in MT is that of lexical choice
(or word selection) in the case of semantic ambiguity, i.e., the
choice for the most appropriate word in the target language for a
polysemous word in the source language when the target language
offers more than one option for the translation and these options
have different meanings. The area that deals with this general
disambiguation problem is referred to as word sense disambiguation
(WSD).
Recently, there has been a growing interest in the
application of discriminative learning models to word selection, and,
more generally, phrase selection in the context of SMT.
Discriminative models allow for taking into account a richer feature
context, and probability estimates are more informed than the simple
frequency counts used in SMT translation models.
Brown et al. (1991a; 1991b) were the first to suggest
using dedicated WSD models in SMT. They integrated a WSD system based
on mutual information into their French-to-English word-based SMT
system. Some years passed until these ideas were recovered by Carpuat
and Wu (2005b), who suggested integrating WSD predictions into a
phrase-based SMT system. In a later work, Carpuat and Wu (2005a)
analyzed the converse question, i.e., they measured the WSD
performance of SMT systems. They showed that dedicated WSD models
significantly outperform the WSD ability of current state-of-the-art
SMT models. Simultaneously, Vickrey et al. (2005) also studied the
application of context-aware discriminative word selection models
based on WSD to SMT. They did not approach the full translation task
but limited to the blank-filling task.
Following similar approaches, Cabezas and Resnik (2005)
and Carpuat et al. (2006) used WSD-based models in the context of the
full translation task to aid a phrase-based SMT system. They reported
a small improvement in terms of BLEU score. More recently, other
authors, including ourselves, have extended these works by moving
from words to phrases and allowing discriminative models to cooperate
with other phrase translation models. Moderate improvements have been
reported (Bangalore et al., 2007; Carpuat & Wu, 2007b; Carpuat &
Wu, 2007a; Giménez & Màrquez, 2007a; Giménez & Màrquez,
2008a; Stroppa et al., 2007; Venkatapathy & Bangalore, 2007). One
interesting observation by Giménez and Màrquez (2008a) is that the
improvement of WSD-based phrase selection models is mainly related to
the adequacy dimension, whereas for fluency there is a slight
decrease. These results reveal a problem of integration: phrase
selection classifiers have been trained locally, i.e., so as to
maximize local phrase translation accuracy. In order to test this
hypothesis and further improve the system, we plan to work on global
classifiers directed towards maximizing overall translation quality
instead.
Other integration strategies have been tried. Specia et
al. (2008) used dedicated predictions for the reranking of n-best
translations, using models based on Inductive Logic Programming
(Specia et al., 2007). Chan et al. (2007) used a WSD system to
provide additional features for the hierarchical phrase-based SMT
system based on bilingual parsing developed by Chiang (2005; 2007).
Sánchez-Martínez et al. (2007) integrated a simple lexical
selector, based on source lemma co-occurrences in a very local scope,
into their hybrid corpus-based/rule-based MT system.
Semantic
Role Labeling in machine translation
In the past few years there has been an increasing
interest in Semantic Role Labeling (SRL), which is becoming an
important component in many NLP applications. SRL is a well-defined
task with a substantial body of work and comparative evaluation.
Given a sentence, the task consists of detecting basic event
structures such as "who" did "what" to "whom",
"when" and "where". From a linguistic point of
view, this corresponds to identifying the semantic arguments filling
the roles of the sentence predicates.
The
identification of such event frames might have a significant impact
in many Natural Language Processing (NLP) applications, including
machine translation. Although the use of SRL systems in real-world
applications has so far been limited, we think the potential is very
high and we expect a spread of this type of analysis to all
applications requiring some level of semantic interpretation. The
UPC team has currently developed SRL prototypes for English, Spanish
and Catalan (Surdeanu et al., 2007a; Màrquez et al., 2007; Surdeanu
et al., 2008).
Pre-edition
and Post-edition In
the last AMTA conference (2008) we detected a special interest in
post-editing. A wide number of papers mentioned this process and two
very interesting papers were devoted to post-editing (Doyon et al.,
2008; Schütz, 2008). Google also is trying to collect this kind of
corpora through its public MT service.
Our experience in providing MT services
(http://www.opentrad.org/) shows us two main sources of translation
errors that could be handled via pre-edition: spelling-errors and the
use of too long sentences in source texts. The spelling checker
Xuxen (Aduriz et al., 97) is a very effective tool in this kind of
situations.
In the same way, in our research groups at EHU and UPC
extensive work has been carried out in syntax and especially in
Clause Identification (Alegria et al., 2008, Carreras et al.,
2002). The task is an interesting case to apply those techniques and
our previous works can be adapted to pre-edition of texts to be
automatically translated.
A post-edition system applying SMT to our RBMT output
has got an improvement greater than 40% in BLEU metric (Diaz de
Ilarraza et al., 08). These results suggest us new ways to
investigate.
But we could not work with a big
manually post-edited corpora as (Simard et al., 2007a) and (Isabelle
et al., 2007). They report that such a Statistical Post-Editing
system using a corpus with no more than 100.000 words of post-edited
translations was enough to outperform an expensive lexicon enriched
baseline RBMT system. Obviously, we could not do it because there is
no such a big corpus for Basque or Catalan. But in this new project
OpenMT-2 we are planning to collect such a manually post-edited
corpus. We have signed collaboration with the Wikipedia community for
Basque. And then, we will be able to exploit this corpus.
Advanced
Evaluation
As human evaluation is very costly, MT researchers
developed several automatic evaluation metrics. The commonly accepted
criterion to define a plausible metric is that it must correlate well
with human evaluators.
The use of N -gram based metrics has represented a
significant advance in MT research in the last decade. A number of
evaluation metrics were suggested. The most successful evaluation
metrics have been word error rate (WER), position independent word
error rate (PER), bilingual evaluation understudy (BLEU) (Papineni et
al., 2001), an improved version of BLEU by National Institute of
Standards and Technology (NIST) (Doddington, 2002), the F-measure
provided by the General Text Matcher (GTM) (Melamed et al., 2003),
and the ROUGE (Lin & Och, 2004).
We suggest a `divide and conquer' strategy to build a
set of specialized metrics each one devoted to the evaluation of a
concrete partial aspect of MT quality, and then to combine a set of
metrics into a single measure of MT quality. In OpenMT we have used
the IQmt package (Giménez et al., 2005), which permits metric
combination, with the singularity that there is no need to perform
any training or adjustment of parameters. Besides considering the
similarity of automatic translations to human references, IQmt
additionally considers the distribution of similarities among human
references.
In the last few years, several approaches have been
suggested. Some of them are based on extending the reference lexicon.
ROUGE and METEOR allow for morphological variations by applying
stemming. Additionally, METEOR may perform a lookup for synonymy in
WordNet (Fellbaum, 1998). Others have suggested taking advantage of
paraphrasing support (Russo-Lassner et al., 2005; Zhou et al., 2006;
Kauchak & Barzilay, 2006; Owczarzak et al., 2006). These are
still attempts at the lexical level. At a deeper linguistic level, we
may find, for instance, the work by Liu and Gildea (2005) who
introduced a series of syntax-based metrics. They developed the
Syntactic Tree Matching (STM) metric based on constituency parsing,
and the Head-Word Chain Matching (HWCM) metric based on dependency
parsing. Also based on syntax, Mehay and Brew (2007) suggested
flattening syntactic dependencies only in the reference translations
so as to compute string-based similarities without requiring
syntactic parsing of the possibly ill-formed automatic candidate
translations. We may find as well the work by Owczarzak et al.
(2007a; 2007b) who presented a metric that compares dependency
structures according to a probabilistic Lexical-Functional Grammar.
Popovic and Ney (2007) proposed a novel
translation errors based on WER and PER measures computed over
different parts-of-speech. The ?NEE? metric defined by Reeder et
al. (2001), which was devoted to measure MT quality over named
entities9. The metrics developed by Giménez and Màquez (2007), as
part of the OpenMT project, operate over semantic role structures and
discourse representations.
Apart from incorporating linguistic information, another
promising research direction suggested in the last years is based on
combining the scores conferred by different metrics into a single
measure of quality (Corston-Oliver et al., 2001; Kulesza &
Shieber, 2004; Quirk, 2004; Gamon et al., 2005; Liu & Gildea,
2007; Albrecht & Hwa, 2007a; Albrecht & Hwa, 2007b; Paul et
al., 2007; Giménez & Màrquez, 2008a; Giménez & Màrquez,
2008c).
The most relevant Spanish research groups on MT Traducens
Group. Universitat d'Alacant (http://transducens.dlsi.ua.es/), Speech
and Language Aplications and Technology (TALP). Universitat
Politècnica de Catalunya (http://www.talp.cat/talp/index.php),
Pattern Recognition and Human Language Technology Group (PRHLT).
Universitat Politècnica de València (http://prhlt.iti.es/),
Department of Translation and Philology. Universitat Pompeu Fabra
(http://www.upf.edu/dtf/), and Deli Group. University of Deusto.
(http://www.deli.deusto.es).
The most relevant European research groups on MT Statistical
Machine Translation Group. University of Edinburgh. UK
(http://www.statmt.org/ued), National Centre for Language Technology
/ Centre for Next Generation Localisation (NCLT-MT). Dublin City
University, Ireland (http://nclt.dcu.ie/mt), Language Technology and
Computational Linguistics (STP). Uppsala Universitet. Sweden
(http://stp.lingfil.uu.se/english/), Institute of Formal and Applied
Linguistics (UFAL). Univerzita Karlova v Praze. Czech Republic
(http://ufal.mff.cuni.cz/), Human Language Technology and Pattern
Recognition, RWTH Aachen (http://www-i6.informatik.rwth-aachen.de/),
Human Language Technology (HLT). Fondazione
Bruno Kessler, Trento. Italy (http://hlt.fbk.eu/), Groupe d'Étude
pour la Traduction Automatique (GETA). Laboratoire d'Informatique de
Grenoble. France (http://www-clips.imag.fr/geta/),
Dept. of Computational Linguistics and Phonetics
(COLI)
Uiversität
des Saarlandes. Germany
(http://www.coli.uni-saarland.de/),
and Centre for Language
Technology (CST). Københavns Universitet. Denmark.
(http://www.cst.ku.dk/) |