|      | ||
EU / EN | |||
State of the art / Antecedentes y estado actual de los conocimientos científico-técnicos
The State of the art will be described for the main innovative points of the project in order to justify how this project will advance in the current situation:
Collection, annotation and exploitation of multilingual corpora As Koehn (2003) and (Och,2005) stated, more data carries better translations. So our first challenge is to develop multilingual resources:
Building large and representative monolingual corpora for open or restricted domains is of advantage to the performance of the statistical translator. The figure below from (Och, 2005) shows that also bigger monolingual data (text to define the language model) brings better translations. Also log-scale improvements on BLEU: doubling the training data gives constant improvement (+0.5 %BLEU) (last addition is 218 billion words out-of-domain web data).
However, building monolingual corpus in an assisted way is a difficult task, mainly due to the necessary effort to compile documents of different formats and sources. This problem is specially marked for non-central languages. To face this problem, SIGWAC (The Special Interest Group of the Association for Computational Linguistics on Web as Corpus) proposes to use Internet as a big source of documents (A. Kilgarriff, G. Grefenstette, 2003). However, new problems arise in this approach: boilerplate removal (CLEANEVAL), duplication detection (A.Z. Broder, 2000), topic filtering (I. Leturia et al. 2008) ...
The other kind of corpora widely used in SMT are parallel corpora. Specifically, these corpora are used to train translation models. Therefore, the size is extremely important. The figure below from (Koehn, 2003) shows how the BLEU score raises when the size of the bilingual parallel corpus augments. They are log-scale improvements on BLEU for all languages. Doubling the training data gives constant improvement (+1 %BLEU). This experiment was performed using a n statistical system and the Europarl parallel corpus, 30 million words in 11 languages (http://www.statmt.org/europarl/).
Parallel corpora can also be exploited for terminology extraction and for extending dictionaries used in RBMT. They are often derived from translation memories. Unluckily, this kind of corpus is scarce. As an alternative some authors propose to use multilingual web-sites as a source to build parallel corpora automatically (P. Resnick and N.A. Smith 2003, Zhang et al 2006, Nie et al. 1999).
Nevertheless, parallel corpora are still scarce for unusual pairs of languages or specific domains. Thus, some authors propose to exploit comparable corpora to extract translation knowledge. They are mainly focused on extract bilingual terminology (Rapp, 1995), (Fung, 1996), which can be very valuable to update and extend bilingual dictionaries used in RBMT. Other authors have improved SMT systems' performance by integrating comproblemsparable corpora exploitation (David Talbot 2003, Munteanu and Marcu 2003, Hewavitharana and Vogel 2008).
As for the compilation of parallel and monolingual corpora, different approaches have been proposed to compile comparable corpus: assisted way (e.g., BNC), RSS feeds (Fairon et al. 2008), web crawling (Chakrabarti et al., 1999), and search APIs (Baroni and Bernardini, 2004). Nevertheless, the collection of comparable corpora (Talvensaari et al. 2008) is a relatively new research topic, and some problems have not been resolved yet. For example, obtaining a high comparability degree, the parameters to determine it, and its effect on the performance of the exploitation are still open questions.
As mentioned before, terminology extraction from comparable corpora is very attractive for many reasons: * Comparable corpora can be easily obtained, unlike parallel corpora. * Comparable corpora are easily updated, so new terminology will be detected. However, it is necessary to say that the precision and recall of terminology extraction from comparable corpora are still far from those achieved with parallel corpora (Melamed, 2001; Och, 2002). While studies report around a 80% precision (taking into account the first 10-20 candidates) with comparable corpora (Fung, 1995; Rapp, 1995), the precision that can be achieved with parallel corpora is above 95% taking into account only the top candidates (Tiedemann, 1998). This is, mainly, due to the more implicitness of the knowledge to infer. The most used paradigms are context similarity (Fung, 1995; Rapp, 1995) and string similarity (Al-Onaizan & KNight, 2002). The standard algorithm to recognize context similarity consists of three steps: modeling of the contexts, translation of the source contexts using a seed bilingual lexicon and calculation of the degree of similarity. The majority of the "bag-of-words" paradigm. Thus, the contexts are represented by weighted collections of words. There are several works discussing how to determine which words make up the context of a word and which relevance they have respect to the word in question (Petkar et al., 2007) (Gamallo 2008). Another way of representing the contexts is by using language models (Shao et al. 2004; Saralegi et al., 2008). Another important issue is the treatment of MWT. There are no many works dealing with MWT (Daille & Morin, 2005) (Morin et al. 2007), but they represent a essential lexicon to adapt a RBMT system to a specific domain.
General framework to represent annotated parallel corpus.
Nowadays the Natural Language Processing development community has generated many tools to manage linguistic data in corpus such as those resulting from tokenizers, morphosyntactic analyzers, lemmatizers, and so on. The integration of these tools have to deal with several types of incompatibilities. These problems are more prominent when we need to integrate tools for different languages and developed by different groups.
A few years ago the language processing community has defined a standard for effective Language Resource Management (ISO TC37/SC4) whose goal is to provide a framework for the creation, annotation and manipulation of linguistic resources and software processing [Ide & Romary 2004]. Adopting this formalism to represent this information is not trivial and different attempts develop such a framework have been made. For example ALEP [Simkins 1994] can be considered the first integration environment for NLP design; GATE [Cunningham et al. 2002] perhaps is the most influential system in the area. ATLAS and AGTK provide an architecture which facilitates the development of linguistic annotation applications. The ATLAS system [Bird et al. 2000] and the Annotation Graphs Toolkit (AGTK) [HaeJoong et al. 2002] are implementations of the Annotation Graphs formalism. They exhibit problems when encoding some linguistic structures because they do not allow the separation of information into layers. GATE uses TIPSTER architecture for annotation and also presents some disadvantages for encoding non-continuous multiword lexical units. The NLTK framework (nltk.sourceforge.net) is not able to represent the ambiguities. In the Emdros [Ulrik 2004] framework it is not possible to properly represent some types of relations between elements or classification ambiguity. We have adopted our own approach because our representation requirements are not completely fulfilled by the annotation schemes proposed in the mentioned systems. Unfortunately, ATLAS and AGTK systems do not allow the separation of information; and NLTK and Emdros do not represent properly ambiguities.
We have developed an environment called Eulibeltz (Casillas et al., 2006) which is an extension of Eulia [Artola et al. 2004] an interface for monolingual corpus. Eulibeltz follows [Ide & Romary 2004] and the stand-off markup approach inspired on TEI guidelines [Sperberg-McQueen & Burnard 2002] to represent linguistic information and translation units obtained by several NLP tools. Also, it provides a way to represent ambiguities, non-continuous multiword lexical units and relations between translation units of bilingual documents. Unlike the other environments, with Eulibeltz the human experts can modify the annotations incorporated by the automatic process when these annotations are incorrect. Eulibeltz is a software that contributes on making easy the integration of linguistic tools developed for the treatment of two languages: Spanish and Basque. This interface also facilitates the access. Our main contribution is that our annotation proposal deals with monolingual and bilingual parallel documents and the software designed allows to edit the markup automatically generated. Exploring the annotated documents is possible. From the end of the data flow it is possible to generate linguistic resources such as translation memories that contain different types of translation units and translation patterns.
MT combined/hybrid systems
Traditionally,
MT was performed via rule-based machine translation systems (RBMT)
that worked by applying a set of linguistic rules in three phases:
analysis, transfer and generation. Since late 80s there is much
interest in exploring new techniques in corpus-based approaches:
statistical text analysis (alignment, etc.), example -based machine
translation (EBMT) and statistical machine translation (SMT).
Nowadays, research is oriented towards hybrid systems which combine
traditional linguistic rules and corpus-based
EBMT
and SMT are the two main models in corpus-based MT. Both need a set
of sentences in one language aligned with their translation in
another. GIZA++ is a well-known tool to perform this alignment. The
two models induce translation knowledge from sentence-aligned
corpora, but there are significant differences regarding both the
type of information learnt and how this is brought to bear in dealing
with new input. In
SMT, essentially, the translation model (obtained from parallel
corpus) establishes the set of target language words (and more
recently, phrases), which are most likely to be useful in translating
the source string, while the language model (obtained from
monolingual corpus) tries to assemble these words (and phrases) in
the most likely target word order. Nowadays, however, SMT
practitioners also get their systems to learn phrasal as well as
lexical alignments (e.g. (Koehn et al., 2003); (Och, 2003)). Novel
approaches to reordering in phrase-based statistical MT have been
proposed (Kanthak et all 2005).
Moreover, in the
last decade several approaches to introducing syntactic knowledge
into SMT have been proposed. At the shallow parsing level, Koehn and
Knight (2002), Schafer and Yarowsky (2003), and Giménez and
Màrquez (2005) have proposed systems that integrate linguistic
concepts such as morphosyntactic analysis (part-of-speech tags),
lemmatization and shallow parsing (chunks) in the frame of SMT.
Moving onto full parsing, Yamada and Knight (2001, 2002) presented a
syntax based tree-to-string probability model in which tree
constituents are aligned to strings. Gildea (2003, 2004) followed and
improved this same idea by working on constituency/dependency
tree-to-tree alignments. Finally, others have suggested the idea of
bilingual parsing applied to phrasal alignment: Wu (1997) presented a
novel stochastic inversion transduction grammar formalism for
bilingual language modeling of sentence-pairs, and, recently, Melamed
(2004) suggested a similar approach based on multitext grammars. In
EBMT, as Somers (2003) and Hutchins (2005) recently stated, the
essence is the matching of SL fragments (from an input text) against
source language fragments (in a database) and the extraction of the
equivalent TL fragments (as potential partial translations). In this
light, whether the matching involves pre-compiled fragments
(templates derived from the corpus), whether the fragments are
derived at run-time, and whether the fragments (chunks) contain
variables or not, are all secondary factors however useful in
distinguishing EBMT subtypes (as Carl and Way (2003) in their
collection). Input sentences may be treated as wholes, divided into
fragments or even analysed as tree structures; what matters is that
in transfer (matching/extraction) there is reference to the example
database and not, as in RBMT, the application of rules and features
for the transduction of SL structures into TL structures. Groves and
Way (2005) developed a
used a set of closed-class words to segment aligned source sentences
and to derive an additional set of lexical and phrasal resources.
This hybrid example-based SMT system improved the results of pure SMT
or EBMT systems.
Recently,
several possible approaches have been developed to combine the RBMT,
EBMT and SMT engines. Some of them will be explored in our project.
In fact the third challenge we face in this project is the
combination or effective hybridization of these three single
paradigms.
Combining
MT paradigms: Multi-Engine MT. (van
Zaanen and Somers, 2005), (Matusov et al.,2006) and (Macherey and
Och, 2007) review a set of references about MEMT (Multi-Engine MT)
including the first attempt by (Frederking and Nirenburg, 1994). All
the papers on MEMT reach the same conclusion: combining the outputs
results in a better translation. Most of the approaches generate a
new consensus translation combining different SMT systems using
different language models and in some cases combining also with RBMT
systems. Some of the approaches require confidence scores for each of
the outputs. The improvement in translation quality is always lower
than 18% relative increasing in BLEU score. (Chen
et al., 2007) reports 18% relative increment for in-domain evaluation
and 8% for out-domain, by incorporating phrases (extracted from
alignments from one or more RBMT systems with the source texts) into
the phrase table of the SMT system and use the open-source decoder
Moses to find good combinations of phrases from SMT training data
with the phrases derived from RBMT. (Matusov
et al., 2006) reports 15% relative increment in BLEU score using
consensus translation computed by voting on a confusion network.
Pairwise word alignments of the original translation hypotheses were
estimated for an enhanced statistical alignment model in order to
explicitly capture reordering. (Macherey
and Och, 2007) presented an empirical study on how different
selections of translation outputs affect translation quality in
system combination. Composite translations were computed using (i) a
candidate selection
(ii) a ROVERlike combination scheme, and (iii) a novel two-pass
search algorithm which determines and re-orders bags of words that
build the constituents of the final consensus hypothesis. All
gave statistically significant relative improvements of up to 10%
BLEU score. They combine large numbers of different research systems. (Mellebeek
et al., 2006) reports improvements of up to 9% BLEU score. Their
experiment is based in the recursive decomposition of the input
sentence into smaller chunks, and a selection procedure based on
majority voting that finds the best translation hypothesis for each
input chunk using a language model score and a confidence score
assigned to each MT engine. (Huang
and Papineni, 2007) and (Rosti et al., 2007) combines multiple MT
systems output at word-, phrase- and sentence-levels. They report
improvements of up to 10% BLEU score.
In
OpenMT project we performed a successful first attempt for
Spanish-Basque multi-engine MT (Alegria et al., 08). We applied the
system to a restricted domain (translation in public administration).
We built a hierarchical strategy for combining MT engines: first EBMT
(translation patterns), then SMT (if its confidence score was greater
than a threshold), and then RBMT. The results of the initial
automatic evaluation showed very significant improvements: 193.55%
relative increase for BLEU comparing EBMT+SMT with SMT single system,
and 15.08% relative increase for BLEU comparing EBMT+SMT with EBMT
single system. However these results were obtained by using
automatic metrics with only one reference. As the RBMT systems use to
be infravaluated with such evaluations a deeper evaluation is
necessary (more than one reference with BLEU and NIST, and human
evaluation HTER), and we expect that the results will be even better. These
initial successful results encourage us to follow investigating in
this line. Combining
MT paradigms: Statistical post-edition on the RBMT output. In
the experiments related by (Simard et al., 2007a) and (Isabelle et
al., 2007) SPE task is viewed as translation from the language of
RBMT outputs into the language of their manually post-edited
counterparts. So they don't use a parallel corpus created by human
translation. Their RBMT system is SYSTRAN and their SMT system
PORTAGE. (Simard et al., 2007a) reports a reduction in post-editing
effort of up to a third when compared to the output of the rule-based
system, i.e., the input to the SPE, and as much as 5 BLEU points
improvement over the direct SMT approach. (Isabelle et al., 2007)
concludes that such a RBMT+SPE system appears to be an excellent way
to improve the output of a vanilla RBMT system and constitutes a
worthwhile alternative to costly manual adaptation efforts for such
systems. So a SPE system using a corpus with no more than 100.000
words of post-edited translations is enough to outperform an
expensive lexicon enriched baseline RBMT system. The
same group recognizes (Simard et al., 2007b) that this sort of
training data is seldom available, and they conclude that the
training data for the post-editing component does not need to be
manually post-edited translations, that can be generated even from
standard parallel corpora. Their new RBMT+SPE system outperforms both
the RBMT and SMT systems again. The experiments show that while
post-editing is more effective when little training data is
available, it remains competitive with SMT translation even when
larger amounts of data. After a linguistic analysis they conclude
that the main improvement is due to lexical selection. In
(Dugast et al., 2007), the authors of SYSTRAN's RBMT system present a
huge improvement of the BLEU score for a SPE system when comparing to
raw translation output. They get an improvement of around 10 BLEU
points for German-English using the Europarl test set of WMT2007.
(Ehara,
2007) presents two experiments to compare RBMT and RBMT+SPE systems.
Two different corpora are issued, one is the reference translation
(PAJ, Patent Abstracts of Japan), the other is a large scaled target
language corpus. In the former case, RBMT+SPE wins, in the later case
RBMT wins. Evaluation is performed using NIST scores and a new
evaluation measure NMG that counts the number of words in the longest
sequence matched between the test sentence and the target language
reference corpus. Finally,
(Elming, 2006) works in the more general field called as Automatic
Post-Processing (APE). They use transformation-based learning (TBL),
a learning algorithm for extracting rules to correct MT output by
means of a post-processing module. The algorithm learns from a
parallel corpus of MT output and human-corrected versions of this
output. The machine translations are provided by a commercial MT
system, PaTrans, which is based on Eurotra. Elming reports a 4.6
point increase in BLEU score.
Inside
the OpenMT project a post-edition system applying SMT to the RBMT
output has got an improvement greater than 40% in BLEU metric (Diaz
de Ilarraza et al., 08). These results suggest us new ways to
investigate. In OpenMT project we performed two experiments to verify
the improvement obtained for other languages by using statistical
post editing. Our experiments differ from other similar works because
we used a morphological component in both RBMT and SMT translations,
and because the size of the available corpora is small. Our results
are coherent with huge improvements when using a RBMT+SPE approach on
a restricted domain presented by (Dugast eta al., 2007; Ehara, 2007;
Simard et al., 2007b). We obtain 200% improvement in the BLEU score
for a RBMT+SPE system working with Matxin RBMT system, when comparing
to raw translation output, and 40% when comparing to SMT system. Our
results also are coherent with a smaller improvement when using more
general corpora as presented by (Ehara, 2007; Simard et al., 2007b).
Then even dealing with small size corpora this paradigm combination
the results were satisfactory. Hybridizing
MT paradigms. Less
works exist in the recent literature, which present a truly internal
hybridization of several MT paradigms (i.e., where the output is
constructed by taking into account simultaneous information from
different paradigms). A couple of strategies to be explored are :
to introduce statistical knowledge to resolve ambiguities in RBMT
systems. For instance, most rule-based systems lack of a
deciding which translation rule must be applied when several rules
are equally applicable in the same context to the same sentence span.
Discriminative lexical selection techniques could provide an
effective solution to this problem. Chan and Chiang (2007) have
applied similar ideas to their hierarchical statistical MT system. In
this system, however, rules are not manually defined but
automatically induced from word alignments. A second and more general
approach would be to devise an inference procedure which is able to
deal with multiple proposed fragment translations, coming from
different sources (translation tables from SMT, fragments translated
by RBMT, examples from EBMT, etc.). The search for the optimal
translation should coherently combine the translation candidates
(from different granularities), taking also into account constraints
regarding the target language. The works by Groves and Way (2005)
introducing EBMT translation pairs into a SMT system fall in this
category.
The
ambitious proposal to raise an investigation not as a hybrid
combination of external systems but that actually combine in the
architecture of translation issues the three paradigms.
Integration
of advanced linguistic knowledge in shallow machine translation
Integration
of syntax into statistical machine translation
A
limitation of standard phrase-based SMT systems is that reordering
models are very simple. For instance, non-contiguous phrases are not
allowed, long distance dependencies are not modeled, and syntactic
transformations are not captured. Syntax-based approaches seek to
remedy these deficiencies by explicitly taking into account syntactic
knowledge. Approaches to syntax-based MT differ in several aspects:
(i) side of parsing (source, target, or both sides), (ii) type of
parsing (dependencies vs. constituents), (iii) modeling of
probabilities (generative vs. discriminative), (iv) core (structured
predictions vs. transformation rules), and (v) type of decoding
(standard phrase-based, modeled by transducers, based on parsing,
graph-based). Approaches to syntax-based SMT may be grouped in three
different families: Bilingual
Parsing.
The translation process is approached as a case of synchronous
bilingual parsing. Derivation rules are automatically learned from
parallel corpora, either annotated or unannotated (Wu, 1997; Wu,
2000; Alshawi, 1996; Alshawi et al., 2000; Melamed, 2004; Melamed et
al., 2005; Chiang, 2005; Chiang, 2007). Tree-to-String,
String-to-Tree and Tree-to-Tree Models.
These models exploit syntactic annotation, either in the source or
target language or both, to estimate more informed translation and
reordering models or translation rules (Yamada & Knight, 2001;
Yamada, 2002; Gildea, 2003; Lin, 2004; Quirk et al., 2005; Cowan et
al., 2006; Galley et al., 2006; Marcu et al., 2006). Source
Reordering.
Another interesting approach consists in reordering the source text
prior to translation using syntactic information so it shapes to the
appropriate word ordering of the target language (Collins et al.,
2005; Crego et al., 2006; Li et al., 2007). Significant improvements
have been reported using this technique. Integration
of semantic
knowledge into machine translation
One
natural and appealing extension of the current RBMT and SMT paradigms
for machine translation is the use of richer sources of linguistic
knowledge, e.g., semantics. Such an extension would presumably lead
to a qualitative improvement of state-of-the-art performance.
However, reasoning with explicit semantics, is really a hard goal for
open-text NLP, which has been generally ignored in the development of
practical systems. Two concrete and feasible aspects, which are
described below, will concentrate our efforts in this respect:
Word
Sense Disambiguation in word translation One
of the challenges in MT is that of lexical choice (or word selection)
in the case of semantic ambiguity, i.e., the choice for the most
appropriate word in the target language for a polysemous word in the
source language when the target language offers more than one option
for the translation and these options have different meanings. The
area that deals with this general disambiguation problem is referred
to as word sense disambiguation (WSD). Note that, as emphasized by
Hutchins and Somers (1992), monolingual WSD is different from the
multilingual task, since the latter is concerned only with the
ambiguities that come along in the translation from one language to
another.
Recently,
there has been a growing interest in the application of
discriminative learning models to word selection, and, more
generally, phrase
selection
in the context of SMT. Discriminative models allow for taking into
account a richer feature context, and probability estimates are more
informed than the simple frequency counts used in SMT translation
models. In these systems lexical selection is addressed as a
classification task. For each possible source word (or phrase)
according to a given bilingual lexical inventory (e.g., the
translation model), a distinct classifier is trained to predict
lexical correspondences based on local context. Thus, during
decoding, for every distinct instance of every source phrase a
distinct context-aware translation probability distribution is
potentially available. Brown
et al. (1991a; 1991b) were the first to suggest using dedicated WSD
models in SMT. In a pilot experiment, they integrated a WSD system
based on mutual information into their French-to-English word-based
SMT system. Results were limited to the case of binary disambiguation
and to a reduced set of very common words. Some years passed until
these ideas were recovered by Carpuat and Wu (2005b), who suggested
integrating WSD predictions into a phrase-based SMT system. In a
first approach, they did so in a hard manner, either for decoding, by
constraining the set of acceptable word translation candidates, or
for post-processing the SMT system output, by directly replacing the
translation of each selected word with the WSD system prediction.
However, they did not manage to improve MT quality. They encountered
several problems inherent to the SMT architecture. In particular,
they described what they called the language model effect in SMT:
"The lexical choices are made in a way that heavily prefers phrasal
cohesion in the output target sentence, as scored by the language
model". In a later work, Carpuat and Wu (2005a) analyzed the
converse question, i.e., they measured the WSD performance of SMT
systems. They showed that dedicated WSD models significantly
outperform the WSD ability of current state-of-the-art SMT models.
Consequently, SMT should benefit from WSD predictions.
Simultaneously, Vickrey et al. (2005) also studied the application of
context-aware discriminative word selection models based on WSD to
SMT. They did not encounter the language model effect because they
approached the task in a soft way, i.e., allowing WSD-based
probabilities to interact with other models during decoding. However,
they did not approach the full translation task but limited to the
blank-filling task, a simplified version, in which the target context
surrounding the word translation is available. Following
similar approaches, Cabezas and Resnik (2005) and Carpuat et al.
(2006) used WSD-based models in the context of the full translation
task to aid a phrase-based SMT system. They reported a small
improvement in terms of BLEU score, possibly because they did not
work with phrases but limited to single words. Besides, they did not
allow WSD-based predictions to interact with other translation
probabilities. More recently, other of authors, including ourselves,
have extended these works by moving from words to phrases and
allowing discriminative models to cooperate with other phrase
translation models as an additional feature. Moderate improvements
have been reported (Bangalore et al., 2007; Carpuat & Wu, 2007b;
Carpuat & Wu, 2007a; Giménez & Màrquez, 2007a;
Giménez & Màrquez, 2008a; Stroppa et al., 2007;
Venkatapathy & Bangalore, 2007). All these works were being
elaborated at the same time, and were presented in very near dates
with very similar conclusions. One interesting observation by Giménez
and Màrquez (2008a) is that the improvement of WSD-based
phrase selection models is mainly related to the adequacy dimension,
whereas for fluency there is a slight decrease. These results reveal
a problem of integration: phrase selection classifiers have been
trained locally, i.e., so as to maximize local phrase translation
accuracy. In order to test this hypothesis and further improve the
system, we plan to work on global classifiers directed towards
maximizing overall translation quality instead. Other
integration strategies have been tried. For instance, Specia et al.
(2008) used dedicated predictions for the reranking of n-best
translations. Their models were based on Inductive Logic Programming
(Specia et al., 2007). They limited to a small set of words from
different grammatical categories. A very significant BLEU improvement
was reported. In a different approach, Chan et al. (2007) used a WSD
system to provide additional features for the hierarchical
phrase-based SMT system based on bilingual parsing developed by
Chiang (2005; 2007). These features were intended to give a bigger
weight to the application of rules that are consistent with WSD
predictions. A moderate but significant BLEU improvement was
reported. Finally, Sánchez-Martínez et al. (2007)
integrated a simple lexical selector, based on source lemma
co-occurrences in a very local scope, into their hybrid
corpus-based/rule-based MT system.
Semantic
Role Labeling in machine translation In
the past few years there has been an increasing interest in Semantic
Role Labeling (SRL), which is becoming an important component in many
NLP applications. SRL is a well-defined task with a substantial body
of work and comparative evaluation. Given a sentence, the task
consists of detecting basic event structures such as "who"
did "what" to "whom", "when" and
"where". From a linguistic point of view, this corresponds
to identifying the semantic arguments filling the roles of the
sentence predicates.
The
identification of such event frames might have a significant impact
in many Natural Language Processing (NLP) applications, including
machine translation. Although the use of SRL systems in real-world
applications has so far been limited, we think the potential is very
high and we expect a spread of this type of analysis to all
applications requiring some level of semantic interpretation.
OpenMT-2
will take advantage of the previous works on SRL carried out by the
UPC team researchers. They conducted two international evaluation
exercises for SRL in the context of the CoNLL-2004 and 2005 (Carreras
and Màrquez, 2004; 2005) shared tasks, and, more recently two
additional evaluation exercises at CoNLL-2008 and 2009 on joint
extraction of syntactic and semantic dependencies for multiple
languages (Surdeanu et al., 2008), that is, a combination of
syntactic dependency parsing and SRL. The second evaluation is
currently underway. The UPC team has currently developed SRL
prototypes for English, Spanish and Catalan (Surdeanu et al., 2007a;
Màrquez et al., 2007; Surdeanu et al., 2008). Additionally, as
explained in the following section, the UPC team has worked on MT
evaluation metrics based on Semantic Roles (Giménez 2008). See
tasks in WP3 for a prospective on the possibilities of applying the
SRL technology in machine translation.
[no he comprobado
las referencias más abajo] Pre-edition
and Post-edition Our
experience in providing MT services (http://www.opentrad.org ) shows
us two main sources of translation errors that could be handled via
pre-edition: spelling-errors and the use of too long sentences in
source texts. We plan to profit from the partners' know-how on
spelling correction and on syntax parsing to get better results in
automatic translation. Because of the late standardization of Basque,
and because adult speakers nowadays did not learn it at school, the
number of misspelled words is relatively high in MT input texts. The
spelling checker Xuxen (Aduriz et al., 97) is a very effective tool
in this kind of situation, giving people more confidence in the text
they are writing. We think it will be very useful to be integrated in
a pre-edition module with the MT system. In
the same way, we are designing a syntax module in the pre-edition
module to generate, or help the writer to generate, shorter sentences
due to the fact that quality of the translation drops significantly
when the number of words augments. This task has to be based on
syntax. In our research groups at EHU and UPC broad work has been
carried out in syntax and specially in Clause
Identification (Alegria et al., 2008,
Carreras et al., 2002). The task is a interesting case to apply those
techniques and our previous works can be adapted to pre-edition of
texts to be automatically transtated.
On
the other hand, In the last AMTA conference (2008) we detected an
special interest in post-editing. A wide number of papers mentioned
this process and two very interesting papers were devoted to
post-editing (Doyon et al., 2008, Schütz, 2008). Google also is
trying to collect this kind of corpora through its MT service
(http://www.google.com/intl/en/help/faq_translation.html#usefeedback)
Based
on the collaborative philosophy of Web2.0, we want to create a
network community to involve human translators in the creation of a
significant corpus of postedition. We believe that the use on these
new lines may lead to a qualitative jump in the quality of machine
translation, even for languages with no huge parallel corpus.
A
post-edition system applying SMT to the RBMT output has got an
improvement greater than 40% in BLEU metric (Diaz de Ilarraza et al.,
08). These results suggest us new ways to investigate. We performed
two experiments to verify the improvement obtained for other
languages by using statistical post editing. But
we could not work with manually post-edited corpora as (Simard et
al., 2007a) and (Isabelle et al., 2007), so a Statistical
Post-Editing system using a corpus with no more than 100.000 words of
post-edited translations was enough to outperform an expensive
lexicon enriched baseline RBMT system. Obviously we could not do it
because there is no such a big corpus for Basque or Catalan. So in
this new project OpenMT-2 we are planning to collect and exploit such
a manually post-edited corpus. We have signed collaboration with the
Wikipedia community for Basque, we will provide them with our on-line
MT systems enhanced with post-editing facilities and they will
promote the use of this translating and postediting tools between
their collaborators when creating Wikipedia contents, and, of course,
simultaneously to enrich automatically the post-edition corpora we
need. This alliance with Wikipedia Community is very significant for
further uses since for each postedited translation we will not have
just the source text, the automatically translated text and the
post-edited text, but also its restricted domain (it will be
automatically deduced from its corresponding wikipedia entry). Advanced
Evaluation Because
human evaluation is very costly, MT researchers have developed
several automatic evaluation metrics. The commonly accepted criterion
that defines a plausible evaluation metric is that it must correlate
well with human evaluators. The
use of N -gram based metrics in the context of system development has
represented a significant advance in MT research in the last decade.
A number of evaluation metrics have been suggested. The most
successful evaluation metrics have been word error rate (WER),
position independent word error rate (PER), bilingual evaluation
understudy (BLEU) (Papineni et al., 2001), an improved version of
BLEU by National Institute of Standards and Technology (NIST)
(Doddington, 2002), the F-measure provided by the General Text
Matcher (GTM) (Melamed et al., 2003), and the ROUGE (Lin & Och,
2004), etc. Without
a single doubt, the construction of a metric that is able to capture
all the linguistic aspects that distinguish `correct' translations
from `incorrect' ones is a very difficult path to trace. We approach
this challenge by following a `divide and conquer' strategy. We
suggest to build a set of specialized metrics each one devoted to the
evaluation of a concrete partial aspect of MT quality. The point then
is how to combine a set of metrics into a single measure of MT
quality. In
OpenMT we have used the IQMT package (Giménez et al., 2005),
which permits metric combination, with the singularity that there is
no need to perform any training or adjustment of parameters. Besides
considering the similarity of automatic translations to human
references, IQMT additionally considers the distribution of
similarities among human references.
However,
having reached a certain degree of maturity, current MT technology
requires nowadays the usage of more sophisticated metrics. In the
last few years, several approaches have been suggested. Some of them
are based on extending the reference lexicon. For instance, ROUGE and
METEOR allow for morphological variations by applying stemming.
Additionally, METEOR may perform a lookup for synonymy in WordNet
(Fellbaum, 1998). Others have suggested taking advantage of
paraphrasing support (Russo-Lassner et al., 2005; Zhou et al., 2006;
Kauchak & Barzilay, 2006; Owczarzak et al., 2006). These are
still attempts at the lexical level. At a deeper linguistic level, we
may find, for instance, the work by Liu and Gildea (2005) who
introduced a series of syntax-based metrics. They developed the
Syntactic Tree Matching (STM) metric based on constituency parsing,
and the Head-Word Chain Matching (HWCM) metric based on dependency
parsing. Also based on syntax, Mehay and Brew (2007) suggested
flattening syntactic dependencies only in the reference translations
so as to compute string-based similarities without requiring
syntactic parsing of the possibly ill-formed automatic candidate
translations. We may find as well the work by Owczarzak et al.
(2007a; 2007b) who presented a metric which compares dependency
structures according to a probabilistic Lexical-Functional Grammar.
They used paraphrases as well. Their metric obtains very competitive
results, specially as a fluency predictor. Other authors have
designed metrics based on shallow-syntactic information. For
instance, Popovic and Ney (2007) proposed a novel
analyzing translation errors based on WER and PER measures computed
over different parts-of-speech. At the semantic level, we may find
the 'NEE' metric defined by Reeder et al. (2001), which was
devoted to measure MT quality over named entities9. We may also find
the metrics developed by Giménez and Màquez (2007), as
part of the OpenMT project, which operate over semantic role
structures and discourse representations. Apart
from incorporating linguistic information, another promising research
direction suggested in the last years is based on combining the
scores conferred by different metrics into a single measure of
quality (Corston-Oliver et al., 2001; Kulesza & Shieber, 2004;
Quirk, 2004; Gamon et al., 2005; Liu & Gildea, 2007; Albrecht &
Hwa, 2007a; Albrecht & Hwa, 2007b; Paul et al., 2007; Giménez
& Màrquez, 2008a; Giménez & Màrquez,
2008c). This solution requires two important ingredients. First, the
combination scheme, i.e., how to combine several metric scores into a
single score. Second, the meta-Evaluation criterion, i.e., how to
evaluate the quality of a metric combination. Main
Spanish research groups working on
MT
Traducens Group. Universitat
d'Alacant <http://transducens.dlsi.ua.es/>, Speech and Language
Aplications and Technology (TALP). Universitat Politècnica de
Catalunya. <http://www.talp.cat/talp/index.php>, Pattern
Recognition and Human Language Technology Group (PRHLT). Universitat
Politècnica de València <http://prhlt.iti.es/>,
Department of Translation and Philology. Universitat Pompeu Fabra
(UPF) <http://www.upf.edu/dtf/> , and Deli Group. University of
Deusto. http://www.deli.deusto.es Other
main European research groups working on
MT
Statistical
Machine Translation Group. University of Edinburgh. UK
http://www.statmt.org/ued, National Centre for Language Technology /
Centre for Next Generation Localisation (NCLT-MT). Dublin City
University. Ireland http://nclt.dcu.ie/mt, Language Technology and
Computational Linguistics (STP). Uppsala Universitet. Sweden
http://stp.lingfil.uu.se/english/, Institute of Formal and Applied
Linguistics (UFAL). Univerzita Karlova v Praze. Czech Republic
http://ufal.mff.cuni.cz/, Human Language Technology and Pattern
Recognition. RWTH Aachen http://www-i6.informatik.rwth-aachen.de/,
Human Language Technology (HLT). Fondazione Bruno Kessler, Trento.
Italy http://hlt.fbk.eu/, Groupe d'Étude pour la Traduction
Automatique (GETA). Laboratoire d'Informatique de Grenoble.
Francehttp://www-clips.imag.fr/geta/,
Dept. of Computational Linguistics and Phonetics
(COLI) Uiversität
des Saarlandes. Germany http://www.coli.uni-saarland.de/,
and Centre
for Language Technology (CST). Københavns Universitet.
Denmark. http://www.cst.ku.dk/ |