Proiektua
Helburuak
Partehartzaileak
Argitalpenak
Intranet
Demoak
Baliabideak eta tresnak
Zientzia egoera 2008
Erreferentziak 2008
|
|
NLP resources
In addition to the
free available and open source resources and tools mentioned above
the groups have developed a long list of resources and tools which
will be very useful for the project, some of them as result of the
OpenMT project.
BILINGUAL
CORPORA:
3 million words
from Software manuals (Linux, Openoffice, Office, Windows, ...)
(Elhuyar)
In the case of
open source (Linux and Openoffice), they can be easily extended to
Spanish and Catalan.
2 million words
from recently translated masterpieces in classic.humanities (Darwin,
Hume, Locke...) (EHU)
3 million words
from Administration (Offical documents) (EHU)
12 million words
from the translation memories of Elhuyar; domains:
telecommunications, environment, finances, science and technology...
1 million words
from Administration (IVAP)
300K words from
journalism (EITB)
1 million words
from popular journalism (Consumer)
4 million words
from environment (IHOBE)
30 million words
from Europarl Corpus (European Parlament speech transcriptions)
(UPC)
10 million words
from El periodico bilingual edition (UPC)
LINGUISTIC
PROCESSORS
ACTUALIZAR
Lemmatization for
Basque (Alegria et al., 1996) (EHU)
Lemmatization
for Spanish and English: We use the Freeling Suite of Language
Analyzers (Carreras et al., 2004), which may be downloaded at
http://www.lsi.upc.es/~nlp/freeling/
(UPC)
PoS Tagging for
Basque: Eustagger (Alegria et al., 2002) (EHU)
PoS
Tagging: SVMTool (Giménez and Màrquez, 2004), which may be freely
downloaded at http://www.lsi.upc.es/~nlp/SVMTool.
(UPC)
Shallow Parsing:
We use the Phreco software (Carreras et al., 2005) (UPC)
Dependency
analysis for Basque using one of the two paradigms: rules and
statistics (Bengoetxea&Gojenola, 2008) (EHU)
Dependency
analysis for Spanish and English: FreeLing
(www.lsi.upc.edu/~nlp/freeling/ )
Clause Splitting:
We use the prototype for English developed by Carreras et al.
(2005). Spanish and Catalan clause splitters are under development
(UPC). For Basque a prototype is ready (EHU) (Alegria et al., 2008).
Monolingual
terminology extraction (eu): Erauzterm (Alegria et al., 2004)
(Elhuyar)
Extraction of
bilingual terminology (es-eu). Elexbi (Alegria et al., 2005)
(Elhuyar)
Semantic Role
Labelling: There is a prototype for English developed by Márquez et
al. (2005) (UPC)
Topic signatures
for all WordNet nominal senses (Agirre et al., 2004c) (EHU)
Word Sense
Disambiguation: We may use the all-words WSD system for English
developed by Villarejo et al. (2004) (UPC) and others for Basque and
Spanish (EHU)
MT ENGINES AND
TOOLS
OpenTrad systems,
developed by EHU, UPC, Elhuyar and other partners (open source)
ca,gl,en,es,eu: www.opentrad.org
IQMT Framework
for MT Evaluation: http://www.lsi.upc.edu/~nlp/IQMT/
|