Project
Goal
Participants
Publications
Demos
Resources and Tools
References
Wiki
|
|
NLP baliabideak
In addition to the
free available and open source resources and tools for MT,
the groups have developed a long list of resources and tools which
will be very useful for the project, some of them as result of the
OpenMT project.
BILINGUAL
CORPORA:
Basque-English ParDeepBank (QTLeap project)
QTLeap corpus (QTLeap Project)
QTLeap WDS/NED corpus (QTLeap Project)
3 million words
from Software manuals (Linux, Openoffice, Office, Windows, ...)
(Elhuyar) In the case of
open source (Linux and Openoffice), they can be easily extended to
Spanish and Catalan.
2 million words
from recently translated masterpieces in classic.humanities (Darwin,
Hume, Locke...) (EHU)
TweetMT corpus
3 million words
from Administration (Offical documents) (EHU)
12 million words
from the translation memories of Elhuyar; domains:
telecommunications, environment, finances, science and technology...
1 million words
from Administration (IVAP)
300K words from
journalism (EITB)
1 million words
from popular journalism (Consumer)
4 million words
from environment (IHOBE)
QTLeap corpus (QTLeap Project)
Europarl-QTLeap WDS/NED corpus (QTLeap Project)
QTLeap WDS/NED corpus (QTLeap Project)
30 million words
from Europarl Corpus (European Parlament speech transcriptions)
(UPC)
10 million words
from El periodico bilingual edition (UPC)
MONOLINGUAL CORPORA:
TWEET-NORM_2013 corpus (TWEET-NORM_2013 Workshop)
TWEET-LID_2014 corpus (TWEET-LID_2014 Workshop)
LINGUISTIC
PROCESSORS
(For more information see Products in the website of Ixa Group)
ixa-pipe-coref-eu: coreference for Basque (http://ixa2.si.ehu.es/ixa-pipes/ and also http://metashare.tilde.com/repository/search/?q=ixa)
ixa-pipe-ned-ukb: Name Entity Disambiguation (http://ixa2.si.ehu.es/ixa-pipes/ and also http://metashare.tilde.com/repository/search/?q=ixa)
ixa-pipe-wsd-ukb: Word sense disambiguation
Interset driver for Basque tagset
ixa-pipe-dep-eu: Dependency anaysis for Basque(http://ixa2.si.ehu.es/ixa-pipes/ and also http://metashare.tilde.com/repository/search/?q=ixa)
ixa-pipe-pos: Generic part of speech tagger
ixa-pipe-pos-eu: Part of speech tagger for Basque (http://ixa2.si.ehu.es/ixa-pipes/ and also http://metashare.tilde.com/repository/search/?q=ixa)
ixa-pipe-srl: Semantic role labelling
Lemmatization for
Basque (Alegria et al., 1996) (EHU)
Lemmatization
for Spanish and English: We use the Freeling Suite of Language
Analyzers (Carreras et al., 2004), which may be downloaded at
http://www.lsi.upc.es/~nlp/freeling/
(UPC)
PoS Tagging for
Basque: Eustagger (Alegria et al., 2002) (EHU)
PoS
Tagging: SVMTool (Giménez and Màrquez, 2004), which may be freely
downloaded at http://www.lsi.upc.es/~nlp/SVMTool.
(UPC)
Shallow Parsing:
We use the Phreco software (Carreras et al., 2005) (UPC)
Dependency
analysis for Basque using one of the two paradigms: rules and
statistics (Bengoetxea&Gojenola, 2008) (EHU)
Dependency
analysis for Spanish and English: FreeLing
(www.lsi.upc.edu/~nlp/freeling/ )
Clause Splitting:
We use the prototype for English developed by Carreras et al.
(2005). Spanish and Catalan clause splitters are under development
(UPC). For Basque a prototype is ready (EHU) (Alegria et al., 2008).
Monolingual
terminology extraction (eu): Erauzterm (Alegria et al., 2004)
(Elhuyar)
Extraction of
bilingual terminology (es-eu). Elexbi (Alegria et al., 2005)
(Elhuyar)
Semantic Role
Labelling: There is a prototype for English developed by Márquez et
al. (2005) (UPC)
Topic signatures
for all WordNet nominal senses (Agirre et al., 2004c) (EHU)
Word Sense
Disambiguation: We may use the all-words WSD system for English
developed by Villarejo et al. (2004) (UPC) and others for Basque and
Spanish (EHU)
MT ENGINES AND
TOOLS
OpenTrad systems,
developed by EHU, UPC, Elhuyar and other partners (open source)
ca,gl,en,es,eu: www.opentrad.org
Asiya opensource evaluation system for MT Evaluation:
http://www.lsi.upc.edu/~nlp/Asiya/
|