EU / EN

Proiektua

Helburuak

Partehartzaileak

Argitalpenak

Intranet

Demoak

Baliabideak eta tresnak

Zientzia egoera 2008

Erreferentziak 2008

NLP resources

In addition to the free available and open source resources and tools mentioned above the groups have developed a long list of resources and tools which will be very useful for the project, some of them as result of the OpenMT project.

BILINGUAL CORPORA:

  • en-eu corpora:

  1. 3 million words from Software manuals (Linux, Openoffice, Office, Windows, ...) (Elhuyar)

In the case of open source (Linux and Openoffice), they can be easily extended to Spanish and Catalan.

  1. 2 million words from recently translated masterpieces in classic.humanities (Darwin, Hume, Locke...) (EHU)

  • es-eu corpora:

  1. 3 million words from Administration (Offical documents) (EHU)

  2. 12 million words from the translation memories of Elhuyar; domains: telecommunications, environment, finances, science and technology...

  3. 1 million words from Administration (IVAP)

  4. 300K words from journalism (EITB)

  5. 1 million words from popular journalism (Consumer)

  6. 4 million words from environment (IHOBE)

  • en-es corpora:

  1. 30 million words from Europarl Corpus (European Parlament speech transcriptions) (UPC)

  • es-ca corpora:

  1. 10 million words from El periodico bilingual edition (UPC)

LINGUISTIC PROCESSORS

ACTUALIZAR

  • Lemmatization for Basque (Alegria et al., 1996) (EHU)

  • Lemmatization for Spanish and English: We use the Freeling Suite of Language Analyzers (Carreras et al., 2004), which may be downloaded at http://www.lsi.upc.es/~nlp/freeling/ (UPC)

  • PoS Tagging for Basque: Eustagger (Alegria et al., 2002) (EHU)

  • PoS Tagging: SVMTool (Giménez and Màrquez, 2004), which may be freely downloaded at http://www.lsi.upc.es/~nlp/SVMTool. (UPC)

  • Shallow Parsing: We use the Phreco software (Carreras et al., 2005) (UPC)

  • Dependency analysis for Basque using one of the two paradigms: rules and statistics (Bengoetxea&Gojenola, 2008) (EHU)

  • Dependency analysis for Spanish and English: FreeLing (www.lsi.upc.edu/~nlp/freeling/ )

  • Clause Splitting: We use the prototype for English developed by Carreras et al. (2005). Spanish and Catalan clause splitters are under development (UPC). For Basque a prototype is ready (EHU) (Alegria et al., 2008).

  • Monolingual terminology extraction (eu): Erauzterm (Alegria et al., 2004) (Elhuyar)

  • Extraction of bilingual terminology (es-eu). Elexbi (Alegria et al., 2005) (Elhuyar)

  • Semantic Role Labelling: There is a prototype for English developed by Márquez et al. (2005) (UPC)

  • Topic signatures for all WordNet nominal senses (Agirre et al., 2004c) (EHU)

  • Word Sense Disambiguation: We may use the all-words WSD system for English developed by Villarejo et al. (2004) (UPC) and others for Basque and Spanish (EHU)

MT ENGINES AND TOOLS

  • MT systems for Basque, using different technologies: http://ixa2.si.ehu.es/openmt-demo


  • OpenTrad systems, developed by EHU, UPC, Elhuyar and other partners (open source) ca,gl,en,es,eu: www.opentrad.org

  • IQMT Framework for MT Evaluation: http://www.lsi.upc.edu/~nlp/IQMT/