Baliabideak eta tresnak

Zientzia egoera 2008

Erreferentziak 2008

NLP resources

In addition to the free available and open source resources and tools mentioned above the groups have developed a long list of resources and tools which will be very useful for the project, some of them as result of the OpenMT project.


  • en-eu corpora:

  1. 3 million words from Software manuals (Linux, Openoffice, Office, Windows, ...) (Elhuyar)

In the case of open source (Linux and Openoffice), they can be easily extended to Spanish and Catalan.

  1. 2 million words from recently translated masterpieces in classic.humanities (Darwin, Hume, Locke...) (EHU)

  • es-eu corpora:

  1. 3 million words from Administration (Offical documents) (EHU)

  2. 12 million words from the translation memories of Elhuyar; domains: telecommunications, environment, finances, science and technology...

  3. 1 million words from Administration (IVAP)

  4. 300K words from journalism (EITB)

  5. 1 million words from popular journalism (Consumer)

  6. 4 million words from environment (IHOBE)

  • en-es corpora:

  1. 30 million words from Europarl Corpus (European Parlament speech transcriptions) (UPC)

  • es-ca corpora:

  1. 10 million words from El periodico bilingual edition (UPC)



  • Lemmatization for Basque (Alegria et al., 1996) (EHU)

  • Lemmatization for Spanish and English: We use the Freeling Suite of Language Analyzers (Carreras et al., 2004), which may be downloaded at (UPC)

  • PoS Tagging for Basque: Eustagger (Alegria et al., 2002) (EHU)

  • PoS Tagging: SVMTool (Giménez and Màrquez, 2004), which may be freely downloaded at (UPC)

  • Shallow Parsing: We use the Phreco software (Carreras et al., 2005) (UPC)

  • Dependency analysis for Basque using one of the two paradigms: rules and statistics (Bengoetxea&Gojenola, 2008) (EHU)

  • Dependency analysis for Spanish and English: FreeLing ( )

  • Clause Splitting: We use the prototype for English developed by Carreras et al. (2005). Spanish and Catalan clause splitters are under development (UPC). For Basque a prototype is ready (EHU) (Alegria et al., 2008).

  • Monolingual terminology extraction (eu): Erauzterm (Alegria et al., 2004) (Elhuyar)

  • Extraction of bilingual terminology (es-eu). Elexbi (Alegria et al., 2005) (Elhuyar)

  • Semantic Role Labelling: There is a prototype for English developed by Márquez et al. (2005) (UPC)

  • Topic signatures for all WordNet nominal senses (Agirre et al., 2004c) (EHU)

  • Word Sense Disambiguation: We may use the all-words WSD system for English developed by Villarejo et al. (2004) (UPC) and others for Basque and Spanish (EHU)


  • MT systems for Basque, using different technologies:

  • OpenTrad systems, developed by EHU, UPC, Elhuyar and other partners (open source) ca,gl,en,es,eu:

  • IQMT Framework for MT Evaluation: