EU / EN

Proiektua

Helburuak

Partehartzaileak

Argitalpenak

Intranet

Demoak

Baliabideak eta tresnak

Zientzia-egoera 2008

Erreferentziak 2009

wiki

NLP resources


In addition to the free available and open source resources and tools for MT, the groups have developed a long list of resources and tools which will be very useful for the project, some of them as result of the OpenMT project.

BILINGUAL CORPORA:

  • en-eu corpora:

  1. 3 million words from Software manuals (Linux, Openoffice, Office, Windows, ...) (Elhuyar)

In the case of open source (Linux and Openoffice), they can be easily extended to Spanish and Catalan.

  1. 2 million words from recently translated masterpieces in classic.humanities (Darwin, Hume, Locke...) (EHU)

  • es-eu corpora:

  1. 3 million words from Administration (Offical documents) (EHU)

  2. 12 million words from the translation memories of Elhuyar; domains: telecommunications, environment, finances, science and technology...

  3. 1 million words from Administration (IVAP)

  4. 300K words from journalism (EITB)

  5. 1 million words from popular journalism (Consumer)

  6. 4 million words from environment (IHOBE)

  • en-es corpora:

  1. 30 million words from Europarl Corpus (European Parlament speech transcriptions) (UPC)

  • es-ca corpora:

  1. 10 million words from El periodico bilingual edition (UPC)

LINGUISTIC PROCESSORS

(For more information see Products in the website of Ixa Group)

  • Lemmatization for Basque (Alegria et al., 1996) (EHU)

  • Lemmatization for Spanish and English: We use the Freeling Suite of Language Analyzers (Carreras et al., 2004), which may be downloaded at http://www.lsi.upc.es/~nlp/freeling/ (UPC)

  • PoS Tagging for Basque: Eustagger (Alegria et al., 2002) (EHU)

  • PoS Tagging: SVMTool (Giménez and Màrquez, 2004), which may be freely downloaded at http://www.lsi.upc.es/~nlp/SVMTool. (UPC)

  • Shallow Parsing: We use the Phreco software (Carreras et al., 2005) (UPC)

  • Dependency analysis for Basque using one of the two paradigms: rules and statistics (Bengoetxea&Gojenola, 2008) (EHU)

  • Dependency analysis for Spanish and English: FreeLing (www.lsi.upc.edu/~nlp/freeling/ )

  • Clause Splitting: We use the prototype for English developed by Carreras et al. (2005). Spanish and Catalan clause splitters are under development (UPC). For Basque a prototype is ready (EHU) (Alegria et al., 2008).

  • Monolingual terminology extraction (eu): Erauzterm (Alegria et al., 2004) (Elhuyar)

  • Extraction of bilingual terminology (es-eu). Elexbi (Alegria et al., 2005) (Elhuyar)

  • Semantic Role Labelling: There is a prototype for English developed by Márquez et al. (2005) (UPC)

  • Topic signatures for all WordNet nominal senses (Agirre et al., 2004c) (EHU)

  • Word Sense Disambiguation: We may use the all-words WSD system for English developed by Villarejo et al. (2004) (UPC) and others for Basque and Spanish (EHU)

MT ENGINES AND TOOLS

  • MT systems for Basque, using different technologies: http://ixa2.si.ehu.es/openmt-demo

  • OpenTrad systems, developed by EHU, UPC, Elhuyar and other partners (open source) ca,gl,en,es,eu: www.opentrad.org

  • IQMT Framework for MT Evaluation: http://www.lsi.upc.edu/~nlp/IQMT/

  • Asiya opensource evaluation system for MT Evaluation: http://www.lsi.upc.edu/~nlp/Asiya/