OpenMT project


EU / EN

Proiektua Helburuak Partehartzaileak Argitalpenak Intranet Demoak Baliabideak eta tresnak Zientzia egoera 2008 Erreferentziak 2008		NLP resources In addition to the free available and open source resources and tools mentioned above the groups have developed a long list of resources and tools which will be very useful for the project, some of them as result of the OpenMT project. BILINGUAL CORPORA: en-eu corpora: 3 million words from Software manuals (Linux, Openoffice, Office, Windows, ...) (Elhuyar) In the case of open source (Linux and Openoffice), they can be easily extended to Spanish and Catalan. 2 million words from recently translated masterpieces in classic.humanities (Darwin, Hume, Locke...) (EHU) es-eu corpora: 3 million words from Administration (Offical documents) (EHU) 12 million words from the translation memories of Elhuyar; domains: telecommunications, environment, finances, science and technology... 1 million words from Administration (IVAP) 300K words from journalism (EITB) 1 million words from popular journalism (Consumer) 4 million words from environment (IHOBE) en-es corpora: 30 million words from Europarl Corpus (European Parlament speech transcriptions) (UPC) es-ca corpora: 10 million words from El periodico bilingual edition (UPC) LINGUISTIC PROCESSORS ACTUALIZAR Lemmatization for Basque (Alegria et al., 1996) (EHU) Lemmatization for Spanish and English: We use the Freeling Suite of Language Analyzers (Carreras et al., 2004), which may be downloaded at http://www.lsi.upc.es/~nlp/freeling/ (UPC) PoS Tagging for Basque: Eustagger (Alegria et al., 2002) (EHU) PoS Tagging: SVMTool (Giménez and Màrquez, 2004), which may be freely downloaded at http://www.lsi.upc.es/~nlp/SVMTool. (UPC) Shallow Parsing: We use the Phreco software (Carreras et al., 2005) (UPC) Dependency analysis for Basque using one of the two paradigms: rules and statistics (Bengoetxea&Gojenola, 2008) (EHU) Dependency analysis for Spanish and English: FreeLing (www.lsi.upc.edu/~nlp/freeling/ ) Clause Splitting: We use the prototype for English developed by Carreras et al. (2005). Spanish and Catalan clause splitters are under development (UPC). For Basque a prototype is ready (EHU) (Alegria et al., 2008). Monolingual terminology extraction (eu): Erauzterm (Alegria et al., 2004) (Elhuyar) Extraction of bilingual terminology (es-eu). Elexbi (Alegria et al., 2005) (Elhuyar) Semantic Role Labelling: There is a prototype for English developed by Márquez et al. (2005) (UPC) Topic signatures for all WordNet nominal senses (Agirre et al., 2004c) (EHU) Word Sense Disambiguation: We may use the all-words WSD system for English developed by Villarejo et al. (2004) (UPC) and others for Basque and Spanish (EHU) MT ENGINES AND TOOLS MT systems for Basque, using different technologies: http://ixa2.si.ehu.es/openmt-demo OpenTrad systems, developed by EHU, UPC, Elhuyar and other partners (open source) ca,gl,en,es,eu: www.opentrad.org IQMT Framework for MT Evaluation: http://www.lsi.upc.edu/~nlp/IQMT/ Website repository of Translation Memories: http://beam.to/tm-m

NLP resources