AnCora

The AnCora corpus gathers together Basque (EPEC-EU), Spanish (AnCora-CAS) and Catalan (AnCora-CAT) tagged texts. The Spanish and Catalan texts include about 500K words. Basque texts comprise around 155K words.
The basis for the tagged Basque AnCora corpus is the EPEC (Euskararen Prozesamendurako Erreferentzia Corpusa – Reference Corpus for the Processing of Basque). A third of data was obtained from the Statistical Corpus of 20th Century Basque (www.euskaracorpusa.net); two thirds of the data were extracted from the Basque newspaper Euskaldunon Egunkaria. We have selected 155K words to build the AnCora corpus. These are part of the syntactically- tagged corpus developed under the CESS-ECE project (HUM2004-21127-E). The texts were then morphologically tagged and transformed from a dependency model to a constituent model. 25% of the corpus is comparable to the Catalan and Spanish corpora, as they consist of news from the same periods.

Argitalpenak (tesiak):

Dependentzia-ereduan oinarritutako baliabide sintaktikoak: zuhaitz-bankua eta gramatika konputazionala

Argitalpenak (artikuluak):

Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for the automatic processing

Corpusen etiketatze linguistikoa

Construcción de un corpus etiquetado sintácticamente para el euskera

Construction of a Basque Dependency Treebank

Abar-Hitz: An Annotation Tool for the Basque Dependency Treebank

3LB: Construcción de una base de árboles sintáctico-semánticos para el catalán, euskera y castellano

Theoretical and Methodological issues of tagging Noun Phrase Structures following Dependency Grammar Formalism

From Dependencies to Constituents in the Reference Corpus for the Processing of Basque

Evaluation of the Syntactic Annotation in EPEC, the Reference Corpus for the Processing of Basque