AnCora

Short description: 
AnCora consists of a Basque corpus (EPEC-EU), a Spanish corpus (ANCORA-CAS) and a Catalan corpus (ANCORA-CAT).
Contact: 
izaskun.aldezabal[abildua/at]ehu.es
Description: 
The AnCora corpus gathers together Basque (EPEC-EU), Spanish (AnCora-CAS) and Catalan (AnCora-CAT) tagged texts. The Spanish and Catalan texts include about 500K words. Basque texts comprise around 155K words.
The basis for the tagged Basque AnCora corpus is the EPEC (Euskararen Prozesamendurako Erreferentzia Corpusa – Reference Corpus for the Processing of Basque). A third of data was obtained from the Statistical Corpus of 20th Century Basque (www.euskaracorpusa.net); two thirds of the data were extracted from the Basque newspaper Euskaldunon Egunkaria. We have selected 155K words to build the AnCora corpus. These are part of the syntactically- tagged corpus developed under the CESS-ECE project (HUM2004-21127-E). The texts were then morphologically tagged and transformed from a dependency model to a constituent model. 25% of the corpus is comparable to the Catalan and Spanish corpora, as they consist of news from the same periods.