ZT Corpusa

Short description: 
Morphosyntactically-tagged Science and Technology corpus.
Authors (no IXA members): 
Nerea Areta ,Antton Gurrutxaga ,Igor Leturia
Contact: 
xabier.artola[abildua|at]@ehu.es
Description: 
The ZT Corpus (Basque Corpus of Science and Technology) is a tagged collection of specialised texts in Basque, which aims to be a major resource in research and development with respect to written technical Basque: terminology, syntax and style.
It is composed of two parts, a 1.6 million-word balanced part, whose annotation has been revised by hand, and another automatically tagged 6 million-word part.
We built new tools to help in building ZTC: corpus compilation, corpus annotation. and a specific interface for advanced queries.
It was released in December 2006 and has been developed by Elhuyar and the Ixa NLP Research Group (On-line consultation: http://www.ZTcorpusa.eus).
The ZT Corpus stands out among other Basque corpora for many reasons: it is the first specialised corpus in Basque, it has been designed to be a methodological and functional reference for new projects in the future (i.e. a national corpus for Basque), it is the first corpus in Basque annotated using a TEI-P4 compliant XML format, it is the first written corpus in Basque to be distributed by ELDA and it has a friendly and sophisticated query interface. The corpus has two kinds of annotation, a structural annotation and a stand-off linguistic annotation.
Functionality: 
The ZT Corpus has been put online for its querying through a web interface. This interface is user-friendly and easy to use in its normal mode, yet it also offers some very interesting more sophisticated options in the advanced mode. Many of the query options of the ZT Corpus are new in Basque corpora.
The user can query for up to three words, which can be at a distance of up to four words –either forwards or backwards– from each other. For each of the words, the user can choose to query for the lemma or a specific word form, and he or she can ask for the complete word, the beginning, or the ending of it. Optionally, he or she can restrict the query of the word to a particular POS –when combined in a multi-word query, it is possible to ask for the POS alone.
Technology: 
XML, XSLT, egoera finituko morfologia (finite-state morphology).
Modules: 
Corpusgile, Eustagger, Eulia.
Notes: 
8,5 milioi hitz daude, eta horietatik 1,9 milioi hitz eskuz berrikusita.

Lantalde osoa