Euskal RST Treebank

Short description: 
Basque RST relation- and tree-bank
Contact: 
mikel.iruskieta(abildua)ehu.es
Description: 
The RST Basque Treebank was annotated at subsentential level following
Tofilosky et al. (2009) and using the extended classification of
discourse relations following the Rhetorical Structure Theory (RST) by
Mann and Thompson (1988). The annotated corpus contains 60 abstract
texts from three different domains: medical, terminologycal and
scientific. RSTTool (O'Donnel 2000), an annotation interface for RST
was used to annotate this corpus and RhetDataBase was used to annotate
the signals of the rhetorical relations.
In this website the user may look up:
  1. all the occurrences of any relation in the corpus,
  2. the relations of a chosen text,
  3. the linear segmentation of a text,
  4. the rhetorical relations that are linked to the central unit in
    the discourse structure,
  5. the signals of the rhetorical relations, and
  6. any information in the corpus based on part of speech.
Functionality: 
We are going to use this corpus in:
  1. The automatic detection of rhetorical relations.
  2. Discourse segmentation EusEduSeg.
  3. Automatic summaryzation.
  4. Sentiment analisys.
  5. Question answering.
Technology: 
Erabili diren programak hauek dira: RSTTool, Rhetorical DataBase eta IXA taldeak garatu dituen hainbat programa
Innovation: 
This is the first corpus annotated in Basque under Rhetorical Structure Theory (RST).
Development: 
The corpus is in evolution
Examples: 
href=http://ixa2.si.ehu.es/diskurtsoa/diskurtsoa_jpg/GMB0401-GS.jpg
target=>

In the href=http://ixa2.si.ehu.es/diskurtsoa/diskurtsoa_jpg/GMB0401-GS.jpg
target=>Figure, units below straight vertical lines represent
the nuclei
of hypotactic relations (2-2, 2-3, 7-7, 6-7, 6-10, 2-10 and 9-10) while
those units found underneath diagonal lines are the nuclei of
paratactic relations (4-4, 5-5, 9-9, and 10-10). Other elements are
satellites of hypotactic relations (1-1, 2-5, 3-3, 4-5, 6-6, 8-8, and
8-10). The span which covers the entire text (1-10) cannot be related
to any other span, and consequently, has no nuclearity.

Relations between segments are represented using arrows extending from
the satellite towards the nucleus; for example, the BACKGROUND relation
connects satellite segment 2-5 to its nucleus, 6-10.
As such, annotators interpret which units are most important for
understanding the text. The main concept—that is, the idea presenting
the most
important unit of tree structure (Mann and Thompson 1987)—is
represented with straight vertical lines if it is a hypotactic relation
or under diagonal vertical lines if it is a paratactic relation.

In our example, unit 7-7 is the main unit of the rhetorical
structure. There are eighteen cases of nuclearity in this example:

  1. eight units function as satellites: 1-1, 2-5, 3-3, 4-5, 6-6,
    8-8, 8-10 and 10-10,
  2. and the other ten units function as nuclei: 2-2, 2-3, 4-4,
    5-5, 7-7, 6-7, 6-10, 2-10, 9-9 and 9-10.

In this example, the
annotator interpreted the rhetorical relations presented in Figure 1 as
follows:
  1. PREPARATION for the article, by means of the title ([1-1 >
    2-10]);
  2. laying out the BACKGROUND of the issue to be considered: the
    profile of users using the emergency services ([2-5 > 6-10]);
  3. demonstrating why the study is interesting using the
    MOTIVATION relation ([6-6 > 7-7]), and
  4. highlighting the RESULTS ([6-7

Within the BACKGROUND relation there are three other relations
explaining how the number of urgent medical visits has risen: two
ELABORATIONS ([2-2 CONJUNCTION relation ([4-4 = 5-5]).
Similarly, the RESULT relation subsumes the PREPARATION relation ([8-8
> 9-10]) and the ELABORATION relation ([9-9



REFERENCES:

Mann, W. C. and Thompson, S. A. 1987. Rhetorical Structure Theory: A Theory of Text Organization. Text 8(3):243-281.
Ownership: 
Ixa taldea
Notes: 
Harremanetarako: mikel.iruskieta[abildua]ehu.es