Statistical Methods and Corpus Linguistics

The course is organized as two separate modules:

i) An introduction to the basic statistical principles and techniques used in NLP. In this part we will address many aspects in the areas of descriptive and inferential statistics. We will also introduce machine learning, including basic data processing and the main learning algorithms.

ii) An introduction to corpus linguistics. We will start with a brief introduction to textual corpora, including linguistic annotation and representation schemas. We will then address aspects such as the extraction of relevant information from corpora, such as collocations or keyword extraction, using statistical and distributional techniques. Finally, we will learn the XML markup language. During the module we will introduce several corpora in various languages (English, Spanish, Basque, etc).

1. Basic statistical measures: Mean, Standard deviation, Chi-square, Mutual information, Kappa, etc.
2. Introduction to hypothesis testing: Independence test, Mc Nemar test.
3. Introduction to Machine Learning in NLP
4. Basic Machine Learning algorithms in WEKA: Naive Bayes, K-NN, Decision trees, Rules
5. Evaluation in supervised learning
6. Introduction to Corpus Linguistics
7. Corpus characteristics and types
   - Corpus examples
8. Corpus annotation
   - Usual marks and analysis levels
   - standards for linguistic representation (TEI, NAF, AWA)
9. XML

- Laboratories on:
  - Calculation of statistics
  - Classification tasks
  - Unix tools
  - Word frequencies and Zipf law
  - Collocations
  - Keyword extraction
  - XML and XPath

6 D
1. lauhilekoa


Subscribe to RSS - LAP1