TU Wien:Natural Language Processing and Information Extraction VU (Hanbury)

Aus VoWi
Zur Navigation springen Zur Suche springen

Daten[Bearbeiten | Quelltext bearbeiten]

Vortragende Allan HanburyFlorina PiroiGábor Recski
ECTS 3,0
Letzte Abhaltung 2023W
Sprache English
Mattermost natural-language-processing-and-information-extraction0RegisterMattermost-Infos
Links tiss:194093
Zuordnungen
Masterstudium Data Science Modul VAST/EX - Visual Analytics and Semantic Technologies - Extension


Inhalt[Bearbeiten | Quelltext bearbeiten]

Introduction to different aspects of Natural Language Processing and a bit of (Computer-)Linguistics.

  • Regex
  • N-Grams
  • Probabilistic Language Models
  • POS-tagging, Hidden Markov Models (HMM) and Viterbi algorithm
  • Syntax
  • Semantic
  • Neural Networks for language processing
  • Information extraction from text
  • Annotation of language corpora
  • Evaluation of language processing systems
  • Summarization of documents
  • Question Answering and Chatbots

Ablauf[Bearbeiten | Quelltext bearbeiten]

Weekly lectures, mostly focused on one topic. Slides are provided, with occasional references to a main book that the class follows, as well as papers and other books. 2023W: One exercise. Previous years: 3 Exercises, see extra section. There is no exam.

Benötigte/Empfehlenswerte Vorkenntnisse[Bearbeiten | Quelltext bearbeiten]

Some interest in language(s) and how to process text data using computers. Python skills, jupyter notebooks (running on a jupyter-hub @ TU Wien). Basics of Machine Learning mehtods (classification), experiment design (for the group project / system evaluation). Basic knowledge of Neural Networks is definitely helpful, as well as basic knowledge of sequence models (such as Hidden Markov Models, HMMs) - but this is covered in the lectures as well, of course. Basic Pytorch skills also can be helpful.

Vortrag[Bearbeiten | Quelltext bearbeiten]

noch offen

Übungen[Bearbeiten | Quelltext bearbeiten]

2023W: Only one exercise for the whole course, in groups of 4. The exercise has 3 deadlines (Milestone 1, Milestone 2, Final Solution) and a 20 min presentation.

The grading is done by Varvara and is very generous, you can get the highest grade without following the instructions perfectly, as long as your work is neat and covers more or less the important requirements.

It's good if at least one group member has solid deep learning skills (PyTorch, Keras, etc) because a DL implementation is required for the exercise.


Previous years:

  • Exercise 1 (indivdiual): calculation of word frequencies and other figures and statistics for a given text corpus, as well as ngram counts and implementation of the Viterbi algorithm for text processing. To be done in JupyterHub / python.
  • Exercise 2 (indivdiual):
    • Part 1: Word embeddings. TF-IDF scores, vector space embeddings (e.g. GLOVE). Visualization of words in an embedding space (or rather, a projection of it). To be done in JupyterHub / python.
    • Part 2: Neural Networks (NN) for text processing. Loading data, wrapping into appropriate loaders / iterators for use with NN, preprocessing. Implement a Feed Forward NN (FFNN) as well as a LSTM (Long Short-Term Memory) NN for text classification. To be done in JupyterHub / python.
  • Exercise 3 (group work, in groups of 4): Choose from a list of topics (or bring your own, similar topic) for a project. Implement a language processing system that handles a certain task, e.g. automatic summarization of texts. Group presentation of the project (15 min.), 2 page report (management summary), code are the deliverables.

The exercise points are weighted according to a defined scheme.

Note Exercise 2, WS2020: When reading this, please keep in mind that this Course was held for the first time in WS2020 : Exercise 2 was rather disastrous. It was delayed, the template for part 1 of the exercise had a ridiculous amount of errors like incorrect tests, file-paths that needed to be replaced by the student but where put in non-editable cells, could not be run remotely on Jupyterhub because it required a file that was too large to upload for student accounts, and a few others (All in all, it felt like 30% of the cells in the template had some kind of problem). Part 2 also had to be run on Google Colab (due to them providing free GPU access).

All of this increased the time need to complete the Exercise (which, imo, was estimated way too low to begin with) even further. Due to the delay, the deadline was extended by a few days, however, grading and feedback took more than a month (it was given after the deadline of Exercise 3).

Prüfung, Benotung[Bearbeiten | Quelltext bearbeiten]

There is no dedicated exam, just the exercises.

Dauer der Zeugnisausstellung[Bearbeiten | Quelltext bearbeiten]

noch offen

Zeitaufwand[Bearbeiten | Quelltext bearbeiten]

2023W:

The workload is probably more than 3 ECTS, unless you are super efficient, focus on the bare minimum and know exactly what to do.

Previous years:

If you have experience with neural networks (or NLP in general) 3 ECTS may just be realistic (probably not though), if you don't, expect quite a bit more than 3 ECTS worth of effort.

The time estimates for the exercises are pretty unrealistic (16 hours in total for ex1 and ex2, 35 hours for the big group project that is ex3).

Unterlagen[Bearbeiten | Quelltext bearbeiten]

Slides are provided, in different formats (pdf, jupyter notebooks/html).

Main textbook: Jurafsky & Martin: Speech and Language Processing, see here for pdfs of the drafts of the 3rd edition: https://web.stanford.edu/~jurafsky/slp3/

Tipps[Bearbeiten | Quelltext bearbeiten]

2023W:

The workload is badly distributed among the milestone: in theory it's 8 hours of work for milestone 1, 8 hours for milestone 2 and 35 hours for the final solution (per student). In practice milestone 1 can take a bit longer, milestone 2 takes super long, probably longer than the final solution. The thing is that if you plan it well you can easily re-use most of your work for milestone 2 as the Final solution. There won't be a distinction in your presentation of what work corresponds to what milestone, you just present everything you've done. The final grading will be done based on the GitHub repository, which doesn't have to be explicitly structured as milestone 1, milestone 2, final solution.

Previous years:

  • For the 2nd part of exercise 2, training the FFNN (and even more so the LSTM NN) was quite time consuming on the Hub, with the duration of 1 epoch of training in the order of several minutes. It can be very helpful here to move to either a machine with a GPU, if you happen to have one available (and set up), or use e.g. Google collab to run your notebook / code there (using a GPU). This can significantly speed up the training and thus development process. (It would be nice if they could provide this infrastructure, not sure that is going to happen).

Highlights / Lob[Bearbeiten | Quelltext bearbeiten]

noch offen

Verbesserungsvorschläge / Kritik[Bearbeiten | Quelltext bearbeiten]

The workload is badly distributed, the lectures are very good, the grading brings no educational value sadly. You don't get much useful feedback overall.