TU Wien:Natural Language Processing and Information Extraction VU (Hanbury)

From VoWi
Jump to navigation Jump to search

Daten[edit]

Lecturers Allan Hanbury, Gábor Recski, Florina Mihaela Piroi
ECTS 3
Department Information Systems Engineering
When winter semester
Language English
Links tiss:194093 , Mattermost-Channel
Zuordnungen
Master Data Science Wahlmodul VAST/EX - Visual Analytics and Semantic Technologies - Extension


Inhalt[edit]

Introduction to different aspects of Natural Language Processing and a bit of (Computer-)Linguistics.

  • Regex
  • N-Grams
  • Probabilistic Language Models
  • POS-tagging, Hidden Markov Models (HMM) and Viterbi algorithm
  • Syntax
  • Semantic
  • Neural Networks for language processing
  • Information extraction from text
  • Annotation of language corpora
  • Evaluation of language processing systems
  • Summarization of documents
  • Question Answering and Chatbots

Ablauf[edit]

Weekly lectures, mostly focused on one topic. Slides are provided, with occasional references to a main book that the class follows, as well as papers and other books. 3 Exercises, see extra section. There is no exam.

Benötigte/Empfehlenswerte Vorkenntnisse[edit]

Some interest in language(s) and how to process text data using computers. Python skills, jupyter notebooks (running on a jupyter-hub @ TU Wien). Basics of Machine Learning mehtods (classification), experiment design (for the group project / system evaluation). Basic knowledge of Neural Networks is definitely helpful, as well as basic knowledge of sequence models (such as Hidden Markov Models, HMMs) - but this is covered in the lectures as well, of course. Basic Pytorch skills also can be helpful.

Vortrag[edit]

noch offen

Übungen[edit]

  • Exercise 1 (indivdiual): calculation of word frequencies and other figures and statistics for a given text corpus, as well as ngram counts and implementation of the Viterbi algorithm for text processing. To be done in JupyterHub / python.
  • Exercise 2 (indivdiual):
    • Part 1: Word embeddings. TF-IDF scores, vector space embeddings (e.g. GLOVE). Visualization of words in an embedding space (or rather, a projection of it). To be done in JupyterHub / python.
    • Part 2: Neural Networks (NN) for text processing. Loading data, wrapping into appropriate loaders / iterators for use with NN, preprocessing. Implement a Feed Forward NN (FFNN) as well as a LSTM (Long Short-Term Memory) NN for text classification. To be done in JupyterHub / python.
  • Exercise 3 (group work, in groups of 4): Choose from a list of topics (or bring your own, similar topic) for a project. Implement a language processing system that handles a certain task, e.g. automatic summarization of texts. Group presentation of the project (15 min.), 2 page report (management summary), code are the deliverables.

The exercise points are weighted according to a defined scheme.

Note Exercise 2, WS2020: When reading this, please keep in mind that this Course was held for the first time in WS2020 : Exercise 2 was rather disastrous. It was delayed, the template for part 1 of the exercise had a ridiculous amount of errors like incorrect tests, file-paths that needed to be replaced by the student but where put in non-editable cells, could not be run remotely on Jupyterhub because it required a file that was too large to upload for student accounts, and a few others (All in all, it felt like 30% of the cells in the template had some kind of problem). Part 2 also had to be run on Google Colab (due to them providing free GPU access).

All of this increased the time need to complete the Exercise (which, imo, was estimated way too low to begin with) even further. Due to the delay, the deadline was extended by a few days, however, grading and feedback took more than a month (it was given after the deadline of Exercise 3).

Prüfung, Benotung[edit]

There is no dedicated exam, just the exercises.

Dauer der Zeugnisausstellung[edit]

noch offen

Zeitaufwand[edit]

If you have experience with neural networks (or NLP in general) 3 ECTS may just be realistic (probably not though), if you don't, expect quite a bit more than 3 ECTS worth of effort.

The time estimates for the exercises are pretty unrealistic (16 hours in total for ex1 and ex2, 35 hours for the big group project that is ex3).

Unterlagen[edit]

Slides are provided, in different formats (pdf, jupyter notebooks/html).

Main textbook: Jurafsky & Martin: Speech and Language Processing, see here for pdfs of the drafts of the 3rd edition: https://web.stanford.edu/~jurafsky/slp3/

Tipps[edit]

  • For the 2nd part of exercise 2, training the FFNN (and even more so the LSTM NN) was quite time consuming on the Hub, with the duration of 1 epoch of training in the order of several minutes. It can be very helpful here to move to either a machine with a GPU, if you happen to have one available (and set up), or use e.g. Google collab to run your notebook / code there (using a GPU). This can significantly speed up the training and thus development process. (It would be nice if they could provide this infrastructure, not sure that is going to happen).

Verbesserungsvorschläge / Kritik[edit]

noch offen


Attachments

This page has no attachments yet but you can add some.