TU Wien:Natural Language Processing and Information Extraction VU (Hanbury)


DatenEdit

Lecturers Allan Hanbury, Gábor Recski, Florina Mihaela Piroi
ECTS 3
Department Information Systems Engineering
When winter semester
Language English
Links tiss:194093 , Mattermost-Channel
Zuordnungen
Master Data Science Wahlmodul VAST/EX - Visual Analytics and Semantic Technologies - Extension


InhaltEdit

Introduction to different aspects of Natural Language Processing and a bit of (Computer-)Linguistics.

  • Regex
  • N-Grams
  • Probabilistic Language Models
  • POS-tagging, Hidden Markov Models (HMM) and Viterbi algorithm
  • Syntax
  • Semantic
  • Neural Networks for language processing
  • Information extraction from text
  • Annotation of language corpora
  • Evaluation of language processing systems
  • Summarization of documents
  • Question Answering and Chatbots

AblaufEdit

Weekly lectures, mostly focused on one topic. Slides are provided, with occasional references to a main book that the class follows, as well as papers and other books. 3 Exercises, see extra section. There is no exam.

Benötigte/Empfehlenswerte VorkenntnisseEdit

Some interest in language(s) and how to process text data using computers. Python skills, jupyter notebooks (running on a jupyter-hub @ TU Wien). Basics of Machine Learning mehtods (classification), experiment design (for the group project / system evaluation). Basic knowledge of Neural Networks is definitely helpful, as well as basic knowledge of sequence models (such as Hidden Markov Models, HMMs) - but this is covered in the lectures as well, of course. Basic Pytorch skills also can be helpful.

VortragEdit

noch offen

ÜbungenEdit

  • Exercise 1 (indivdiual): calculation of word frequencies and other figures and statistics for a given text corpus, as well as ngram counts and implementation of the Viterbi algorithm for text processing. To be done in JupyterHub / python.
  • Exercise 2 (indivdiual):
    • Part 1: Word embeddings. TF-IDF scores, vector space embeddings (e.g. GLOVE). Visualization of words in an embedding space (or rather, a projection of it). To be done in JupyterHub / python.
    • Part 2: Neural Networks (NN) for text processing. Loading data, wrapping into appropriate loaders / iterators for use with NN, preprocessing. Implement a Feed Forward NN (FFNN) as well as a LSTM (Long Short-Term Memory) NN for text classification. To be done in JupyterHub / python.
  • Exercise 3 (group work, in groups of 4): Choose from a list of topics (or bring your own, similar topic) for a project. Implement a language processing system that handles a certain task, e.g. automatic summarization of texts. Group presentation of the project (15 min.), 2 page report (management summary), code are the deliverables.

The exercise points are weighted according to a defined scheme.

Prüfung, BenotungEdit

There is no dedicated exam, just the exercises.

Dauer der ZeugnisausstellungEdit

noch offen

ZeitaufwandEdit

noch offen

UnterlagenEdit

Slides are provided, in different formats (pdf, jupyter notebooks/html).

Main textbook: Jurafsky & Martin: Speech and Language Processing, see here for pdfs of the drafts of the 3rd edition: https://web.stanford.edu/~jurafsky/slp3/

TippsEdit

  • For the 2nd part of exercise 2, training the FFNN (and even more so the LSTM NN) was quite time consuming on the Hub, with the duration of 1 epoch of training in the order of several minutes. It can be very helpful here to move to either a machine with a GPU, if you happen to have one available (and set up), or use e.g. Google collab to run your notebook / code there (using a GPU). This can significantly speed up the training and thus development process. (It would be nice if they could provide this infrastructure, not sure that is going to happen).

Verbesserungsvorschläge / KritikEdit

noch offen


Attachments

This page has no attachments yet but you can add some.