TU Wien:Natural Language Processing and Information Extraction VU (Hanbury)
DatenEdit
Lecturers | Allan Hanbury, Gábor Recski, Florina Mihaela Piroi |
---|---|
ECTS | 3 |
Department | Information Systems Engineering |
When | winter semester |
Language | English |
Links | tiss:194093 , Mattermost-Channel |
Master Data Science | Wahlmodul VAST/EX - Visual Analytics and Semantic Technologies - Extension |
InhaltEdit
Introduction to different aspects of Natural Language Processing and a bit of (Computer-)Linguistics.
- Regex
- N-Grams
- Probabilistic Language Models
- POS-tagging, Hidden Markov Models (HMM) and Viterbi algorithm
- Syntax
- Semantic
- Neural Networks for language processing
- Information extraction from text
- Annotation of language corpora
- Evaluation of language processing systems
- Summarization of documents
- Question Answering and Chatbots
AblaufEdit
Weekly lectures, mostly focused on one topic. Slides are provided, with occasional references to a main book that the class follows, as well as papers and other books. 3 Exercises, see extra section. There is no exam.
Benötigte/Empfehlenswerte VorkenntnisseEdit
Some interest in language(s) and how to process text data using computers. Python skills, jupyter notebooks (running on a jupyter-hub @ TU Wien). Basics of Machine Learning mehtods (classification), experiment design (for the group project / system evaluation). Basic knowledge of Neural Networks is definitely helpful, as well as basic knowledge of sequence models (such as Hidden Markov Models, HMMs) - but this is covered in the lectures as well, of course. Basic Pytorch skills also can be helpful.
VortragEdit
noch offen
ÜbungenEdit
- Exercise 1 (indivdiual): calculation of word frequencies and other figures and statistics for a given text corpus, as well as ngram counts and implementation of the Viterbi algorithm for text processing. To be done in JupyterHub / python.
- Exercise 2 (indivdiual):
- Part 1: Word embeddings. TF-IDF scores, vector space embeddings (e.g. GLOVE). Visualization of words in an embedding space (or rather, a projection of it). To be done in JupyterHub / python.
- Part 2: Neural Networks (NN) for text processing. Loading data, wrapping into appropriate loaders / iterators for use with NN, preprocessing. Implement a Feed Forward NN (FFNN) as well as a LSTM (Long Short-Term Memory) NN for text classification. To be done in JupyterHub / python.
- Exercise 3 (group work, in groups of 4): Choose from a list of topics (or bring your own, similar topic) for a project. Implement a language processing system that handles a certain task, e.g. automatic summarization of texts. Group presentation of the project (15 min.), 2 page report (management summary), code are the deliverables.
The exercise points are weighted according to a defined scheme.
Prüfung, BenotungEdit
There is no dedicated exam, just the exercises.
Dauer der ZeugnisausstellungEdit
noch offen
ZeitaufwandEdit
noch offen
UnterlagenEdit
Slides are provided, in different formats (pdf, jupyter notebooks/html).
Main textbook: Jurafsky & Martin: Speech and Language Processing, see here for pdfs of the drafts of the 3rd edition: https://web.stanford.edu/~jurafsky/slp3/
TippsEdit
- For the 2nd part of exercise 2, training the FFNN (and even more so the LSTM NN) was quite time consuming on the Hub, with the duration of 1 epoch of training in the order of several minutes. It can be very helpful here to move to either a machine with a GPU, if you happen to have one available (and set up), or use e.g. Google collab to run your notebook / code there (using a GPU). This can significantly speed up the training and thus development process. (It would be nice if they could provide this infrastructure, not sure that is going to happen).
Verbesserungsvorschläge / KritikEdit
noch offen