TU Wien:Experiment Design for Data Science VU (Knees)

Aus VoWi
Zur Navigation springen Zur Suche springen
Ähnlich benannte LVAs (Materialien):

Daten[Bearbeiten | Quelltext bearbeiten]

Vortragende Allan HanburyPeter KneesAndreas RauberAlexander Schindler
Letzte Abhaltung 2020W
Sprache English
Mattermost experiment-design-for-data-scienceRegisterMattermost-Infos
Links tiss:188992, eLearning
Masterstudium Data Science
Masterstudium Data Science
Masterstudium Computational Science and Engineering
Masterstudium Business Informatics
Katalog Freie Wahlfächer

Inhalt[Bearbeiten | Quelltext bearbeiten]

Data privacy and ethics (Hanbury), statistical testing (Knees), and productivity and data management (Rauber). Esentially, one of data science "bread and butter" courses.

Ablauf[Bearbeiten | Quelltext bearbeiten]

Different lectureres covering different parts of the material. Lectures not every week. Attendance not mandatory (not checked). 2 practical assignments to be done during semester. The first assignment is individual and requires esentially no coding except for very basic dataset exploration and visualisation. The second assignment a group work and is far more demanding in terms of coding and time needed.

Benötigte/Empfehlenswerte Vorkenntnisse[Bearbeiten | Quelltext bearbeiten]

Basic knowledge about statistical testing is certainly helpful, as is basic knowledge about how to set up a machine learning experiment (but this is also briefly presented in the lectures). For reproducing the chosen paper's experiment, some programming skills come in handy. Here, some knowledge of python (scikit-learn, pandas, numpy; visulization with matplotlib, seaborn can be helpful) is necessary, maybe some R and/or Python if you prefer that (may depend on the chosen paper to be reproduced, see assignments). Weka could also be needed.

Vortrag[Bearbeiten | Quelltext bearbeiten]

  • A. Hanbury: clear and concise, on-topic, presentation of relevant thoughts and questions surrounding Data Science (Comment WS22: While the talk is very good and entertaining, the slides are really just a bunch of screenshots with barely any text, and since Hanbury does not record his lectures "studying" his part for the exam is a bit weird)
  • P. Knees: lectures about experiment design, forming hypotheses and statistical evaluation (Comment WS22: Recordings provided)
  • A. Rauber: lectures about reproducibility of papers and experiments, problems with code, libraries, etc. (Comment WS22: No recordings provided)

Übungen[Bearbeiten | Quelltext bearbeiten]

2 Assignments:

  • Forming hypotheses for (machine learning) experiments: given a data set, explore it and describe some interesting details. Then, from the gained insights, form three different hypotheses that can be tested in machine learning experiment. Describe dependent and independent variables, and how you would conduct the respective experiments. No actual programming required. To be done individually (not group work).
  • From a list of 3 short papers (around 3-4 pages), select one and reproduce the experiments and results. Check how well that is possible, if all data, description of methods, etc., is available to be able to reproduce. This is to be done in groups of 3 students. Groups are giving a short presentation on their progress during the semester, helpful for getting feedback. Deliverables are code and a report.

-> in 2020W: Professor Knees had decided that he would ask the students to reproduce the experiments on papers that students pick for themselves. He offers list of possible links one can look through and decide as a group which short paper to repoduce.

Prüfung, Benotung[Bearbeiten | Quelltext bearbeiten]

The exam contains 4 relatively open-ended questions, each bringing 25% of the points (see materials).

Dauer der Zeugnisausstellung[Bearbeiten | Quelltext bearbeiten]

noch offen

Zeitaufwand[Bearbeiten | Quelltext bearbeiten]

noch offen

Unterlagen[Bearbeiten | Quelltext bearbeiten]

2019WS: To simplify studying for the exam I created a minimal set of slides containing just the "relevant" parts: Minimal_2019_all_blocks.pdf

Tipps[Bearbeiten | Quelltext bearbeiten]

  • Assignment 2: Form a group of motivated people, start early, check for your favorite paper option (which one seems easy / aligned with your skills), read some extra literature around the chosen paper, if necessary.

I had no prior knowledge on statistical testing but I still managed to get a 2. For assignment 1, it really helps if you know Python or excel or any other prgrams/programming language to analyze data sets. For assignment 2, I completely agree with the person who wrote above me but would like to mention that before picking the paper, check the datasets because sometimes your computer might not be enough to validate the experiment due to the datasets' massive size.

Verbesserungsvorschläge / Kritik[Bearbeiten | Quelltext bearbeiten]

As stated above, for Exercise 2 students are supposed to pick papers (3 candidates and 1 that is eventually chosen from them) from the last 2 years from 4 possible conference series. They then need to check the experiment-setup of the chosen paper for problems and see if the results reported by the paper are reproducable (it makes sense to pick papers that publish their entire code). This doesn't sound too bad but it becomes problematic when you consider that this lecture is in the 1. semester according to the Data Science Master curriculum. Assuming a student with not much prior knowledge about data science, understanding what the paper is about (papers from the last 2 years with topics such as Information Retrieval and Recommender Systems) would probably already be difficult. In addition to that such a student will likely also have trouble evaluating whether they have the computational resources to run the experiments of the paper, as the models and datasets used by recent papers can be very big, making running the experiments on the local machine difficult/impossible. The exercise description mentions using Google Colab to solve that, but anyone who is familiar with Colab will know that the GPU it provides usually isn't stronger than your average gaming GPU, so this suggestions is kind of a pseudo solution to the problem. It would probably make sense to move this lecture to the 2. or 3. semester in the curriculum.