TU Wien:Data-intensive Computing VU (Knees)

From VoWi
Revision as of 02:11, 15 September 2021 by Somebot (talk | contribs) (vowi_import_courses.py)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Similarly named LVAs (Resources):

Daten[edit | edit source]

Lecturers Peter KneesIvona BrandicDietmar Winkler
Alias Data-intensive Computing (en)
Department Information Systems Engineering
When summer semester
Last iteration 2021SS
Language English
Links tiss:194048
Master Data Science Pflichtmodul BDHPC/FD - Big Data and High Performance Computing - Foundations

Mattermost: Channel "data-intensive-computing"RegisterMattermost-Infos

Inhalt[edit | edit source]

  • MapReduce (Java)
  • Spark (either scala or python): RDDs and DataFrame/DataSets, MLLib for Machine Learning
  • Text processing: pre-processing, feature selection, text classification

Ablauf[edit | edit source]

noch offen

Benötigte/Empfehlenswerte Vorkenntnisse[edit | edit source]

Recommended/necessary: Java, Python (for pySpark) and/or scala (spark). Some basics of functional programming (basic lambda expressions) are useful as well. Having used MapReduce or Spark or knowing some of the concepts is certainly helpful. Some basic shell scripting and using shell for navigating (os and hadoop) file system. Some basic concepts of machine learning, e.g. train/validation/test splits or cross-validation, parameter tuning. Concepts of text processing / information retrieval, e.g. TF-IDF, feature selection using Chi-Square values.

Some of the material is also covered in the class "Advanced Database Systems", check material for that class if possible, it is partially more in-depth.

Vortrag[edit | edit source]

noch offen

Übungen[edit | edit source]

Entry exercise, 2 assignments (individual work) + group project (groups of 4)

  • Entry exercise: Word count example using MapReduce on Hadoop cluster
  • Assignment 1: Map Reduce, text pre-processing, Chi-Square feature selection. Write a MapReduce implementation.
  • Assignment 2:
  - Part 1: Spark RDDs
  - Part 2: Spark DataFrames, pipelines for MLLib
  . Part 3: ML pipepline for text classification
  • Assignment 3: RecSys challenge for the given year

Prüfung, Benotung[edit | edit source]

noch offen

Dauer der Zeugnisausstellung[edit | edit source]

noch offen

Zeitaufwand[edit | edit source]

noch offen

Unterlagen[edit | edit source]

noch offen

Tipps[edit | edit source]

  • start early, problems may be challenging if you're not somewhat fluent in java, python and/or scala.
  • start early, the cluster may be used heavily and not working sometimes
  • set up a spark environment on your own machine if possible, and start developping locally with smaller subsets of the data.
  • test on the cluster at some point to make sure everything runs there (file io uisng hadoop, starting job from shell script, etc.).

Ass1 - Tipps for hadoop:

  • you can run multiple map-reduce jobs
  • pass simple variables via context Counter between jobs
  • pass more complex data via (cached) files between jobs
  • checkout the setup and cleanup function of mappers and reducers
  • though they mention Avro, it created an unnecessary overhead for me and didn't work in the end. going with a text based solution is easier IMO
  • check this link for chi-square calculation: http://www.learn4master.com/algorithms/chi-square-test-for-feature-selection

Ass2 - Tips:

Check accumulators and broadcaster

Verbesserungsvorschläge / Kritik[edit | edit source]

noch offen


This page has no attachments yet but you can add some.