TU Wien:Machine Learning VU (Musliu)/Exam 2021-06-24

Aus VoWi
Zur Navigation springen Zur Suche springen

Multiple Choice (Answer: True/False)[Bearbeiten | Quelltext bearbeiten]

  • K-d-Tree can be used as search space optimisation for k-NN
  • Random Forests is a boosting ensemble technique - F
  • Back propagation is a method for training Multi-Layer Perceptrons - T
  • Ordinal data does not allow distances to be computed between data points
  • In AdaBoost, the weights are uniformly initialised
  • Suppose we have a neural network with ReLU activation function. Let's say, we replace ReLU activations by linear activations. Would this new neural network be able to approximate an XOR function? (Note: The neural network was able to approximate the XOR function with activation function ReLu)
  • The entropy of a data set is based solely on the relative frequencies of the data distribution, not on the absolute number of data points present
  • k-nearest neighbors is based on a supervised learning paradigm
  • Support Vector Machines with a linear kernel are particularily suitable for classification of very high dimensional, sparse data - T
  • Support Vector Machine can by default only solve binary classification problems
  • Naive Bayes gives usually good results for regression data sets
  • Learning the structure of Bayesian networks is usually simpler than learning the probabilities
  • Learning the structure of Bayesian networks is usually more complicated than learning the probabilities
  • The mean absolute error (a performance metric used for regression) is less sensitive to outliers than MSE
  • Chain Rule simplifies calculation of probabilities in Bayesian Networks
  • "Number of attributes of data set" is not a model based features that is used for metalearning - T
  • Kernel projections can only be used in conjunction with support vector machines
  • Suppose a convolutional neural network is trained on ImageNet dataset. This trained model is then given a completely white image as an input. The output probabilities for this input would be equal for all classes.
  • When learning an SVM with gradient descent, it is guaranteed to find the globally optimal hyper plane. - F
  • Usually state of the art AutoML systems use grid search to find best hyperparameters - T
  • Linear regression converges when performed on linearly separable data - F
  • Linear regression converges when performed on linearly not separable data - T
  • Laplace Corrector must be used when using Naive Bayes - F
  • Gradient boosting minimizes residual of previous classifiers - T
  • Decision Trees using error rate vs entropy leads to different results - T
  • Depth of decision tree can be larger than the number of training samples used to create a tree - F (depth of tree not larger than number of samples)

Free Text[Bearbeiten | Quelltext bearbeiten]

  • Consider the following 2D data set. Which classifier(s) will acheive zero training error on this data set?
    o +
    + o
    • Perceptron
    • SVM with a linear kernel
    • Decision tree (T)
    • 1-NN classifier
  • Describe at least three methods that are used for hyperparameter optimization
  • How can we select automatically the most promising machine learning algorithm for a particular data set?
    • Describe Rice's framework from the AutoML lecture
  • Why a general Bayesian Network can give better results than Naive Bayes?
  • What is overfitting, and when & why is it a problem? Explain measures against overfitting on an algorithm discussed in the lecture
  • What is the difference between micro and macro averaged performance measures?
  • Which are the important issues to consider when you apply Rice's framework for automated selection of machine learning algorithms.
  • How can we avoid overfitting for polynomial regression?
  • Which type of features are used in Metalearning ? What are landmarking features?
  • something like: explain 3 regression performance valuation methods.
  • something like: how is 1-R related to Decision tree

NOT EXACT, BUT SOMETHING LIKE THIS:

  • In which order should the steps be when training neural network with gradient descent (and 5 options listed, should be place in correct order) :
    • Initialize weights and bias
    • Let input through the NN to get output
    • Get error rate(compare what was expected to output or smthg like that)
    • Adjust weights
    • Reiterate until the best weights are in place
  • Compare ridge and lasso regression
  • Goal and settings of classification. To what tasks does it relate and from which it differs in machine learning ?
  • When are two nodes in Bayesian network considered to be d-separated ?
  • Can kernel be used in perceptron ?
  • How can we automatically pick best algorithm for a specific dataset ?
  • How can you learn the structure of Bayesian Networks?
  • Explain how we can deal with missing values or zero frequency problem in Naive Bayes.
    • Ignore the missing values / apply Laplace correction
  • What is Deep Learning? Describe how it differs from "traditional" Machine Learning approaches? Name two application scenarios where Deep Learning has shown great advances over previous methods.
  • Describe the goal and setting of classification. How does that relate and differ from other techniques in machine learning?
    • compare it to unsupervised, name regression as a technique, etc.
  • What is the randomness in random forests? Describe where in the algorithm randomness plays a role
  • Describe 2 AutoML systems
  • Describe in detail the algorithm to compute random forest
  • Given are 1000 observations, from which you want to train a decision tree. As pre-pruning the following parameters are set :
    • The minimum number of observations required to split a node is set to 200
    • The minimum leaf size (number of obs.) to 300

Then, what would be the maximum depth a decision tree can take (not counting the root node) Explain your answer!


See image for more:

https://vowi.fsinf.at/wiki/Datei:TU_Wien-Machine_Learning_VU_(Mayer,_Musliu)_-_Exam_24062021.png