TU Wien:Machine Learning VU (Musliu)/Exam 2021-06-24

Multiple Choice (Answer: True/False)[Bearbeiten | Quelltext bearbeiten]

K-d-Tree can be used as search space optimisation for k-NN
Random Forests is a boosting ensemble technique - F
Back propagation is a method for training Multi-Layer Perceptrons - T
Ordinal data does not allow distances to be computed between data points
In AdaBoost, the weights are uniformly initialised
Suppose we have a neural network with ReLU activation function. Let's say, we replace ReLU activations by linear activations. Would this new neural network be able to approximate an XOR function? (Note: The neural network was able to approximate the XOR function with activation function ReLu)
The entropy of a data set is based solely on the relative frequencies of the data distribution, not on the absolute number of data points present
k-nearest neighbors is based on a supervised learning paradigm
Support Vector Machines with a linear kernel are particularily suitable for classification of very high dimensional, sparse data - T
Support Vector Machine can by default only solve binary classification problems
Naive Bayes gives usually good results for regression data sets
Learning the structure of Bayesian networks is usually simpler than learning the probabilities
Learning the structure of Bayesian networks is usually more complicated than learning the probabilities
The mean absolute error (a performance metric used for regression) is less sensitive to outliers than MSE
Chain Rule simplifies calculation of probabilities in Bayesian Networks
"Number of attributes of data set" is not a model based features that is used for metalearning - T

Kernel projections can only be used in conjunction with support vector machines
Suppose a convolutional neural network is trained on ImageNet dataset. This trained model is then given a completely white image as an input. The output probabilities for this input would be equal for all classes.
When learning an SVM with gradient descent, it is guaranteed to find the globally optimal hyper plane. - F
Usually state of the art AutoML systems use grid search to find best hyperparameters - T
Linear regression converges when performed on linearly separable data - F
Linear regression converges when performed on linearly not separable data - T
Laplace Corrector must be used when using Naive Bayes - F
Gradient boosting minimizes residual of previous classifiers - T
Decision Trees using error rate vs entropy leads to different results - T
Depth of decision tree can be larger than the number of training samples used to create a tree - F (depth of tree not larger than number of samples)

Free Text[Bearbeiten | Quelltext bearbeiten]

Consider the following 2D data set. Which classifier(s) will acheive zero training error on this data set?
o +

+ o
- Perceptron
- SVM with a linear kernel
- Decision tree (T)
- 1-NN classifier
Describe at least three methods that are used for hyperparameter optimization
How can we select automatically the most promising machine learning algorithm for a particular data set?
- Describe Rice's framework from the AutoML lecture
Why a general Bayesian Network can give better results than Naive Bayes?
What is overfitting, and when & why is it a problem? Explain measures against overfitting on an algorithm discussed in the lecture
What is the difference between micro and macro averaged performance measures?

Which are the important issues to consider when you apply Rice's framework for automated selection of machine learning algorithms.
How can we avoid overfitting for polynomial regression?
Which type of features are used in Metalearning ? What are landmarking features?
something like: explain 3 regression performance valuation methods.
something like: how is 1-R related to Decision tree

NOT EXACT, BUT SOMETHING LIKE THIS:

In which order should the steps be when training neural network with gradient descent (and 5 options listed, should be place in correct order) :

- Initialize weights and bias
- Let input through the NN to get output
- Get error rate(compare what was expected to output or smthg like that)
- Adjust weights
- Reiterate until the best weights are in place

Compare ridge and lasso regression
Goal and settings of classification. To what tasks does it relate and from which it differs in machine learning ?
When are two nodes in Bayesian network considered to be d-separated ?
Can kernel be used in perceptron ?
How can we automatically pick best algorithm for a specific dataset ?

How can you learn the structure of Bayesian Networks?
Explain how we can deal with missing values or zero frequency problem in Naive Bayes.
- Ignore the missing values / apply Laplace correction
What is Deep Learning? Describe how it differs from "traditional" Machine Learning approaches? Name two application scenarios where Deep Learning has shown great advances over previous methods.
Describe the goal and setting of classification. How does that relate and differ from other techniques in machine learning?
- compare it to unsupervised, name regression as a technique, etc.

What is the randomness in random forests? Describe where in the algorithm randomness plays a role
Describe 2 AutoML systems
Describe in detail the algorithm to compute random forest
Given are 1000 observations, from which you want to train a decision tree. As pre-pruning the following parameters are set :
- The minimum number of observations required to split a node is set to 200
- The minimum leaf size (number of obs.) to 300

Then, what would be the maximum depth a decision tree can take (not counting the root node) Explain your answer!

See image for more:

https://vowi.fsinf.at/wiki/Datei:TU_Wien-Machine_Learning_VU_(Mayer,_Musliu)_-_Exam_24062021.png

TU Wien:Machine Learning VU (Musliu)/Exam 2021-06-24

Multiple Choice (Answer: True/False)[Bearbeiten | Quelltext bearbeiten]

Free Text[Bearbeiten | Quelltext bearbeiten]

Navigationsmenü