TU Wien:Business Intelligence VU (Tjoa)/2. Test - Data Mining - old tests
Test 2012[Bearbeiten | Quelltext bearbeiten]
Q1 Single Choice[Bearbeiten | Quelltext bearbeiten]
1. In k-fold cross validation each data sample is used k-1 times for testing
2. Ordinal data can only be tested for equality/inequality.
3. In decision tree each node denotes a test on an attribute value and each branch represents an outcome of the test.
4. Result of k-means depends on initial guesses of the seeds.
5. Partitional methods allow clusters to be found at different level of granularity.
6. Support vector machines can only be used for linearly separated data.
7. Support Vector Machines with a linear kernel work particularly well with high-dimensional data.
8. 1-n coding doubles the dimension of the feature space.
9. Binning turns ratio quantity attribute into ordinal attribute.
10. High sparsity in feature spaces denotes the concept of insufficient number of training/test data
11. Decision tree pruning is necessary to keep computational load in manageable order of magnitude.
12. K-means is based on supervised leaning paradigm.
13. SOM provides topology-preserving mapping
14. K-means k___to guarantee dec arbitrary dimension of flat space
15. K-anonymity achieved by using high value for k in k-nearest neighbors classification.
Q2[Bearbeiten | Quelltext bearbeiten]
Visualize the different concepts of computing distances between clusters in the figures (like in the slides, MIN, MAX, ...)
Q3[Bearbeiten | Quelltext bearbeiten]
List 3 different types of scaling and describe their characteristics, specifically when they lead to drastically different results and in which case you would apply which type of scaling.
Q4[Bearbeiten | Quelltext bearbeiten]
The following table presents the data of weather conditions, and whether one should play golf given these conditions. Based on this training data, classify the given test data using the Naïve Bayes Classifier.
Test 2013[Bearbeiten | Quelltext bearbeiten]
Q1: Single Choice[Bearbeiten | Quelltext bearbeiten]
1. A lazy learner creates a complex model of the decision space to speed up classification of new data samples.
2. 1-n coding doubles the dimensionality of the feature space.
3. In a decision tree, each node denote a test on an attribute value and each branch represents an outcome of the test.
4. K-anonymity requires at k attributes to uniquely identify a specific data item.
5. High sparsity in feature spaces denotes the concept of insufficient number of training/test data
6. Support Vector Machines can only be used for linearly seperable data
7. Support Vector Machines with a linear kernel work particularly well with high-dimensional data
8. Ordinal data can only be tested for equality/inequality
9. Binning turns ratio quantity attributes into ordinal attributes
10. Partitional methods for clustering allow clusters to be found at different levels of granularity
11. K-means is based on a supervised learning paradigm
12. Decision tree pruning is necessary to keep the computational load in a manageable order of magnitude
13. The Self-Organizing-Map (SOM) is based on a supervised learning paradigm
14. In k-fold cross-validation, each data sample is used k-1 times for testing
15. The result of the k-means method depends on the initial guesses of the seeds
Q2[Bearbeiten | Quelltext bearbeiten]
Visualize the different concepts of computing distances between clusters in the figures (like in the slides, MIN, MAX, ...)
Q3[Bearbeiten | Quelltext bearbeiten]
List 3 different types of scaling and describe their characteristics, specifically when they lead to drastically different results and in which case you would apply which type of scaling.
Q4[Bearbeiten | Quelltext bearbeiten]
The following table presents the data of weather conditions, and whether one should play golf given these conditions. Based on this training data, classify the given test data using the Naïve Bayes Classifier.
Test 2014[Bearbeiten | Quelltext bearbeiten]
Q1: Single Choice[Bearbeiten | Quelltext bearbeiten]
1. Binning describes the process of grouping class labels in multiple layers for hierarchical classification.
2. With 1-n coding of nominal attributes min/max scaling is required so that proper Euclidean distances can be calculated.
3. To minimize impact, missing values can be replaced by ‘0’ when training a Naïve Bayes classifier.
4. The test set is used for selecting the optimal parameter in classifier training
5. Support Vector machines cannot be overtrained, i.e. they do not need a parameter to ensure generalization.
6. In leave-one-out validation, only 1 data sample is used for training case fore each iteration of classifier training.
7. In k-means k has to be an even number to guarantee a decision can be made under arbitrary dimensionality of the feature space.
8. In a Decision Tree, the entropy of the data represented by each branch decreases as we move towards the branches of the tree.
9. Regression is a machine learning task where the target attribute is numeric.
10. K-nearest neighbors is based on a supervised learning paradigm.
11. In a decision tree, each node denotes a test on an attribute value and each branch represents an outcome of the test.
12. Ordinal data shows distances to be computed.
13. Association rules are based on an unsupervised learning paradigm.
Q2[Bearbeiten | Quelltext bearbeiten]
Suppose that we want to determine if someone has an illness ‘C’ based on his symptoms and the given training data in the table below. Each row of the table represents one person and the features f1, … , f4 indicate if the person has a particular symptom or not. The information for the illness is given in column C. You are given these symptoms of a new person <f1,f2,f3,f4>: <0,0,1,0>. Determine if this person has the illness ‘C’ based on k-nearest neighbor, with k=3 distance metric: Hamming or Euclidian distance, or any well-known distance metric of your choice if specified) (Table is missing, but the only values are 0 and 1 for f1 to f4)
Q3[Bearbeiten | Quelltext bearbeiten]
Explain the concept of Data Management Plans and what they are needed for. List the elements it consists of and explain the motivation of moving toward management plans.
Q4[Bearbeiten | Quelltext bearbeiten]
What is the difference between micro-averaging and macro-averaging in classifier evaluation and under which conditions it is important to choose one over the other.
Q5[Bearbeiten | Quelltext bearbeiten]
The following artificial data set represent the characteristics for particular meals, and wheter they are healthy or not. Based on this training data, classify the given test data using Naïve Bayes classifier. The rationale for your decision, i.e. the calculation leading to the result must be provided.
Test 2018[Bearbeiten | Quelltext bearbeiten]
Q1: Single Choice[Bearbeiten | Quelltext bearbeiten]
1. K-nearest is based on supervised leaning paradigm
2. The entropy is at its maximum when the cardinalities of classes are equal
3. Decision tree pruning helps against overfitting
4. With 1-n coding of nominal attributes min/max scaling is required so that proper Euclidean distances can be calculated
5. Support Vector machines cannot be overtrained, i.e. they do not need a parameter to ensure generalization
6. Regression is a machine learning task where the target attribute is numeric
7. Ordinal data shows distances to be computed
8. A decision tree is sensitive against new points (stability)
9. CRISP-DM starts with pre-processing and finishes with evaluation
Q2[Bearbeiten | Quelltext bearbeiten]
What is ACM Statement on Algorithmic Transparency and Accountability? Describe it and why is it important for Data Mining? Which parts belong to it?
Q3[Bearbeiten | Quelltext bearbeiten]
What is binning? What is 1-n coding? Describe and give example. In which case should it be used?
Q4[Bearbeiten | Quelltext bearbeiten]
Naive Bayes: Calculation
Test 2019[Bearbeiten | Quelltext bearbeiten]
2018W: Test B - 07.01.2019
Q1: Multiple Choice[Bearbeiten | Quelltext bearbeiten]
5 questions à 4 answers "All-or-nothing" => 25 points
- About Spark, MapReduce, Mlib...
Q2: Single Choice[Bearbeiten | Quelltext bearbeiten]
15 True/False questions, 2 points for right answer, -1 for wrong answer, 0 for no answer given => 30 points
- Binning turns nominal attributes into ordinal attributes
- Ordinal data allows distances to be computed
- Ordinal data does not allow distances to be computed
- Remaining were also somehow similar to questions in previous exams..
Q3[Bearbeiten | Quelltext bearbeiten]
Open question => 20 points
- Give at least 4 out of 7 ACM Statement on Algorithmic Transparency and Accountability. Describe then and why are they important for Data Mining?
- See for example: https://www.acm.org/binaries/content/assets/public-policy/2017_usacm_statement_algorithms.pdf
Q4[Bearbeiten | Quelltext bearbeiten]
Open question => 25 points
- Naive Bayes: Calculation
- Example give: Predict if someone will be playing some sport under the given features.
- Features: Outlook, Temperature, Humidity, Wind
- Comparable to this example: https://www.youtube.com/watch?v=CPqOCI0ahss
2018W: Test A - 21.03.2019
Q1[Bearbeiten | Quelltext bearbeiten]
5 All or nothing questions about Big Data Analysis
Q2[Bearbeiten | Quelltext bearbeiten]
True/False questions about ML/DM, 2 points for right answer, -1 for wrong answer, 0 for no answer given
Q3[Bearbeiten | Quelltext bearbeiten]
Given a matrix with distances between points do Single Linkage and Complete Linkage Clustering (You had to do the first 3 steps for each)
Q4[Bearbeiten | Quelltext bearbeiten]
knn: Given test samples of persons with various attributes and a class label predict the class label of a new sample.