TU Wien:Advanced Methods for Regression and Classification VU (Filzmoser)/Possible Exam Questions and Answers

Aus VoWi
Zur Navigation springen Zur Suche springen

Notes from lecture: Questions[Bearbeiten | Quelltext bearbeiten]

List of questions given in last lecture:

  1. Some formulas should be known: (1.1), (2.1), F-test, comparing RSS for model selection / variable selection
  2.  : fit measure, for fit of model to training data; adjusted possibility for model selection
  3. Information criteria: derivation not important, but result. AIC vs BIC: different penalty, vs.
  4. Resampling: CV: 5-, 10-fold. Bootstrapping
  5. Least Squares: Linear model, LS estimator: . When problematic? Near singularity of
  6. How to deal with this? Variable selection: AIC, BIC; forward, backward, both; best subset regression
  7. PCR principal comp. regression. What and why?
  8. Partial Least Squares. Involves also the response variable, unlike PCR!
  9. Shrinkage Methods: when is Sum of large? In case of near singularity of , some may be very large, we don’t want that.
  10. Formula for should be known! , regularization of the solution!
  11. Lasso: shrinkage of . No explicit formula for solution. Why do we get zeros for some of the ( this aims at the figure with the norm balls for norm, where LS solution is more likely to hit corners, where some are zero)
  12. Classification: Basic formulation using Bayesian theorem, what is posterior, prior prob., likelihood. Assumptions on distributions? mostly gaussians
  13. LDA vs QDA: assumptions on the covariance matrices; discriminant functions.
  14. RDA: compare to LDA, QDA
  15. Logistic Regression: formula: No distribution assumptions, we get a weighted LS problem (Newton-Raphson algorithm)
  16. Nonlinear Methods: Splines, what is that? Definition on p.67 (bottom).
  17. Problems/parameters: order of splines, number and placement of knots
  18. Natural cubic splines, linear extrapolation outside the data
  19. Smoothing splines; (7.1) Formula for RSS, penalization of curvature. How to select the degrees of freedom? What is the role of ?
  20. GAM, GLM: what is that? Formula for PRSS(...) = ....
  21. Tree based methods: Regression. p 88: objective function. , .
  22. How to tune between under-, overfitting.
  23. Classification: Gini, deviance measure?
  24. Random Forests: what are they doing?
  25. Bootstrapped data, out of bag data, variable importance?
  26. SVM: objective functions, constraints
  27. What is the kernel-trick?

Answers[Bearbeiten | Quelltext bearbeiten]

Some Formulae[Bearbeiten | Quelltext bearbeiten]

Classical linear model with the linear function: (1.1)

In matrix-vector notation:

[p. 14f]

The residuals are the difference between the independent (target) variable and the model prediction , for every data sample :

This is the cost-function that is to be minimized to get an optimal linear fit (in the least-squares sense) to the data points:

This is one method for model selection.

Testing several of the model coefficients to be zero, where we have tow different models and , where . We have the : “the smaller model is true”, i.e. the additional coefficients introduced in the model are not significantly different from 0 (CHECK). This yields the following test statistic: (2.1) Under the assumed , Gaussian residuals (), the statistic above is distributed according to the -distribution: with degrees of freedom (numerator of ) and (denominator), respectively. If we cannot reject the , is not explaining the data significantly better, if we can reject the , does explain the data significantly better.

as fit measure[Bearbeiten | Quelltext bearbeiten]

This is the multiple , the coefficient of determination, it describes the amount of variance explained by the model. The formula is given as:

The last term is the square of the correlation between the data and , that is the fitted values (the predicted values on the training set variables/observations), note that .

The multiple grows with the size of the model. Therefore, we use the adjusted , that prevents this effect and is thus more suitable for comparing the fit of models of different sizes (number of variables). This is achieved by including the degrees of freedom in the calculation. Remember: number of observations, number of variables (not including the intercept).

is a measure mostly for evaluating fit to training data (PF’s comment in lecture recap) .

Information criteria AIC, BIC for model selection[Bearbeiten | Quelltext bearbeiten]

Used for model selection; strictly only valid for nested models (what exactly does that mean?).

Value itself is meaningless, only in comparison between different models it is useful.

Tries to minimize the log-likelihood function of the parameter of a distribution (??), given data. In the lecture notes, there is a derivation for Gaussian likelihood function.

If the residual variance is unknown, it can be estimated using ML method: .

The log-likelihood function can then be expanded to this:

AIC and BIC thus look similar, but have a slightly different penalization of the number of variables , and BIC also includes the log of the number of data points, .

For , BIC penalizes more strongly, thus leads to smaller models.

Resampling: k-fold Cross Validation, Bootstrapping[Bearbeiten | Quelltext bearbeiten]

Use loss function , typically MSE.

Used instead of test set, to measure performance of model on unseen data (data it was not trained on).

The given data samples are split into parts. One part is chosen to be the test set, the other parts are used as training data. For the -th part, we calculate the prediction error, e.g. MSE. So, we get basically different models (estimates of the modelled function ), and thus different test set prediction errors.

Note that we get for each exactly one predicted value! From these we calculate a CV error, as average over the error for each example:

Types:

  • , leave-one-out CV, LOOCV. Has high computational effort, high variance due to the training sets being very similar
  • -fold CV with or .

Randomly draw with replacement data sets from the original data. The bootstrap samples have the same size as the original data set. This is done times, i,e, bootstrap samples. To each of the bootstrap samples, a model is fit, and evaluated with the loss function. I.e., in contrast to CV, there is no (simulated) test set here? We simply evaluate the fit onto the bootstrap (training) data?

Bootstrap error is usually too optimistic.

There is also leave-one-out bootstrap, see lecture notes section 2.4.2, "Bootstrap", page 11 (lecture notes version WS 2018, version from January 2019).

Least Squares estimation, Linear model[Bearbeiten | Quelltext bearbeiten]

In matrix-vector notation: LS-estimator for the coefficient vector: Fitted values :

How to deal with problems in ordinary LS, near singularity of ?[Bearbeiten | Quelltext bearbeiten]

[p. 20]

For (highly) correlated variables, becomes almost singular, the inverse will become very large (large entries), and the statistical tests for the coefficients to be zero become unreliable. We can thus not make a reliable inference, and the coefficients are not uniquely defined, there is more than one solution.

Note that the inverse of the matrices shows up as the covariance matrix of the coefficient vectors’ distribution; thus, if this gets very large near singularity, the covariance will blow up: This is valid for our usual assumptions about the residuals, being zero mean iid uncorrelated Gaussian variables.

Variable selection can get around that problem, by selecting variables that are not correlated. We can use methods that take derived values of the input variables instead of the variables themselves, see PCR, PLS (and continuum regression); furthermore, regularization methods can be used (see Lasso, Ridge).

Principal Component Regression (PCR)[Bearbeiten | Quelltext bearbeiten]

Basic idea: decompose the data matrix into its principal components, and use only the subset explaining a certain amount of variance. The regression is then done on (a subset of) the principal components instead of the original data. This will eliminate from the estimation procedure the contributions of variables that are (highly) correlated. The degree of lost “information”, or rather explained variance of the data, can be tuned by selecting more or less principal components. For predicting on the original data space, the model needs to be transformed back. As PCA is not scale invariant, the data should be centered and scaled beforehand (the centers and standard deviations should be kept for re-scaling and adding the means back after transforming back to the original data space, if desired).

In PCR, we only look at the data , but not on whether this makes sense for getting better predictions.

Eigenvalue, eigenvector decomposition of the covariance matrix of the data: where is the orthogonal ( matrix of the eigenvectors and is the diagonal matrix of the eigenvalues of the covariance matrix. The eigenvalues are also the variances of the components () of .

We can select only principal components for the regression, to use e.g. only a the components adding up to a certain total explained variance.

The model looks like:

The model using only variables:

LS estimate:

Transforming back

TO DO: CHECK and reformulate if necessary.

Partial Least Squares (PLS)[Bearbeiten | Quelltext bearbeiten]

Data and targets should be centered. (WHY?)

In contrast to PCA, we try to find projections that maximize (besides the variance of the components?) the covariance between the projections and the predicted variable, but keeps different projections uncorrelated from each other.

For : under constraints and for (i.e. ).

Each makes a weighted combination of the columns of into a single column that is highly correlated with . They are then stacked into a matrix .

The model looks like:

Shrinkage Methods: Lasso, Ridge[Bearbeiten | Quelltext bearbeiten]

Ridge and Lasso are methods for regularization, i.e., stabilizing LS methods for estimating regression coefficients in cases of (highly) correlated variables and thus cases where is close to singularity (see also above).

We penalize the coefficients . Why do we do that, why are we concerned about large values of / within ? Because we get that if we are near singularity of .

Ridge Regression, cost function[Bearbeiten | Quelltext bearbeiten]

[p. 23f]

The cost function looks similar to classical RSS, but has as important difference an additive regularization term: (3.5) is called complexity parameter and controls the amount of shrinkage. Larger means more shrinkage. Coefficients are shrunk towards zero and thus towards each other. The regularization term penalizes the sum of the squared coefficients, i.e. the -norm of the coefficient vector: (Note: the intercept is not included in the penalization term! So this is a slight abuse of notation to write it as the coefficient vector here).

Can be re-written as classical RSS cost function to be solved under the constraint with some finite positive .

More important is probably the solution to the cost function written in matrix-vector notation: This makes evident the adding of values to the diagonal of before inverting, to ensure non-singularity.

  • Ridge solution is not scale invariant, so standardize input.
  • Intercept not included in penalization term.
  • Both points above mean that the Ridge regression estimator is not affine equivariant (not sure this was discussed in this class, see e.g. Multivariate Statistik).

There is a relation between Ridge regression and PCR.

TO DO: how to choose , using CV or something? Check notes.

Lasso Regression[Bearbeiten | Quelltext bearbeiten]

The cost-function is again classical RSS with a side constrain: (x.x) subject to: Note that again, the intercept is not included in the penalization term. Here, the sum of the absolute values is constrained, i.e. the -norm of the coefficient vector (without intercept). This is the difference to the Ridge regression case (in terms of cost function).

We get a sparse solution, i.e. we get zeros in the coefficient vector . Zeros in the coefficient vector basically means variable selection, getting a smaller model.

But, in contrast to Lasso, we get no closed form solution, but have to resort to iterative solvers, e.g. coordinate descent aka. shooting algorithm (?)

TO DO: , how to select good , why do get coefficients to 0 more likely/faster (norm balls).

Classification, Bayes Theorem[Bearbeiten | Quelltext bearbeiten]

Q: Name the involved parts, i.e. posterior, prior probability, likelihood function. Assumptions on the involved (family of) distributions?

A quick refresher of (one specific incarnation of) the Bayesian theorem. Here, is a probability density function (pdf), is a probability, is some event and is a -dimensional observation: is the prior probability of the event , is a conditional pdf of , conditioned on the event ; it is termed likelihood function in the context of Bayesian methods. is the posterior (or a-posteriori) probability of the event given the observation . The denominator follows from the law of total probability and normalizes the numerator to the range of a probability, i.e. (the have to be mutually disjoint and their union must cover the whole probability space . The event from the numerator is of course included in the ). We will see that the denominator is not important here, don’t worry too much about it.

Let’s specialize this formula to the notation given in the lecture notes [p. 48]: Our event here takes the form of the statement “the class (or group) is ”, where simply labels one of the possible groups that the observations belong to. Now, is the (posterior) probability that, given a specific , the group label is some number . is the likelihood function, and simply uses a different notation, think of this as simplifying the notation like so: . This notation no longer makes it clear that this is a conditional pdf, but it is. For the prior probability, the notation is changed to: .

Here, we assume the involved probability distributions to be multivariate Gaussians (note that a likelihood function, being a conditional pdf, has all the properties of a pdf). We denote the multivariate Gaussian pdf, fully parameterized by a mean vector and covariance matrix , as . Note that, as this is a conditional pdf conditioned on a certain group , we index the pdf and the parameters accordingly (, , ): where is the determinant of .

This is the basis we need for formulating Linear and Quadratic Discriminant Analysis (LDA, QDA). For the derivations, different assumptions on the covariance matrices are made to achieve this, see below. In general, the necessary parameters , and are estimated from a set of training examples ( many per group ) that are labelled with their corresponding group (we need these parameters for each group ).

As a teaser, we will see:

  • LDA: all covariance matrices are assumed equal, i.e. .
  • QDA: covariances matrices unequal for the groups: , for .

LDA vs. QDA[Bearbeiten | Quelltext bearbeiten]

The main assumption behind LDA (besides Gaussian distributions) is that the covariance matrix is identical for all groups . This means that we assume all groups (clusters) in the data to have the same shape (remember how the covariance matrix defines the shape and direction of axes of the Gaussian pdf). This may be a limitation of the capacity to model the data, but we need to estimate far less parameters in comparison to QDA (more on that later). Thus, there is a specific way to estimate the overall pooled covariance matrix from the training data (it is not group specific and thus has no index ): where number of training examples, number of groups, indices of the training examples belonging to group . Here, the inner sum estimates a covariance matrix for the respective group over the training examples belonging to that group (up to some scaling), and the outer sum averages all these per-group-covariance matrices to one pooled matrix.

The estimated covariance matrix can also be computed by first centering the columns of the data matrix (i.e., subtracting the estimated mean of each column), and then arranging the data points belonging to each group into a distinct, centered data matrix (there will be of those). Then, the common pooled covariance matrix can be computed by averaging these matrices: Comment: I took this from the lecture note for Multivariate Statistik, if I remember correctly. It is presented there as a method for estimating the covariance matrix, and was adapted here, using averaging, to calculate the pooled common covariance matrix.

For a derivation of the linear discriminant function (it is linear in the observations ) see the lecture notes. It has the form: and involves the group mean, the pooled covariance matrix, the prior probability, and the respective observation that we now want to classify, i.e. assign the most probable class label to. We do this by plugging our specific into the for all different and assign that that yields the maximum value of all the . This can also be described formally as: where is then the yielding largest . (This is a maximum a-posterior (MAP) decision rule.)

For QDA, we do not assume the covariance matrices of the different groups (clusters) to be identical, thus we have one specific covariance matrix per group . Clearly, this means we need to estimate more parameters than in the LDA case, but the model is more flexible as we can have different shapes of the groups.

The discriminant function is more complicated than in the LDA case, and is first and foremost no longer linear in the observations (note the quadratic form involving , that is basically the squared Mahalanobis distance of and ):

Regularized Discriminant Analysis (RDA) vs. LDA, QDA[Bearbeiten | Quelltext bearbeiten]

Compromise between LDA and QDA. It uses a different version of a pooled common covariance matrix, created as a weighted average from the groups’ distinct covariance matrices (for groups (note that these are the matrices estimated from the data): where is the number of examples in the respective group.

Then, for each group , a regularized (estimated) covariance matrix can be calculated: where balances between LDA (, only pooled common covariance matrix; note that this matrix, calculated as stated above, is slightly different as it uses weighting in the averaging) and QDA (, only individual covariance matrices). A value for the hyperparameter can be estimated using cross-validation.

Logistic Regression[Bearbeiten | Quelltext bearbeiten]

As we formulate this as a likelihood problem, we don’t make distributional assumptions (maybe check this).

Supervised method for classification, i.e. we have a labeled (with groundtruth) dataset available. We want to find a linear function that separates the groups (classes).

The posterior probability for a group (class) label, given a data point , is , for a group label . Further down, the notation of the posterior will change to , more on that later. The subscript is the group label for groups 1, 2, for each data point , where .

The log-ratios of posterior probabilities for different groups (for each specific data point) are modeled using affine functions (see lecture notes). For the case groups (classes), we get: The posteriors must sum to 1.

For estimating the parameters , , we can us the maximum likelihood (ML) method. As stated above, the notation in the lecture notes for the posterior probabilities changes at this point like so: . Furthermore, the vector is now assumed to include the intercept coefficient and is thus of dimension . The data points are then of course also supposed to contain the intercept (have a 1 inserted in the first entry), and have the same dimension as the coefficient vector.

The log-likelihood function is now written as: This can be brought to an easier form, see lecture notes.

We sought to maximize the log-likelihood, i.e., finding the coefficients that achieves this. This involves taking the derivative of the log-likelihood and setting it to zero.

An iterative solution using the Newton-Raphson algorithm can be found, this involves also the second derivative, the so called Hessian matrix.

The update equation for the coefficients can also be written using matrices and vectors directly. The algorithm used for solving is also called iteratively reweighted Least Squares (IRLS). Update equation (CHECK): where . The matrix involved is for a diagonal matrix.

I.e., we get a weighted LS solution to the problem.

Note that the target values that appear in the LS equation are simply indicators for the class membership, i.e., there is for each row in the data matrix a corresponding indicator . For the 2 class case, the indicators are 1 or 0, accordingly.

The lecture notes say that logistic regression is used for modeling and inference. “The goal is to understand the role of the input variables in explaining the outcome”.

TO DO: Check how the inference works, what kind of information is given by the method about the variables in ?

It turns out that LDA also has a linear function modeling the log-ratio of the posterior probabilities. LDA achieves this by assumptions of equal covariance matrices for all groups, and the assumption of normally distributed data in the groups. However, the ways for estimating the coefficients are different. Logistic regression makes no assumptions about the prior probabilities of the data points, whereas LDA assumes a (multivariate) Gaussian mixture model.

Nonlinear Methods: Splines[Bearbeiten | Quelltext bearbeiten]

Splines are piecewise polynomial functions. For order , it has continuous derivatives up to order . So, for up to 2nd derviative being continous, we need , so called cubic splines.

There are different (formulations of?) bases available for splines, the B-splines are numerically more stable.

Parameters for Spline Regression[Bearbeiten | Quelltext bearbeiten]

Parameters of splines:

  • the order of the splines, typically
  • number of knots
  • placement of the knots. Must (?) be manually chosen, e.g. at appropriate percentiles of

Natural cubic splines[Bearbeiten | Quelltext bearbeiten]

Have the additional constraint that they are linear beyond the boundary knots (outermost knots). Thereby, they avoid the possibly erratic behaviour continuation of the splines beyond the outermost knots (see figures in lecture notes).

Smoothing splines[Bearbeiten | Quelltext bearbeiten]

From all functions with two continuous derivatives, we seek the one that minimizes the penalized RSS: Here, is the (continuous) second derivative of the function . is the smoothing parameter that balances between the influence of the “classical” RSS (that measures the fit of the function to the data) and the penalization term, that penalizes the function for its curvature. Two special cases:

  • : is any function that fits the data (so, it could be very curvy).
  • : LS fit, no second derivative of the function is tolerable (i.e., only an affine function is possible, linear regression?).

The unique minimizing solution of the RSS cost function above is a natural cubic spline with knots at the -values of the data. The data here are collections of pairs , ?

We use the following setup for our function (natural spline): i.e., we have many spline functions, each one with its own coefficients.

The cost function can be rewritten in matrix-vector notation: Here: , i.e., each of the basis functions is evaluated at one of the data points (=knots) . Thus, has dimension . Furthermore, we have: , note that the integral yields a scalar. Note that has dimension as well. Finally, has dimension .

The solution is then: this should be compared to the solution of e.g. ridge regression (see lecture notes). Note the matrix , the encoding of the data values into the spline basis functions, takes the place of the data matrix , and the regularization term involving , weighted by , that is added before inverting.

Choosing the value for (degrees of freedom) can be done using cross-validation.

CHECK: plays the role of the here? Check how the other equaitons here look like, etc.

Generalized Additive and Linear Models (GAM, GLM)[Bearbeiten | Quelltext bearbeiten]

GLM Generalized Linear Models
GAM Generalized Additive Models
For GLMs, the right hand side is a linear combination of the input variables (of a -dimensional vector ). For GAMs, the right hand side is an additive combination of (possibly nonlinear) functions of the input variables.

We have the conditional expectation of the target value , given the corresponding data : The GAM model looks like this, for a so called link function : There are different link functions available for different purposes, see lecture notes.

For regression (with Gaussian residuals?), the link function is the identity, i.e. .

Cost function: penalized residual sum or squares (PRSS), for pairs : Larger lead to smoother functions.

Independent of the choice of the , an additive model with cubic splines minimizes the PRSS above.

A unique solution needs the following constraints:

  • non-singularity of

The solution can be found using the backfitting algorithm. (TO DO: add the general idea of the backfitting algorithm).

Tree-based methods[Bearbeiten | Quelltext bearbeiten]

Trees for regression and for classification.

We have data , for -dimensional vectors .

Regression Trees[Bearbeiten | Quelltext bearbeiten]

Partitioning the space of the -variables into regions . The response variable as a constant in each Region: where is the index function.

A region with observations inside is also referred to as the node .

A tree can be grown to a certain size , where the splitting is stopped when a minimum (terminal ?) node size is reached. By pruning, i.e. reducing inner (internal) nodes (an internal node represents a split not at a terminal position in the tree?), one can get sub-trees . The number of terminal nodes (regions) in is denoted as .

Estimating the constant approximating the response in each region : Node impurity at node : Cost complexity measure, to be minimized (for all possible for a given ?): For every there exists a unique smallest sub-tree that minimizes . The optimal can be found with cross-validation. The actual tree for a chosen is then found by a search? (CHECK this)

Classification Trees[Bearbeiten | Quelltext bearbeiten]

Partition the (space of the) variables into classes, according to their respective output variable of levels.

Proportion of examples belonging to class in node :

The class for the node , denoted as is then assigned as follows:

Measures of node impurity (also denoted ?):

  • Gini index
  • Deviance or cross-entropy
  • Misclassification error

Cross-entropy and gini index are differentiable

Random Forests[Bearbeiten | Quelltext bearbeiten]

Consist of many trees (hence forest). The randomness is due to using only bootstrap samples instead of the whole data set, and at each knot only use a random selection of the variables, so that every tree is different.

A bootstrap sample has about 2/3 of distinct observations, the rest is the so called Out of Bag (OOB) data.

For the OOB data, compute the impurity of the whole tree, permute for each variable the data values and compute another (different) tree impurity, use difference as measure for variable importance.

For classification, use majority voting for reaching decision. For regression, use average of forest for prediction.

SVM (Support Vector Machines)[Bearbeiten | Quelltext bearbeiten]

The perceptron algorithm gives a separating hyperplane in case the data are linearly separable, but only one of (in this case) infinitely many possible solutions. Support Vector Machines (SVM) can give an optimal solution for such data.

Distance as minimum distance of any observation to the separating hyperplane: with class label of observation .

The margin is to be maximized, under side-constraints:

This means, we seek the parameters , , where , that maximize such that all (training) examples are correctly classified (this is what the last line says). SVMs are sometimes called maximum-margin classifiers, as their cost function seeks to maximize the margin.

For data that are not linearly separable, there is the idea of soft-margin classifiers, that introduces slack variables in the optimization scheme.

TO DO: re-formulating the optimization problem.

The Kernel Trick[Bearbeiten | Quelltext bearbeiten]

Transforming the data to a higher dimensional space, where they can be separated linearly.

Basis expansion of the data , e.g. (this is the function defining the hyperplane).

Inner product of functions of the data is repalced by a kernel function :

Examples: Linear kernel: Radial Basis Function (RBF):