” 1 Homework 6: Multivariate Regression
1.1 Purpose
Homework 6 is meant to give you some practice on understanding what can go
wrong with multivariate regression.
1.2 What needs to be returned?
Please upload a typed out solution for the following questions to CourseWorks
before class starts.
1.3 Math to Code
1.3.1 Q1
Define a random vector with 3 random variables:
X Normal(0, 10)
Y Exp( = 0.1)
Z = Y + 2 X + , where Unif[5, 5]
Please assume that X, Y , and are all independent from one another.
Please calculate the theoretical values of Cov
1.3.2 Q2, numerically approximating covariances
Please test out 2 sample sizes, 100 and 10000 to numerically approximate the 3
by 3 covariance matrix from Q1 via simulation.
You should use the sample covariance to approximate the theoretical covariance
matrix, i.e. n . n is the sample covariance based on sample size n.
Create 200 simulations for each sample size to approximate the covariance
matrix above (i.e. you would have 200 covariance estimates for each sample size).
If the theoretical covariance matrix is and the estimate is , define the error
as k k2 = kDk2 =qP3i=1
P3j=1 D2i,j ). This is called the Frobenius norm
for the matrix D. Please report the 2.5 to 97.5 percentile values of the Frobenius
norm, across the 200 simulations for each sample size. Please comment on the
sample sizes effect on the accuracy of the numerical approximation of .
1
1.3.3 Q3, a common abuse of the word sample size
In this question, the word sample size is used in 2 different ways that is
common and confusing.
In the regression setting, we often describe Y = X + , where Cov(|X) =2
I as a n n matrix. This 2
I is the theoretical covariance.
In the case where n = 20, i.e. the sample size of the regression is 20, please
write the code that would numerically approximate the covariance
matrix 2
I using the sample size with the smaller error from Q2 by
simulating different vectors. Please set
2 = 4. Im intentionally not
prescribing the distribution of , choose your favorite distribution. :)
1.3.4 Q4
Let there be 20 samples. Let X1 Bernoulli(0.3), let X2 Unif[10, 10], let
X3 Normal(100, 10), let X4 = 2 X1 X2 + 0.3X3, let X5 = 1 X1, and
finally define X0 as the constant feature of 1s. Please examine the eigen values
for the matrix XT X with the following definitions of X and report whether
(XT X)1
exists.
Note, the notation below indicates combining the vectors by columns
1.4 Simultaneous Inference
1.4.1 Q5
Imagine Z Binomial(n = 100, p = 0.05), please report an 95% prediction
interval for Z. Note, by convention, prediction intervals centers the expected
value, but this is not technically required.
A 95% prediction interval is any interval such that, when predicting the
value of Z, will have a 95% chance of containing Z.
2
1.4.2 Q6
Let the sample size be 1000, Y Normal(0, 10) then create 99 random features
that are completely uncorrelated with Y . Please regress Y on these features and
report the number of significant features using point wise hypothesis
tests, i.e. |iSE2 (i| t(n p, 97.5) would identify a significant feature.
Recall that SE2 (i) = rhCov(|X)ii,i
where we use 2
to approximate2. Since you have the intercept, there should be a total of 100 tests being
performed.
1.4.3 Q7
Continuing from Q6, let us adjusted the problem by using the Bonferroni correction
to perform simultaneous inference. Please write the code that would
numerically show that the false positive rate from Bonferroni is at most 5%
over 1000 simulations. In other words, the probability of calling at least one
feature significant when all coefficients are 0, is upper bounded by 5%.
1.5 Interpreting your model
1.5.1 Q8
Let the sample size be 200. Define X Normal(10, 10), let Z Bernoulli(0.4),
and let Y = 5 X + 2 Z + 3 X Z + where Unif[3, 3]. Please run
the regression using all the data that includes the interaction effect and report
the coefficients, lets call these . You should have 4 coefficients.
Side note, this is an example of how you can imagine data to be generated
from different groups that have different intercepts and slopes.
1.5.2 Q9
Please regress Y on X only using the values where Z = 0. Repeat this regression
only using the values where Z = 1. Report those coefficients, lets call them 0
and 1 respectively.
1.5.3 Q10
Assume the parameters below refer to the coefficients from ,
0 and
1. Please
answer the following:
The intercept for equals which other parameter?
The intercept for
1 is the sum between which 2 parameters?
The slope for
0 is the same as which other parameter?
The slope for
1 equals to the linear combination of which other parameters?
3
1.5.4 Q11
Q8-Q10 shows a case where we can obtain identical regression estimates by
regressing with interactions or by training 2 separate regression models, are the
standard errors for these estimates the same, yes/no?
A thought you should have: which method would you choose if someone
asked you to choose? (No need to answer this question for Q11).
“
添加老师微信回复‘’官网 辅导‘’获取专业老师帮助,或点击联系老师1对1在线指导。