This is the first graduate course in Econometrics. It requires familiarity with matrix algebra, the foundations of probability theory, and a willingness to spend a lot of time in front of a computer.
Course outline:
This link takes you to the class outline as a PDF file. For the same material in html, follow this link.
Class Notices:
The final exam is now available. Follow this link to see or download it. The exam is due by 13.00 on Thursday December 12.
Class meets on Tuesdays and Thursdays, in Leacock 424, from 10.00 until 11.30.
Our TA this year is Raphaël Langevin
Today, Thursday November 28, Raphaël will have office hours at 16.00 in Leacock 424. He suggests that students would do well to attend, as many did not do very well on Assignment 4, and he would like to explain why.
Mercury Course Evaluation is now open and will be so until December 4. Please give your evaluation of this course, as it may help me to do better in future.
The principal reference for this course used to be the textbook Econometric Theory and Methods (ETM), Oxford University Press, ISBN 0-19-512372-7, by James MacKinnon and me. Some of you may be interested to know that the book has been translated into Chinese and Russian. However, it is now much better to rely mainly on the new ebooks Foundations of Econometrics Part 1 and Part 2. These books are partially based on ETM, but are shorter, since they contain little more than the material needed for this course.
The ETM textbook, the new ebook, and also an older textbook, Estimation and Inference in Econometrics (EIE), Oxford University Press, ISBN 0-19-506011-3, by James MacKinnon and me as well, are available at this link as PDF files, and can be accessed or downloaded completely free of charge. We ask you to respect the copyright, which is held by us, the two authors, since 2021.
Be warned that the book referred to as EIE treats things at a more advanced level than is needed for this course. It may well still serve as a useful reference for certain things.
If you have, or can find, a hard copy of either book, please note that both of them have undergone a number of printings, with a few corrections incorporated each time. Even if you have the first printing of either book, that would serve perfectly well, since all corrections, right from the beginning, are available on the book homepage, at
http://qed.econ.queensu.ca/pub/dm-book/The versions available on the website are of course the most up to date.
Some people seem to think that good empirical practice is all there is to econometrics. In one sense, that is so, but my experience has shown me that there is no good empirical practice without a good mastery of the underlying theory. It can be tempting to think of econometrics as a set of cookbook recipes, especially as so many of these recipes are made available by modern software. But it is all too easy to apply recipes wrongly if you do not understand the theory behind them. (This remark also applies to the cooking of food!) Thus the second vital aspect of econometric practice is understanding what data are telling you. Although I can make you do exercises that should make you competent in the implementation of a number of procedures, no one can (directly) teach you how to interpret the results of these procedures. Such interpretation is more an art than a science, and can therefore be taught best by example. Unfortunately, we do not have too much time for that. But even if some of the exercises you will be given in the assignments use simulated rather than real data, I will try to make you think of how your results can be interpreted. Making a practice of that may well save you from purely formal errors in the exercises.
Log of material covered:
During the first class on August 29, we discussed the contents of the first chapter of Foundations. The main claim in this chapter is that scientific models can be interpreted as virtual reality. Discussion of causality, both necessary and sufficient, led to the formal definition of a data-generating process, or DGP. We considered the role of randomness in econometric models, and looked at the properties of deterministic random-number generators, or RNGs. The role of counterfactuals in any analysis of causality was stressed, especially in the context of randomised trials and of natural experiments.
After that, we started on Chapter 2, on the linear regression model. The first section looks only at the simple regression model, with a constant term and only one non-constant regressor. There are two parameters in the regression function, the intercept, and the slope coefficient. This was illustrated by the very simple model of the consumption function.
The next class was on Tuesday September 3. We began with section 2.2, with material that should be mostly familiar, on probability, random variables, distributions, expectations, moments, and such things. In particular, conditional versions of all of these things were introduced. Two special distributions were defined: the standard normal, denoted N(0,1), and the uniform distribution U(0,1). The main point of working through this section is to establish terminology and notation.
What is meant by the specification of a linear regression model came
next, in section 2.3. There are two parts to the specification, the
deterministic and the stochastic. A model, thought of as a set of DGPs, is said
to be correctly specified if it contains the DGP that corresponds to the
true
DGP in external reality. The DGPs of the model are specified
conditional on some explanatory variables, or, more generally, on an
information set Ωt.
On September 5, we continued the discussion of the specification of a
regression model, making a distinction between linear and
nonlinear regression models. The linearity is with respect to the
parameters, not the regressors. There are two parts to the specification, the
deterministic and the stochastic. A model, thought of as a set of DGPs, is said
to be correctly specified if it contains the DGP that corresponds to the
true
DGP in external reality. This leads on to how a linear regression
model can be simulated. The important distinction was made between
endogenous and exogenous variables. The latter constitute the
information set on which the variables in the model are conditioned.
After that, the only part of section 2.4 we looked at was on partitioned matrices.
We made a start on section 2.5, on techniques of estimation, and introduced the method of moments. Frequently used terminology speaks of the sample and the population. The latter is for us a metaphor for the data-generating process. The twin concepts of estimate and estimator were introduced, the former being a realisation of the latter. An estimating equation sets equal to zero and estimating function, wihich is a function of data and parameters. Here the word data can be used in two ways: actual observations of numbers, and random variables generated by the DGP.
There was unfortunately no class on Tuesday September 10, since I was sick.
On September 12, we finished the remaining material in Chapter 2. Estimating functions are zero functions, that is, functions of both data and parameters of which the expectations are zero when the parameters are those of the true DGP. Estimating functions are usually linear combinations of elementary zero functions, which depend on the data of just one single observation. For a regression model, the obvious elementary zero functions are the residuals. If we choose the elements of the explanatory variables as the coefficients of the linear combinations, we end up with the ordinary least-squares, or OLS estimator. It can be defined as an M-estimator, where the estimator is given by minimising the sum of squared residuals, or as a Z-estimator given by the solution of estimating equations.
Chapter 3 deals with the Geometry of least squares. After reminding ourselves of Pythagoras' Theorem, we developed geometric representations of the operations of linear algebra, that is, addition of vectors and scalar multiplication of vectors. If a linear, or vector, space also admits the scalar product, it becomes a Euclidean space, in which the angles between vectors can be defined. A subspace of a Euclidean space is defined as the set of linear combinations of a set of vectors in the original space. These can sometimes be represented in two or three dimensions geometrically.
We started with the important concept of linear dependence on September 17. We normally require the explanatory variables in a linear regression to be linearly independent, since, if the columns of the matrix X are linearly dependent, the matrix XTX is singular, meaning that its inverse does not exist.
The Geometry of OLS estimation constitutes the material of section 3.3 and much of section 3.4. The result of running a linear regression by OLS was represented geometrically, and this led naturally to the algebraic notion of orthogonal projections, characterised by idempotent symmetric matrices. We saw how the orthogonal projection on to the space spanned by a set of regressors is unchanged if the regressors are replaced by a linearly independent set of linear combinations of them, and illustrated this by switching between Fahrenheit and Celsius temperatures in a regression.The topic of section 3.4 is the Frisch-Waugh-Lovell (FWL) theorem. It is approached by two preliminary results. The first is illustrated by considering deviations from the mean. More generally, in a regression in which the regressors split into two subgroups, one may add linear combinations of those in one group to those of the other, without changing the coefficient estimates for the latter group.
The second preliminary result was the first topic on September 19. It is that, if the two groups of regressors are mutually orthogonal, leaving one group out does not change the estimates for the other group. The two preliminary results can now be combined so as to get the FWL theorem. An algebraic proof of the theorem is quite easy, but by itself doesn't give the intuitions that follow from the preliminary results.
After that came the applications of the theorem in section 3.5, starting
with seasonal variation. A method called seasonal adjustment by regression was
presented, and contrasted with seasonal adjustment as practised by Statistics
Canada and statistical institutions in other countries. We made a distinction
between seasonal adjustment and taking account of seasonality in
the context of an econometric model. If seasonal adjustment by regression is
called deseasonalising
a time series, we can detrend
a series in
a similar way, by incorporating a time trend in a linear regression. Both of
these are specific cases of the fixed-effects model, the properties of
which were treated.
Section 3.6 deals with the phenomena called leverage and influential observations, and we started with this on September 24. The influence of a single observation in a sample can be measured by the effect on the parameter estimates of leaving it out. This can be done by use of unit basis vectors, which are indicator variables for just one observation. The FWL theorem then lets us obtain an algebraic formula for the difference between the estimated parameter vector for the full sample and that for the sample without one observation. Next, we will interpret this formula, and derive some properties, whereby we can detect potentially influential observations, and measure their actual influence.
The potential influence of observation t in a linear regression model is determined by the quantity ht, the tth diagonal element of the projection matrix PX. Whether or not the potential influence is realised depends on the residual ût for this observation.
The quantities ht must lie between zero and one. If the regression has a constant, they must be no less than 1/n, where n is the sample size. Because of a property of the trace of a product of matrices, the sum of the ht is equal to k, the number of regressors. This brought us to the end of Chapter 3.
In Chapter 4, we defined linear regression models, and in particular the classical normal linear model, in which regressors must be exogenous. Models are defined as sets of DGPs, while a DGP can be thought of as a unique recipe for simulation. In parametric models, there is a parameter-defining mapping, which associates a vector of parameters to each DGP contained in the model.
An estimator is a deterministic function of the data in a dataset, these data being generated by a DGP, so that the estimator is a random variable. Its realisations are called estimates. In the context of a parametric model, an estimator may be biased or unbiased. As most estimators are defined as the solutions of estimating equations, estimating equations can also be biased or unbiased. Unfortunately, use of an unbiased estimating equation does not guarantee an unbiased estimator. We saw this in the case of a linear regression model with a lagged dependent variable as a regressor. This breaks the assumption of exogeneity, although the regressor may be predetermined and the disturbance an innovation.
As soon as the exogeneity assumption is violated, there are hardly any exact results available. Instead, recourse may be had to asymptotic theory to provide approximate results. But asymptotic theory relies on an asymptotic construction, which in principle may be chosen arbitrarily. The arbitrariness of an asymptotic construction was illustrated by use of a silly model, for which two quite different asymptotic constructions naturally suggest themselves.
On September 26, we began with the topic of stochastic convergence, that is, the convergence of sequences of random variables. The first type of stochastic convergence we looked at is almost sure convergence. This is the kind of convergence given by the Strong Law of Large Numbers. Next came convergence in probability, which is implied by almost sure convergence, but does not imply it. A quite different type of convergence is convergence in distribution, where there is in general no limiting random variable, but rather a limiting distribution. It was then necessary to introduce the big-O notation for the same-order relation, both for non-random quantities and for random variables.
We were then able to give a definition of the consistency of an estimator, and to see how this definition is similar to that of unbiasedness. However, unbiasedness and consistency are two distinct properties, and a sequence may well satisfy one but not the other, as shown by some pathological examples.
In Section 4.4, there are definitions of covariance matrices, correlation matrices, and positive definite and positive semidefinite matrices. The algebraic properties of these were examined. We used all these preliminaries to study the covariance matrix, and also the precision matrix, of the OLS estimator. The properties of these matrices depend on the covariance matrix of the disturbances, usually denoted as Ω. If it is a scalar matrix, we get the usual expression for the covariance matrix of the OLS estimator. However, we noted the possibilities of heteroskedasticity and autocorrelation.
The main focus of the class on October 1st was the Gauss-Markov Theorem, according to which the OLS estimator is BLUE (for Best Linear Unbiased Estimator). The theorem requires the exogeneity of the regressors and white-noise disturbances (homoskedastic and serially uncorrelated). The criterion used to assess efficiency is based on the difference of covariance matrices, or, alternatively, of precision matrices.
While setting the stage for the theorem, we interrupted ourselves to look at the errors of forecasts based on linear regression, and we saw that, in addition to the error induced by the random disturbance, there is also parameter uncertainty, caused by the fact that the parameters used in the forecast are just estimates, and model uncertainty, caused by the very likely inadequacy of the regression model used for forecasting.
In Section 4.7, the statistical properties of the OLS residuals are presented, under the assumptions of the Gauss-Markov theorem. Their covariance matrix is proportional to the orthogonal projection matrix MX. This means that the mean of the squared residuals is an underestimate of the true variance of the disturbances, but it is easy to construct an unbiased estimator, called s2.
The penultimate section in Chapter 4 deals with the misspecification of a model. Overspecification, whereby a regression model contains regressors with no explanatory power, is not misspecification, but it is worth considering, because it shows how it is necessary to specify a model before invoking the Gauss-Markov theorem. An underspecified model arises if relevant regressors are omitted. This is misspecification, and leads to the definition of a pseudo-true value. With misspecification, we can see the tradeoff between bias and variance. In particular, a biased estimator may have a smaller mean squared error than an unbiased one.
The final section of Chapter 4 introduces the coefficient of determination, usually known as the R2. It can be uncentred, centred, or adjusted, and it is widely used, and misused, as a measure of the goodness of fit of a regression.
On October 3, we moved on to Chapter 5, on hypothesis testing. We started with a long list of definitions. A hypothesis, called a null hypothesis, corresponds to a model, and the hypothesis is that the model is correctly specified. The alternative hypothesis serves as a framework for the null, in that the null is usually a restricted version of the alternative. A test statistic is a deterministic function of data, and it is usually chosen to satisfy two requirements: first that its distribution, as a random variable, is known and tractable if the data are generated by a DGP contained in the null model; second that its distribution is as different as possible when the DGP is in the alternative model but not in the null model.
A test is based on a test statistic, and it can either reject the null or not. The rejection rule usually takes the form of calling for rejection if the value of the test statistic is in a rejection region. Equal-tailed, two-tailed, and one-tailed tests can define quite different rejection regions for one and the same statistic. Usually such a region is defined by a critical value.
A Type I error is committed if a true null is rejected; a Type II error when a false null is not rejected. The significance level of a test is the desired probability of Type I error, while the power is one minus the probability of Type II error. An important concept is the P value, or marginal significance level, that is, the level at which one is at the margin of rejection and non-rejection. A P value is a useful way of transcending the binary nature of a test, as it allows each investigator to choose a personal significance level.
In Section 5.3, there are descriptions and recipes for simulation for various commonly used distributions. The first of these is the Multivariate Normal distribution, which can be constructed on the basis of a set of independent standard normal variables.
Class on October 8 was unusual. As on-campus activities were discouraged, only some of us met in the classroom, while the lecture was being recorded and then made available on myCourses.
After a quick review of the multivariate normal distribution, we moved on to the chi-squared (χ2) distribution, with a positive integer as the degrees-of-freedom number. The connection with the multivariate normal distribution was emphasised. After that, we looked at Student's t distribution, and the F distribution.
Section 5.4 embarks on exact tests in the context of the classical normal linear model. First, we saw how to construct a test statistic for one single restriction on the regression parameters of a linear regression, and showed that its distribution under the null is Student's t. For more than one restriction, a test statistic can be constructed such that its distribution under the null is the F distribution.
We then continued our study of the F statistic, developing several algebraic representations of the statistic, which was initially introduced as a function of the sums of squared residuals from the restricted and unrestricted models. One such representation led to a quick proof that that statistic does indeed have an F distribution under the null hypothesis. Next, we showed how the dependent variable is implicitly made the subject of a threefold orthogonal decomposition when the null and alternative regressions are run.
The three components of the orthogonal decomposition were examined at the start of class on October 10. The first is not used by the F statistic, while the statistic itself is the ratio of the squared norms of the second and third components, each divided by its degrees-of-freedom number. Study of the numerator of the ratio shows clearly how the distribution of the statistic differs under the null and the alternative, thus permitting the statistic to perform the desired statistical discrimination of the alternative from the null.
An important application of the F test is given by the Chow test, developed by Gregory Chow. This test has a null hypothesis according to which the regression parameters are the same for two (or more) subsamples of the overall dataset, the alternative being that the parameters differ across the subsamples. The implementation of the test is quite subtle, although easy enough, and proceeds as a straightforward F test. Since all the work on the t and F tests is specific to the classical normal linear model, logically the next topic has to be the asymptotic theory behind use of these tests in more general linear regression models.
The two main tools of asymptotic theory are laws of large numbers and central limit theorems. The former leads to a deterministic or degenerate probability limit, while the second yields only convergence in distribution. In Section 5.5, we used these two tools to investigate the asymptotic normality and root-n consistency of the OLS estimator where the disturbances are not necessarily normal white noise, and the regressors are predetermined.
In section 5.6, with which we started class on October 22, the asymptotic properties of t and F test statistics are studied, and the result is that the exact results obtained for the classical normal linear model hold asymptotically under the weaker conditions used here. The last subsection introduces Wald tests, here applied to testing linear restrictions on the parameters of a linear regression, but much more generally applicable. For the linear setup, the Wald statistic is asymptotically equal to a multiple of an F statistic.
Section 5.7 deals with multiple testing, and it warns of many potential traps. Consideration of the family-wise error rate shows that using a set of tests with one or a small number of degrees of freedom rather than one joint test leads to loss of control of the probability of Type I error. The Bonferroni bound gives a way to recover control, but at the cost of a very conservative test. Other procedures, due to Simes, and to Benjamini and co-authors, alleviate the problem, but have no rigorous justification.
The next section, section 5.8, investigates the power of hypothesis tests, and introduces various non-central distributions, characterised by non-centrality parameters (NCP), which determine the power of the corresponding tests. We defined the non-central χ2 distribution, the non-central F distribution, and the non-central t distribution. A difficult result due to Das Gupta and Perlman says that, for a given NCP, the power of a test is a decreasing function of the degrees-of-freedom number.
Section 5.9 deals with pretesting. We defined what is meant by this term, and will show that it is usually a bad idea.
On October 24, we resumed the study of pretesting, normally a bad thing, although it has been used quite extensively in the past. However, it can sometimes be admissible, if a researcher is willing to tolerate some bias in order to have a reduction in variance. This tradeoff recurs often in econometrics, and in statistics more generally.
The first few sections of Chapter 6 deal with confidence intervals and confidence sets, and the important notion of coverage. A confidence set for a scalar parameter or a vector of parameters contains all values of the parameter(s) θ0 for which a test of the hypothesis that the true θ is equal to θ0 is not rejected at significance level α. We began with symmetrical equal-tail confidence intervals. This entailed a digression on quantiles. In fact, the critical values used to determine the limits of the confidence interval are quantiles of the null distribution of the test statistic, such as, for instance a t statistic.
A confidence set can be defined only in the context of a model. In order for a confidence set to have exact coverage, the test statistic must be a pivot, or be pivotal, for the model. Another way to say this is that the statistic must be a pivotal function, that is, a function of data and parameters such that, when it is evaluated at the true parameter values for some DGP contained in the model, its distribution under that DGP does not depend on the particular DGP in the model. If the statistic is merely an asymptotic pivot, the interval has only approximate coverage.
There are various counter-intuitive aspects to constructing confidence intervals if they are unbounded, or if they are asymmetrical, as we saw on October 29. Of these, the strangest is the fact that the upper quantile of the distribution of the test statistic used to construct the confidence interval determines the lower limit of the interval, while the lower quantile determines the upper limit. When more than one parameter is involved, the confidence region no longer suffers from these counter-intuitive problems, because it depends on a test statistic which is a quadratic form, so that the information in the sign bit is lost. For both exact confidence regions and asymptotic ones, the shape of the region is elliptical in two dimensions and ellipsoidal in higher dimensions.
One-dimensional confidence intervals can be based on a higher-dimensional confidence region by projection. This leads to conservative intervals, for which the coverage probability is greater than nominal.
Section 6.4 is the first in a series of sections that develop robust covariance matrix estimates, which all take the form of sandwich covariance matrices. This section looks at various types of HCCME, or Heteroskedasticity-Consistent Covariance Matrix Estimator. Although with a sample size of n, there are n different variances, which cannot all be estimated consistently, this does not stop us from getting a consistent estimate of the sandwich covariance matrix, which depends on all these variances only through a k x k matrix.
The basic HCCME is called HC0. Better versions are denoted as HC1, HC2, and HC3. Of these HC2 is (almost) uniformly better than HC0 or HC1. Between HC2 and HC3 there is no uniform ranking, but HC2 tends to do better when the heteroskedasticity is not very strong, and HC3 when it is stronger. It is pointed out that heteroskedasticity does not matter if all that is done is to compute a standard error for a sample mean.
Only a few of us gathered for class on October 31, probably the warmest Hallowe'en on record - last year it snowed. Others were hard at work completing the midterm.
We started with the study of Heteroskedasticity and Autocorrelation Consistent (HAC) covariance matrix estimators. These involve double sums over the elements of the covariance matrix, and it turns out to be useful to define the set of autocovariance matrices, which are sums computed over the elements of a diagonal of the matrix. If subsequently sums are computed of the autocovariance matrices, we end up with the complete double sum. Practical HAC estimators make use of a lag truncation parameter in order to reduce the number of terms in the estimator. The Hansen-White estimator has the disadvantage of not necessarily being positive definite, but the Newey-West estimator does not share this disadvantage, and so is the most widely used HAC estimator.
Next came the Cluster Robust Variance Estimator (CRVE), used in the presence of clustering, which breaks a sample down into non-intersecting subsamples, or clusters. The overall disturbance covariance matrix is then assumed to have a block-diagonal structure. Use of a simple error-components model allows one to see why clustering matters, since, if it is not accounted for, variance can be very severely underestimated. The CRVE uses a sum over all the clusters, and it has rather unusual properties. It may not be easy to see how best to split a sample into clusters, and if one or two clusters contain many more observations than the others, statistical inference becomes very difficult.
Then we embarked on section 6.7, on difference in differences (DiD). This is a widely used technique currently, and it tries to shed light on the effect of some change in one subsample, when there is another subsample in which the change did not occur. It relies on double differencing (hence the name) in order to construct a counterfactual model, which allows one to measure the causal effect of the change. A more general formulation of the diff-in-diff approach followed, allowing for many jurisdictions and many time periods.
We completed the study of Chapter 6 on November 5. We looked at the last substantive section of Chapter 6, on the Delta Method. This uses the mean-value theorem, a special case of Taylor's Theorem, in order to linearise one or more nonlinear functions. This provides a way to do inference on nonlinear functions of estimated parameters. There was more emphasis on the scalar case, in particular, on two different ways of constructing a confidence interval for a nonlinear function of an estimated parameter.
Then we embarked on Chapter 7, on the Bootstrap. Despite the rather ridiculous name, this provides an astonishingly effective technique for statistical inference. The bootstrap principle says that one can estimate any property of a DGP if the true, unknown, DGP can be estimated. One uses the same property of the estimated DGP in order to estimate that property of the true DGP.
We began by seeing how to calculate bootstrap P values based on the empirical distribution of a set of simulated statistics. We looked at a bootstrap test based on a pivotal test statistic. It is then called a Monte Carlo test. It gives exact inference, in the sense that the rejection probability when the null is true is exactly equal to the significance level, if and only if the condition that α(B+1) is equal to an integer is satisfied, where α is the significance level and B is the number of bootstrap statistics. Even when the statistic is not pivotal, it is desirable to respect the rule that α(B+1) should be an integer.
In the context of linear regression models, the bootstrap can be used for testing even when there are lagged dependent variables in the set of regressors. Class on November 7 started with this problem. With lagged dependent variables as regressors, the bootstrap DGP must implement the solution of a recurrence relation. We looked first at the parametric bootstrap, where the bootstrap DGP is completely characterised by a finite set of parameters, and then at the resampling bootstrap, where the distribution of the disturbances is given by the empirical distribution of the residuals.
We briefly discussed the question of when to stop generating bootstrap repetitions. The idea is that when it is clear that the decision to reject the null or not is very unlikely to be changed on account of further repetitions, one should stop.
Next, the influence of B on the power of a t test was examined graphically, and it was seen that B = 99 can improve power quite significantly compared with B = 19. This is related to the fact that the exact inference for small B is achieved using randomness from the random number generator as much as from the randomness in the real sample, and so it is possible to reject a null one day and fail to reject it the next.
With a decent choice of B, however, results of an experiment that compares rejection probabilities of an asymptotic test with those of a bootstrap test show very clearly that a correctly designed bootstrap test achieves inference much closer to being exact than does an asymptotic test. This led on to the Golden Rules of bootstrapping, which define what a well designed bootstrap DGP should be. The first rule simply says that the bootstrap DGP must satisfy the null hypothesis under test. The second is a bit more subtle, and wants the bootstrap DGP to be as good an estimate as possible of the true unknown DGP, under the assumption that the latter also satisfies the null hypothesis.
Next, we considered how a bootstrap DGP might take account of heteroskedasticity. We looked briefly at Freedman's pairs bootstrap and Flachaire's improved version of it, while noting that, although Flachaire's version satisfies Golden Rule 1, neither satisfies Golden Rule 2.
On November 12, we saw that the best ways to set up a bootstrap DGP when disturbances are heteroskedastic are the various versions of the wild bootstrap. The bootstrap conditions on a set of residuals, preferably the restricted residuals, but the unrestricted ones will also work. Each residual is multiplied by a random variable, from the computer RNG, with expectation zero and variance one.
Mammen suggests a two-point distribution, with only two possible realisations, chosen, along with their probabilities, so that the third moment of these variables, the st*, is equal to one. However, experience has shown that the Rademacher distribution is almost always better: this distribution just amounts to a random sign.
With autocorrelated disturbances, nothing so far invented works as well as the wild bootstrap. The most frequently used bootstrap DGP makes use of the moving-block bootstrap, where blocks of consecutive observations are resampled. As with the pairs bootstrap, regressors are resampled along with the dependent variable. The blocks are chosen to be overlapping, as, once again, experience has shown that this is better than using non-overlapping blocks.
The sieve bootstrap is an example of a semi-non-parametric procedure. One seeks an autoregressive, moving-average, or even ARMA model for the disturbances which makes the residuals as close as possible to white noise. The bootstrap DGP then solves the recurrence defined by the AR(MA) process, using disturbances that could be resampled residuals or residuals multiplied by st* variables, as in the wild bootstrap.
Bootstrap confidence sets, as they are usually computed, make use of Golden Rule 1, turned on its head. The bootstrap DGP is defined using the estimated parameters, but the bootstrap statistic tests the hypothesis about the parameter of interest that is true of that bootstrap DGP. This leads to an estimate of the distribution of the statistic, if the tested parameter value is the true parameter for the bootstrap DGP, and the quantiles of this distribution are used, both for confidence intervals and higher-dimensional confidence sets, exactly as for asymptotic confidence sets. It is necessary to choose particular versions of the quantiles of the discrete empirical distribution, and it is seen that specific order statistics provide the best choice.
Chapter 7 concludes with some warnings about poorly designed bootstrap techniques, which are based on bootstrapping quantities that are not even approximately pivotal.
After a quick look at bootstrap confidence regions, we embarked on
November 14 on Chapter 8, on instrumental variable (IV)
estimation. Reasons for needing instrumental variables include the classical
errors-in-variables
problem, but probably the most important reason is
when more endogenous variables than just one are determined simultaneously.
The simplest example of this is given by a partial-equilibrium model, where
the equilibrium price and quantity are determined jointly by the condition
that quantity demanded equals quantity supplied.
In section 8.3, the main properties of IV estimation are established. If the instruments are exogenous, it is seen that the estimating equations are unbiased, although the estimator itself is biased. It can be shown to be consistent with any sensible asymptotic construction, and it has an asymptotic covariance matrix that is the same, algebraically, as the one derived in connection with the Gauss-Markov theorem.
We were able to distinguish the simple IV estimator, where there are exactly as many instruments as regressors, from the generalised IV estimator, which applies not only in the just-identified case but also in the over-identified case. The identification condition is that the matrix XTPWX is nonsingular.
Consideration of efficiency, by the criterion of the asymptotic covariance matrix, shows that the optimal instruments are given by the expectations of the regressors, conditional on exogenous and predetermined information. Since these optimal instruments are infeasible, what can be done in practice is to use the orthogonal projections of the regressors on the space spanned by the instruments.
Next came the asymptotic properies of IV estimation. The main one is (asymptotic) identification, which in turn implies consistency. Then root-n consistency follows with a little more regularity, and estimation of the asymptotic covariance matrix is straightforward.
We took a quick look at the venerable procedure called two-stage least squares on November 19. This is a way of computing the IV estimator, but it needs a bit more work to get the covariance matrix estimator.
In this course, we don't pay much attention to the finite-sample properties of the IV estimators. However, section 8.4 gives a brief treatment of this topic. A particularly simple example, with only one regressor and one instrument, shows that the unconditional expectation of the estimator does not exist. The more general result was stated: moments of the estimator exist only up to the degree of over-identification, that is, the number of instruments minus the number of regressors.
Section 8.5 deals with hypothesis testing with linear regressions estimated with instrumental variables. The main result is that there is an artificial regression that lets almost everything be done in the same way as with OLS. We can construct t statistics, F statistics, using non-robust covariance matrix estimators, or HCCMEs, or HAC estimators, all in the usual way. What we cannot do is use an F statistic calculated with the sums of squared residuals from IV estimation, because IV makes use of an oblique projection, so that Pythagoras' theorem no longer applies.
What is analogous to the F statistic is a statistic based on the difference between the unconstrained and constrained IV criterion functions. We still have to get a consistent estimate of the variance of the disturbances, σ2, but that is easy.
Next we looked at two important special cases of hypothesis testing. The first is the test of the overidentifying restrictions. This makes sense only if the degree of overidentification, that is, the difference between the number of instruments and the number of regressors, is greater than zero. This is a good example of a diagnostic test. If it rejects the null hypothesis, there is ambiguity about why. It may be that some instruments are not valid, or it could be that some of the instruments that are not included in the set of regressors should be included, as they have explanatory power in their own right.
The second test is the Durbin-Wu-Hausman (DWH) test. The null hypothesis is that all the regressors are in fact exogenous or predetermined, against the alternative that some are not, and thus need to be replaced by instruments for estimation and testing. If the null is rejected, there is an ambiguity as to why similar to that with the test of overidentifying restrictions.
On November 21, we began with the rest of Chapter 8, about how to bootstrap linear regression models estimated with instrumental variables. Since there are several endogenous variables, the left-hand side of the structural equation and the endogenous explanatory variables, the bootstrap DGP must be capable of generating them all. Thus to the structural equation the reduced-form equations have to be appended so as to have as many equations as endogenous variables to be generated.
The structural equation for the null hypothesis model can be estimated by IV in order to get restricted parameter estimates and restricted residuals. The reduced-form equations can be estimated by OLS, and this gives parameter estimates and residuals. The residuals can be grouped in an n x (k2+1) matrix, where k2 is the number of endogenous explanatory variables. Then a resampling bootstrap resamples entire rows of this matrix to provide all the bootstrap disturbances. For a parametric bootstrap that assumes multivariate normality, the covariance matrix of the residuals is estimated by the product of the transpose of the matrix of residuals with the matrix itself, divided by n. We can then generate bootstrap disturbances from a multivariate normal distribution with this estimated covariance matrix. A better idea is to use the wild bootstrap, by which the bootstrap disturbances for observation t are the set of all the residuals for that observation, multiplied by a random sign.
The wild cluster bootstrap works similarly. The residuals of all the observations in a given cluster are multiplied by the same random sign, the random signs being independent across clusters. Clustering, cluster-robust inference, and the wild cluster bootstrap, are still active fields of research.
Next came Chapter 9, in which the first topic is Generalised Least Squares (GLS). If the covariance matrix Ω of the disturbances is known, it can be used to transform all the variables in a linear regression so as to end up with disturbances with the identity matrix as the covariance matrix. The Gauss-Markov Theorem then ensures the efficiency of the GLS estimator.
Feasible GLS can be used when the covariance matrix is known only up to some parameters that can be estimated consistently. Under weak regularity conditions, asymptotic properties are unchanged between GLS and feasible GLS. A special case is weighted least squares, where Ω is diagonal. Another case we considered was where there is an explicit skedastic function.
On November 26, we looked at tests for heteroskedasticity in section 9.5. The null is homoskedasticity, but there are a great many possibilities for the heteroskedastic alternative. One way to choose among them is to have a set of variables that appear as explanatory variables in the skedastic function, and to run an auxiliary testing regression with the squared OLS residuals as regressand and a constant and these variables as regressors. The test statistic just tests whether all the parameters in the auxiliary regression are zero except for the constant.
Next came autocorrelation. We started with the AR(1) process, and discussed various conditions for stationarity. A necessary condition is that the autoregressive parameter ρ must be less than one in absolute value. By itself this condition guarantees asymptotic stationarity, but an initialisation condition is needed for sufficiency.
Then came higher-order AR processes where the AR(p) process uses p lags, the MA(1) process, and the MA(q) process. The concept of a polynomial in the lag operator was useful for providing a compact notation for AR, MA, and ARMA processes, and for expressing the necessary condition for stationarity as a condition on the (complex) roots of a polynomial.
Testing for autocorrelation was logically the next step. The AR(1) alternative gives rise to a nonlinear regression model, of interest in its own right, but, for testing purposes, it can be linearised to yield a linear testing regression. This sort of thing can also be done for AR(p) processes and MA(q) processes as well, and it turns out that an AR(p) process and an MA(q) process are locally equivalent. The testing regressions all regress the OLS residuals on the regressors in the model and some number of lags of the residuals.
We skipped over the subsection on the now outmoded Durbin-Watson test. Although it is unsatisfactory from most points of view, it can give correct inference if bootstrapped.
Models with serially correlated disturbances can be estimated by feasible GLS. For AR(1), the autoregressive parameter, usually denoted ρ, has to be estimated consistently, but this can be done in various different ways, Although we worked out the full covariance matrix Ω for the AR(1) process, it turned out to be much easier to get the matrix ΨT that is used in the GLS transformation: the white noise disturbances can easily be expressed as linear combinations of the serially correlated disturbances. It was pointed out that it seldom makes sense to estimate a model like this by feasible GLS, because various kinds of misspecification produce the appearance of serial correlation, due only to the misspecification and not to actual serial correlation. Another disadvantage of formulating a time-series model in this way is that the methods fail if there are lagged dependent variables in the regression function.
The last substantive section in Chapter 9 is on panel-data models. We began with that on November 28. We looked very briefly at the fixed-effects model and the random-effects model. Of these, the former can be estimated by the least-squares dummy variables (LSDV) model, while the latter can be estimated by feasible GLS. Panels are often unbalanced in empirical work. This matters little for the fixed effects model, but can be handled with a little cleverness with random effects.
Then we moved on to the second volume of the somewhat revised textbook. Chapter 1 of this new book deals with Nonlinear Regression. For a regression to be nonlinear, the regression function has to be nonlinear with respect to the parameters. One example we had already seen is the nonlinear dynamic model that results from modifying a linear model with AR(1) disturbances so that the transformed disturbances are white noise. This nonlinear regression, like many others, can be regarded as a linear regression subject to nonlinear restrictions on the parameters.
Nonlinear regressions can be estimated using instrumental variables, and the estimating equations are very simple to write down. If asymptotic theory is brought to bear on these equations, it is seen that consistency is a consequence of asymptotic identification, a property that says that the limiting estimating equations have a unique solution for the model parameters. If this condition is slightly strengthened, we are led to root-n consistency and asymptotic normality.
The last class was on December 3. We resumed study of Chapter 1 of the second volume of the textbook, and succeeded (just!) in completing it.
Asymptotic normality allows us to compare limiting covariance matrices or
precision matrices determined by the choice of instrumental variables. A result
that falls out quickly is that the optimal
instruments are the columns
of the Jacobian matrix of the regression functions with respect to the
parameters, evaluated at the true parameter values. But, since the true
parameters are not known, this choice of instruments is not feasible.
A solution to this difficulty is to estimate the instruments simultaneously with the parameters, and this approach leads to estimating equations that are the first-order conditions for minimising the sum of squared residuals, a procedure called, naturally enough, nonlinear least squares (NLS). There are a few asymptotic niceties involved in the simultaneous estimation of parameters and instruments, but they do not change the main results.
The facts that the estimating equations of NLS are nonlinear, and that the sum-of-squared-residuals criterion function is not quadratic in the parameters, imply that an iterative procedure has to be used for estimation. Newton's method is one way of approaching this. It makes use of both the gradient and the Hessian matrix of the criterion function. It amounts to using the quadratic approximation of this function given by Taylor's theorem, and solving the first-order conditions for minimising it. At each step, the gradient and Hessian are re-computed, leading to what one hopes are progressively better approximations to the function, and to solutions of the approximate first-order conditions that converge to the actual NLS estimates.
Newton's method may or may not converge, and it may converge to something other than the global minimum of the criterion function. It may be computationally demanding on account of the need to evaluate second-order derivatives. For these reasons, some quasi-Newton methods have been developed, where the Hessian matrix is replaced by a matrix that, unlike the Hessian, is guaranteed by its construction to be positive definite. Of these methods, the best known and most widely used is the Gauss-Newton regression (GNR).
We saw why the Gauss-Newton method, considered as a quasi-Newton method, is especially well adapted to nonlinear regression. It has at least three possible uses. The first is just to check whether a particular parameter vector does indeed satisfy the estimating equations for NLS, that is, the first-order conditions for the minimisation of the sum of squared residuals. Note that the sufficient second-order condition for a minimum is automatically satisfied at such a parameter vector.
The second use for the GNR is to estimate the covariance matrix of the NLS estimates. Although this is much less necessary nowadays than some decades ago, it is still useful, especially for bootstrapping. It is also useful because the GNR can provide an HCCME valid for NLS by use of the usual formula. The third property, that of one-step efficient estimation, is also no longer of much interest in itself, but it justifies the use of the GNR for hypothesis testing.
These days, the GNR is mainly used for hypothesis testing. It allows test statistics, asymptotic t or F, and Wald statistics, to be computed in the usual way, since the GNR, like all artificial regressions, is a linear regression, subject to all of the purely formal properties of OLS. Like OLS, it also allows for the use of HCCMEs, or perhaps even HAC estimators, when needed. We took a swift retrospective look again at the model with a linear regression with AR disturbances. It provides a good example of a GNR-based test. The GNR can even be adapted to estimation using instrumental variables, in which case it is called the IVGNR.
Midterm exam:
Follow this link for the midterm exam. You have 48 hours to submit it. For the data file, follow this link.
Assignments:
Data:
All data files needed for the assignments can be obtained by following this link.
Ancillary Readings:
This link takes you to Efron's original 1979 paper in which he introduced the bootstrap.
To send me email, click here or write directly to russell.davidson@mcgill.ca.
URL:
http://russell-davidson.research.mcgill.ca/e662