This is the first graduate course in Econometrics. It requires familiarity with matrix algebra, the foundations of probability theory, and a willingness to spend a lot of time in front of a computer.
Course outline:
This link takes you to the class outline as a PDF file. For the same material in html, follow this link.
Class Notices:
The principal reference for this course used to be the textbook Econometric Theory and Methods (ETM), Oxford University Press, ISBN 0-19-512372-7, by James MacKinnon and me. Some of you may be interested to know that the book has been translated into Chinese and Russian. However, it is now much better to rely mainly on the new ebooks Foundations of Econometrics Part 1 and Part 2. These books are partially based on ETM, but are shorter, since they contain little more than the material needed for this course.
The ETM textbook, the new ebook, and also an older textbook, Estimation and Inference in Econometrics (EIE), Oxford University Press, ISBN 0-19-506011-3, by James MacKinnon and me as well, are available at this link as PDF files, and can be accessed or downloaded completely free of charge. We ask you to respect the copyright, which is held by us, the two authors, since 2021.
Be warned that the book referred to as EIE treats things at a more advanced level than is needed for this course. It may well still serve as a useful reference for certain things.
If you have, or can find, a hard copy of either book, please note that both of them have undergone a number of printings, with a few corrections incorporated each time. Even if you have the first printing of either book, that would serve perfectly well, since all corrections, right from the beginning, are available on the book homepage, at
http://qed.econ.queensu.ca/pub/dm-book/The versions available on the website are of course the most up to date.
Some people seem to think that good empirical practice is all there is to econometrics. In one sense, that is so, but my experience has shown me that there is no good empirical practice without a good mastery of the underlying theory. It can be tempting to think of econometrics as a set of cookbook recipes, especially as so many of these recipes are made available by modern software. But it is all too easy to apply recipes wrongly if you do not understand the theory behind them. (This remark also applies to the cooking of food!) Thus the second vital aspect of econometric practice is understanding what data are telling you. Although I can make you do exercises that should make you competent in the implementation of a number of procedures, no one can (directly) teach you how to interpret the results of these procedures. Such interpretation is more an art than a science, and can therefore be taught best by example. Unfortunately, we do not have too much time for that. But even if some of the exercises you will be given in the assignments use simulated rather than real data, I will try to make you think of how your results can be interpreted. Making a practice of that may well save you from purely formal errors in the exercises.
Log of material covered:
The first class on August 28 was marred by the misbehaviour of the screen in Leacock 424. Nonetheless, we discussed the contents of the first chapter of Foundations. The main claim in this chapter is that scientific models can be interpreted as virtual reality. Discussion of causality led to the formal definition of a data-generating process, or DGP. We considered the role of randomness in econometric models, and looked at the properties of deterministic random-number generators, or RNGs.
After that, we started on Chapter 2, on the linear regression model. We spent some time on notation and terminology.
On September 2, we started on Chapter 2, on the linear regression model. The first section looks only at the simple regression model, with a constant term and only one non-constant regressor. There are two parameters in the regression function, the intercept, and the slope coefficient. This was illustrated by the very simple model of the consumption function.
Next came section 2.2, with material that should be mostly familiar, on probability, random variables, distributions, expectations, moments, and such things. In particular, conditional versions of all of these things were introduced. One special distribution was defined: the standard normal, denoted N(0,1). The main point of working through this section is to establish terminology and notation.
What is meant by the specification of a linear regression model came
next, in section 2.3., on which we embarked on September 4. There are
two parts to the specification, the deterministic and the stochastic. A model,
thought of as a set of DGPs, is said to be correctly specified if it contains
the DGP that corresponds to the true
DGP in external reality. The DGPs
of the model are specified conditional on some explanatory variables,
or, more generally, on an information set Ωt.
Next, we made a
distinction between linear and nonlinear regression models. The
linearity is with respect to the parameters, not the regressors. There are two
parts to the specification, the deterministic and the stochastic. A model,
thought of as a set of DGPs, is said to be correctly specified if it contains
the DGP that corresponds to the true
DGP in external reality. This leads
on to how a linear regression model can be simulated. The important distinction
was made between endogenous and exogenous variables. The latter
constitute the information set on which the variables in the model are
conditioned.
After that, the only part of section 2.4 we looked at was on partitioned matrices.
We made a start on section 2.5, on techniques of estimation, and introduced the method of moments. Frequently used terminology speaks of the sample and the population. The latter is for us a metaphor for the data-generating process.
On September 9, we finished the remaining material in Chapter 2. Estimating functions are zero functions, that is, functions of both data and parameters of which the expectations are zero when the parameters are those of the true DGP. Estimating functions are usually linear combinations of elementary zero functions, which depend on the data of just one single observation. For a regression model, the obvious elementary zero functions are the residuals. If we choose the elements of the explanatory variables as the coefficients of the linear combinations, we end up with the ordinary least-squares, or OLS estimator. It can be defined as an M-estimator, where the estimator is given by minimising the sum of squared residuals, or as a Z-estimator given by the solution of estimating equations.
Chapter 3 deals with the Geometry of least squares. After reminding ourselves of Pythagoras' Theorem, we developed geometric representations of the operations of linear algebra, that is, addition of vectors and scalar multiplication of vectors. If a linear, or vector, space also admits the scalar product, it becomes a Euclidean space, in which the angles between vectors can be defined. A subspace of a Euclidean space is defined as the set of linear combinations of a set of vectors in the original space. These can sometimes be represented in two or three dimensions geometrically.
We started with the important concept of linear dependence on September 11. We normally require the explanatory variables in a linear regression to be linearly independent, since, if the columns of the matrix X are linearly dependent, the matrix XTX is singular, meaning that its inverse does not exist.
The Geometry of OLS estimation constitutes the material of section 3.3 and much of section 3.4. The result of running a linear regression by OLS was represented geometrically, and this led naturally to the algebraic notion of orthogonal projections, characterised by idempotent symmetric matrices. We saw how the orthogonal projection on to the space spanned by a set of regressors is unchanged if the regressors are replaced by a linearly independent set of linear combinations of them, and illustrated this by switching between Fahrenheit and Celsius temperatures in a regression.The topic of section 3.4 is the Frisch-Waugh-Lovell (FWL) theorem. It is approached by two preliminary results. The first is illustrated by considering deviations from the mean. More generally, in a regression in which the regressors split into two subgroups, one may add linear combinations of those in one group to those of the other, without changing the coefficient estimates for the latter group.
The second preliminary result is that, if the two groups of regressors are mutually orthogonal, leaving one group out does not change the estimates for the other group. The two preliminary results can now be combined so as to get the FWL theorem.
deseasonalisinga time series, we can
detrenda series in a similar way, by incorporating a time trend in a linear regression. Both of these are specific cases of the fixed-effects model, the properties of which were treated. Section 3.6 deals with the phenomena called leverage and influential observations. The influence of a single observation in a sample can be measured by the effect on the parameter estimates of leaving it out. This can be done by use of unit basis vectors, which are indicator variables for just one observation. The FWL theorem then lets us obtain an algebraic formula for the difference between the estimated parameter vector for the full sample and that for the sample without one observation. This formula can be interpreted, and it leads to ways whereby we can detect potentially influential observations, and measure their actual influence.
The potential influence of observation t in a linear regression model is determined by the quantity ht, the tth diagonal element of the projection matrix PX. Whether or not the potential influence is realised depends on the residual ût for this observation.
The quantities ht must lie between zero and one. If the regression has a constant, they must be no less than 1/n, where n is the sample size. Because of a property of the trace of a product of matrices, the sum of the ht is equal to k, the number of regressors. This brought us to the end of Chapter 3.
We started on Chapter 4 on September 23. We defined linear regression models, and in particular the classical normal linear model, in which regressors must be exogenous. Models are defined as sets of DGPs, while a DGP can be thought of as a unique recipe for simulation. In parametric models, there is a parameter-defining mapping, which associates a vector of parameters to each DGP contained in the model.
An estimator is a deterministic function of the data in a dataset, these data being generated by a DGP, so that the estimator is a random variable. Its realisations are called estimates. In the context of a parametric model, an estimator may be biased or unbiased. As most estimators are defined as the solutions of estimating equations, estimating equations can also be biased or unbiased. Unfortunately, use of an unbiased estimating equation does not guarantee an unbiased estimator. We saw this in the case of a linear regression model with a lagged dependent variable as a regressor. This breaks the assumption of exogeneity, although the regressor may be predetermined and the disturbance an innovation.
As soon as the exogeneity assumption is violated, there are hardly any exact results available. Instead, recourse may be had to asymptotic theory to provide approximate results. But asymptotic theory relies on an asymptotic construction, which in principle may be chosen arbitrarily.
To make use of asymptotic theory, we need the study the topic of stochastic convergence, that is, the convergence of sequences of random variables. The most useful sort of stochastic convergence for econometrics is convergence in probability. A quite different type of convergence is convergence in distribution, where there is in general no limiting random variable, but rather a limiting distribution. It was then necessary to introduce the big-O notation for the same-order relation, both for non-random quantities and for random variables.
We were then able to give a definition of the consistency of an estimator, and to see how this definition is similar to that of unbiasedness. However, unbiasedness and consistency are two distinct properties, and a sequence may well satisfy one but not the other.
We began on September 25 by looking at some pathological examples in which unbiasedness and consistency are not the same. Then, in Section 4.4, there are definitions of covariance matrices, correlation matrices, and positive definite and positive semidefinite matrices. The algebraic properties of these were examined. We used all these preliminaries to study the covariance matrix, and also the precision matrix, of the OLS estimator. The properties of these matrices depend on the covariance matrix of the disturbances, usually denoted as Ω. If it is a scalar matrix, we get the usual expression for the covariance matrix of the OLS estimator. However, we noted the possibilities of heteroskedasticity and autocorrelation.
While setting the stage for the Gauss-Markov Theorem, we interrupted ourselves to look at the errors of forecasts based on linear regression, and we saw that, in addition to the error induced by the random disturbance, there is also parameter uncertainty, caused by the fact that the parameters used in the forecast are just estimates, and model uncertainty, caused by the very likely inadequacy of the regression model used for forecasting.
The Gauss-Markov Theorem states that the OLS estimator is BLUE (for Best Linear Unbiased Estimator). The theorem requires the exogeneity of the regressors and white-noise disturbances (homoskedastic and serially uncorrelated). The criterion used to assess efficiency is based on the difference of covariance matrices, or, alternatively, of precision matrices.
We looked at just one proof out of many for the theorem. It works by showing that the covariance matrix of an arbitrary linear unbiased estimator is equal to that of the OLS estimator plus another covariance matrix, which is necessarily positive semidefinite.
September 30: in Section 4.7, the statistical properties of the OLS residuals are presented, under the assumptions of the Gauss-Markov theorem. Their covariance matrix is proportional to the orthogonal projection matrix MX. This means that the mean of the squared residuals is an underestimate of the true variance of the disturbances, but it is easy to construct an unbiased estimator, called s2.
The penultimate section in Chapter 4 deals with the misspecification of a model. Overspecification, whereby a regression model contains regressors with no explanatory power, is not misspecification, but it is worth considering, because it shows how it is necessary to specify a model before invoking the Gauss-Markov theorem. An underspecified model arises if relevant regressors are omitted. This is misspecification, and leads to the definition of a pseudo-true value. With misspecification, we can see the tradeoff between bias and variance. In particular, a biased estimator may have a smaller mean squared error than an unbiased one.
The final section of Chapter 4 introduces the coefficient of determination, usually known as the R2. It can be uncentred, centred, or adjusted, and it is widely used, and misused, as a measure of the goodness of fit of a regression.
On to Chapter 5, on hypothesis testing. We started with a long list of definitions. A hypothesis, called a null hypothesis, corresponds to a model, and the hypothesis is that the model is correctly specified. The alternative hypothesis serves as a framework for the null, in that the null is usually a restricted version of the alternative. A test statistic is a deterministic function of data, and it is usually chosen to satisfy two requirements: first that its distribution, as a random variable, is known and tractable if the data are generated by a DGP contained in the null model; second that its distribution is as different as possible when the DGP is in the alternative model but not in the null model.
A test is based on a test statistic, and it can either reject the null or not. The rejection rule usually takes the form of calling for rejection if the value of the test statistic is in a rejection region. Equal-tailed, two-tailed, and one-tailed tests can define quite different rejection regions for one and the same statistic. Usually such a region is defined by a critical value.
A Type I error is committed if a true null is rejected; a Type II error when a false null is not rejected. The significance level of a test is the desired probability of Type I error.
Work on October 2nd pursued the study of hypothesis testing. Analogously to the level, we define the power of a test as one minus the probability of Type II error. An important concept is the P value, or marginal significance level, that is, the level at which one is at the margin of rejection and non-rejection. A P value is a useful way of transcending the binary nature of a test, as it allows each investigator to choose a personal significance level.
In Section 5.3, there are descriptions and recipes for simulation for various commonly used distributions. The first of these is the Multivariate Normal distribution, which can be constructed on the basis of a set of independent standard normal variables. Next came the chi-squared (χ2) distribution, with a positive integer as the degrees-of-freedom number. The connection with the multivariate normal distribution was emphasised. After that, we looked at Student's t distribution, and the F distribution.
We just started on section 5.4 on exact tests in the context of the classical normal linear model.
Midterm exam:
Assignments:
Data:
All data files needed for the assignments can be obtained by following this link.
Ancillary Readings:
This link takes you to Efron's original 1979 paper in which he introduced the bootstrap.
To send me email, click here or write directly to russell.davidson@mcgill.ca.
URL:
http://russell-davidson.research.mcgill.ca/e662