Economics 662

This is the first graduate course in Econometrics. It requires familiarity with matrix algebra, the foundations of probability theory, and a willingness to spend a lot of time in front of a computer.

Course outline:

This link takes you to the class outline as a PDF file. For the same material in html, follow this link.

Class Notices:

My office is Leacock 321C.

Class meets on Tuesdays and Thursdays at 13.00-14.30 in Leacock 424.

I will normally be in my office on Tuesdays and Thursdays from a little after 10.00 until a little before 13.00.

Class on Thursday September 18 has to be cancelled as I will unavoidably be out of town

Our TA for this term is Miroslav Zhao. His office hours are on Wednesdays 16.30-17.30 in Leacock 112.

The principal reference for this course used to be the textbook Econometric Theory and Methods (ETM), Oxford University Press, ISBN 0-19-512372-7, by James MacKinnon and me. Some of you may be interested to know that the book has been translated into Chinese and Russian. However, it is now much better to rely mainly on the new ebooks Foundations of Econometrics Part 1 and Part 2. These books are partially based on ETM, but are shorter, since they contain little more than the material needed for this course.

The ETM textbook, the new ebook, and also an older textbook, Estimation and Inference in Econometrics (EIE), Oxford University Press, ISBN 0-19-506011-3, by James MacKinnon and me as well, are available at this link as PDF files, and can be accessed or downloaded completely free of charge. We ask you to respect the copyright, which is held by us, the two authors, since 2021.

Be warned that the book referred to as EIE treats things at a more advanced level than is needed for this course. It may well still serve as a useful reference for certain things.

If you have, or can find, a hard copy of either book, please note that both of them have undergone a number of printings, with a few corrections incorporated each time. Even if you have the first printing of either book, that would serve perfectly well, since all corrections, right from the beginning, are available on the book homepage, at

http://qed.econ.queensu.ca/pub/dm-book/

The versions available on the website are of course the most up to date.

Some people seem to think that good empirical practice is all there is to econometrics. In one sense, that is so, but my experience has shown me that there is no good empirical practice without a good mastery of the underlying theory. It can be tempting to think of econometrics as a set of cookbook recipes, especially as so many of these recipes are made available by modern software. But it is all too easy to apply recipes wrongly if you do not understand the theory behind them. (This remark also applies to the cooking of food!) Thus the second vital aspect of econometric practice is understanding what data are telling you. Although I can make you do exercises that should make you competent in the implementation of a number of procedures, no one can (directly) teach you how to interpret the results of these procedures. Such interpretation is more an art than a science, and can therefore be taught best by example. Unfortunately, we do not have too much time for that. But even if some of the exercises you will be given in the assignments use simulated rather than real data, I will try to make you think of how your results can be interpreted. Making a practice of that may well save you from purely formal errors in the exercises.

Log of material covered:

The first class on August 28 was marred by the misbehaviour of the screen in Leacock 424. Nonetheless, we discussed the contents of the first chapter of Foundations. The main claim in this chapter is that scientific models can be interpreted as virtual reality. Discussion of causality led to the formal definition of a data-generating process, or DGP. We considered the role of randomness in econometric models, and looked at the properties of deterministic random-number generators, or RNGs.

After that, we started on Chapter 2, on the linear regression model. We spent some time on notation and terminology.
On September 2, we started on Chapter 2, on the linear regression model. The first section looks only at the simple regression model, with a constant term and only one non-constant regressor. There are two parameters in the regression function, the intercept, and the slope coefficient. This was illustrated by the very simple model of the consumption function.

Next came section 2.2, with material that should be mostly familiar, on probability, random variables, distributions, expectations, moments, and such things. In particular, conditional versions of all of these things were introduced. One special distribution was defined: the standard normal, denoted N(0,1). The main point of working through this section is to establish terminology and notation.
What is meant by the specification of a linear regression model came next, in section 2.3., on which we embarked on September 4. There are two parts to the specification, the deterministic and the stochastic. A model, thought of as a set of DGPs, is said to be correctly specified if it contains the DGP that corresponds to the true DGP in external reality. The DGPs of the model are specified conditional on some explanatory variables, or, more generally, on an information set Ω_t.

Next, we made a distinction between linear and nonlinear regression models. The linearity is with respect to the parameters, not the regressors. There are two parts to the specification, the deterministic and the stochastic. A model, thought of as a set of DGPs, is said to be correctly specified if it contains the DGP that corresponds to the true DGP in external reality. This leads on to how a linear regression model can be simulated. The important distinction was made between endogenous and exogenous variables. The latter constitute the information set on which the variables in the model are conditioned.

After that, the only part of section 2.4 we looked at was on partitioned matrices.

We made a start on section 2.5, on techniques of estimation, and introduced the method of moments. Frequently used terminology speaks of the sample and the population. The latter is for us a metaphor for the data-generating process.
On September 9, we finished the remaining material in Chapter 2. Estimating functions are zero functions, that is, functions of both data and parameters of which the expectations are zero when the parameters are those of the true DGP. Estimating functions are usually linear combinations of elementary zero functions, which depend on the data of just one single observation. For a regression model, the obvious elementary zero functions are the residuals. If we choose the elements of the explanatory variables as the coefficients of the linear combinations, we end up with the ordinary least-squares, or OLS estimator. It can be defined as an M-estimator, where the estimator is given by minimising the sum of squared residuals, or as a Z-estimator given by the solution of estimating equations.

Chapter 3 deals with the Geometry of least squares. After reminding ourselves of Pythagoras' Theorem, we developed geometric representations of the operations of linear algebra, that is, addition of vectors and scalar multiplication of vectors. If a linear, or vector, space also admits the scalar product, it becomes a Euclidean space, in which the angles between vectors can be defined. A subspace of a Euclidean space is defined as the set of linear combinations of a set of vectors in the original space. These can sometimes be represented in two or three dimensions geometrically.
We started with the important concept of linear dependence on September 11. We normally require the explanatory variables in a linear regression to be linearly independent, since, if the columns of the matrix X are linearly dependent, the matrix X^TX is singular, meaning that its inverse does not exist.
The Geometry of OLS estimation constitutes the material of section 3.3 and much of section 3.4. The result of running a linear regression by OLS was represented geometrically, and this led naturally to the algebraic notion of orthogonal projections, characterised by idempotent symmetric matrices. We saw how the orthogonal projection on to the space spanned by a set of regressors is unchanged if the regressors are replaced by a linearly independent set of linear combinations of them, and illustrated this by switching between Fahrenheit and Celsius temperatures in a regression.

The topic of section 3.4 is the Frisch-Waugh-Lovell (FWL) theorem. It is approached by two preliminary results. The first is illustrated by considering deviations from the mean. More generally, in a regression in which the regressors split into two subgroups, one may add linear combinations of those in one group to those of the other, without changing the coefficient estimates for the latter group.

The second preliminary result is that, if the two groups of regressors are mutually orthogonal, leaving one group out does not change the estimates for the other group. The two preliminary results can now be combined so as to get the FWL theorem.
We began with the (simple) algebraic proof of the FWL theorem on September 16. After that came the applications of the theorem in section 3.5, starting with seasonal variation. A method called seasonal adjustment by regression was presented, and contrasted with seasonal adjustment as practised by Statistics Canada and statistical institutions in other countries. We made a distinction between seasonal adjustment and taking account of seasonality in the context of an econometric model. If seasonal adjustment by regression is called deseasonalising a time series, we can detrend a series in a similar way, by incorporating a time trend in a linear regression. Both of these are specific cases of the fixed-effects model, the properties of which were treated.
Section 3.6 deals with the phenomena called leverage and influential observations. The influence of a single observation in a sample can be measured by the effect on the parameter estimates of leaving it out. This can be done by use of unit basis vectors, which are indicator variables for just one observation. The FWL theorem then lets us obtain an algebraic formula for the difference between the estimated parameter vector for the full sample and that for the sample without one observation. This formula can be interpreted, and it leads to ways whereby we can detect potentially influential observations, and measure their actual influence.

The potential influence of observation t in a linear regression model is determined by the quantity h_t, the t^th diagonal element of the projection matrix P_X. Whether or not the potential influence is realised depends on the residual û_t for this observation.

The quantities h_t must lie between zero and one. If the regression has a constant, they must be no less than 1/n, where n is the sample size. Because of a property of the trace of a product of matrices, the sum of the h_t is equal to k, the number of regressors. This brought us to the end of Chapter 3.
We started on Chapter 4 on September 23. We defined linear regression models, and in particular the classical normal linear model, in which regressors must be exogenous. Models are defined as sets of DGPs, while a DGP can be thought of as a unique recipe for simulation. In parametric models, there is a parameter-defining mapping, which associates a vector of parameters to each DGP contained in the model.

An estimator is a deterministic function of the data in a dataset, these data being generated by a DGP, so that the estimator is a random variable. Its realisations are called estimates. In the context of a parametric model, an estimator may be biased or unbiased. As most estimators are defined as the solutions of estimating equations, estimating equations can also be biased or unbiased. Unfortunately, use of an unbiased estimating equation does not guarantee an unbiased estimator. We saw this in the case of a linear regression model with a lagged dependent variable as a regressor. This breaks the assumption of exogeneity, although the regressor may be predetermined and the disturbance an innovation.

As soon as the exogeneity assumption is violated, there are hardly any exact results available. Instead, recourse may be had to asymptotic theory to provide approximate results. But asymptotic theory relies on an asymptotic construction, which in principle may be chosen arbitrarily.

To make use of asymptotic theory, we need the study the topic of stochastic convergence, that is, the convergence of sequences of random variables. The most useful sort of stochastic convergence for econometrics is convergence in probability. A quite different type of convergence is convergence in distribution, where there is in general no limiting random variable, but rather a limiting distribution. It was then necessary to introduce the big-O notation for the same-order relation, both for non-random quantities and for random variables.

We were then able to give a definition of the consistency of an estimator, and to see how this definition is similar to that of unbiasedness. However, unbiasedness and consistency are two distinct properties, and a sequence may well satisfy one but not the other.
We began on September 25 by looking at some pathological examples in which unbiasedness and consistency are not the same. Then, in Section 4.4, there are definitions of covariance matrices, correlation matrices, and positive definite and positive semidefinite matrices. The algebraic properties of these were examined. We used all these preliminaries to study the covariance matrix, and also the precision matrix, of the OLS estimator. The properties of these matrices depend on the covariance matrix of the disturbances, usually denoted as Ω. If it is a scalar matrix, we get the usual expression for the covariance matrix of the OLS estimator. However, we noted the possibilities of heteroskedasticity and autocorrelation.

While setting the stage for the Gauss-Markov Theorem, we interrupted ourselves to look at the errors of forecasts based on linear regression, and we saw that, in addition to the error induced by the random disturbance, there is also parameter uncertainty, caused by the fact that the parameters used in the forecast are just estimates, and model uncertainty, caused by the very likely inadequacy of the regression model used for forecasting.

The Gauss-Markov Theorem states that the OLS estimator is BLUE (for Best Linear Unbiased Estimator). The theorem requires the exogeneity of the regressors and white-noise disturbances (homoskedastic and serially uncorrelated). The criterion used to assess efficiency is based on the difference of covariance matrices, or, alternatively, of precision matrices.

We looked at just one proof out of many for the theorem. It works by showing that the covariance matrix of an arbitrary linear unbiased estimator is equal to that of the OLS estimator plus another covariance matrix, which is necessarily positive semidefinite.
September 30: in Section 4.7, the statistical properties of the OLS residuals are presented, under the assumptions of the Gauss-Markov theorem. Their covariance matrix is proportional to the orthogonal projection matrix M_X. This means that the mean of the squared residuals is an underestimate of the true variance of the disturbances, but it is easy to construct an unbiased estimator, called s².

The penultimate section in Chapter 4 deals with the misspecification of a model. Overspecification, whereby a regression model contains regressors with no explanatory power, is not misspecification, but it is worth considering, because it shows how it is necessary to specify a model before invoking the Gauss-Markov theorem. An underspecified model arises if relevant regressors are omitted. This is misspecification, and leads to the definition of a pseudo-true value. With misspecification, we can see the tradeoff between bias and variance. In particular, a biased estimator may have a smaller mean squared error than an unbiased one.

The final section of Chapter 4 introduces the coefficient of determination, usually known as the R². It can be uncentred, centred, or adjusted, and it is widely used, and misused, as a measure of the goodness of fit of a regression.

On to Chapter 5, on hypothesis testing. We started with a long list of definitions. A hypothesis, called a null hypothesis, corresponds to a model, and the hypothesis is that the model is correctly specified. The alternative hypothesis serves as a framework for the null, in that the null is usually a restricted version of the alternative. A test statistic is a deterministic function of data, and it is usually chosen to satisfy two requirements: first that its distribution, as a random variable, is known and tractable if the data are generated by a DGP contained in the null model; second that its distribution is as different as possible when the DGP is in the alternative model but not in the null model.

A test is based on a test statistic, and it can either reject the null or not. The rejection rule usually takes the form of calling for rejection if the value of the test statistic is in a rejection region. Equal-tailed, two-tailed, and one-tailed tests can define quite different rejection regions for one and the same statistic. Usually such a region is defined by a critical value.

A Type I error is committed if a true null is rejected; a Type II error when a false null is not rejected. The significance level of a test is the desired probability of Type I error.
Work on October 2nd pursued the study of hypothesis testing. Analogously to the level, we define the power of a test as one minus the probability of Type II error. An important concept is the P value, or marginal significance level, that is, the level at which one is at the margin of rejection and non-rejection. A P value is a useful way of transcending the binary nature of a test, as it allows each investigator to choose a personal significance level.

In Section 5.3, there are descriptions and recipes for simulation for various commonly used distributions. The first of these is the Multivariate Normal distribution, which can be constructed on the basis of a set of independent standard normal variables. Next came the chi-squared (χ²) distribution, with a positive integer as the degrees-of-freedom number. The connection with the multivariate normal distribution was emphasised. After that, we looked at Student's t distribution, and the F distribution.

We just started on section 5.4 on exact tests in the context of the classical normal linear model.
On October 7, we resumed discussion of exact tests. First, we saw how to construct a test statistic for one single restriction on the regression parameters of a linear regression, and showed that its distribution under the null is Student's t. For more than one restriction, a test statistic can be constructed such that its distribution under the null is the F distribution.

We then continued our study of the F statistic, developing several algebraic representations of the statistic, which was initially introduced as a function of the sums of squared residuals from the restricted and unrestricted models. One such representation led to a quick proof that that statistic does indeed have an F distribution under the null hypothesis. Next, we showed how the dependent variable is implicitly made the subject of a threefold orthogonal decomposition when the null and alternative regressions are run.

Of the three components of this decomposition, the first is not used by the F statistic, while the statistic itself is the ratio of the squared norms of the second and third components, each divided by its degrees-of-freedom number. Study of the numerator of the ratio shows clearly how the distribution of the statistic differs under the null and the alternative, thus permitting the statistic to perform the desired statistical discrimination of the alternative from the null.

An important application of the F test is given by the Chow test, developed by Gregory Chow. This test has a null hypothesis according to which the regression parameters are the same for two (or more) subsamples of the overall dataset, the alternative being that the parameters differ across the subsamples. The implementation of the test is quite subtle, although easy enough, and proceeds as a straightforward F test. Since all the work on the t and F tests is specific to the classical normal linear model, logically the next topic has to be the asymptotic theory behind use of these tests in more general linear regression models.
On October 9, we used asymptotic theory to investigate the properties of test statistics. The two main tools of asymptotic theory are laws of large numbers and central limit theorems. The former leads to a deterministic or degenerate probability limit, while the second yields only convergence in distribution. In Section 5.5, we used these two tools to investigate the asymptotic normality and root-n consistency of the OLS estimator where the disturbances are not necessarily normal white noise, and the regressors are predetermined.

In section 5.6, the asymptotic properties of t and F test statistics are studied, and the result is that the exact results obtained for the classical normal linear model hold asymptotically under the weaker conditions used here. The last subsection introduces Wald tests, here applied to testing linear restrictions on the parameters of a linear regression, but much more generally applicable. For the linear setup, the Wald statistic is asymptotically equal to a multiple of an F statistic.

Section 5.7 deals with multiple testing, and it warns of many potential traps. Consideration of the family-wise error rate shows that using a set of tests with one or a small number of degrees of freedom rather than one joint test leads to loss of control of the probability of Type I error.
We began on October 21 with the remainder of the story about multiple testing. The Bonferroni bound gives a way to recover control of the probability of Type I error, but at the cost of a very conservative test. Other procedures, due to Simes, and to Benjamini and co-authors, alleviate the problem, but have no rigorous justification.

The next section, section 5.8, investigates the power of hypothesis tests, and introduces various non-central distributions, characterised by non-centrality parameters (NCP), which determine the power of the corresponding tests. We defined the non-central χ² distribution, the non-central F distribution, and the non-central t distribution. A difficult result due to Das Gupta and Perlman says that, for a given NCP, the power of a test is a decreasing function of the degrees-of-freedom number.

Section 5.9 deals with pretesting. We defined what is meant by this term, normally a bad thing, although it has been used quite extensively in the past. However, it can sometimes be admissible, if a researcher is willing to tolerate some bias in order to have a reduction in variance. This tradeoff recurs often in econometrics, and in statistics more generally.

The first few sections of Chapter 6 deal with confidence intervals and confidence sets, and the important notion of coverage. A confidence set for a scalar parameter or a vector of parameters contains all values of the parameter(s) θ₀ for which a test of the hypothesis that the true θ is equal to θ₀ is not rejected at significance level α. We began with symmetrical equal-tail confidence intervals.

Midterm exam:

The midterm exam is now available here. It is due by noon on Friday October 24.

Assignments:

Please submit assignments to our TA on myCourses.

The first assignment, dated September 23, is here. It is due by midnight on Monday September 29. Please submit your computer code as well as your answers and results.
The second assignment, dated October 6, is here. It is due by midnight on Monday October 13. Please submit your computer code as well as your answers and results.

Data:

All data files needed for the assignments can be obtained by following this link.

Ancillary Readings:

This link takes you to Efron's original 1979 paper in which he introduced the bootstrap.

To send me email, click here or write directly to russell.davidson@mcgill.ca.

Back to the main page

URL: http://russell-davidson.research.mcgill.ca/e662