|
|
|
This course in statistics is intended for all Honours students in Economics. The aim is to get students to understand what the discipline of statistics is, why it is important, and how to use it, especially with economic data. The specialised application of statistics to economics is called econometrics, and is the topic of later courses in the Honours program, although majors and even minors students often take these courses.
The mathematical requirements for the course are not very heavy, but students should have reasonable knowledge of, and ability to work with, both the differential and the integral calculus. Some acquaintance with linear algebra, more particularly of matrix algebra, is also desirable.
Course Outline:
The course outline is found by following this link for the PDF version. It is alternatively available from this link in HTML. Although the outline has official status regarding administrative matters, this website is the most important resource. It will be regularly updated, with information, assignments and so forth.
Announcements:
This message is for students switching into ECON 227.
Your grade will be determined entirely by your results in 227, maintaining the relative weights of assignments and examinations. If for example you switch into 227 in January and your grade in the second term of 227 is 75%, then your grade for the entire year in 227 will be 75%.
Switching into 227 late in the first term carries the disadvantage that you will be writing a December examination in a course for which you will have attended only a small proportion of the lectures. So: you might find it easier to switch in the first week of January. That will be your only window to switch in the second term, however; you cannot switch after the January add-drop period. Please see a majors program advisor if you have any questions about the administrative details or deadlines.
Mercury Course Evaluation is now open and will be so until December 3. Please give your evaluation of this course, as it may help me to do better in future.
The Christmas exam has been scheduled. It is to take place on December 9, 18.30-21.30, in ENGTR 0100. The exam now appears on the official list: check here. One cheat sheet, letter size, two-sided, will be allowed. The exam will cover all the material studied this term, but with an emphasis on what was covered since the midterm.
This is just a reminder that this webpage covers the first term only of Economics 257. The evaluation for the complete full-year course will not be complete until after the exam session at the end of the winter term.
The midterm exam is scheduled in class time 08.30-10.00 on Monday October 20.
In order to answer various questions received by email, here is an extract from the instructions at the top of the exam:
All mobile phones, smart phones, smart watches and web-accessible electronic
devices must be turned off and must not be in the student's possession during
the exam.
For examinations requiring the use of a calculator, unless otherwise specified
by the examiner, only non-programmable, non-text storing calculators are
permitted.
The exam is closed-book. No cheat-sheets are permitted.
Our TA for this term is Miroslav Zhao. His email is miroslav.zhao@mail.mcgill.ca. His office hours are on Wednesdays 16.30-17.30 in Leacock 112.
My own office hours, in Leacock 321C, are on Tuesdays and Thursdays from a little after 10.00 until a little before 13.00.
Textbooks:
There is no single textbook required for the course. Two that are suitable are as follows.
Exercises
In response to requests for exercises that let you understand statistics better and prepare for assignments and exams, here are some sources.
Software:
I have no specific instructions or recommendations about appropriate software, for assignments or other uses. If you have no preferred software of your own, or if you are having trouble with available software for running regressions, simulations, etc., you might like to try my own software, Ects. The documentation is available, not all but most of it in English, all of it in French. For ease, you can find the first volume here (in English), and the second volume here.
The paper reached by following this link is now a bit old (2009), but it contains a pretty comprehensive list of software available for econometrics and statistics.
Log of material covered:
Our first class was held on September 3. After the usual preliminaries, we embarked on Chapter 1 of Galbraith's textbook, entitled Statistical Reasoning. A set of examples is presented, in which various circumstances are described where statistical reasoning is useful.
The first, on gambling and lotteries, allowed us to think about how notions of probability and statistics were originally developed, by people who hoped to make money by gambling. They were of course disappointed. A much subtler example came next, in which the idea of information in uncertain situations was introduced. Giving qualitative and quantitative descriptions of numerical data sets is obviously important, and an example led to the concept of the density of a sample. The distinction between a population and a sample drawn from it was made, and we were led to formulate the question of how information about the population can be inferred from a sample.
On September 8, we continued and finished Chapter 1 of the textbook. First, we looked at the example on memory in random processes, distinguishing successions of coin tosses, without memory, from weather forecasts, or predictions about the stock market, where memory of various sorts may have an influence.
The next two examples dealt with association and conditional association. The concept of covariates was introduced, and it was pointed out that for statistical conclusions to have any validity, it is necessary to control for one, or more likely many, covariates. In an experimental situation, this is easy to do, but, when one has to rely on observations that one cannot control, things are typically much harder. What is considered the best way to proceed, if possible, is to run an RCT, or randomised control trial, with a test group and a control group.
The last example in the chapter was about prediction, and forecasting. A couple of examples from machine learning illustrated this.
In Chapter 2, different types of economic or financial data are discussed. In macroeconometrics, a very common data type is time series. An economic variable like GDP, or the inflation rate, is observed at different points in time, and the results grouped into an ordered set, with each observation carrying a time stamp. At the other extreme, the data in a cross-section are all collected at the same time from different entities, like households, firms, provinces, etc. Panel data are collections of observations on a set of cross-sectional units that are observed through time, so that each observation carries two indices, the time, and one that corresponds to the cross-sectional unit.
We saw some examples of graphical presentations of these data types, noting that three dimensions are needed for panel data. Some of these presentations are hard to interpret, unless the data set is ordered in some way.
We completed the study of Chapter 2 on September 10. Most of our time was spent on transformations of data series. One such is motivated by the fact that a particular series (US GDP seasonally adjusted) is closely matched by an exponential growth function. The actual transformation is to replace the raw data by their logarithms. After this, exponential growth looks more like a straight line.
Another transformation replaces nominal sums of money, or growth rates, by real values, which take account of inflation. A base period has to be selected, in which real and nominal coincide. Seasonally adjusted data are made available by statistical agencies in an attempt to separate seasonal variation from variation caused by other economic activity.
It is often desirable to transform a time series in levels into one of proportionate changes, as this can uncover features of the series that are not easily visible in the levels. Sometimes the reverse is the case.
Financial data are very different from macroeconomic data, and are much more precise. This means that we often need different techniques to deal with them.
The mathematical material on the exponential and logarithmic functions started us off on September 15. A few corrections were made to the Appendix to Chapter 2 of the textbook.
Chapter 3 deals with various summary statistics that can be used to describe, or characterise, a data set. First come measures of central tendency: these include the sample mean, median, and mode. The quantiles of a distribution are defined in terms of the order statistics, and they include quartiles, quintiles, deciles, vigintiles, and percentiles. Then came measures of dispersion, with the variance and its square root, the range, and the inter-quartile range. Box-whisker plots were introduced at this point, after which came the coefficients of skewness and kurtosis.
It was pointed out on September 17 that the coefficients of skewness and kurtosis were dimensionless as defined. This is not the case for the variance or the covariance of two variables. For the covariance, this can be corrected by using instead the correlation, which is dimensionless and is also restricted to the [0,1] interval.
At the end of Chapter 3, we are sternly warned that correlation in no way
implies causation. Causation has no sense outside of a model that
tries to account for empirical reality. Before one can infer causation, it is
necessary to recognise some mechanism whereby the cause
can give
rise to the effect
.
Next came Chapter 4, which is of a philosophical nature. The main themes come from the work of Karl Popper. It is impossible to prove that some theory is true, but one single example can show that it is false. For any finite set of empirical facts, it is in principle possible to find an infinite number of theories that are compatible with these facts. Scientific progress is thus achieved when theories are falsified, which can be achieved with a small amount of empirical evidence. A notable example is how Newton's laws were falsified in the early twentieth century by observations that were compatible with Einstein's theories of relativity.
A problem for philosophers is that of induction. This term refers to the way humans like to generalise from some specific examples to formulating theories or hypotheses meant to apply generally. Since a theory can never be proved, induction is not a foolproof method, and it may well lead to a theory that is later falsified. We started with this, as discussed at the end of Chapter 4, on September 22.
We then embarked on Part II of the textbook, beginning with Chapter 5. The topic of the chapter is Probability Theory. Mathematical probability is a formalisation of the idea of frequency - how many times does a coin come down heads in a large number of tosses? These tosses constitute a random experiment, of which the outcome is not known in advance with certainty. We may have a notion that the coin is fair, so that the probability of heads is one half. But if we carry out the experiment we may find that far more than a half of the tosses are heads, or perhaps far less. This would lead us to update our notion of the probability, and prefer an a posteriori probability. This is how we can learn from an experiment.
In order to proceed with mathematical probability, we need some ideas from set theory. There are two binary operations defined on sets: intersection and union. There is also a unary operation, where we define the complement of a set. The two binary operations satisfy the properties of commutativity, associativity, and distributivity, and when combined with the complement operation , they satisfy the de Morgan laws.
As we saw on September 24, the operations of set theory can be illustrated using Venn diagrams. Much of Axiomatic probability can be so illustrated. There is a fine distinction between axioms and definitions. For probability, we define a probability space as a triple. The first element is the outcome space, usually denoted Ω. Subsets of the outcome space are called events, and the set of events has to satisfy the axioms of a sigma algebra (σ-algebra), denoted 𝓕, which means that it is closed under the operations of union and intersection. The σ-algebra 𝓕 is the second element of the triple. If we have just (Ω, 𝓕), this constitutes a measurable space.
Probabilities may be assigned to events, elements of 𝓕. A probability measure must also satisfy some axioms, and these allow us to prove various identities involving the probabilities of events. If the probability measure is denoted P, then we have the probability space as the triple (Ω, 𝓕, P).
It is often necessary to count the number of ways in which some things can be done. Functions that are very useful in this context are the factorial, and the combination and permutation functions. Their use was illustrated by counting the number of hands that can be dealt using playing cards.
Conditional probability was the main theme of the class on September 29. From the definition of the probability of an event conditional on another event one can derive Bayes' Theorem, which follows from the fact that the operation of intersection is commutative. The theorem can be stated in several different ways, and numerous results follow from it in combination with the general properties of a probability measure.
A somewhat counter-intuitive result was examined, where we computed the probability that an individual had a rather rare condition when a diagnostic test that is not infallible gave a positive result. Although test has very low error rates, the computed probability was very low.
In Chapter 6, distributions of random variables are the principal topic. A real-valued random variable is a mapping from the outcome space Ω to the real line, and, as such, can have different probability measures superimposed on it. The most straightforward of doing so is to postulate a cumulative distribution function, or CDF. This function has various essential properties, and is sufficient to characterise the distribution of the random variable completely.
On October 2nd we went on to look at other properties of CDFs, and introduced EDFs, for empirical distribution functions, of samples. These are necessarily discrete, since a sample must be of finite size. But there are other discrete distributions, where the number of discrete points that can be realised is infinite. For a discrete distribution, we saw that another form of complete characterisation of the distribution is the probability mass function, or PMF, which specifies a positive probability for each of the possibly infinite number of points of the distribution.
The concept of the support of a distribution was introduced. It is the set of all points that are possible realisations, or drawings from, the distribution. It is sometine necessary, with continuous distributions, to include in the support any points that can be reached as limits of a sequence of points in the support. This can also be stated by saying that the support is a closed set.
When a distribution is continuous, the probability that some single value is realised is zero. Thus we prefer to argue in terms of the probability of intervals. The density of a continuous distribution is a function that, when integrated over an interval, gives the probability of that interval. When a density is integrated over the whole real line, from minus infinity to plus infinity, the answer must be one. The density is the derivative of the CDF.
A graphical way that can give a summary description of a distribution, continuous or discrete, is the histogram. Separate intervals, or cells, of the support are defined, and the histogram shows the probabilities of these intervals.
The family of Normal distributions is what is called a location-scale family. A Normal distribution is completely characterised by two parameters, the expectation and the variance. The standard Normal distribution has expectation zero and variance one, and any other Normal distribution can be generated from the standard Normal. The Normal density is a bit complicated, but should be remembered. It is an example of a distribution that is symmetric about its expectation.
We completed Chapter 6 on October 6. This involved looking at graphs of the CDF and density of prices of paintings by Canadian artists. There were a few very large sums of money that distorted the graphs. Better information came from restricting attention to prices less than $10,000. But that revealed the fact that most prices were round numbers, with nothing in between. The CDF was thus discontinuous, and so, strictly speaking, the density does not exist. But a smooth density was nonetheless produced by an algorithm that smooths out the discontinuities.
In Chapter 7, we had formal definitions, for both discrete and continuous distributions, of the expectation of a random variable, or rather of its distribution. We then discussed higher moments, central moments, absolute moments, and absolute central moments. Definitions of the coefficients of skewness and kurtosis were given, and we noted that they are dimensionless quantities, unchanged by changes in location or scale.
Among properties of the expectation, we noted Jensen's inequality for convex functions of random variables. Then two inequalities were stated, the first ascribed to Chebychev. We looked quickly at the proof of the second inequality, involving the use of indicator functions.
The proofs of the two theorems given at the end of Chapter 7 were gone over at the start of class on October 8. These both allow us to bound the probability mass in the tail or tails of a distribution, under certain regularity conditions. The proofs suppose that the distributions are continuous, but it is a good exercise to reformulate them for discrete random variables with given probability mass functions.
Then we moved on to Chapter 8, which deals with conditional probability. The first definition is of the joint CDF of a set of random variables. From the joint CDF it is possible to obtain the marginal CDFs of the individual variables, by letting the arguments that correspond to the other variables tend to infinity. Then it is possible to define conditional CDFs using the formulas developed earlier for the probability of an event conditional on another event.
The approach began with discrete distributions, but it can be extended to continuous distributions that possess a density. It is of interest to note that a conditional CDF, probability mass function, or density function can be thought of as a deterministic function of a random variable, and so itself a random variable.
October 22 was the first class after reading week, since October 20 was the midterm exam. We resumed work on Chapter 8, beginning with conditional probability mass functions and conditional densities, which, we recalled, are deterministic functions of the conditioning variable or variables. Conditional densities let us get a formal definition of independence of random variables and events. The definition says that independence is equivalent to factorising the joint mass function or density function into the product of the marginal quantities.
When variables are not independent, it is of interest to define their covariance and correlation. The covariance of a variable with itself is just its variance. The correlation has the advantage of being dimensionless and its values are limited to the [-1,1] range. There was a quite long discussion of how correlation does not imply causation. Causation is a concept that makes sense only in the context of a model, and for causation to make sense, a mechanism must be specified in the model that leads from cause to effect. Correlation can arise from phenomena that are not at all causal.
The next topic was the conditional expectation. It too is a deterministic function of the conditioning variable(s). Regression models specify the expectation of a dependent variable conditional on a set of explanatory variables.
The last section of the chapter is on the bivariate Normal distribution, which is a special case of the multivariate Normal distribution. A set of multivariate Normal variables is a set of linear combinations of mutually independent standard normal N(0,1) variables, to which possibly nonzero expecations are added.
The bivariate Normal distribution was the first topic treated on October 27. The expression for the density is rather complicated, and it depends on 5 parameters: two expectations, two standard deviations, and a correlation. We briefly discussed how it is possible to derive the expression given in the text from the known expression for the standard Normal density. The chapter concludes with some perspective drawings of the three-dimensional density of the bivariate Normal distribution, which show the effects of the five parameters.
In Chapter 9, some standard distributions are discussed, starting with three discrete distributions: the uniform distribution on a set of integers, the binomial distribution that gives the probabilities of success in n independent trials that can either succeed or fail, and the Poisson distribution, suitable for working with count data.
Of the numerous continuous distributions described in this chapter, we had time only for the uniform and Normal densities. It was pointed out that a probability distribution can be characterised in two ways: by an analytical formula, or by a recipe for simulation. Such recipes were provided for the distributions we looked at.
The study of some standard distributions was continued on October 29. For the discrete uniform distribution, and for the family of univariate Normal distributions recipes for simulation were given.
The main effort was on continuous distributions. Logically, the first of these is the chi-squared (χ2) distribution. It depends on a parameter called the degrees of freedom (d.f.). The density is analytically rather complicated, as it depends on the Gamma function, which is related to the factorial when the argument is a positive integer. But the recipe for simulation is simple: the χ2 variable with ν degrees of freedom is the sum of the squares of ν IID standard Normal variables.
Next came Student's t-distribution, also characterised by a degrees-of-freedom number ν. It is the ratio of a standard normal variable and the square root of a χ2 variable with ν d.f. divided by ν. The F-distribution has two degrees-of-freedom parameters, one each for numerator and denominator.
Chapter 9 is rounded out by two more continuous distributions: the exponential, with one parameter which is just the expectation, and the log-normal, of which the log is Normal.
Chapter 10 is the first in Part III of the textbook. It deals with sampling and sampling distributions. There was a long discussion of the relation between a population and a random sample, where, in order to estimate a property of the population, one can use the same property of the sample.
We covered the first three sections of Chapter 10 on November 3. An essential concept is that of the sampling distribution, that is, the probability distribution of a random sample. Ideally, we want to have an IID sample, but against this there is sometimes a need for stratified sampling.
A number of specific examples were considered. One, rather trivial, involved tossing a fair coin. Less trivial was an example where it was necessary to use a random number generator (RNG). Many drawings were made from two distributions, the U(0,2) and the χ2 with 1 degree of freedom. What was observed is that the sampling distribution became more concentrated around the population mean as the sample size increased. We were led to define the notion of root-n convergence after considering an IID sample.
When an assumption of Normality is made, we can be more specific, and compute algebraically the expectation and variance of the sampling distribution from an IID sample. This led us to introduce the notion of centring and standardising the sample mean.
The first topic on November 5 was confidence intervals. If we know, or can approximate, the distribution of a test statistic that tests that a parameter takes on a particular value, we can make a probabilistic statement, based on the realised value of the statistic, about the true value of the parameter. This statement can be interpreted as saying that the random confidence interval includes the non-random true parameter value with a certain probability, often 5% or 1%.
The next topic consumed the rest of the time of that class. It was the demonstration that the sample variance, denoted s2, has a distribution proportional to a χ2 with n-1 degrees of freedom where n is the sample size. It is important that the sample variance is not the variance of the sampling distribution. It has a denominator of n-1 instead of n. The intuition is that a degree of freedom is used up in estimating the mean, leaving only n-1 for estimating the variance.
The proof of the theorem needs to be updated. Watch the updates of the textbook for this.
On November 10, it was stated that the proof of Theorem 10.3, especially the paragraph immediately following the proof had been updated, and was now (presumably) correct.
Chapter 11 begins with definitions of convergence in probability and convergence in distribution, The latter is not really about convergence of sequences of random variables, rather it is about convergence of their distribution functions. The Weak Law of Large Numbers (WLLN) was stated, but its proof was postponed until the Appendix of the Chapter.
A simple version of the Central Limit Theorem was stated, this time without proof. It is important to note that the theorem concerns, not the data themselves, but sample means of functions of the data. Graphical examples were given of the densities of sample means where the underlying distribution was χ12, thus heavily skewed. But, even for quite small sample sizes, the distribution of the sample mean can be seen to approach the symmetric normal distribution.
A specific example was considered, with a survey in which people were asked whether they planned to vote yes (1) or no (0) in an upcoming referendum. The aim of the survey was to provide statistical information about the proportion of people in the population going to vote yes. The distribution for any surveyed individual is a Bernoulli distribution with parameter p, where p is the proportion in the population for yes. This leads to a confidence interval for p.
This brought us to the Appendix to Chapter 11, in which a proof of the WLLN is given. It relies on the Markov inequality stated and proved in Chapter 7.
Chapter 12 begins with a classical problem, in which there are random samples drawn from two populations, and one wishes to test the hypothesis that the population means are equal across the two samples. We saw a way to construct a confidence interval for the difference in the population means.
Ww managed to cover all of Chapter 12 on November 12. The main topic of interest in this Chapter is the concept of a confidence interval. If we know the distribution of a particular statistic that depends on a parameter like a population mean, or if we know it approximately, as for instance by use of a Central Limit Theorem, then we can make a probability statement about quantities determined by a data set and the parameter. The principal application of this idea is when the statistic is a sample mean, centred and standardised. We can represent this by the expression (x̅ - μ)/(s2/n)1/2. The CLT then tells us that, as n tends to ∞, the distribution of the statistic tends to standard normal - N(0,1). It is then possible to construct a confidence interval using the quantiles of the standard normal distribution.
We discussed getting a confidence interval (a random interval) for the difference in the population means (a non-random quantity) of two distributions from each of which random samples had been drawn. There were two cases: paired samples and independent samples.
There was then an extended discussion of how the statistics and corresponding confidence intervals could be studied by simulation. It is necessary to choose a data-generating process or DGP, which can be thought of as a recipe for simulation. The results of some simulation experiments based on the CLT were shown graphically for the statistics used to compute confidence intervals.
We embarked on Chapter 13 on November 17. The title of the chapter is Point estimators. The distinction was made between an estimator, which is a random variable, usually a deterministic function of a random sample, and an estimate, which is a realisation of an estimator. The two are often confused, but the context normally lets one know if it's a random variable or a realisation. For a parameter θ an estimate/or can be denoted as θ̂.
Properties of estimators came next. The bias of an estimator is the expectation of the estimation error, that is, the difference between the estimator and the estimand, which is the quantity to be estimated. Regarding the second moment of an estimator, one estimator is said to be more efficient, or precise, than another if its variance is smaller.
The mean squared error of an estimator combines bias and variance, and illustrates the tradeoff between them. Indeed, the mean squared error of an estimator is the sum of the square of its bias and its variance.
A large class of estimators consists of those defined by minimising a loss function, which is a function of both data and parameters. For a scalar parameter θ with estimator θ̂, a loss function would look like ℓ(θ̂, θ). The expectation of a loss function is the corresponding risk function.
We often have to resort to asymptotic theory when deriving the properties of an estimator. An estimator θ̂ is consistent for θ if the plim of the estimation error is zero. An estimator is asymptotically normal if the limiting distribution as n tends to infinity of the estimation error scaled up by multiplying by the square root of n is Normal.
We discussed the least-squares (LS) estimator in some detail, and likened it to an approximate solution to a set of linear simultaneous equations which have no joint solution. The least absolute deviation (LAD) estimator replaces the squares of the errors by their absolute values. In general, the LS estimator is the more efficient of the two, while the LAD estimator is the more robust, that is, valid under weaker conditions than those needed for the validity of LS.
We finished Chapter 13 on November 19. This was an old-fashioned class, with me writing with chalk on the blackboard, since the connection to the computer was missing.
Much of what we did was revision of earlier material in Chapter 13, especially regarding least-squares and least-absolute-deviation estimation. But we also discussed estimation by the method of moments, and showed how it could be used for a simple regression model. The final estimation method was maximum likelihood. This method requires more information than the other methods discussed in this chapter, but it gives an efficient estimator. If we want a loss function, it has to be the negative of the likelihood function, which is a function of data and parameters. If the parameters are fixed, the function becomes a density function, but, in use, the data are fixed, and the function maximised with respect to the parameters.
We rushed though Chapter 14 on November 24, since it serves mainly to put together, and give formal definitions about, interval estimators, as opposed to point estimators, and confidence intervals in particular.
We covered all of Chapter 15, on Hypothesis testing, on November 24. The null hypothesis can be thought of as a model, and the null hypothesis, usually denoted H0, is true if the true DGP belongs to the model. It is contained in a larger model, called the alternative hypothesis, denoted H1.
The simplest example was a hypothesis about the population mean, to be tested on the basis of an IID sample drawn from the population. The test is based on a test statistic, denoted τ. This statistic has some distribution, perhaps an approximate one given by asymptotic theory, under the null hypothesis, and a different one for DGPs in the alternative that are not in the null. The difference in distributions under the null and alternative lets the statistic perform statistical discrimination.
A test is binary: it leads one to reject the null or to fail to reject. In no circumstances can it confirm a hypothesis - this is another manifestation of Popper's discussion of falsifiable hypotheses. But even rejection by a statistical test cannot be definitive. It depends on the significance level of the test, which can be interpreted as tolerance for Type I error, that is, rejecting a null when it is true. Type II error is committed when a test fails to reject a false null. The power of a test is the complement of the probability of Type II error, that is, the probability of rejecting a false null. The power in the simple case considered is a function of the non-centrality parameter.
A test has a rejection region, such that when the statistic τ falls into this region, the test rejects. The rejection region is defined by one or more critical values, which are quantiles of the null distribution of τ. There are one-tailed and two-tailed tests, and the latter may or may not be equal-tailed.
An important concept is the P-value, or marginal significance level. It is a deterministic function of τ and it lets a researcher reject or not based on his or her subjective significance level. A small P-value may lead to rejection, but a large P-value means that there is no significant evidence against the null.
Matrix algebra is the topic of Chapter 16, on which we embarked on November 26. We covered all of this chapter, and started on the next chapter, on linear regression. The operations on matrices that we looked at are, a unary operation, namely transposition. Binary operations include addition and subtraction, and multiplication. Matrix multiplication leads to a product matrix, all the elements of which are scalar products of the rows and columns of the two factor matrices.
Although a matrix is in general a rectangular array of real numbers, a column vector is a matrix with only one column, and a row vector has only one row. Some operations are restricted to square matrices, in particular matrix inversion. An important square matrix is the identity matrix.
Other square matrices with special properties include symmetric matrices, diagonal matrices, and triangular matrices.
A regression model expresses the expectation of the dependent variable conditional on a set of explanatory variables. If the conditional expectation is a linear function of the explanatory variables, we have a linear regression.
December 1st was the last class of the term except for the review session programmed for December 3. We discussed a lot about linear regression. The starting point was a regression with just one explanatory variable, or that along with a constant. The estimation method was the least-squares method, and it depends on minimising the sum of squared residuals. The first-order conditions for the minimisation are the estimating equations, and solving them means solving a pair of linear simultaneous equations.
It is possible to handle an almost arbitrary number of explanatory variables if one makes use of matrix notation and matrix algebra. It emerges that solving a system of linear simultaneous equations is equivalent to inverting a square matrix, a task that has long been studied, and for which efficient computer algorithms exist. The resulting estimator is call the ordinary least squares (OLS) estimator.
If we are prepared to make some assumptions about the distributions of the variables in a linear regression model, we can derive an expression for the covariance matrix of the OLS parameter estimates, and from that standard errors for them. The variance parameter, σ2, can be estimated by the minimised sum of squared residuals, divided by n-k, where n is the sample size, and k is the number of explanatory variables, including the constant. This estimator is denoted as s2.
A measure of goodness of fit is provided by the coefficient of determination, denoted R2. It is the ratio of the explained sum of squares to the total sum of squares. It has some advantages, but also some disadvantages, notably that it always increases when additional explanatory variables are used in the model, even if they have little or no explanatory power.
Practice Christmas exam:
Follow this link to see a practice exam that is similar to what will be on the Christmas final exam.
Assignments:
When an assignment is due on a certain date, that means that, if it is submitted on myCourses before midnight of that day, it is considered to be on time.
The first assignment, dated September 22, can be found by following this link. It is due on Tuesday September 30.
The second assignment, dated October 6, can be found by following this link. It is due on Monday October 13.
The third assignment, dated November 17, can be found by following this link. It is due on November 24.
The fourth assignment, dated November 24, can be found by following this link. It is due on December 3
In order to encourage the use of the Linux
operating system, here is a link to an article
by James MacKinnon, in which he gives valuable information about what software
is appropriate for the various tasks econometricians and statisticians
wish to undertake.
To send me email, click here or write directly to russell.davidson@mcgill.ca.
URL: http://russell-davidson.research.mcgill.ca/e257