Back
Legal

Multiple regression in property analysis

Regression analysis is central to much property-related research these days. Adarkwah Antwi offers a basic explanation of some terms and statistics that are necessary for understanding the regression process.

Gradually but steadily, statistical concepts are increasingly being used in property research and analysis. One cannot start reading an article in a property journal without coming across concepts like coefficient of correlation, standard error, statistical significance etc. More particularly, regression analysis is central to most property-related research these days.

With the advent of user-friendly computer software, it is indeed very easy to undertake these analyses. However, an understanding of the processes and concepts involved is a prerequisite. Surveying practitioners therefore need to have a working knowledge of these concepts to be able to follow the current thinking and developments in the profession.

It must be stated that, while the aim of this article is to provide an understanding of regression analysis, it does not go so far as providing proofs of mathematical derivations.

What is it?

Regression analysis is a formal technique for quantifying or establishing a relationship between different sets of data. We may say, for example, that the movement of house prices is strongly influenced by the movement of interest rates. This is generally true. But by how much are house prices affected if interest rates change by one percentage point? Do prices of all house types in all regions of the UK respond in the same way to interest rate movements? Questions of this nature can be answered by employing regression analysis.

The analysis would allow a mathematical relationship to be established linking house prices and interest rates using historical data. Also, this statistical technique enables predictions on a new sample of observations based on the findings of a previous sample to be made. Simple regression relates one independent variable (house prices in the above example) to the dependent variable (interest rates). Multiple regression analysis (henceforth MRA) allows the determination of whether a relationship exists between several independent variables and a dependent variable. The functional form of MRA is:

where Y is the dependent variable or regressand, that is, what is to be forecasted or explained:

X1s are the independent or explanatory variables (the regressors).

{i = 1……n}

The functional form

The manner in which the dependent and independent variable(s) are related is technically termed “functional form”. The simplest form the equation can take, the linear form, can be represented as:

A linear form model implies that the dependent variable is an additive result of the changes over time of the independent variables.

In economics, the relationship between the dependent and independent variables often tend to be more complex than linear. An instance is where the growth of rental value, say, is a multiplicative response to the independent variables (construction cost etc), rather than linear. The functional form of such a relationship can be very complicated. However, most non-linear functional relations can be mathematically transformed into linear versions to facilitate estimation.

Estimation

To verify the specific relationship between the variables, a specified model is fitted to historical data. The data is normally one of two distinct types – time series or cross-sectional. Time series data are historical data collected over time (eg rental behaviour of chosen properties for the past 20 years) while cross-sectional data are data collected for a given point in time across a geographical area (eg rental values of office properties of major cities in the UK in 1992).

Statistical methods for estimation are many and various, but the most commonly used is the ordinary least squares (OLS) method (at least in applied property research). There are many statistical packages available in the market for estimating regressions (Microfit is one example). Also, depending on the level of accuracy required, other non-statistical packages (eg spreadsheets) can be used to estimate the regression. By its nature, the regression process is a trial-and-error one. The researcher experiments with the data to achieve the best equation which describes the system under investigation. Consequently, there are various tests for guidance.

The decision as to the best equation to employ after series of trials depends on relying on some statistical tests, which include the coefficient of correlation (r), coefficient of determination (R2), t-statistic, standard error of estimate, and f-statistic. Also considered are the following supplemental statistical tests – multicollinearity, Durbin-Watson and stability of the equation. These assist in indicating the strength of the relationships in the regression equation and provide the basis for choosing the most relevant equation.

The statistical tests

Coefficient of correlation
The coefficient of correlation, r, is a relative measure of the relationship between any two variables (ie Y and X variables). The value of r will range between -1 and +1, with -1 or +1 indicating a perfect correlation. A positive number indicates that as X increases Y increases; a negative number suggests that as X increases Y decreases or vice versa. A value of zero means that there is little or no relationship between the variables.

Coefficient of determination (R2)
The coefficient of determination, R2, is the square of the coefficient of correlation and is the ratio of the explained variation. It represents the percentage of movement of the dependent variable that can be explained by the percentage movement in the independent variable(s), or how well changes in the independent variables explain the change in the dependent variable. Thus an R2 of 0.9 means that 90% of the historic movement of the dependent variable can be explained by the regression equation.

R2 lies within the range 0 to 1. If the data scatter points are close to the line of best fit, there is a high degree of explained variance, hence R2 is high. For example, if all the data points are on the line, R2 equals 1 – an improbable event. Generally, a large R2 indicates high significance; that is, how well an MRA equation is able to explain changes in the dependent variable on the basis of the specified independent variables.

T-statistic
The t-statistic, also known as the t-test or t-ratio, is used to determine the significance of each coefficient of the independent variables in predicting the dependent variable. To determine whether the individual coefficients are significantly different from zero, the t-value is computed and compared with a predetermined critical value of t from a t-table 1 given the confidence level at which one wants to test one’s coefficients and the appropriate degrees of freedom (df). The degrees of freedom is obtained by subtracting the number of independent variables in the regression equation plus the constant term from the total number of observations.

In general, at a 95% confidence level, if the absolute value of the t-statistic is equal to or greater than two, the coefficients are probably statistically significant. On the other hand, if the t-statistic is less than the critical value in table or more generally less than two, then the coefficient is not significantly different from zero. The implication of such a case is that the variable concerned is fortuitous and the coefficient could as well be zero without affecting the variation of the dependent variable. No reliance could therefore be placed on such a variable as contributing or explaining the variation of the dependent variable.

Standard error of the estimate
The standard error of the estimate (see), or the standard error of the regression, indicates the standard deviation of the dependent variable, given a specific set of values of the independent variables. Under ordinary least squares (OLS) assumptions, the total effect of changes in the independent variables do not explain all the variation in the dependent variable, hence the error term. The see gives a standard measure of deviation of the calculated value of Y, from the true Y. The calculated see can therefore be used to band the regression equation. Although there is no general critical value or bench-mark for the see, a lower see would suggest more significance because the dependent variable would be banded by a tighter range. By definition, the R2 and see are inversely related since the better the fit the lower should be the see .

F-statistic
The f-statistic, or f-test, is a measurement to determine whether the total equation is significant in predicting the dependent variable. Under classical assumptions of OLS the f-statistic can be used to test the significance of the regressors included in the equation other than the intercept term. In practical terms, the f-statistic can be used to distinguish between statistically significant correlations and correlations resulting from sampling error or chance. That is to say, a high R2 per se might not be showing statistical significance of the regressors, but may be a result of the way the samples were chosen (biased sampling, for example) or by sheer chance. The f-test helps to eliminate any such doubts. To decide the statistical significance of the regression equation, the calculated f-statistic is compared with a critical f-value (in F-Distribution tables) at a level of confidence determined by the analyst (90%, 95%, 99% etc).

Once again the selection of the critical value from tables depends on the required level of confidence and the degrees of freedom (df) associated with it (two different degrees of freedoms in this case). If the calculated f-value is greater than the critical value (from tables), then the regression equation is said to be statistically significant at the chosen level of confidence. Thus the analyst can be approximately 95% certain that there is an overall significant relationship between the dependent variable Y and the independent variables, assuming the calculated f-value is greater than the critical value at the 95% confidence level.

In general, if there are six to 10 observations, the regression equation is likely to be significant at the 95% confidence level if the calculated f-value is greater than or equal to six. For more than 10 observations the f-value should be greater than or equal to five to be significant at the 95% level of confidence (see Murphy, 1989). If the calculated f-statistic is below the critical f-value, then any correlation or relationship could be due to chance or sampling error.

Durbin-Watson test
One classical assumption of regression analysis is that the error terms (denoted by e) are independently and identically distributed. That is, the rent figure in one quarter say, is independent of the rent of the next quarter. This assumption is particularly important for time series analysis. This is a very important statistical test for property-related analysis, given that rental value determination in the UK depends so much on comparable rents of previous periods. The Durbin-Watson test is used to indicate whether the unexplained individual errors are random. When there is no independence, there is a probability of autocorrelation, also referred to as serial correlation. This could mean that the estimated standard errors are biased, and hence the t-statistics are unreliable. This could also mean that an important part of the dependent variable Y might not have been explained.

Multicollinearity
A multicollinearity problem occurs when two or more of the independent (explanatory) variables have a high probability of correlation. If this occurs in a regression equation the R2 value is no more reliable, since the statistic is confused by the existence of correlation between the independent variables. A practical way of avoiding the occurrence of this problem is to investigate the correlation between the independent variables (ie producing a correlation matrix) and dropping one or more of any set of variables that have disturbingly high correlation coefficients.

In general, a correlation of a magnitude greater than or equal to ±0.70, suggests a high degree of interrelationship, and the possibility of multicollinearity existing. Generally this problem is not important in models whose sole aim is to provide forecasts of the independent variables. But where the interrelationship among the collinear variables is not stable, the model might produce spurious forecasts. Forecasting from an econometric model assumes that the parameters upon which the forecast is based will hold into the future.

Stability of the equation
This is a test to ensure that the coefficients of the equation are stable and do not change by significant proportions when new observations are included in the data or when some data is dropped. If, as a result of new observations, the bs vary significantly, the equation is said not to be stable and can therefore not be relied on for any inference, since the results might be sensitive to a number of observations. A practical way of testing for stability is to drop the last one or two observations and rerun the regression to determine whether the bs change by a substantial margin. If the variation is not substantial, this is an indication that the regression equation is stable.

Dummy variables
In regression analysis a problem is sometimes encountered where the dependent variable is influenced by other factor(s) which is/are not readily quantifiable. For example, holding all other factors constant, female surveyors of similar qualifications have been found to earn less than their male counterparts. Such qualitative information can be included as explanatory variables in a regression by way of dummy variables. Since dummy variables normally indicate the presence or absence of a “quality”, the attributes can be artificially constructed on a binary basis to take the value 1 when the quality occurs (in our example, when a surveyor is female) and 0 when the quality is absent (male surveyor). The inclusion of these dummy variables as regressors would help to isolate the qualitative effect.

Estimation undertaken in Microfit

The implication of the equation is as follows: employing the t-test, all the variables are statistically significant at the 99% confidence level, ie we can be 99% confident that the coefficients of the independent variables cannot be zero. The R2 value of 0.98 means that 98% of variation in the office rent index can be explained by the variables included in the equation. All the signs of the variables are consistent with a priori expectations: the lagged rent and demand variables have a positive relation while the supply variable has a negative relation with the independent variable (expected from economic theory). The coefficients tell us how changes in the independent variables affect the dependent variable. For example, a 1% increase in the office rent index will lead to a 2.1% increase in the rent index in the following quarter if the economy is expanding and 1% increase if contracting (see coefficients of ORENT1 and ORENT2). The F-statistic suggests that the equation as a whole is significant at all the conventional confidence levels.

Conclusion

It is always very difficult to obtain an ideal regression equation that satisfies all the above tests in full. However, efforts are always to be made to achieve, within data constraints, the best possible equation which provides reasonable satisfaction of most of the tests. In practice, depending on the objective of the analyst, satisfaction of certain tests are observed more stringently than others. There is always a trade-off between satisfying some of the tests in preference to others. For example, if the aim is to provide a forecast of the dependent variable, it might be necessary to aim at increasing the R2 value without much attention to the coefficient of the independent variables, since a high R2 is likely to lead to high forecasting accuracy. In such a case, the t-test is not adhered to stringently. Such a regression equation, accurate though it might be in forecasting the dependent variable, cannot be relied upon in explaining the behaviour. There is also the danger for the forecasted results to be fortuitous and sensitive to the particular data used. Care should always be taken not to neglect the theory and embark on manipulating the data to achieve the best statistical results – a practice termed “data mining” in econometric jargon.

Adarkwah Antwi is lecturer/research associate in the Centre for Property Research, School of Urban and Regional Studies, Sheffield Hallam University.

References and suggested reading
Antwi, A (1993); “An Econometric Approach to Rental Forecasting”, Unpublished MA Thesis, City University Business School, London.
Murphy, L T (1989), “Determining the Appropriate Equation in Multiple Regression Analysis”. The Appraisal Journal, October 1989 pp 498-517.

Up next…