L1.02: Section 1
Section 1: Choosing a numerical indicator of how well a model fits a dataset
In an earlier topic we computed a numerical indicator of the noise of a measurement process, the standard deviation around the average measurement. If we had a similar indicator of the scatter of the data around a model, we could use it to guide our search for the best set of model parameters. Even better, we might be able to get a computer program to help in the search since computers are good at working with numbers. Our original method of judging a model, visually comparing the data and model graphs, is easy for people but beyond the current capabilities of computers. While scatter indicators of this kind as often referred to as measuring “goodness of fit”, keep in mind that they are showing the amount of deviation, so smaller is better (these are actually badness-of-fit indicators). We will find the best model by adjusting the parameters to make the indicator as close to zero as possible. We want to minimize both positive and negative deviations, since here we care about how close the data is to the model, but not whether it is higher or lower. In this situation, the most productive way to treat positive and negative deviations the same is to square each deviation (i.e., multiply it by itself, a process that produces positive results for both positive and negative numbers). We will then minimize some indicator based on these squared deviations. In spreadsheets that arranged similarly to Models.xls (input data in column A, output data in column B, model prediction in column C, and data-model deviation in column D), it is natural to put these squared deviations in column E. This can be done by putting the formula “=D3^2” into cell E3, then spreading that formula down all the data rows. Example 1: Add a squared-deviations row to the US-population spreadsheet made in an earlier topic| A | B | C | D | E | F | G | H | I | |
| 1 | x | Y | y | Residual | Squared | Exponential model: y = a * (1+r)^x | |||
| 2 | Year-1780 | Population | Model | values | Deviations | y = 2.8 * (1.03^x) | |||
| 3 | 0 | 2.8 | 2.80 | 0.0000 | 0.000000 | 2.8 | a: Initial value at x=0 | ||
| 4 | 10 | 3.9 | 3.76 | 0.1370 | 0.018778 | 0.03 | r: Growth rate | ||
| 5 | 20 | 5.3 | 5.06 | 0.2429 | 0.058995 | ||||
| 6 | 30 | 7.2 | 6.80 | 0.4037 | 0.162945 | ||||
| 7 | 40 | 9.6 | 9.13 | 0.4663 | 0.217430 | ||||
| 8 | 50 | 12.9 | 12.27 | 0.6251 | 0.390704 | ||||
| 9 | 60 | 17.1 | 16.50 | 0.6035 | 0.364226 | ||||
| 10 | 70 | 23.2 | 22.17 | 1.0301 | 1.061103 | ||||
| 11 | 80 | 31.4 | 29.79 | 1.6055 | 2.577651 | ||||
| 12 | 90 | 39.8 | 40.04 | -0.2413 | 0.058230 | ||||
Column E now contains the formulas
“=D3^2” in E3
“=D4^2” in E4
“=D5^2” in E5
et cetera, for each data row
What number will be used to represent the squared-deviation values?
If we are going to make decisions about parameter settings by looking at a numerical goodness-of-fit indicator, that indicator must be a single number so we can decide which settings are better by choosing those that give the lowest indicator value. What indicator should be used to summarize this column of squared deviations? One obvious possibility, which turns out to be the most generally useful, is the sum of the squared deviations. This can be computed by applying the spreadsheet SUM function to column E, as by putting the formula “=SUM(E3:E12)” into some empty cell such as H8. An alternative indicator that is useful in a few situations is simply the largest squared deviation (which comes from the largest deviation, whether positive or negative). This can be computed with the spreadsheet MAX function, using the formula “=MAX(E3:E12)” in this case. Other indicators are sometimes constructed that take into account the economic cost in a particular situation of different deviation sizes and/or directions. But the sum of the squared deviations is by far the most commonly used indicator for fitting models to data, and we will use it in this course except where another indicator is specifically asked for.Example 2: Improve the earlier population model by using the sum-of-squared-deviations indicator.
The exponential-model parameter settings a = 2.8 and r = 0.03 that were found earlier fit the data pretty well, but of course other nearby settings of the parameters, such as would change the graph so little we could not tell by looking whether one of these models is slightly better than the other. But the numerical sum-of-squared-deviations calculation in H8 provides a way to choose among nearby model settings to get the very best fit. [a] Here is the worksheet result from the initial model, which was chosen by setting the first parameter of the model to 2.8 (the data value at x = 0), then adjusting the growth rate until the graphs were close.| A | B | C | D | E | F | G | H | I | |
| 1 | x | y | y | Residual | Squared | Exponential model: y = a * (1+r)^x | |||
| 2 | Year-1780 | Population | Model | values | deviations | y = 2.8 * (1.03^x) | |||
| 3 | 0 | 2.8 | 2.80 | 0.0000 | 0.000000 | 2.8 | a: Initial value at x=0 | ||
| 4 | 10 | 3.9 | 3.76 | 0.1370 | 0.018778 | 0.03 | r: Growth rate | ||
| 5 | 20 | 5.3 | 5.06 | 0.2429 | 0.058995 | ||||
| 6 | 30 | 7.2 | 6.80 | 0.4037 | 0.162945 | ||||
| 7 | 40 | 9.6 | 9.13 | 0.4663 | 0.217430 | Goodness of fit for these settings | |||
| 8 | 50 | 12.9 | 12.27 | 0.6251 | 0.390704 | Sum of sq. | 4.910062 | ||
| 9 | 60 | 17.1 | 16.50 | 0.6035 | 0.364226 | ||||
| 10 | 70 | 23.2 | 22.17 | 1.0301 | 1.061103 | ||||
| 11 | 80 | 31.4 | 29.79 | 1.6055 | 2.577651 | ||||
| 12 | 90 | 39.8 | 40.04 | -0.2413 | 0.058230 | ||||
| A | B | C | D | E | F | G | H | I | |
| 1 | x | y | y | Residual | Squared | Exponential model: y = a * (1+r)^x | |||
| 2 | Year-1780 | Population | Model | values | deviations | y = 2.9 * (1.03^x) | |||
| 3 | 0 | 2.8 | 2.90 | -0.10 | 0.0100 | 2.9 | a: Initial value at x=0 | ||
| 4 | 10 | 3.9 | 3.90 | 0.00 | 0.0000 | 0.03 | r: Growth rate | ||
| 5 | 20 | 5.3 | 5.24 | 0.06 | 0.0039 | ||||
| 6 | 30 | 7.2 | 7.04 | 0.16 | 0.0259 | ||||
| 7 | 40 | 9.6 | 9.46 | 0.14 | 0.0196 | Goodness of fit for these settings | |||
| 8 | 50 | 12.9 | 12.71 | 0.19 | 0.0348 | Sum of sq. | 3.237818 | ||
| 9 | 60 | 17.1 | 17.09 | 0.01 | 0.0002 | ||||
| 10 | 70 | 23.2 | 22.96 | 0.24 | 0.0568 | ||||
| 11 | 80 | 31.4 | 30.86 | 0.54 | 0.2931 | ||||
| 12 | 90 | 39.8 | 41.47 | -1.67 | 2.7934 | ||||
Licenses & Attributions
CC licensed content, Shared previously
- Mathematics for Modeling. Authored by: Mary Parker and Hunter Ellinger. License: CC BY: Attribution.