Scatter Plot Smoothing |
The following data from the Connecticut Tumor Registry presents age-adjusted numbers of melanoma incidences per 100,000 people for the 37 years from 1936 to 1972 (Houghton, Flannery, and Viola; 1980).
data Melanoma; input Year Incidences @@; format Year d4.0; datalines; 1936 0.9 1937 0.8 1938 0.8 1939 1.3 1940 1.4 1941 1.2 1942 1.7 1943 1.8 1944 1.6 1945 1.5 1946 1.5 1947 2.0 1948 2.5 1949 2.7 1950 2.9 1951 2.5 1952 3.1 1953 2.4 1954 2.2 1955 2.9 1956 2.5 1957 2.6 1958 3.2 1959 3.8 1960 4.2 1961 3.9 1962 3.7 1963 3.3 1964 3.7 1965 3.9 1966 4.1 1967 3.8 1968 4.7 1969 4.4 1970 4.8 1971 4.8 1972 4.8 ;
The following PROC SGPLOT statements produce the simple scatter plot of these data displayed in Figure 52.1.
proc sgplot data=Melanoma; scatter y=Incidences x=Year; run;
Suppose that you want to smooth the response variable Incidences as a function of the variable Year. The following PROC LOESS statements request this analysis with the default settings:
ods graphics on; proc loess data=Melanoma; model Incidences=Year; run;
You use the PROC LOESS statement to invoke the procedure and specify the data set. The MODEL statement names the dependent and independent variables.
When ODS Graphics is enabled, PROC LOESS produces several default plots. Figure 52.2 shows the "Fit Plot" that overlays the loess fit on a scatter plot of the data. You can see that the loess fit captures the increasing trend in the data as well as the periodic pattern in the data, which is related to an 11-year sunspot activity cycle.
Fit Summary | |
---|---|
Fit Method | kd Tree |
Blending | Linear |
Number of Observations | 37 |
Number of Fitting Points | 37 |
kd Tree Bucket Size | 1 |
Degree of Local Polynomials | 1 |
Smoothing Parameter | 0.25676 |
Points in Local Neighborhood | 9 |
Residual Sum of Squares | 2.03105 |
Trace[L] | 8.62243 |
GCV | 0.00252 |
AICC | -1.17277 |
Figure 52.3 shows the "Fit Summary" table. This table details the settings used and provides statistics about the fit that is produced. You can see that smoothing parameter value for this loess fit is . This smoothing parameter determines the fraction of the data in each local neighborhood. In this example, there are data points and so the smoothing parameter value of yields local neighborhoods containing 9 observations.
Optimal Smoothing Criterion | |
---|---|
AICC | Smoothing Parameter |
-1.17277 | 0.25676 |
The "Smoothing Criterion" table provides information about how this smoothing parameter value is selected. The default method implemented in PROC LOESS chooses the smoothing parameter that minimizes the AICC criterion (Hurvich, Simonoff, and Tsai; 1998) that strikes a balance between the residual sum of squares and the complexity of the fit.
You use options in the MODEL statement to change the default settings and request optionally displayed tables. For example, the following statements request that the "Model Summary" and "Output Statistics" tables be included in the displayed output. By default, these tables are not displayed.
proc loess data=Melanoma; model Incidences=Year / details(ModelSummary OutputStatistics); run;
Model Summary | ||||
---|---|---|---|---|
Smoothing Parameter |
Local Points |
Residual SS | GCV | AICC |
0.41892 | 15 | 3.42229 | 0.00339 | -0.96252 |
0.68919 | 25 | 4.05838 | 0.00359 | -0.93459 |
0.31081 | 11 | 2.51054 | 0.00279 | -1.12034 |
0.20270 | 7 | 1.58513 | 0.00239 | -1.12221 |
0.17568 | 6 | 1.56896 | 0.00241 | -1.09706 |
0.28378 | 10 | 2.50487 | 0.00282 | -1.10402 |
0.20270 | 7 | 1.58513 | 0.00239 | -1.12221 |
0.25676 | 9 | 2.03105 | 0.00252 | -1.17277 |
0.22973 | 8 | 2.02965 | 0.00256 | -1.15145 |
0.25676 | 9 | 2.03105 | 0.00252 | -1.17277 |
The "Model Summary" table shown in Figure 52.5 provides information about all the models that PROC LOESS evaluated in choosing the smoothing parameter value.
Figure 52.6 shows the "Criterion Plot" that provides a graphical display of the smoothing parameter selection process.
Output Statistics | ||||
---|---|---|---|---|
Obs | Year | Incidences | Predicted Incidences | Residual |
1 | 1936 | 0.90000 | 0.76235 | 0.13765 |
2 | 1937 | 0.80000 | 0.88992 | -0.08992 |
3 | 1938 | 0.80000 | 1.01764 | -0.21764 |
4 | 1939 | 1.30000 | 1.14303 | 0.15697 |
5 | 1940 | 1.40000 | 1.28654 | 0.11346 |
6 | 1941 | 1.20000 | 1.44528 | -0.24528 |
7 | 1942 | 1.70000 | 1.53482 | 0.16518 |
8 | 1943 | 1.80000 | 1.57895 | 0.22105 |
9 | 1944 | 1.60000 | 1.62058 | -0.02058 |
10 | 1945 | 1.50000 | 1.68627 | -0.18627 |
11 | 1946 | 1.50000 | 1.82449 | -0.32449 |
12 | 1947 | 2.00000 | 2.04976 | -0.04976 |
13 | 1948 | 2.50000 | 2.30981 | 0.19019 |
14 | 1949 | 2.70000 | 2.53653 | 0.16347 |
15 | 1950 | 2.90000 | 2.68921 | 0.21079 |
16 | 1951 | 2.50000 | 2.70779 | -0.20779 |
17 | 1952 | 3.10000 | 2.64837 | 0.45163 |
18 | 1953 | 2.40000 | 2.61468 | -0.21468 |
19 | 1954 | 2.20000 | 2.58792 | -0.38792 |
20 | 1955 | 2.90000 | 2.57877 | 0.32123 |
21 | 1956 | 2.50000 | 2.71078 | -0.21078 |
22 | 1957 | 2.60000 | 2.96981 | -0.36981 |
23 | 1958 | 3.20000 | 3.26005 | -0.06005 |
24 | 1959 | 3.80000 | 3.54143 | 0.25857 |
25 | 1960 | 4.20000 | 3.73482 | 0.46518 |
26 | 1961 | 3.90000 | 3.78186 | 0.11814 |
27 | 1962 | 3.70000 | 3.74362 | -0.04362 |
28 | 1963 | 3.30000 | 3.70904 | -0.40904 |
29 | 1964 | 3.70000 | 3.72917 | -0.02917 |
30 | 1965 | 3.90000 | 3.82382 | 0.07618 |
31 | 1966 | 4.10000 | 4.00515 | 0.09485 |
32 | 1967 | 3.80000 | 4.18573 | -0.38573 |
33 | 1968 | 4.70000 | 4.35152 | 0.34848 |
34 | 1969 | 4.40000 | 4.50284 | -0.10284 |
35 | 1970 | 4.80000 | 4.64413 | 0.15587 |
36 | 1971 | 4.80000 | 4.78291 | 0.01709 |
37 | 1972 | 4.80000 | 4.91602 | -0.11602 |
Figure 52.7 show the "Output Statistics" table that contains the predicted loess fit value at each observation in the input data set.
Although the default method for selecting the smoothing parameter value is often satisfactory, it is often a good practice to examine how the loess fit varies with the smoothing parameter. In some cases, fits with different smoothing parameters might reveal important features of the data that cannot be discerned by looking at a fit with just a single "best" smoothing parameter. Example 52.4 provides such an example. You can produce the loess fits for a range of smoothing parameters by using the SMOOTH= option in the MODEL statement as follows:
proc loess data=Melanoma; model Incidences=Year/smooth=0.1 0.25 0.4 0.6 residual; ods output OutputStatistics=Results; run;
The RESIDUAL option causes the residuals to be added to the "Output Statistics" table. Note that, even if you do not specify the DETAILS option in the MODEL statement to request the display of the "Output Statistics" table, you can use an ODS OUTPUT statement to output this and other optionally displayed tables as data sets.
PROC PRINT displays the first five observations of the Results data set:
proc print data=Results(obs=5); id obs; run;
Obs | SmoothingParameter | Year | DepVar | Pred | Residual |
---|---|---|---|---|---|
1 | 0.1 | 1936 | 0.9 | 0.90000 | 0 |
2 | 0.1 | 1937 | 0.8 | 0.80000 | 0 |
3 | 0.1 | 1938 | 0.8 | 0.80000 | 0 |
4 | 0.1 | 1939 | 1.3 | 1.30000 | 0 |
5 | 0.1 | 1940 | 1.4 | 1.40000 | 0 |
Note that the fits for all the smoothing parameters are placed in single data set. A variable named SmoothingParameter that you use to distinguish each fit is included in this data set.
When you specify a list of smoothing parameters for a model and ODS Graphics is enabled, PROC LOESS produces a panel containing up to six plots that show the fit obtained for each value of the smoothing parameter that you specify. If you specify more than six smoothing values, then multiple panels are produced. For each regressor, PROC LOESS also produces panels of the residuals versus each regressor by the smoothing parameters that you specify.
If you examine the plots in Figure 52.9, you see that a visually reasonable fit is obtained with smoothing parameter values of . With smoothing parameter value , there is gross overfitting in the sense that the original data are exactly interpolated. When the smoothing parameter value is , you obtain an overly smooth fit where the contribution of the sunspot cycle has been mostly averaged away. At smoothing parameter value the fit shows just the increasing trend in the data.
It is also instructive to look at scatter plots of the residuals for each of the fits. These are also produced by default by PROC LOESS when ODS Graphics is enabled.
Figure 52.10 shows a scatter plot of the residuals by year for each smoothing parameter value. One way to discern patterns in these residuals is to superimpose a loess fit on each plot in the panel. You request loess fits on the residual plots in this panel by specifying the SMOOTH= suboption of the PLOTS=RESIDUALSBYSMOOTH option in the PROC LOESS statement. Note that the loess fits that are displayed on each of the residual plots are obtained independently of the loess fit that produces these residuals. The following statements show how you do this for the Melanoma data.
proc loess data=Melanoma plots=ResidualsBySmooth(smooth); model Incidences=Year/smooth=0.1 0.25 0.4 0.6; run;
The loess fits shown on the plots in Figure 52.11 help confirm the conclusions obtained when you look at Figure 52.9. Note that residuals for smoothing parameter value do not exhibit any pattern, confirming that at this value the loess fit of the melanoma data has successfully modeled the variation in this data. By contrast, the residuals for the fit with smoothing parameter retain the variation caused by the sunspot cycle.
The examination of the fits and residuals obtained with a range of smoothing parameter values confirms that the value of that PROC LOESS selects automatically is appropriate for these data. The next step in this analysis is to examine fit diagnostics and produce confidence limit for the fit. If ODS Graphics is enabled, then a panel of fit diagnostics is produced. Furthermore, you can request prediction confidence limits by adding the CLM option in the MODEL statement. By default 95% limits are produced, but you can use the ALPHA= option in the MODEL statement to change the significance level. The following statements request 90% confidence limits.
proc loess data=Melanoma; model Incidences=Year/clm alpha=0.1; run; ods graphics off;
Figure 52.12 shows the fit diagnostics panel. The histogram of the residuals with overlaid normal density estimator and the normal quantile plot show that the residuals do exhibit some small departure from normality. The "Residual-Fit" spread plot shows that the spread in the centered fit is much wider that the spread in the residuals. This indicates that the fit has accounted for most of the variation in the incidences of melanoma in this data. This conclusion is supported by the absence of any clear pattern in the scatter plot of residuals by predicted values and the closeness of the points to the 45-degree reference line in the plot of observed by predicted values.
Finally, Figure 52.13 shows the selected loess fit with 90% confidence limits.