搜档网
当前位置:搜档网 › anusplin44

anusplin44

anusplin44
anusplin44

THE AUSTRALIAN NATIONAL UNIVERSITY

FENNER SCHOOL OF ENVIRONMENT AND SOCIETY

CANBERRA

ANUSPLIN VERSION 4.4

USER GUIDE

Michael F.Hutchinson and Tingbao Xu

The ANUSPLIN package contains FORTRAN programs for fitting surfaces to noisy data as functions of one or more independent variables. The package includes programs for interrogating the fitted surfaces in both point and grid form. Procedures for calculating standard error surfaces have also been developed.

The programs are normally distributed as binary executables for:

All Microsoft Windows operating systems with 64 bit or 32 bit hardware.

Linux on Intel or AMD hardware.

Last revision to this document: 23 August 2013

The publishing program of the Fenner School of Environment and Society at the Australian National University is designed to present the results of the School’s research and the proceedings of conferences and workshops. Views expressed in Fenner School publications are the views of the authors and are not necessarily those of the School or any associated institution.

Director: Prof Stephen Dovers

Executive Officer: Suzanne Mendes

? FSES 2013

This book is copyright. Apart from any fair dealing for the purposes of study, research, criticism or review as permitted under the Copyright Act, no part may be reproduced by any process without permission. Enquiries should be made to the publisher.

All FSES publications are available from:

Publications Section

Fenner School of Environment and Society

Australian National University

Canberra ACT 0200

Tel. +61 2 6125 2579

Fax +61 2 6125 0746

URL: https://www.sodocs.net/doc/a41207313.html,.au

TABLE OF CONTENTS

INTRODUCTION (1)

PROGRAM SUMMARY (3)

SPLINE (5)

Program Inputs (5)

Program Outputs (6)

Knot Selection (6)

Interpretation of Output Statistics (7)

Calculation of Standard Errors (10)

Dependent Variable Transformations (11)

Fitting Climate Surfaces (12)

SPLINE User Directives (14)

GCVGML User Directives (19)

SELNOT User Directives (20)

ADDNOT User Directives (23)

LAPPNT User Directives (24)

LAPGRD User Directives (26)

ANNOTATED EXAMPLES (30)

Spline smoothing of uni-variate data (31)

Partial spline smoothing of monthly mean temperature data (39)

Tri-variate spline smoothing of monthly mean precipitation data using knots and the square root transformation (44)

Bi-variate and tri-variate spline smoothing of monthly mean solar radiation data using surface independent variables (47)

REFERENCES (50)

INTRODUCTION

The aim of the ANUSPLIN package is to provide a facility for transparent analysis and interpolation of noisy multi-variate data using thin plate smoothing splines. The package supports this process by providing comprehensive statistical analyses, data diagnostics and spatially distributed standard errors. It also supports flexible data input and surface interrogation procedures.

The original thin plate (formerly Laplacian) smoothing spline surface fitting technique was described by Wahba (1979), with modifications for larger data sets due to Bates and Wahba (1982), Elden (1984), Hutchinson (1984) and Hutchinson and de Hoog (1985). The package also supports the extension to partial thin plate splines based on Bates et al. (1987). This allows for the incorporation of parametric linear sub-models (or covariates), in addition to the independent spline variables. This is a robust way of allowing for additional dependencies, provided a parametric form for these dependencies can be determined. In the limiting case of no independent spline variables (not currently permitted), the procedure would become simple multi-variate linear regression.

Thin plate smoothing splines can be viewed as a generalisation of standard multi-variate linear regression, in which the parametric model is replaced by a suitably smooth non-parametric function. The degree of smoothness, or inversely the degree of complexity, of the fitted function is usually determined automatically from the data by minimising a measure of predictive error of the fitted surface given by the generalised cross validation (GCV). Theoretical justification of the GCV and demonstration of its performance on simulated data have been given by Craven and Wahba (1979). An alternative criterion is to minimise the generalised maximum likelihood (GML) developed by Wahba (1985,1990). This is based on a Bayesian formulation for the thin plate smoothing spline model and has been found to be superior to GCV in some cases (Kohn et al. 1991). Both criteria are offered in this version of ANUSPLIN.

A comprehensive introduction to the technique of thin plate smoothing splines, with various extensions, is given in Wahba (1990). A brief overview of the basic theory and applications to spatial interpolation of monthly mean climate is given in Hutchinson (1991a). These interpolated monthly mean climate surfaces have provided critical underpinning for bioclimatic analyses and natural resource modelling more generally (Booth et al. 2013, Xu and Hutchinson 2011,2013). More comprehensive discussion of the algorithms and associated statistical analyses, and comparisons with kriging, are given in Hutchinson (1993) and Hutchinson Gessler (1994). Applications to annual, monthly and daily climate data have been described by Hutchinson (1995, 1998ab), Price et al. (2000), Hutchinson et al. (2009) and McKenney et al. (2011). The book by Schimek (2000) provides a good overview of the subject of smoothing and non-parametric regression with extensive references.

It is often convenient, particularly when processing climate data, to process several surfaces simultaneously. If the independent variables and the relative weightings of the data are the same for each surface, and there are no missing data values, then many surfaces can be calculated for little more computation than one surface. ANUSPLIN allows for arbitrarily many such surfaces with significant savings in computation. ANUSPLIN also introduces the concept of "surface independent variables", to accommodate independent variables that change systematically from surface to surface. ANUSPLIN permits systematic interrogation of these surfaces, and their standard errors, in both point and grid form.

ANUSPLIN also permits transformations of both independent and dependent variables and permits processing of data sets with missing data values. When a transformation is applied to the dependent

variable ANUSPLIN permits back-transformation of the fitted surfaces, calculates the corresponding standard errors, and corrects for the small bias that these transformations induce. This has been found to be particularly convenient when fitting surfaces to precipitation data and other data that are naturally positive or non-negative.

A summary of the six programs that make up the ANUSPLIN package is tabulated in the following section, accompanied by a flow chart showing the main connections between the programs. This is followed by detailed documentation for each program in the package. The User Guide concludes with a comprehensive discussion of example smoothing spline analyses of uni-variate data and multi-variate climate data. The data supporting these analyses are supplied with the package. These analyses can be used as a tutorial on the basic concepts of data smoothing, with particular applications to the spatial interpolation of climate.

PROGRAM SUMMARY

Table 1. The six programs making up the ANUSPLIN package.

PROGRAM DESCRIPTION

SPLINE A program that fits an arbitrary number of (partial) thin plate

smoothing spline functions of one or more independent

variables. Suitable for data sets with up to about 10,000 points

although data sets can have arbitrarily many points. It uses

knots either determined directly by SPLINE itself or from the

output of either SELNOT or ADDNOT. The knots are chosen

from the data points to match the complexity of the fitted

surface. There should normally be no more than about 2000 to

3000 knots, although arbitrarily large knot sets are permitted.

The degree of data smoothing is normally determined by

minimising the generalised cross validation (GCV) or the

generalised maximum likelihood (GML) of the fitted surface.

SELNOT Selects an initial set of knots for use by SPLINE. Now rarely

used. It can be useful for specifying a single knot set for a very

large data set that is to be processed by SPLINE in overlapping

tiles. It can also be used to select a spatially representative

subset of a data set for spatially unbiased withheld data

assessment of surface accuracy.

ADDNOT Updates a knot index file when additional knots are selected

from the ranked residual list produced by SPLINE.

GCVGML Calculates the GCV or GML for each surface, and the average

GCV or GML over all surfaces, for a range of values of the

smoothing parameter. It can be applied to optimisation

parameters produced by SPLINE. The GCV or GML values

are written to a file for inspection and plotting.

LAPPNT Calculates values and Bayesian standard error estimates of

partial thin plate smoothing spline surfaces at points supplied in

a file.

LAPGRD Calculates values and Bayesian standard error estimates of

partial thin plate smoothing spline surfaces on a regular

rectangular grid.

The flow chart in Figure 1 shows the main data flows through the programs described in Table 1. The overall analysis proceeds from point data to output point and grid files suitable for analysis and display by a geographic information system (GIS) and other analysis packages. The analyses by SPLINE produce output files that provide statistical analyses, support detection of data errors, an important phase of the analysis, and facilitate determination of additional knots by ADDNOT. The output surface coefficients and error covariance matrices enable systematic interrogation of the fitted surfaces by LAPPNT and LAPGRD. The GCV or GML files output by GCVGML can also assist detection of data errors and revision of the specifications of the spline model.

Figure 1. Main data flows through the ANUSPLIN package. The procedure for choosing and updating knot sets is described in a following section. Knot selection is also demonstrated in the provided annotated examples.

SPLINE

SPLINE is a FORTRAN 95 program that fits partial thin plate smoothing spline surfaces to multi-variate noisy data. It fits partial thin plate smoothing spline surfaces, constructed from a set of knots, to multi-variate noisy data. The knots are chosen from the data points, either by the SPLINE program itself, or by SELNOT. It is important to note that knots are only used to limit the complexity of the fitted surface. No matter what size the knot set, all data points are used to calculate the fitted surface. Knot sets can also be augmented by the ADDNOT program. ADDNOT can choose additional knots from the largest residuals of an earlier run of SPLINE. The computation time of SPLINE is proportional to the cube of the number of knots, so it is normally beneficial to limit the number of knots to the minimum needed to match the complexity of the fitted surface. Limited knot sets can also enable robust analyses of poor quality data. Further advice on knot selection is given in a later section, and is demonstrated in the annotated examples.

User directives for the program are read from standard input and output statistics are written to standard output. Users are strongly advised to either use the Menu Interface provided with ANUSPLIN package, or to use a command file for the user directives, so that program output can be saved in an output log file. The log file provides a record of the directives supplied to the program and provides essential statistical analyses of the fitted surfaces, standard error estimates and a sorted list of the largest residuals. The Menu Interface also permits log files to be saved.

To run the program from a command-line shell type, for example:

spline < job.cmt > job.log

where job.cmt is an input command text file and job.log is the output log file.

Program Inputs

These include the numbers of independent spline variables and covariates, the lower and upper limits for each independent variable, optional transformations of each independent variable, and of the dependent variable, the order of derivative to be minimised, the number of surfaces, and the method to be used to determine the amount of data smoothing for each surface. Input and output file specifications are also required. Data points at positions that lie outside the user supplied independent variable limits are ignored. These limits can be used to fit a surface to a subset of the data without having to create a separate data file. These limits may include margins to allow for the creation of overlapping surface patches. This can be required for very large data sets. The user-supplied limits also give a simple check on the specified data format and the order of the independent variables in the data file. An error in these specifications would be indicated if fewer than the expected number data points were selected. See the annotated examples for further discussion of program inputs.

With the incorporation of standard FORTRAN 95 ANUSPLIN has dynamically allocated memory for most data and working arrays. Accordingly, SPLINE can accommodate arbitrarily many surfaces fitted to arbitrary numbers of data points. However, it is advisable to limit the number of data points to no more than about 10,000 data points and to limit the number of knots to no more than 2000 to 3000 points, provided the number of knots is sufficient to adequately approximate the fitted spline function. The main storage requirement for SPLINE is proportional to the square of the number of knots and the processing time is proportional to cube of the number of knots. The latter is required to perform a tri-diagonal decomposition of a matrix of order the number of knots. The required

linear algebra routines are contained within the double precision LINPACK library (Dongarra et al., 1979), with amendments to incorporate standard vector arithmetical routines in FORTRAN 95.

The SPLINE program permits data sets containing coincident data points and with missing data values.

Program Outputs

Summary statistics and a list of the largest residuals, ranked in descending order, are always written to standard output that should always be saved in an output log file. The ranked list of largest residuals is particularly useful in detecting data errors, especially when fitting a surface to a new data set. A list of the data and fitted values with Bayesian standard error estimates may also be written to an output list file. This can also assist in the detection of data errors. Optimisation parameters, that are used to determine the optimum smoothing parameter, may also be written for input to GCVGML. The program can also provide cross validation residuals for the data points and summary statistics for points in a test data file. These permit detailed assessment of predictive errors and overall quality of the fitted surfaces. See the annotated examples for further discussion of program outputs.

Files containing the coefficients of the fitted surfaces and the Bayesian error covariance matrices of the fitted surface coefficients may also be written. These surface coefficients and error covariance matrices are used to calculate values and standard errors of the fitted surfaces by LAPGRD and LAPPNT.

Knot Selection

For data sets with no more than a few hundred data points it is normally recommended to select every data point as a knot. This can be done by simply specifying the number of knots to be calculated by SPLINE to a number at least as large as the number of data points. However for larger data sets, and for data sets with poor quality data, it is normally recommended to choose the knots as a distinct subset of the data. This can significantly reduce computation time for larger data sets, and provide a robust analysis in the presence of poor quality data.

It should be noted that the degree of data smoothing is normally optimised to minimise the predictive error of the fitted surfaces. This becomes independent of the number of knots once the number of knots is sufficient to capture the information in the data. Thus, knot sets cannot be increased in size indefinitely to “improve” the fit of the surfaces. This is illustrated in command 7 of the first set of annotated examples (sine20.cmt). In this example a spline curve calculated using just 20 knots is virtually indistinguishable from the spline curve calculated using all 101 data points as knots.

The following, somewhat heuristic, procedure for knot selection has been found to work well in practice:

1. Specify an initial number of knots in the SPLINE run itself, or less commonly, use SELNOT to select an initial knot set for a very large data set. The knots are selected to equi-sample the independent spline variable space covered by the data points. When choosing knots as a subset of the data points, a typical initial set of knots may be around 1/4 to 1/3 of the size of the data set. However the number of knots required really depends on the spatial complexity of the data being fitted, with more knots required for more complex surfaces. If the signal of the fitted surface is found to be within 10 to 20% of the number of knots (the maximum possible signal), then the process should be re-started with a larger initial knot set. The process should also be re-started with

a larger initial knot set if the ranked large residual list indicates large outliers for apparently valid data points for more than around 5% of the number of knots. These actions comprise the left hand option illustrated in Figure 1.

2. Run SPLINE, with the output list of data and fitted values, and examine the largest residuals for

data errors. Re-fit the surface if necessary after data errors have been corrected. If there is a

moderate number of remaining large data residuals that appear to be associated with valid data,

typically less than 5% of the number of knots, the ADDNOT program can be used to add to the knot

index file the indices of the largest aberrant residuals that are not already knots. The surface can then be re-fitted using these additional knots. This is the right hand option illustrated in Figure 1. These

indices may be read by ADDNOT from the large residual list output by SPLINE. The number of

additional knots selected by ADDNOT should normally be no more than around 1-2% of the number

of knots. Knot indices may also be added to the knot index file by supplying knot indices directly to

ADDNOT but this is not generally recommended. Residuals that are already associated with a knot

are identified by a minus sign, both in the output ranked residual list in the SPLINE log file, and in

the output list file of data and fitted values. ADDNOT ignores those residuals already associated

with a knot when adding new knots.

3. Repeat the procedure of adding to the knot list the indices of the largest aberrant residuals and

re-fitting the surface until the solution stabilises or the variance estimates output by the program are

in approximate agreement with a priori estimates. This should normally be done only once or twice,

since there is a risk of overfitting to erroneous data if it is done too many times, especially if there is

short range correlation in the data. For large data sets, where it becomes critical to choose knots

carefully, two successive additions of 1% of the number of knots can make an effective choice of the

additional knots.

Interpretation of Output Statistics

The output statistics are best interpreted in relation to the partial spline model for N observed data

values z i given by

z i = f(x i ) + b T y i +e i (i =1,...,N ) (1)

where each x i is a d -dimensional vector of spline independent variables, f is an unknown smooth

function of the x i , each y i is a p -dimensional vector of independent covariates, b is an unknown p -

dimensional vector of coefficients of the y i and each e i is an independent, zero mean error term with

variance w i V 2, where w i is termed the relative error variance (known) and V 2 is the error variance

which is constant across all data points, but normally unknown (Hutchinson, 1991a). The model

reduces, on the one hand, to an ordinary thin plate spline model when there are no covariates (p =0)

and to a simple multivariate linear regression model, on the other hand, when f(x i ) is absent. The

latter possibility is not currently permitted by ANUSPLIN.

The function f and the coefficient vector b are determined by minimising

| ??1

·¨¨?§ N 1i m i i T i i f J w y b x f z U 2

(2) where f J m is a measure of the complexity of f , the "roughness penalty" defined in terms of an

integral of m th order partial derivatives of f and U is a positive number called the smoothing

parameter. As U approaches zero, the fitted function approaches an exact interpolant. As U

approaches infinity, the function f approaches a least squares polynomial, with order depending on

the order m of the roughness penalty. The value of the smoothing parameter is normally determined

by minimising a measure of predictive error of the fitted surface given by the generalised cross

validation (GCV).

The vector z of fitted function values can be written

Az z (3) where A is an N x N matrix called the influence matrix . By analogy with linear regression (Wahba

1990), the number of degrees of freedom of the fitted spline, or the effective number of parameters,

is given by

SIGNAL = trace (A). (4)

The number of degrees of freedom of the weighted residual sum of squares, the first term of equation

(2), is given by

ERROR = trace (I -A) = N - trace (A). (5)

The weighted mean residual sum of squares is given by

MSR = 2-1W I -A z /N (6)

where W is the diagonal matrix given by

()N 1w ,...,w diag =W (7)

The SIGNAL degrees of freedom and the ERROR degrees of freedom for each surface add up to N

(the number of data points).

The GCV is calculated for each value of the smoothing parameter U by implicitly removing each

data point and calculating the residual from the omitted data point of a surface fitted to all other data

points using the same value of U . The GCV is then a suitably weighted sum of the squares of these

residuals (Craven and Wahba 1979, Wahba 1990). The GCV is actually calculated by the formula

-2-12W I -A z /N

GCV tr I A /N ao??. (8)

The surface fitting procedure is normally considered to have failed to find a genuine optimum value

of the smoothing parameter if either the smoothing parameter is very small and the signal is the

maximum possible (equal to the number of knots) or the smoothing parameter is very large and the

signal is the minimum possible (a number which depends on the number of independent variables

and the order of the roughness penalty). Both of these conditions are flagged by an asterisk in the

output log file. Hutchinson (1993) and Hutchinson and Gessler (1994) recommend that the signal

should not exceed around half the number of data points. Signals larger than this can indicate

insufficient data or positive correlation in data errors.

The variance 2V of the data error i e in equation (1) is estimated by

2-1W I -A z

VAR=tr I -A (9)

If 2V is known, or estimated, an unbiased estimate of the "true" mean square error of the fitted function across the data points is given by

--2-122MSE =W I A z /N 2σtr I -A /N +σ. (10)

Craven and Wahba (1979) have shown that under suitable conditions the formula

GCV =2σ+MSE (11) holds approximately. Thus minimising GCV , which does not depend on knowing 2V , is

approximately equivalent to minimising MSE , the true mean square error.

The generalised cross validation (GCV), mean square residual (MSR) and the data error variance

estimate (VAR) are written to the output log file together with their square roots (RTGCV, RTMSR,

RTVAR) which are in the units of the data values. VAR is the estimate of V 2 given by equation (9).

The mean square residual given by equation (6) is weighted according to the relative variance

estimates w i as provided in the data file. For the GCV calculation these relative variances are

rescaled to have average value 1 in order to facilitate comparisons of GCV values across different

models. If the relative variance estimates are actual estimates of the absolute value of the error

variance (so that V 2 =1), then VAR and RTVAR should be approximately 1.

The goodness of fit of the fitted model may be checked by comparing the scaled residual sum of

squares (N .MSR/V 2 where N is the number of data points) with the critical points of a chi-square

variable with df degrees of freedom, where df is the error degrees of freedom, given by equation (5),

as output by the program, and V 2 is an a priori estimate of the error variance.

This variance corresponds to the "nugget" in standard kriging analyses. It is rarely known a priori ,

since it includes two distinct components. The first of these is error inherent in the data, such as

measurement error. This may be known or reasonably estimated beforehand. However, the second

component is the error in the underlying spline function. This error is essentially unknown, and

decreases as the number of data points increases. In different situations one of these components can

be dominant, or they can be equally important, as is often the case when interpolating climate

statistics (Hutchinson 1995).

When an estimate of V 2 is available an alternative strategy is to provide the corresponding standard

deviation estimate V to the program. The program then minimises an unbiased estimate of the true

mean square error, MSE given by equation (10) instead of the GCV . This is not normally

recommended since it depends on having a reasonably accurate estimate of V 2. It is generally

preferable to minimise GCV , since this appears to be more robust and does not depend on knowing

V 2. An a priori estimate of V 2 can be better used to check the goodness of fit of the model as

described above. On the other hand, specifying the error standard derivation may be preferable when

there is no local minimum of the GCV , as can happen when fitting surfaces to very small data sets

(less than about 20-30 data points).

SPLINE provides in the output log file the coefficients of any covariates as well as the estimate of the mean square error of the smoothed data values (MSE). This estimate depends on the value of error variance (VAR) as estimated by equation (9) or the input error standard deviation estimate when this has been provided by the user.

Calculation of Standard Errors

Using a Bayesian argument, Wahba (1983) and Silverman (1985) have adopted appropriate multi-variate Gaussian prior distributions for the vector z of data values, so that the error covariance matrix

of the vector

z of fitted values is given by

AW V2 (12) where A is the influence matrix described in equation (3) and W V2 is the assumed error covariance

matrix of the data errors. Here W is described by equation (7) and V2 is estimated by equation (9).

Spatially distributed standard errors for surfaces fitted by SPLINE are calculated using the method

described by Hutchinson (1993). SPLINE calculates the error covariance matrix of the coefficients

of the fitted spline surface by expressing the surface coefficients as a linear transformation of the

vector

z of fitted values. This includes the error covariance matrix of the coefficients of any

covariates, from which standard error estimates of the coefficients of the covariates may be directly

calculated. The error covariance matrices of the fitted surfaces are written by SPLINE to a separate

binary file, as shown in Figure 1.

The value z x of a spline surface at an arbitrary position x can be written

z x = a x T c (13) where a x is a vector depending on x and c is the vector of fitted surface coefficients. The standard

error estimate of the surface value z x is then calculated by LAPPNT and LAPGRD using the formula

(a x T Va x)? (14) where V is the error covariance matrix of the surface coefficients calculated by SPLINE. This

standard error is called the model standard error, since it relates to the error in estimating the model

given by equation (1). The prediction standard error is calculated by LAPPNT and LAPGRD using

the formula

(a x T Va x + V2)? (15) where V2 is the variance of the data error. This estimate is only applicable if the values being

predicted have a uniform variance of V2 about the fitted spline function. This normally occurs when

W is the identity matrix. Non-uniform error variances, such as those for the model discussed by

Hutchinson (1995), must be accommodated using a separate calculation. Alternatively, non-uniform

error variances may be directly accommodated in LAPPNT and LAPGRD using one of the

transformations of the dependent variable described in the following section.

Confidence intervals of the calculated spline values are estimated by multiplying either the model

standard error or the prediction standard error by 1.96 corresponding to the 95 percent two-sided

confidence interval of the standard normal distribution.

The mean of an arbitrary number of fitted surface values is a linear function of the fitted surface

coefficients. It can be expressed in the form

a T c (16) where a is the mean of the vectors a x in equation (13). The standard error of the mean is therefore

given by

(a T Va)1/2. (17)

This formula is used by LAPPNT and LAPGRD to calculate the standard error of the mean of the

surface values when there is no dependent variable transformation. It is not the mean of the standard

errors of the individual surface values.

Dependent Variable Transformations

Three dependent variable transformations, the square root, the natural logarithm and an occurrence

transformation, are currently permitted by ANUSPLIN. Any of these transformations may be

applied by SPLINE to the data before a spline surface is fitted. The square root and the natural

logarithm transformations can reduce positive skew in measured values, as can arise with data that

are naturally non-negative or positive. The occurrence transformation is defined by setting all

positive data values to 1.0 and ignoring all negative data values.

These transformations are automatically coded into the fitted surface coefficients file so that

LAPPNT and LAPGRD can calculate either transformed surface values or back-transformed values.

For the square root and natural logarithm transformations, these are obtained by applying the inverse

dependent variable transformation (square or exponential) to the calculated surface values. When

either of these inverse transformations is applied a correction for bias is made. Hutchinson (1998a)

has found that applying the square root transformation to daily rainfall data, before fitting a thin plate

smoothing spline, could reduce interpolation error by about 10 percent.

For the occurrence transformation, the back-transformation consists of setting output spline values to

0.0 or 1.0 depending on whether or not the fitted spline values are respectively less than or greater

than the threshold value of 0.5. Standard errors are not available for the back-transformed occurrence

values.

If the surface values are chosen to back-transformed using the inverse of the square root or natural

logarithm transformations then standard errors are calculated by LAPPNT and LAPGRD

accordingly. Formulae appropriate for the square root transformation have been demonstrated by

Hutchinson (1998a). If the interpolated square root value is given by X, with standard error s, then

an estimate of the standard error of X2 is given by

SE(X2) = 2s(X2 + s2/2)?. (18)

This can be applied with s as either model standard error, or predictive standard error, as defined in

the preceding section. The second term in this expression is negligible except when X is close to

zero or s2 is relatively large. Relative errors are thus given approximately by the formula

RE(X2) = 2s/X(19)

that is twice the relative error in the square root surface values.

For the square root transformation, absolute standard error estimates are calculated. It can be seen

from these two formulae that smaller surface values will be estimated with smaller absolute standard

error, while larger surface values will be estimated with smaller relative error. Approximate

confidence intervals are calculated in this case by assuming that the errors of the interpolated square

root values are distributed according to a normal distribution. It follows that the 95 percent

confidence interval for the squared values is given by

[X2 - CI, X2 + CI] (20) where

CI = 4X 1.96 s(X2 + s2/2)?. (21) Analogous standard error estimates are calculated by LAPPNT and LAPGRD when the natural

logarithm has been applied to the data values and the exponential transformation is applied to the

interpolated values. If the interpolated logarithmic value is given by X, with standard error s, then

LAPPNT and LAPGRD calculate the standard error in the value exp(X) using the formula

SE(exp(X)) = exp(X + s2/2) (exp(s2) - 1)?. (22)

Relative confidence intervals, that must be applied multiplicatively, are calculated in this case by

assuming that the errors of the interpolated logarithmic values are distributed according to a normal

distribution. The two-sided 95 percent confidence interval is then given by

[exp(X)/CI, exp(X).CI] (23) where

CI = exp(1.96s). (24) LAPPNT and LAPGRD provide the absolute standard error estimate given by equation (22) and the

relative confidence interval given by equation (24).

Fitting Climate Surfaces

The ANUSPLIN package was primarily developed for this task. There are normally at least two

independent spline variables, longitude and latitude, in this order and in units of decimal degrees. A

third independent variable, elevation above sea-level, is normally appropriate when fitting surfaces to

temperature or precipitation. This is normally included as a third independent spline variable, in

which case it should be scaled to be in units of kilometres. Minor improvements can sometimes be

had by slightly altering this scaling of elevation. This scaling was originally determined by

Hutchinson and Bischof (1983) and has been verified by Hutchinson (1995, 1998b).

Over restricted areas, superior performance can sometimes be achieved by including elevation not as

an independent spline variable but as an independent covariate. Thus, in the case of fitting a

temperature surface, the coefficient of an elevation covariate would be an empirically determined

temperature lapse rate (Hutchinson, 1991a). Other factors that influence the climate variable may be

included as additional covariates if appropriate parameterizations can be determined and the relevant

data are available. These might include, for example, topographic effects other than elevation above

sea-level. Other applications to climate interpolation have been described by Hutchinson et al.

(1984ab, 1996a, 2009), Hutchinson (1989a, 1991ab) and McKenney et al. (2011). Applications of

fitted spline climate surfaces to global agroclimatic classifications and to the assessment of

biodiversity are described by Hutchinson et al. (1992, 1996b, 2005). They have also been used to develop spatially detailed climate change scenarios (Houser et al. 2004).

To fit multi-variate climate surfaces, the values of the independent variables are needed only at the data points. Thus meteorological stations should be accurately located in position and elevation. Errors in these locations are often indicated by large values in the output ranked residual list. Recent applications have examined the utility of using elevation and variables related to slope and aspect obtained from digital elevation models at various horizontal resolutions (Hutchinson 1995, 1998b). Thin plate spline interpolation of monthly mean precipitation and temperature has been favourably compared with other methods by Price et al. (2000) and Hutchinson et al. (2009).

The LAPGRD program can be used to calculate regular grids of fitted climate values and their standard errors, for mapping and other purposes, provided a regular grid of values of each independent variable, additional to longitude and latitude, is supplied. This usually means that a regular grid digital elevation model (DEM) is required. A technique for calculating such DEMs from elevation and stream line data has been described by Hutchinson (1988, 1989b, 1996, 2001).

SPLINE User Directives

User Directive Type Description

Title of fitted surfaces 60 characters Title recorded in surface coefficient file to

document surface.

Surface value units code and optional missing data value 0–8, real number 0 - undefined

1 - metres

2 - feet

3 - kilometres

4 - miles

5 - degrees

6 - radians

7 - millimetres 8 – megajoules

Data values less than or equal to the missing

data value are removed from the analysis. If

a dependent data transformation is specified

then data values outside the natural domain

of the transformation are automatically

removed. Thus negative data values are

automatically removed if the square root

dependent variable transformation is

specified.

Number of independent spline variables Non-negative integer May not exceed specified limit (currently

10).

Number of independent covariates Non-negative integer Limit depends on the number of spline

variables.

Number of surface independent spline variables Non-negative integer Surface independent variables take different

values for each surface.

Number of surface independent covariates Non-negative integer Surface independent variables take different

values for each surface.

Independent variable lower and upper limits, transformation code, units code, optional margins. Two real numbers,

two non-negative

integers (0-8), up to

two real numbers for

each independent

variable

Lower limit precedes upper limit. Data

points outside these limits, augmented by

margins, are ignored. One or both margins

may be omitted. If one margin is supplied it

is used as the common lower and upper

margin. If both margins are omitted the

transformation code and units code may also

be omitted. Units code as for surface value

units code.

Independent variable transformation parameters One or two real

numbers (a,b)

Required for each independent variable for

which the transformation code is positive.

The possible transformations for each

independent variable x are:

0 - no transformation

1 - x/a

2 - ax

3 -a log (x +b)

4 - (x/b) a

5 - a exp (x/b)

6 - a tanh (x/b)

7 - anisotropy angle in degrees

8 - anisotropy factor - in the direction

specified by the anisotropy angle.

Dependent variable transformation 0, 1, 2 or 5 0 - no transformation.

1 - fit surface to natural logarithm of the

data values.

2 - fit surface to the square root of the data

values.

5 – occurrence – transform data values by

setting all positive value to 1.0 and

ignoring all negative values.

Order of spline Positive integer Usually 2.

Lower limit specified by the program. Number of

surfaces

Positive integer Any positive number of surfaces permitted.

Number of relative error variances Non-negative integer 0 - data points uniformly weighted for each

surface.

1 - the same weighting is applied to each

surface.

Number of surfaces - a different weighting

is applied to each surface.

Optimization directive 0 – 2 0 - common smoothing parameter for all

surfaces.

1 - common smoothing directive for all

surfaces (default).

2 - different smoothing directive for each

surface.

Smoothing directive for each surface 0 – 4 0 - fixed smoothing parameter -supply

value.

1 - minimise GCV (default).

2 - minimise true mean square error using

supplied error standard deviation

estimate.

3 - fixed signal - supply value.

4 - minimise GML.

Data file name 255 characters Must be supplied.

Maximum number of data points Positive integer Used to allocate memory for data and

working arrays. Number should be at least

as large as the number of data points.

Number of characters in site label 0 - 20 If positive, an alphanumeric site label is

expected for each data point in the data file.

These labels are printed in the output data

list and large residual files. Names with

embedded blanks are permitted provided the

data are read with a format statement.

Data file format 255 characters If non-blank the provided FORTRAN

format statement is used to read in order:

the site label (if number of characters in site

name is positive), the independent variables

(spline variables before covariates), the

surface independent variables (spline

variables before covariates), the data values

and the relative variances as specified

above. A uniform weighting of 1 for each

data point may be specified by having zero

relative variances.

If the format is blank, the data file is read in

list directed free format in the same order as

for formatted reads. Blank is not permitted

if the site names have embedded blanks.

Number of knots calculated by SPLINE Non-negative integer If positive then SPLINE selects the specified

number of knots from the data. If this

number exceeds the number of data points

then all data points are selected as knots.

The selected knots can optionally be written

to the output knot file.

If zero then the knots must be read from the

supplied input knot index file.

Knot index file (input/output) 255 characters

(optional)

Optional output file if the number of knots

calculated by SPLINE is positive. Blank if

not required.

Required input file if the number of knots

calculated by SPLINE is set to zero.

Input bad data flag file 255 characters Blank if not required. File used to remove

particular data values from the analysis.

Each record has a site label followed by binary number (0 or 1) for each surface, with each 1 indicating a corresponding data value to be removed. This permits removal of suspicious data values without altering the data file.

Output bad data flag file 255 characters Blank if not required. File contains all bad

data flags from the input bad data flag file

augmented by a flag for each data value that

differs from the corresponding fitted surface

value by more than 3.6 standard deviations.

This file can be used as an input bad data

flag file for a subsequent run of SPLINE

after inspection and possible changes by the

user.

Output large residual file name 255 characters Blank if not required. Used to check for

data errors. May be read directly by

ADDNOT to add knots to an existing knot

file.

Output large cross validation residual file name 255 characters Blank if not required. Used to check for

data errors. Can help to identify spatially

isolated points with bad data values.

Output optimisation parameters file 255 characters Blank if not required. File containing

parameters used to calculate the optimum

smoothing parameter(s). This file can be

used with GCVGML to calculate GCV or

GML values as a function of the smoothing

parameter.

Output surface coefficients file 255 characters Normally required but may be blank if

surface coefficients are not required.

Contains the coefficients defining the fitted

surfaces. These are used to calculate values

of the surfaces by LAPPNT and LAPGRD.

Output error covariance file name 255 characters Blank if not required. Error covariance

matrices of fitted surface coefficients. Used

by LAPPNT and LAPGRD to calculate

spatially distributed standard error estimates

of fitted surfaces.

Output data list file name 255 characters Blank if not required. List of data and fitted

values with Bayesian standard error

estimates. Useful for checking for data

errors.

相关主题