## Services on Demand

## Article

## Indicators

## Related links

- Cited by Google
- Similars in Google

## Share

## Water SA

##
*On-line version* ISSN 1816-7950

*Print version* ISSN 0378-4738

### Water SA vol.35 n.3 Pretoria Apr. 2009

**On the determination of trends in rainfall**

**Philip Lloyd ^{*}**

Energy Research Centre, University of Cape Town, Private Bag, Rondebosch 7701, South Africa

**ABSTRACT**

The question of how to assess trends in rainfall data is very relevant to that of climate change. A short review of prior work revealed that there was little consensus on the methodology to be adopted. Many methods had been tried and abandoned. Some methods had found comparatively wide acceptance although they employed a statistical software package that is not readily available, and which does not have tools common to other more widely available packages. In the light of this review, it was decided to start from the inherent distribution of rainfall and develop a method for determining temporal trends based on the underlying distribution. Data sets from 4 different locations and covering sample periods from every 5 min to every week were assessed. In each case it was found that the data could be represented extremely well by a log-normal distribution, which meant that normal statistics could be applied to the transformed data. When it was so applied, clear trends emerged, the significance of which could be readily judged via an F-test or t-test. Some worked examples are provided. Attention is drawn to the possibility of estimating the likelihood of extreme events by this method. It is also noted that the usual method of reporting rainfall as an arithmetic average overstates the precipitation, and that on statistical grounds use of a geometric mean is to be preferred.

**Keywords: **rainfall, trend analysis, statistical analysis, extreme events

**Introduction**

The problem of estimating trends emerged during an assessment of precipitation chemistry using data from the US National Atmospheric Deposition Program [NADP]. NADP has established some 200 monitoring stations across the United States (NADP, 2007a). Some are no longer operational; a new station nearby usually replaces one no longer active. Composite samples are collected once a week, comprising all the precipitation collected over the week. The samples are sealed and sent to central laboratories for analysis. The raw data are available from the NADP website (NADP, 2007b), and are usually complete up to about 6 months before present.

The problem was to find a reliable means of determining the trend in rainfall with time. There are many methods published in the literature, some of which are reviewed below. But there is no general consensus of a methodology for what would, at first sight, appear to be a simple task.

The NADP data have been extensively assessed by NADP staff. Long-term trends in the data have been evaluated (Lehmann et al., 2005; 2007) using the Seasonal Kendal Trend (SKT) test. The SKT test is a non-parametric test for trend which is insensitive to the existence of seasonality (Hirsch et al., 1982). The difficulty of this approach is apparent from the description of the methodology, as applied to the flux of chemical species (i.e. rainfall x concentration) rather than rainfall *per se*:

"Statistical seasons were based on meteorological seasons (December- February = winter, etc.) stratified into high and low precipitation sample volume classes. The high sample volume class included samples with volumes at or above the seasonal median sample volume. The low sample volume class included samples with volumes below the seasonal median sample volume. This subdivision by seasonal median sample volume resulted in eight statistical seasons in each year (e.g. high volume spring, low volume spring, etc.). To compare the statistical ability of this proposed method, trend analyses were also performed using one statistical season per year (annual averages) and four statistical seasons per year (seasonal averages).

The SKT analysis was performed using the EnvironmentalStats Version 2.0 package of S-PLUS 6.1 (Millard and Neerchal, 2000). The null hypotheses were that the trend is zero (Kendall's tau statistic = 0), and that the seasonal taus were homogeneous (Kendall's tau values were equivalent) over all statistical seasons. An example of a homogeneous trend is one in which the trends are of equivalent magnitude and direction in all statistical seasons.

The magnitude of the trend slope was determined by taking the natural log of the concentration data (as meq/ℓ) and determining the Sen's median estimator (Gilbert, 1987; Helsel and Hirsch, 1992; Millard and Neerchal, 2000). The Sen's median estimator of log-transformed concentrations provides a non-parametric estimate of the percent change of concentration over the period. Equation (1) gives the trend magnitude in per cent change over analysis period.

where:

ΔCper cent change in concentration over period

SSen's median estimator of trend slope

tlength of trend period (years)

Weekly samples whose analytical concentrations fell below reporting limit were handled as follows as per Helsel and Hirsch (1992). For the trend analysis, statistical seasonal averages that fell below reporting limits were set to zero. For the Sen's median estimator calculations, statistical seasonal averages that fell below reporting limits were removed from the data set."

First, it has to be asked whether there is merit in reducing weekly data to seasonal data. Much information is discarded in the process. Moreover in relatively arid areas, seasons are anything but homogeneous. Even in wet areas, the start and finish of a wet season can vary by several months. Therefore this approach discarded much data because it was non-homogeneous.

Then there is the use of a module (EnvironmentalStats) of the 'Enterprise developer' S-PLUS. Such software is not widely available, and there is always the fear that packages of this sort will lure the user into 'plug-and-play' statistics, feeding numbers in and getting numbers back with no means of checking the correctness of the answers. For this work, the SKT test was initially developed in the widely available Excel spreadsheet. However, McLeod et al. (1991) developed a Spearman partial rank correlation test (SPRC) and compared it to the SKT test, finding it more powerful than the SKT test. It therefore has to be asked why the SKT test was employed at all.

Sen's median estimator (Sen, 1968) is a methodology for determining the equivalent of a regression coefficient in cases where the underlying distribution is not normal. In the present case, it is unclear why it was adopted, because no evidence is presented that the log-transformed data are not normally distributed, and, as is shown later, in fact it is sufficiently close to normal that the regression coefficient and its associated confidence interval is most unlikely to be affected by non-normality.

A search was therefore initiated for alternative methods. Park et al. (2000) used both simple correlation methods and Brillinger's (1989) nonparametric method to study rainfall trends in Korea over 220 years. Because the early data had been obtained with an instrument that did not collect snow and had a minimum reading of 2 mm, they found it necessary to use 2-month periods during late-spring to early-autumn. Nevertheless, over such a long baseline, clear trends were evident for several such 2-month periods, significant at the 5% level. The problem with linear correlation is that any trend can extend to negative values, which is physically impossible. Brillinger's method gave the existence of a trend but did not quantify it.

Widmann and Schar (1997) performed a principal component analysis to study up to 90 years of continuous records of daily precipitation from Switzerland. Empirical orthogonal functions were used to transform the data into a few key variables. Again, they employed seasonal averages, which worked well in the very seasonal conditions experienced in Switzerland. However, principal component analysis is something of a brute-force approach. The answers it provides are unique and independent of any hypothesis about data probability distribution. However, being non-parametric, no prior knowledge can be incorporated. Therefore if the data are reasonably normally distributed, the answers it provides are robust, whereas if the data are not normally distributed, even after normalisation to the mean (as is often employed), the answers can be misleading. It was therefore decided not to look further into this approach until more was known of the data distribution.

Cheng et al. (2004) employed a non-parametric test derived by Pettitt (1979) to determine whether a change had occurred and if so what the significance of the estimate was. They also employed a rank correlation method to check the non-parametric test. There were slight and unexplained differences between the 2 approaches, and in this light it was decided not to pursue this approach.

Reinsel and Tiao have suggested using linear regression models to estimate trends. They assessed the serial dependence in the noise via auto-regression (Reinsel et al., 1981; Tiao, 1983; Reinsel and Tiao, 1987; Reinsel, 1989). They allowed for variables other than time. For instance, they estimated seasonal effects, using sinusoidal curves and their harmonics. This approach suffered from the deficiencies identified earlier and was not pursued.

Similarly, the study by Civerolo (2001) looked at monthly data in which the measured weekly concentrations were multiplied by the weekly precipitation and summed over the month. One of the problems with this approach is that there are many weeks when there is no precipitation, so that the monthly data can be misleading. Moreover, the SO_{4}^{2-}concentration data were first log-transformed to stabilise the variance of the time series and to reduce the non-linear effects before being subjected to filtering using a Kolmogorov-Zurbenko filter (Rao, 1994).

This approach was criticised by Hess et al. (2001), who found the Kolmogorov-Zurbenko filter had a very high probability of detecting a trend when there was no trend present. Hess et al. (2001) found that ordinary linear regression had a similar power to other tests they evaluated, but that when there was definite seasonality, the SKT test and a t-test adjusted for seasonality were both stronger than ordinary linear regression.

As this brief survey shows, there appears to be no agreed methodology for determining whether or not there is a trend in rainfall data. Many methods have been presented, but all seem to suffer from identifiable deficiencies. One of the problems is that none started from the underlying distribution of the data, which, classically, should be the starting point for all statistical analyses. Accordingly it was decided to try to develop a methodology based on the properties of the distribution of rainfall.

**Data sources**

The 1^{st} data set was collected by Nel and Sumner (2008) and kindly provided by Nel (2008). A typical set comprised nearly 8 000 records obtained using an automated tipping rain gauge with a cycle time of 5 min and a resolution of 0.2 mm of rain. At Royal Natal National Park, (28.68°S, 28.95°E, and 1 392 m a.m.s.l.), between 21 Nov 2001 and 10 January 2005, there were 1 094 d when no rain was observed; 5 664 periods each of 5 min when 0.2 mm of rain were recorded; 1 127 similar periods when 0.4 mm of rain were recorded, and so on up to 2 x 5 min periods when 10.0 mm was recorded. The total number of non-zero records was 7 931.

The 2^{nd} data set was collected daily over a period of 30 years between October 1978 and January 2008 at the NADP site FL03 Bradford Forest, (29.9748ºN, -82.1978ºW, and 44 m a.m.s.l.). This was typical of the daily data collected by the NADP (NADP, 2007b). There were 2 597 validated non-zero days of rainfall during this period and 7 393 d of zero recorded rain. The records were essentially continuous except for a gap of 63 d between 17 Jan and 21 March 2006. Two samples were taken every 7 d, one typically from 05:00 to 18:00 and the next from 18:00 to 05:00 the following day. The 2 samples were therefore pooled so that all samples represented a full day, generally from 05:00 to 05:00.

The 3^{rd} data set used to examine rainfall distribution was collected weekly over the period between October 1979 and January 2008 at the NADP site CA45 Hopland (39.0045ºN, -123.086ºW, and 253 m a.m.s.l.). There were 744 validated non-zero records during this period.

The final data set was also collected weekly from December 1981 onwards at the NADP Site CA99 Yosemite, 37.7961ºN, -119.8581ºW and 1 393 m a.m.s.l. There were 681 validated non-zero records during this period.

The distributions of each set of data were determined by sorting into ascending order and straightforward decomposition into classes.

**The distribution of rainfall**

The instantaneous (5 min) data distribution is shown in Fig. 1. The records for 0.2 to 0.6 mm are omitted as the abscissa scale then becomes so large that the form of the distribution is hidden. Note that the omission is purely for illustrative purposes; the ~6000 records in the 0.2 to 0.6mm range were included in the later analysis. Note also that while points are shown, the numbers on the ordinate are in fact the upper bounds of ranges. So a point shown as 2.0mm rain in 5 min is strictly the range >1.8 __<__2.0 mm. Again, for clarity, the range is omitted. The same practice is followed in most distributions shown in the rest of this paper.

When the data are re-plotted on logarithmic co-ordinates, as shown in Fig. 2, there is a very clear relationship between 5 min rainfall and frequency, as shown in Fig. 2. Figure 2 includes the data which had been omitted for clarity in Fig. 1.

Over nearly 2 orders of magnitude of measured rainfall, and 4 orders of magnitude of frequency, a linear log-log relationship is observed with a regression coefficient of 0.9625, significant at the 0.1% level.

In Fig. 2, the data ranges are selected from the untransformed data. When the data ranges are selected from the log-transformed data, the distribution approximates that of the familiar normal distribution, as illustrated in Fig. 3.

The distribution of the daily rainfall in Florida is shown in Fig. 4. The distribution is clearly not a normal distribution.

However, the frequency of the **logarithm** of the daily rainfall is again linear with a significant correlation coefficient, as shown in Fig. 5.

The 3^{rd} data set, that from Site CA45, Hopland in California, plotted in a manner similar to that in Fig. 2, also gave a linear log-log relationship between volume and frequency, as shown in Fig. 5. In this case the correlation is poorer than in the case of the 5 min measurements given in Fig. 2, but this may well be caused by the fact that there are far fewer weekly data than there are in the case of the 5 min samples.

The final data set, that also of weekly rainfall, from Site CA99, Yosemite, showed a very similar distribution, linear in the log domain.

It is therefore clear that rainfall data are, to a close approximation, log-normally distributed, and that any trends should be determined using log-transformed data. It is also apparent that, once transformed, normal statistics may be employed to evaluate any trends and to determine the confidence limits on those trends. The frequency plots also give a reasonable estimate of the probability of very high precipitation events. Figure 2, for instance, shows a real likelihood of 10 mm in 5 min; Fig. 5 shows that over 200 mm/d is likely in Florida; and Figs. 6 and 7 indicate that events of over 400 mm/week are perfectly possible in California.

**The estimation of temporal trends**

The distribution of daily rainfall records from Florida as a function of time is shown in Fig. 8. Also shown are the trends determined on the entire data set by regression on the untransformed data ('Raw data') and on the log-transformed data ('Linear (Log10)').

In this one can clearly see a discrete rather than continuous distribution at the lower end (<0.1 mm) of rainfall, which reflects measurement to an accuracy of 0.01 mm. A logarithmic model is therefore by no means perfect, but as we have seen in the previous section, it represents the data to a very good approximation.

There is a definite difference between the regressions in the linear and logarithmic domains. The regression in the linear domain shows a slight downward trend over the period, but it is clearly incorrect to use linear regression on the untransformed data, for the very reasons clearly identified by Sen (1968). In contrast, the regression in the logarithmic domain reveals an obvious trend. The question is what exactly this trend means.

Clearly the noise in the data is such that, at any one point in time, the regression is a very poor estimator of the value at that point in time. But trends are not concerned with values at points in time, but with the change in the mean value over time.

In essence, one is seeking a measure of the change in the sample mean, the variance of which, *s _{m}^{2}*, will be given by:

If the degrees of freedom, n, are large, then the variance of the sample mean will be far smaller than the variance of the individual sample. Thus the slope of the regression line can indeed estimate how the sample mean changes locally. And it is the slope which gives the desired trend. There is no need to filter the data in any way; no need to be concerned with a poor level of correlation between seasons; and even quite large gaps in the data can be accommodated without seriously affecting the significance of the slope so determined.

There are 2 possible tests for the significance of the regression, namely the F-test and the t-test. Table 1 gives the calculation of F.

The value of F is so high, and the significance so low, that there can be no doubt that the trend in the data is real.

The t-test, comparing the 1^{st} and 2^{nd} halves of the data, gives an even greater significance:

The regression line is of the form:

log_{10 }(rain) = 0.16482-(2.4955x10^{-5}).D

where:

Dis the time in days after 1-01-1900

For the Florida data, the change in rain over the period sampled is given in Table 3.

Obviously, there are large errors in these estimates of the average rain, but the trend is obvious – there has been a significant drop at this station. Other stations in Florida have shown an equally significant increase in rainfall.

Several questions may be asked. For instance, is this an artefact of the choice of end dates? The answer is 'no.' It was quite simple to repeat the analysis with subsets of the data, with start dates and end dates at different times of the year. While there is limited seasonality at this station, on average winter tends to be drier than summer, but a significant decreasing trend over the sample period was shown whether the period started in winter and ended in summer, or vice versa. Was there any impact from the missing 3 months? Changing the end date by over 2 years, to October 2005, dropped the value of F to 29.3434, with a slight drop in the significance to 6.66E-08, both of which are negligible considering that about 10% of the data were ignored.

The 3^{rd} set of data was the monthly NADP data from the Californian Site CA45, Hopland. The linear data, smoothed by averaging over 7 successive weeks, are shown in Fig. 9. It is clear that this has a marked seasonality, with winter rainfall sharply higher than that in summer.

Some idea of the difficulty of accounting for seasonal effects may be gained from the fact that the peak rainfall, 39 mm, in February 1994, was close to the minimum, 34 mm, in June 2005. While the peaks look sharp, they occur as early as October and as late as March. It is for these reasons that attempts to correct for seasonality are plagued with difficulties.

Repeating the exercise with a logarithmic plot of the data for Hopland gives the picture shown in Fig. 10, with a very clear trend of increasing rainfall.

The F-test of this data shows that this trend is highly significant, as