SciELO - Scientific Electronic Library Online

 
vol.34 número1Reclamation of ultra-fine coal with Scenedesmus microalgae and comprehensive combustion property of the Coalgae® composite índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Artigo

Indicadores

Links relacionados

  • Em processo de indexaçãoCitado por Google
  • Em processo de indexaçãoSimilares em Google

Compartilhar


Journal of Energy in Southern Africa

versão On-line ISSN 2413-3051
versão impressa ISSN 1021-447X

J. energy South. Afr. vol.34 no.1 Cape Town Mar. 2023

http://dx.doi.org/10.17159/2413-3051/2022/v33i4a13819 

ARTICLES

 

Modelling emissions from Eskom's coal fired power stations using Generalised Linear Models

 

 

D. ChikobvuI, *; M. MambaII

IDepartment of Mathematical Statistics and Actuarial Science, University of the Free State, Mangaung, South Africa
IIDepartment of Mathematical and Physical Sciences, Central University of Technology, Mangaung, South Africa

 

 


ABSTRACT

The aim of this paper is to determine if a Generalised Linear Model (GLM) is a better model over the traditional simple linear regression when fitted to nitrogen dioxide (N02) emitted into the atmosphere during the production ol electricity from 13 Eskoms coal fuelled power stations. GLMs have flexibilities of allowing the variance to vary as a function of the mean (non-constant variance), and have the advantage of keeping the data in its original scale. Unlike regression, the models do not assume a linear relationship between the response variable and the explanatory variables, and instead the link function is used. The data also need not be Normally distributed. Group-lasso interaction network (glintemet) was used in variable selection for the GLM models. A similar model using regression analysis was fitted foi comparison. The results show that a GLM can be used to predict and explain NO2 emissions from coal fired electricity stations in South Africa. The Lognormal model was found to be the better model by diagnostic measures including plots that showed improved variance behavior in the residuals. Various variables such as amount of electricity sent oui (in GWhs), age of power station (in years), power station used, and interaction terms such as electricity and station, Age and station can be used in describing and predicting NO2 emissions (in tons) from Eskoms coal fuelled powei stations.

Keywords: Eskom; generalised linear modelfs) (GLM); linear regression; lognormal distribution; nitrogen dioxide (NO2) emissions.


 

 

Introduction

Coal is the primary source of energy in South Africa and its use has increased significantly over the years. This is as a result of an increased demand of electricity in South Africa. This has given rise to more emission of pollutants including nitrogen dioxide (NO2) from the coal fired electricity power stations (Eskom, 2016). Exposure to this emission impacts on human health (Anand, Varma and Srimurali, 2013; World Health Organization, 2013; Wellenius, Schwartz and Mittleman, 2015). In order to control NO2, and sulphur dioxide (SO2) emissions and other pollutants from the electricity industry, minimum emission standards were published in terms of the National Environmental Management: Air Quality Act in 2010, requiring Eskom to install many retrofits of abatement technologies in order to comply with the emission standards (Eskom, 2011).

The selection of the right statistical probability distribution for describing or modelling environmental pollution data is an important step. These probability models have become the basis for quantifying emissions to meet the evolving information needs of environmental quality management (Singh et al., 2001).

Georgopoulos and Seinfeld (1982) concluded that air pollutant concentrations are inherently random variables because of their dependence on the fluctuations of a variety of meteorological and emission variables. They also concluded that there is no single statistical distribution which gives the best fit to air quality/emission at all time periods. The choice of a statistical distribution generally depends on the pollutant, the time period of interest, the average time of the data, the location and other factors.

Popular statistical probability density functions in representing atmospheric concentrations emissions include the two-parameter distributions (namely, the Lognormal, the Weibull and the Gamma), three-parameter distributions (namely, the 3 parameter Lognormal, the 3 parameter Gamma, the 3 parameter Weibull and 3 parameter Beta distributions) and four-parameter distributions (e.g. four parameter Beta distribution) (Georgopoulos and Seinfeld, 1982). The distributions are useful because of their property of being right skewed, allowing for the modelling of higher emissions. Statement of the problem

The aim of this study is to determine if Generalised Linear Models (GLMs) have an advantage or give a better model fit than the traditional linear regression model when fitted to the NO2 emission data. The study also aims to determine those variables contributing significantly to the amount of NO2 emitted into the atmosphere during the production of electricity from 13 Eskoms coal fuelled power stations.

Justification of the study

The identification of input variables that contribute to the NO2 emission is important to combat and monitor high emission volumes into the atmosphere in order to find ways to decrease such emissions and meet statutory regulations and lower the risk associated with electricity production emissions. The flexibilities of GLMs, compared to models based on regression analysis, can be useful in the determination of these input variables. This flexibility includes advantages of allowing the variance to vary as a function of the mean (non-constant variance), and the response variables having a distribution other than the Normal distribution. Also, GLMs provide the advantage of keeping the data to its original scale by making use of link functions. In the South African context, there is not sufficient literature to suggest a wide use of GLMs in the modelling of emission, especially the NO2 pollutant. The study will try to reduce this gap. Objectives of the study In this study, the objectives are:

To check if the Lognormal distribution based GLM is a better model over the traditional simple linear regression (the Normal distribution based GLM with identity link function) when fitted on the response NO2 emission data.

Determining if the variables electricity sent out (GWhs), age of power station (years), power station, abatement technology and month can be used to predict the emission of NO2 (tons).

To rank the Eskom power plants in terms of NO2 emission efficiency.

Contribution of the study

With the aging of the power stations and high demand of electricity, NO2 emissions are projected to increase from coal fired electricity stations in South Africa (Pretorius et al., 2015). There is therefore a need to model NO2 emissions from these stations. This will provide information to monitor and manage emissions to meet the regulations and thus minimise the exposure of high emissions to humans and the environment.

The rest of the paper is organised as follows, section 2 gives the Literature review. Section 3 gives the methodology. Section 4 gives the results and section 5 concludes.

Literature review

This section reviews some of the literature, including models used in modelling emissions.

Perez and Trier (2001) used predictions to compare linear regression and multilayer neural networks to find a method of predicting NO and NO2 concentrations. A feed forward neural network was chosen as the convenient method of prediction over the linear regression since this method had reasonable control over the adjustment of parameters.

In studies by Nagendra and Khare (2006), Perez and Trier (2001) and many others, such as those by Kukkonen et al. (2003) and Capilla (2014), there was a strong non-linear dependency between NO2 emissions (concentrations) and the selected input variables. Simple linear models, multiple regression models, feed-forward multilayer perceptron networks etc. were compared in modelling NO2 concentrations. Pollutant concentrations rarely follow a Normal distribution. NO2 is no different from the other pollutants, but it can also be modelled using the statistical distributions from the flexible exponential family distribution and it also shares the statistical characteristics found in other pollutants. The exponential family distributions give the much needed flexibility in the construction of such models (Nelder and Wedderburn, 1972).

The GLM model is used to model NO2 emissions at Eskoms coal fueled power plants in this study.

 

Methodology

The linear regression and the GLM models are discussed in this section.

Linear regression

This section focuses on models to be used in regression under the Normality assumption of the response variable NO2. This assumption implies that the emission data is symmetric.

The following model will be fitted on the NO2 emission data initially. Analysis of Covariance (ANCOVA) is applicable, since the explanatory variables are both continuous and categorical and the response variable is continuous.

where,

Ypqt is the response variable (NO2 emitted in tons by plant p with abatement filter q and at time t (in years))

β0 is the intercept

β1 is the coefficient of the electricity sent out in Gigawatt-hours

β2is the coefficient of the age of the power station in years

Aget of the power plant in years at time t.

xpqt is the amount of electricity sent out in Gigawatt-hours by plant p with filter q at age t

Yp is the pth plant effect

q is the qth filter effect

s is the sth month effect

pqt N(0,2)

The model includes all the variables recorded in the study. The group-lasso interaction network variable selection is then selected to try and find a competing model with more variables including interaction terms in the variables mentioned above.

Model selection

Various model variable selection methods exist, such as, among others, subset selection (namely, Best-subset selection, Forward- and Backward-stepwise selection), shrinkage (namely, Ridge Regression, Lasso and Least angle regression) and methods using derived input directions (namely, Principal Components Regression and Partial Least Squares). The group-lasso interaction network (glinternet) is used in this paper to select significant variables (Lim and Hastie, 2015). Lasso regression is selected and used in this paper since it performs both variable selection and régularisation (shrinkage reducing model variance) to enhance predictor accuracy. The method used is an extension of the lasso (least absolute shrinkage and selection operator) variable selection technique (Tibshirani, 1996) and uses a version of the group-lasso to select pairwise interactions and enforce hierarchy (Yuan and Lin, 2006; Bien, Taylor and Tibshirani, 2013). It automatically selects and adds pairwise interactions into the Lasso model. The model selection procedure implies that not all variables may be used in the final model.

A model with less variables- no interaction terms The glinternet variable selection approach, without interaction terms (only the main effects), is used to select a few significant variables from equation (1) above.

The residuals plot is also used to determine the best fitting model. A constant (homogeneous) residual pattern (constant variance) plot over the predicted values, suggest a good fit or an improvement in the model of interest.

To check if the assumption of normality of data and residuals holds, the box plot, histogram, Kolmogorov-Smirnov test and the quantile-quantile (QQ) plot are used in this study. A symmetric bell-shaped histogram would suggest the data is Normally distributed. The best model is found in the case of the Normal and Lognormal distributions and all with identity link functions as discussed later.

A Model with more terms- including interaction terms In this section a more complex model is presented using glinternet and allowing for the interaction of variables.

For this model, the selection process will consider all the explanatory variables and all pairwise interaction terms: where the star (*) implies an interaction term. The full model is given as:

where, for instance,

xtpqs * Aget is the joint effect of electricity sent out in Gigawatt-hours (by filter q in plant p at given age at t in month s) and Aget of the power plant at time t (in years).

Other interaction parameters can be interpreted similarly.

Generalised linear models

In a classical regression model with data being Normally distributed, the variance Var(Y)=2 is assumed constant. However, in practise, it is common to find data in the form of continuous measurements where the variance increases with the mean (McCullagh and Nelder, 1989). The Lognormal model is one such model.

The Lognormal distribution

If a random variable x is such that x~N(μ, σ2), then under the transformation y = ex then Y~Lognormal(μ, σ2) <=> In (Y)-N(μ, σ2). To fit a Lognormal distribution to a data set, one can firstly log transform the data and then fit a normal distribution to it.

If a random variable Y, with pdf has a variance that increases with the mean, that is for small , the appropriate variance Var(Y) = 02[E(Y)]2 =μ202 stabilising transformation would be the logarithm.

The Lognormal model has a variance which increases with the mean. The variance increases with the mean in such a way that the coefficient of variation is a constant. The Lognormal modelling can be used to compensate for such increases with the mean. Also, for small , the log-transformed variable ln(Y) has approximate mean and variance given by

The log transformed variable has variance given as

Since the variance of the Lognormal distribution can be written as Var(Y) =μ202

The logarithm of the data has a constant variance. Log transforming the data should result in homoscedasticity. Therefore, the GLM Lognormal distribution model will be used to compensate for increases in variance of the emission with increases in mean emissions when such an effect is present in the data.

The exponential family and canonical form

Consider a random variable Y with distribution in the exponential family and pdf f(y;|l) in the standard form:

When a(Y)=y, the distribution is said to be in canonical form.

For a distribution to be a GLM, it must have the three components, namely the error distribution, linear predictor and link function (Dobson and Barnett, 2008). The Lognormal distribution is in the exponential family has:

Error Distribution

The Lognormal distribution has independent response variables Y1,Y2,-,Yn with Yl:~Lognormal(μ,σ2) with pdf given as

The distribution is not in canonical form since a(y)=ln(y).

Linear Predictor

The linear predictor is chosen as for instance The parameters βand explanatory variable vector Xi are such that

Link function

A flexible family of transformations, the power transformations, was introduced by (Box and Cox, 1964). For a given parameter λ, the transformation is defined by

The Box-Cox approach is used to estimate the value of λ that will help determine the best link function.

According to Myers et al. (2010) the natural values for λ are as follows:

When λ=0 then Log link function

When λ=1 then Identity link function

When λ=1/2 then Square root link function

When λ=-1 then Inverse link function

For the NO2 data, there exist a monotone link function g such that

The choice of a link function can be based on the nature of the data available for the study. The response variable being continuous and positive, the link function is chosen from these

Linear regression is a GLM with an identity link Model selection

Similarly to the linear regression, the group-lasso interaction network will be considered in determining models without and with interaction terms, respectively.

Maximum likelihood (ML) is the principal method of estimation used for all GLMs (McCullagh and Nelder, 1989).

In a ML approach, a standard assessment is to compare the fitted model with a fully or saturated specified model (Hardin and Hilbe, 2007). Let βmax be the parameter vector of the saturated model and bmax be the ML estimator of the βmax. The likelihood function of the saturated model evaluated at bmax is L(bmax;y). For the maximum value L(b;y) of the likelihood function of the model of interest, we have l(bmax;y) and l(b;y) as the associated log-likelihoods. Such that

is the deviance. The deviance for the Lognormal distribution model is given by

A likelihood ratio test (LRT) can be used to perform a hypothesis test on the parameters of interest. To define this test, let M1 be a GLM with deviance D1 and p parameters β1,...,βP, and let M2 be a GLM with deviance D2 and q<p parameters β1,...,βq. Let β be partitioned as β = [β(1),β(2)]' where, β(1) = β1,...,βq, and β(2) =βq+1,...,βp. Under the null hypothesis

Let l(β;y) be the maximum value of the log-likelihood function for M1 and let l(β;y) be the value of the log-likelihood function for M2. The difference of deviances

has an approximate x2 distribution with p-q degrees of freedom and is known as the Likelihood Ratio Test statistic of the null hypothesis.

 

Results

The data used in this paper is monthly NO2 emissions per station, from Eskom, for a maximum period of 108 months (between 2005 and 2014).

Exploratory data analysis

Before any data analysis can be performed, it is important to explore the data in order to know and understand how it is distributed. Graphical display of the data will be done by using the Histogram, Box plot and the QQ plot for the NO2 emission (in tons). From Figure 1 below, the histogram looks symmetric but is bimodal and hence is not normally distributed (Kolmogorov-Smirnov p-value<0.01). The Box-plot shows that NO2 emission (in tons) has skewness and kurtosis (skewness value=-0.11 and Kurtosis= -0.94). The Quantile-Quantile plot suggests that NO2 emission (in tons) is not Normally distributed since data points deviate from a 450 line towards the extremities on each graph.

 

 

Efficiency of power stations

Summary statistics on all the power stations used in modelling NO2 emission (in tons per month) are presented in Figure 2 below

 

 

The power station with the lowest average NO2 emission is Komati with 1422.23 tons per month and the highest is Majuba with 10433.49 tons per month.

Komati power station produced the lowest amount of electricity sent-out (in GWhs) on average per month and Matimba power station produced the highest. Hendrina is the oldest power station with an age of 44 years and Majuba is the youngest power station with an age of 18 years in year 2014.

Since the efficiency of a power station cannot be measured by observing the amount of NO2 emission (in tons) alone, the relative nitrogen dioxide (tons/Gigawatt-Hours) was calculated as follows

The power station with the lowest average relative NO2 emission (tons/GWhs) was taken to be the most efficient of the 13 power stations. Figure 2 shows Matimba with 2.4177 tons/Gigawatt-hours to be the most efficient power station. This suggests that Matimba produces the highest amount of electricity sent out. Kriel is the least efficient power station with 5.96708 (tons/GWhs).

The most efficient month was July with 4.47572 of average relative NO2 emission (tons/Gigawatt-hours) and January being the least efficient with 4.647 tons/Gigawatt-hours. The month differences are however minimal.

The joint fabric filter, electrostatic precipitators and flue gas condition were associated with the highest efficiency, with 4.27707 tons/GWhs, and electrostatic precipitators are associated with the least efficiency, with an emission of 4.76371 tons/GWhs.

 

Figure 3

 

Variable Selection

One of the aims of the paper is to find/select explanatory variables with a significant effect on NO2 emission at Eskoms power plants.

Test for collinearify (Dependence) It is important to check for collinearity between some paired continuous explanatory variables before fitting the data to a regression model. The presence of such a relationship would mean that having information about one variable implies that we can predict the other. Thus, both would be trying to explain the same variability for the one response variable. The variance inflated factors, of the explanatory variables will be used to check for collinearity. A value of VIFi >10 raises concern. R2 is the coefficient of variation.

As an example, for two variables age (in years) and electricity sent out (in GWhs), we have VIFi = 1.71897 < 10 for each. Which means there is no significant dependence between the two explanatory variables.

The Lasso via hierarchical interactions variable selection

Since no collinearity between variables in the dataset exists, one can start to select a model which includes only the explanatory variables which are significant in determining NO2 emission (in tons). In determining this, the Lasso (with hierarchical interactions) is used. The information is summarised in Table 1 below.

Table 1 shows all the coefficients generated by the variable selection process. The table includes the main effects and interaction effects. The first column shows the coefficients of the main effect, and the rest of the columns show the interaction effects. However, not all terms have interaction effects, a 0 indicates such a pair with no interaction effect. The variables amount of electricity sent out (in GWhs), power station used, age of power station (in years), and interaction terms electricity and station, age and station, and station and filter were selected and will be used to produce GLM models for this paper. In determining the GLMs, a model consisting of only the main effects and without interaction terms will first be considered. The model will be referred to as model I, and is given by

The second model with both the main and interaction effects will also be considered and is given b

Generalised Linear Models

Since the results in Figure 1 suggest that N02 emission (in tons) is not normally distributed, and it is common to find data in the form of continuous measurements where the variance increases with the mean, the Lognormal GLM under model I (model without interaction terms) will be fitted. Similarly, to the model in equation 2 above, the final model is given by the linear predictor:

however, the explanatory variable, installed filter, will not be included in this model since it produced parameters with zero values. The model can thus be given as

The plot of residuals versus predicted values, and also, observed values versus predicted values are given in Figures 4 and 5 below, respectively, for the distribution model. Also included in the figures, are the plots for the Normal distribution model with identity link function model. The plots help in assessing the goodness of fit of the models. The first plots are on residuals versus predicted values (see Figure 4 below)

 

 

 

 

Below are the plots of the observed versus predicted values (see Figure 5)

A plot of observed against predicted values again shows the Normal distribution models seems to show an increasing variance with predicted values and hence the model is not very good. On the other hand, the Lognormal model seems to tame the variance behaviour and hence gives the better fit.

A Model with more terms- including interaction terms In the current section, a model with interaction terms is considered. The resultant model is called Model II and corresponds to the model in equation 2 above.

The Normal model

The final model includes the interaction effects between Electricity and Age, Electricity and Station and Age and station, and explanatory variables electricity sent out (in GWhs), age of power station (in years) and power station used.

The Lognormal model

Similarly, to the normal model above, the final model includes the interaction effects between Electricity and Age, Electricity and Station and Age and station, and explanatory variables electricity sent out (in GWhs), age of power station (in years) and power station used.

For the two models above, the age of the power station is included because of the inclusion of the upper order interaction term Age*station. Also, the interaction term between station and filter, and explanatory variable filter are not included in the final model since they produced coefficients with values of zero.

Thus Model II for the two distributions is given as:

Below (in Figure 6) are the plots of residuals against predicted values for the Normal and Lognormal distributions under Model II.

 

 

When the residuals are plotted against predicted values, the Normal model shows an increasing variance with predicted values and hence the model with these interaction terms is also not good. The Lognormal model seems to tame the variance behaviour and hence gives the better fit. The results of the actual against predicted plots in Figure 7 below also confirm this observation.

 

 

Link functions and the deviance

In order to check for a good fit, the deviance was compared to the degrees of freedom. Below are the tables showing the model used, the deviance, degrees of freedom and the associated link functions for the Normal and Lognormal distributions models.

 

 

Normal distribution

The degrees of freedom for models I and II above are very small compared to their corresponding deviances, that is

 

D1i and D2i are the deviances for model I and model II, respectively (with i=l and 2 representing the identity and log link functions, respectively).

DF1 and DF2 are the degrees of freedom for model I and model II, respectively. This observation suggests that the Normal distribution is not a good fit in modelling N02 emissions from Eskoms coal fuelled power stations. This observation was checked and confirmed by the use of residual plots and actual versus predicted plots. The identity link function gave the lowest deviance and was hence used.

Lognormal model

 

 

All the models from the Lognormal distribution show a good fit to the data since the deviance for each link function is smaller than the degrees of freedom, that is

where,

D1i and D2i are the déviances for model I and model II, respectively (with i=l and 2 representing the identity and log link functions, respectively).

DF1 and DF2 are the degrees of freedom for model I and model H, respectively.

Under the Lognormal model (for both model I and model II), the best fit is with the identity link function since it has the smallest deviance value of the three link functions.

Parameter estimation

Parameters were estimated using ML estimation with Matimba as the basis for comparison since it produced the lowest volumes of average relative NO2 emissions and hence was the most efficient.

The Lognormal distribution with identity link function (model I detailed results)

Model I: Model with explanatory variables electricity sent out (in GWhs), age of power station (in years) and power station used. Table 4 gives the parameter estimates of the best fitting model I using the Lognormal model as discussed above.

 

 

Table 4 shows the ML parameter estimate of electricity sent out (in GWhs) of 0.0008. This means that an increase in electricity sent out by 1 Gigawatt-hour will increase the log NO2 emission in log tons by 0.0008 (equivalent to 1.0008 tons). Other log tons estimates will be similarly interpreted.

An estimate with a positive value for the plant coefficient means the associated power station variable in the model has the effect to produce log NOa emission exceeding those of the basis, Matimba, by the estimated value. A negative value means the basis (Matimba) effect exceeded the log NO2 emission of the associated power station by the value of the estimate. The lowest plant coefficient implies the lowest impact on emission (in log tons of N02) having taken account the other variables in the model. The highest plant coefficient implies the highest log NO2 emission impact.

Komati, Grootvlei and Camden produce less electricity and hence are expected to produce less NO2 emissions.

According to the power plant parameter estimates in Table 4, Komati (with log emission level of 0.1939 log tons less than Matimba) has the least impact of the 13 power stations. It has the lowest parameter estimate (and the only estimate with a negative value). Majuba (with 1.1764 log tons more than Matimba) has the greatest impact in increasing emissions. The parameters are interpreted in the presence of other variables in Model I.

The Lognormal distribution with Identity link function: Model II (model with interaction terms)

The parameter estimates for the best Model II are given in table 5. This model consists of the explanatory variables electricity sent out (in GWhs), age of power station (in years) and power station used, and the interaction terms electricity'age, electricity*station and age*stationIn Table 5 above, the ML coefficient of electricity sent out (in GWhs) is 0.0007. This means that an increase in electricity sent out by 1 Gigawatt-hour will increase the log NO2 emission in log tons by 0.0007 units (equivalent to 1.0007 tons). On the other hand, an increase of age by a year will increase log NO2 emission by 0.0298 log tons (equivalent to 1.0302 tons). Table 5 gives the power station effect in the presence of other variables in the Lognormal model. According to the Lognormal Model II, the power stations Arnot, Hendrina, Camden, Grootvlei, Tutuka, Komati and Kriel had less effect on emissions compared to Matimba since these had negative parameter estimates. This is happening when interaction effects are allowed for. Arnot (with 0.9759 log tons less than Matimba) had the least effect from the 13 power stations followed by Hendrina (with 0.7919 log tons less than Matimba). Duvha, Matla, Majuba, Lethabo and Kendal had the greatest effect in increasing emissions compared to Matimba, with Kendal (emission level of 1.6753 log tons more than Matimba) contributing the greatest effect on emissions of all the 13 power stations.

 

 

Since the interaction of electricity sent out (in GWhs) and age produced a very small value of the estimate such that the software package used cannot display it but its sign only, we can only conclude that the joint increase in electricity sent out by 1 Gigawatt-hour and increase in age by a year will decrease the log NO2 emission in log tons by a value less than 0.0001 units.

Taking a closer look on the interaction term: electricity'station

For the interaction term electricity *statdon, the least effect from the 13 power stations comes from the interaction term electricity*Kendal (with only 0.0001 log tons less than electricity*Matimba) and the interaction of the electricity variable with Komati power station has the greatest effect to increase emissions significantly (with 0.0051 more log tons when compared to electricity*Matimba). Komati, Grootvlei and Camden produce less electricity and hence are expected to produce less NO2 emissions. However, the emissions are disproportionately higher.

For the effect age*station, Komati, Kendal, Camden, Grootvlei, Lethabo, Hendrina, Duvha and Majuba have interaction with age coefficients to reduce emission impact since they all have negative interaction coefficients when compared with the basis, age*Matimba. Age interaction with, Kriel, Arnot, Matla and Tutuka contribute to increasing emissions since the coefficients are all positive. The interaction term age'station has Komati (with 0.0575 log tons less than age*Matimba) leading to the least impact on emission. Age interaction with Tutuka leads to the greatest emission impact (with 0.0404 more log tons compared to age'Matimba). Generally, the older plants give more emissions. Tutuka produces more emissions than expected given its age.

Criteria for assessing goodness of fit: Selecting the best model.

One can now determine if the addition of interaction terms produced a better fit or not when compared to the model with less terms (no interaction effects). Lognormal model with identity link function Let Dl and D2 be the deviances for models I and II, respectively, such that

This suggest that the null hypothesis will be rejected at a=0.05 and we can conclude that the addition of the interaction terms is significant in predicting the emission of NO2 and thus model II can be used in predicting NO2 emission and can be regarded as the best fit of the two.

Evaluating the predictive models (RMSE, MAPE and MAE)

In addition to the residuals plots above, prediction evaluation metrices are presented to confirm the fitting model. Table 6 below shows the root mean squared error (RMSE), mean absolute percentage error (MAPE) and the mean absolute error (MAE) for the two models, Normal and Lognormal distributions. From Table 6 above, the MAPE for the Lognormal model (with a value of 0.86%) is lower than that of the Normal distribution (with a value 5.34%). This suggests the Lognormal model is a better fit compared to the Normal model. This is supported by the results of the RMSE and MAE i.e. for the Lognormal model II, MAE has a lower value of 0.0653 log tons (equivalent to 1.0675 tons) compared to 296.1763 tons of the Normal distribution model II.

 

 

Discussion

In a classical regression model, the variance is assumed a constant and the data is assumed to be normally distributed. However, in practice, it is common to find data in continuous measurements where the variance increases with the mean (McCullagh and Nelder, 1989). In such cases, a Lognormal GLM could be used. Diagnostic plots suggest an increasing variance with an increasing mean for this data set. The data set obeys the constant coefficient of variation assumption. The results of the linear regression model suggest that N02 emission data is not Normally distributed. This is supported by the results from the histogram, box plot. The Lognormal distribution models are also fitted to the data. The best link function is the identity link as evidenced by the smallest deviance compared to the log and inverse link functions. Intermediate results in comparisons of the Lognormal model with identity link function and linear regression model, using the residuals plots and actual versus predicted plot, indicate that, the Lognormal model is better as it produced plots that showed improved variance behaviour that is now constant. It can be concluded that, the GLM model is a better model than the linear regression model in explaining and predicting NO2 emission data from Eskoms coal-fuelled power stations.

The identification of significant variables contributing to high emissions is essential in the monitoring and managing of emissions. The interaction terms electricity'station, Age'station and variables electricity sent out (in GWhs), age of power station (in years), power station used can be used in describing and predicting NO2 emissions from Eskoms coal fuelled power stations.

To enhance research on NO2 emissions from Eskom coal fuelled power stations, it would be beneficial to add the amount and quality of coal used in the generation of electricity as some of the explanatory variables. For future studies, the researchers would like to compare two GLM distributions models that obey the constant coefficient of variation assumptions namely, Lognormal and Gamma models.

 

Conclusion

This paper discusses the use of GLMs in the modelling of emission data from the 13 Eskoms coal-fuelled power stations. GLM distribution models, namely the Normal and the Lognormal, were constructed and compared. Each distribution model was divided into two, one without (Model I) and the other with interaction terms (Model II), respectively, by making use of group-lasso interaction network (glinternet) variable selection method. This was done to determine if addition of interaction effects in the models is significant or not. The deviance was then used to determine the best link function between the identity, log and inverse for Model I and Model II. The identity link function was deemed the most appropriate for the given dataset. In the case of the Normal GLM models, the deviance had values that were very large compared to their corresponding degrees of freedom, suggesting that the Normal distribution models (and thus the linear regression models) are not a good fit for the data. This is expected as it is common to have continuous data, including emission data, that does not obey the Normality assumption (McCullagh and Nelder, 1989). We can, therefore, conclude that the linear regression model is not a good fit for the NO2 emission data. For the Lognormal distribution model, the addition of interaction terms was significant. The main contribution of this paper is to demonstrate the GLMs flexibility offered by the link functions to transform the data compared to the limited classical linear regression when modelling NO2 emission data (Nelder and Wedderbum, 1972). The modelling helps in coming up with better models to explain Eskom emmissiom data, such the NO2 emission data. The study is useful to power utilities such as Eskom in the monitoring and management of emissions to meet the regulations and thus manage the emission to minimise the exposure of high NO2 emissions to humans and the environment.

 

References

Anand, S., Varma, K. and Srimurali, M. (2013) Concentration of Nitrogen Dioxide Estimation from Modeled NOX of a Thermal Power Plant, Journal of Environmental Science, Toxicology and Food Technology, 6(3), pp. 0811.         [ Links ]

Bien, J., Taylor, J. and Tibshirani, R. (2013) A lasso for hierarchical interactions, The Annals of Statistics, 41(3). doi:10.1214/13-AOS1096.         [ Links ]

Box, G.E.P. and Cox, D.R. (1964) An Analysis of Transformations, Journal of the Royal Statistical Society: Series B (Methodological), 26(2), pp. 211243. doi:10.1111/j.2517-6161.1964.tb00553.x.         [ Links ]

Capilla, C. (2014) Multilayer perceptron and regression modelling to forecast hourly nitrogen dioxide concentrations, WIT Transactions on Ecology and The Environment, 183, pp. 3948.         [ Links ]

Dobson, A.J. and Barnett, A.G. (2008) An introduction to generalized linear models. 3rd edn. Edited by B.P. Carlin et al. CHAPMAN & HALL/CRC: Texts in Statistical Science Series.         [ Links ]

Eskom (2011) COP17 fact sheet: Air quality and climate change. Available at: http://www.eskom.co.za (Accessed: 10 December 2015).         [ Links ]

Eskom (2016) 2014s best performing return to service project globally - Camden. Available at: http://www.eskom.co.za/news/Pages/Febll.aspx (Accessed: 21 December 2017).         [ Links ]

Georgopoulos, P.G. and Seinfeld, J.H. (1982) Statistical distributions of air pollutant concentrations, Environmental Science & Technology, 16(7), pp. 401A-416A. doi:10.1021/es00101a727.         [ Links ]

Hardin, J., and Hilbe, J.. (2007) Generalized linear models and extensions. 2nd edn. StrataCorp LP.         [ Links ]

Kukkonen, J. (2003) Extensive evaluation of neural network models for the prediction of N02 and PM10 concentrations, compared with a deterministic modelling system and measurements in central Helsinki, Atmospheric Environment, 37(32), pp. 45394550. doi:10.1016/S1352-2310(03)00583-l.         [ Links ]

Lim, M. and Hastie, T. (2015) Learning Interactions via Hierarchical Group-Lasso Régularisation, Journal of Computational and Graphical Statistics, 24(3), pp. 627654. doi: 10.1080/10618600.2014.938812.         [ Links ]

McCullagh, P. and Nelder, J.A. (1989) Generalized linear models. 2nd edn. London: Chapman and Hall.         [ Links ]

Myers, R.H. et al. (2010) Generalized linear models: with applications in engineering and the sciences. John Wiley & Sons.         [ Links ]

Nelder, J.A. and Wedderburn, R.W.M. (1972) Generalized Linear Models, Journal of the Royal Statistical Society. Series A (General), 135(3), p. 370. doi: 10.2307/2344614.         [ Links ]

Perez, P. and Trier, A. (2001a) Prediction of NO and N02 concentrations near a street with heavy traffic in Santiago, Chile, Atmospheric Environment, 35(10), pp. 17831789. doi:10.1016/S1352-2310(00)00288-0.         [ Links ]

Perez, P. and Trier, A. (2001b) Prediction of NO and N02 concentrations near a street with heavy traffic in Santiago, Chile, Atmospheric Environment, 35(10), pp. 17831789. doi:10.1016/S1352-2310(00)00288-0.         [ Links ]

Pretorius, I. et al. (2015) A perspective on South African coal fired power station emissions, Journal of Energy in Southern Africa, 26(3), pp. 2740.         [ Links ]

Singh, K.P. et al. (2001) Mathematical modeling of environmental data, Mathematical and Computer Modelling, 33(67), pp. 793800. doi:10.1016/S0895-7177(00)00281-8.         [ Links ]

Tibshirani, R. (1996) Regression Shrinkage and Selection Via the Lasso, Journal of the Royal Statistical Society: Series B (Methodological), 58(1), pp. 267288. doi:10.1111/j.2517-6161.1996.tb02080.x.         [ Links ]

Wellenius, G.A., Schwartz, J. and Mittleman, M.A. (2015) Health and the environment: addressing the health impact of air pollution, Draft resolution proposed by the delegations of Albania, Chile, Colombia, France, Germany, Monaco, Norway, Panama, Sweden, Switzerland, Ukraine, United States of America, Uruguay and Zambia. Sixty-Eighth World Health Assembly. Agenda item, 14, p. A68.         [ Links ]

World Health Organization (2013) Health Effects of Particulate Matter: Policy implications for countries in eastern Europe, Caucasus and central Asia. Available at: https://apps.who.int/iris/handle/10665/344854.         [ Links ]

Yuan, M. and Lin, Y. (2006) Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), pp. 4967. doi:10.1111/j.l467-9868.2005.00532.x        [ Links ]

 

 

* Corresponding author: Email: Chikobvu@ufs.ac.za

Creative Commons License Todo o conteúdo deste periódico, exceto onde está identificado, está licenciado sob uma Licença Creative Commons