## Services on Demand

## Article

## Indicators

## Related links

- Cited by Google
- Similars in Google

## Share

## South African Journal of Economic and Management Sciences

##
*On-line version* ISSN 2222-3436

*Print version* ISSN 1015-8812

### S. Afr. j. econ. manag. sci. vol.15 n.1 Pretoria Jan. 2012

**ARTICLES**

**On employees' performance appraisal: the impact and treatment of the raters' effect**

**Temesgen Zewotir**

School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal

**ABSTRACT**

By putting in place a performance appraisal scheme, employees who improve their work efficiency can then be rewarded, whereas corrective action can be taken against those who don't. The aim of this paper is to develop a technique that helps to measure the subjective effect that a given rater's assessment will have on the performance appraisal of a given employee, assuming that an assessment of one's work performance will have to be undertaken by a rater and that this rating is essentially a subjective one. In particular, a linear mixed modelling approach will be applied to data that comes from a South African company which has 214 employees and where an annual performance evaluation has been run. One of the main conclusions that will be drawn from this study, is that there is a very significant rater's effect that needs to be properly accounted for when rewarding employees. Without this adjustment being done, any incentive scheme, whether its motive is reward based or penalty based, will ultimately fail in its intended purpose of improving employees' overall performance.

**Key words:** raters' effect; performance appraisal; model diagnostics; mixed model; fixed effect; best linear unbiased predictor

**JEL: C210, 49, M49**

**1 Introduction**

Yearly performance reviews are seen as critically important for ensuring the success of public entities and private companies (Saxena, 2010). Their aim is to induce workers to become more efficient and effective (Kondrasuk, 2011), and help supervisors to become more transparent in the way they interact with their workers. As a result, workers begin to have a better understanding of their supervisors' expectations, leading to a greater sense of ownership of their duties and thus improved work performance. Ignoring these performance issues will ultimately decrease morale, which in turn will lead to a drop-off in the company's overall level of performance as management wastes time rectifying what isn't being done properly (Grote, 1996). Thus an effective performance appraisal can provide huge benefits for the employer in terms of increased staff productivity, knowledge, loyalty and participation (Margrave & Gorden, 2001).

How one best measures the performance of an employee, however, can be significantly affected by what has become known as a horns and halos effect. This refers to the effect of one person's judgment of another being unduly influenced by a first impression. A selective perception problem, the term 'horns' refers to an unfavorable first impression, while the term 'halo' refers to a favorable impression. Ideally one would like to minimise the effect that a first impression has on a final rating, but this selective perception bias has been observed in the behaviour of all raters, and is therefore known as raters' effect (Wolfe, 2004).

Due to the complexity of the job performance and interpersonal relations at work, much of the existing research typically indicates that raters account for significant proportions of the variance in employees' true performance (Woehr et al, 2005; Hoffman & Woehr, 2009; Hoffman et al, 2010). It is therefore in the interests of both the organisation and the individual to maximise the effectiveness of performance appraisal by reducing the rater errors (see for example, Aguinis & Pierce, 2008; Uggerslev & Sulsky, 2008; Ferris, 2008; Ogunfowora, 2010). Most of the studies focus on the rating strategies before the rating rather than attending to rating outcomes.

Therefore, the purpose of this study is to introduce a statistical method to (i) demonstrate the plausibility of rater source factors at the performance appraisal; (ii) to identify (and adjust for) the magnitude of raters' effect and thereby rank the 'best' and 'worst' performers, and (iii) identify deviant ratings. Hence, this study contributes to the literature by attempting to clarify the structure of raters' effect, the existence and nature of raters' effect, and the relative proportion of variance accounted for by the raters' influence on performance ratings.

**2 The data and purpose of the analysis**

The South African based company^{1} has 214 employees. All were included in the study as each employee was part of a per annum based performance appraisal scheme. For each project (or activity) in which he/she was involved, that employee was given a rating on a continuum scale ranging from 0 to 25, with a higher rating showing a better performance. The ratings were performed by 85 evaluators. The scale of complexity of the given tasks that the employees were being asked to perform was also taken into consideration when the rating was being done by the evaluators.

To help mitigate the effect of using different raters, all 85 raters received some form of training (i) to familiarise themselves with the measures that they would be working with, (ii) to ensure that they understood the sequence of steps that they would have to follow in their assessment and (iii) to explain how they should interpret any normative data that they would be given. More details about the data can be obtained from Zewotir (2001).

If one were able to use all 85 raters to rate each and every employee in the firm, raters' training would minimise rater effects, as the effects would be the same (Pulakos, 1986; Houston et.al., 1991). No single employee would run the risk of having a lower or higher overall rating as all the employees would receive the same benefit or penalty from the rater's subjective leniency or harshness. In the firm that we studied, however, not every employee was able to be rated by the same set of raters. In particular, Table 1 shows how some raters evaluated several employees whereas others only rated a few employees. It should be noted that in Table 1 there are 340 ratings of 214 employees because some employees were involved in a number of projects (or activities) and accordingly had multiple raters.

The difference between the rating that will be assigned by a single rater and the average rating that will be assigned by all 85 raters is called the 'raters' effect'. Clearly, if this raters' effect is non zero, then employees that have been evaluated by a different set of multiple raters may receive an unfair (i.e. biased) score primarily because they have faced a relatively lenient or relatively harsh set of judges when compared with the other employees in the firm. In this case, an adjustment to a given employee's average score should be made, which takes into account the potential bias that may arise because a different set of raters has been used. Simply averaging the score given by each rater to an employee will not adjust this raters' effect. In the next section we will develop a method that attempts to account for a raters' effect. Once this has been done, we can then separate 'good' performers from 'poor' performers and reward them accordingly.

**3 Formulation of the model**

A classical example of testing for inter-rater reliability is described by Fliess (1986) in the context of a medical situation where depressive patients are being rated by several psychiatrists, and there is a restriction on the number of examinations that a patient can undergo. However, this method cannot be used in our context of performance appraisal because the rater who is evaluating a given employee is someone who has a detailed knowledge of that person's performance, i.e. the random assignment of employees to any given evaluator is not possible in our context. Furthermore, one is not necessarily able to restrict the number of employees that each rater sees, or vice versa.

Some researchers have suggested that one calculate a mean performance score for each employee and then rank the employees based on their mean performance. As has already been noted, because the set of raters being used differs from one employee to the next, simply ranking the mean performance scores of each employee will not remove the rater bias in this procedure (Russell, 2000). Other researchers have attempted to develop an analysis of variance-based raw scores (Braun, 1988; de Gruijter, 1984; Houston et al., 1991) or a multifaceted Rasch model (Wolfe et al., 2001; Wolfe 2004). Such a model however requires that one make use of a Likert scale when rating an employee's performance (like Excellent, Very good, Good, Fair, Poor).

In our modelling context the rating that is given is not based on a Likert scale. In order to develop a performance score for a given employee and to correct this score for a possible rater's effect, we will use a linear mixed model i.e.

y_{ij} = µ + α_{i} + β_{i} + ε_{ij}

where y_{ij} denotes the appraisal score of the i^{th} employee that has been given by rater j, µ denotes an overall mean score, α_{i} denotes a deviation of employee i from this overall mean score, β_{j} denotes the j^{th} rater's effect and ε_{ij} is an error term. In particular, we will assume that the α_{i}s are independent identically distributed normal random variables with a mean 0 and variance σ_{1}^{2}, and the ε_{ij}s are independent identically distributed normal random error terms with mean 0 and variance σ_{0}^{2}, respectively. Focusing on the model parameter β_{j} some of the management group may want to look only at the 85 raters, in which case the raters' effect β_{j} should be treated as being a fixed effect. On the other hand, some may argue that the 85 evaluators are representatives from a population of raters, in which case the raters' effect should be treated as being a random effect.

Instead of arguing about whether this raters' effect should be fixed or random, we will construct two models: one with a raters' effect that is fixed and another where we treat this raters' effect β_{j} as being an independent identically distributed normal random variable with a mean 0 and variance σ_{2}^{2}. We will also assume that α_{i}, β_{j} and ε_{ij} are distributed independently of each other. The resulting model then becomes a linear random effects model. A detailed discussion about linear random effect models can be found in, among others, Harville (1990), Robinson (1991), Searle et al. (2006) and SAS Institute (1992). The main focus of interest in this model is the variance of the raters' effect, σ_{2}^{2}. If σ_{2}^{2} = 0, then the data supports the hypothesis that the raters' effect is constant or identical. In other words, employees receive an identical bias from any rater that is assigned by the company implying that there is no need to adjust the employee's score with respect to a raters' effect. On the other hand, if the hypothesis σ_{2}^{2} = 0 is not supported by the data, then different raters have a different level of leniency/severity that they employ when judging an employee's performance, and thus the employee's score should be adjusted to account for this effect.

In a fixed effects model our main interest will focus on whether the β_{j}s are identical for all j = 1, 2,...,85. Such a model is known as a two-way mixed effect (see, for example, Little et al., 2000; Skrondal & Rabe-Hesketh, 2004; McCulloch et al., 2008). If the data supports the following hypothesis H_{0}: β_{1}=β_{2}=...=β_{85} then the employees will be receiving an identical bias from all the 85 raters so that there will be no need to adjust the employee's score for this rater's effect.

An important component of this model is a measure of its reliability. Sometimes called an intra-class correlation (ICC) coefficient, ρ, can be defined as the proportion of the total variance of the scores that can be attributed to the true performance score.

The estimation of the employee based variables ai will make use of a technique which is known as Best Linear Unbiased Prediction (BLUP). BLUP is a class of statistical tools that has some desirable properties (Robinson, 1991; SAS, 1992; Searle et al., 1996; McCulloch et al., 2008). The term "Best" in the acronym BLUP is used to describe the property that, from the available data on an employee, its predicted true performance will be as error-free as possible. The term 'linear' simply means the data has not been adjusted to some other scale such as being squared. 'Unbiasedness' means that, on average, the estimated true performance calculated will be the same as the employee's true performance. 'Prediction' refers to the task at hand: trying to predict true performance.

Once a BLUP has been obtained for each one of the employee based parameters, a hypothesis test can be constructed by noting that the standardised BLUP's are distributed as a Student's t-distribution with degrees of freedom equal to the denominator degrees of freedom (ddf). One can then pinpoint the i^{th} employee as being a significantly good/bad performer if the standardised BLUP is greater than t(1-α/2, ddf) where t(1-α/2;ddf) is the lower 1-α/2 level of Student's t distribution with degrees of freedom ddf. For exceptionally good performers, the estimate will be positive valued and for bad performers it will be negative valued.

Model diagnostics also form an important part of statistical modelling. Zewotir and Galpin (2004, 2005 and 2007) have outlined some formal and informal procedures that can be used to help detect outliers, influential points and specific departures from underlying assumptions in the linear mixed models. These procedures will also be employed in this paper.

**4 Results and discussions**

**4.1 Without an adjustment for the raters' effect**

One can perform an analysis without adjusting for the raters' effect, by simply using the average score that has been assigned by all the raters to a given employee. Using this approach, the best and worst performers are presented in Table 2.

**4.2 Adjusted model 1: Including a raters' effect as a fixed effect**

Results for the rater fixed effects model are given in Table 3. The rater row of Table 3 is testing whether the rater effect parameter estimates that we have obtained are significantly different from zero. The very small p-value that we have obtained (p = 0.0001) indicates that the hypothesis H_{0}: β_{1}=β_{2}=...=β_{85} = 0 can be rejected. This clearly shows the existence of a rater bias in the scores given to different employees of the firm.

The variance parameter estimate for σ