<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>0038-223X</journal-id>
<journal-title><![CDATA[Journal of the Southern African Institute of Mining and Metallurgy]]></journal-title>
<abbrev-journal-title><![CDATA[J. S. Afr. Inst. Min. Metall.]]></abbrev-journal-title>
<issn>0038-223X</issn>
<publisher>
<publisher-name><![CDATA[The Southern African Institute of Mining and Metallurgy]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S0038-223X2012000700005</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Comparing two mass size distributions]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Lombard]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Lyman]]></surname>
<given-names><![CDATA[G.J.]]></given-names>
</name>
<xref ref-type="aff" rid="A02"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,North-West University Centre for Business Mathematics and Informatics ]]></institution>
<addr-line><![CDATA[Potchefstroom ]]></addr-line>
<country>South Africa</country>
</aff>
<aff id="A02">
<institution><![CDATA[,Materials Sampling and Consulting Pty Ltd  ]]></institution>
<addr-line><![CDATA[Southport ]]></addr-line>
<country>Australia</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>07</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>07</month>
<year>2012</year>
</pub-date>
<volume>112</volume>
<numero>7</numero>
<fpage>613</fpage>
<lpage>619</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.za/scielo.php?script=sci_arttext&amp;pid=S0038-223X2012000700005&amp;lng=en&amp;nrm=iso&amp;tlng=en"></self-uri><self-uri xlink:href="http://www.scielo.org.za/scielo.php?script=sci_abstract&amp;pid=S0038-223X2012000700005&amp;lng=en&amp;nrm=iso&amp;tlng=en"></self-uri><self-uri xlink:href="http://www.scielo.org.za/scielo.php?script=sci_pdf&amp;pid=S0038-223X2012000700005&amp;lng=en&amp;nrm=iso&amp;tlng=en"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[We consider in this paper the use of a modified version of Hotelling's statistic in the analysis of particle size distributions. The statistic can be adversely affected by the presence of outliers among the data. We propose a competitor to the statistic that is based on ranks, and hence is less sensitive to outlier effects. The results of a Monte Carlo study suggest that the rank test is highly competitive with the Hotelling test in its ability to detect differences between two mass size distributions. The calculation of the rank statistic is explained in detail and its application is illustrated on two sets of data.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[mass size distributions]]></kwd>
<kwd lng="en"><![CDATA[bias testing]]></kwd>
<kwd lng="en"><![CDATA[rank statistic.]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[ <p align="right"><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b>TRANSACTION    PAPER</b></font></p>     <p>&nbsp;</p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="4"><b><a name="top"></a>Comparing    two mass size distributions</b></font></p>     <p>&nbsp;</p>     <p>&nbsp;</p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b>F. Lombard<sup>I</sup>;    G.J. Lyman</b><sup><b>II</b></sup></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><sup>I</sup>Centre    for Business Mathematics and Informatics, North-West University, Potchefstroom,    South Africa    <br>   <sup>II</sup>Materials Sampling and Consulting Pty Ltd, Southport, Australia</font></p>     <p>&nbsp;</p>     <p>&nbsp;</p> <hr size="1" noshade>     ]]></body>
<body><![CDATA[<p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b>SYNOPSIS</b></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">We consider in    this paper the use of a modified version of Hotelling's statistic in the analysis    of particle size distributions. The statistic can be adversely affected by the    presence of outliers among the data. We propose a competitor to the statistic    that is based on ranks, and hence is less sensitive to outlier effects. The    results of a Monte Carlo study suggest that the rank test is highly competitive    with the Hotelling test in its ability to detect differences between two mass    size distributions. The calculation of the rank statistic is explained in detail    and its application is illustrated on two sets of data.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b>Keywords:</b>    mass size distributions, bias testing, rank statistic.</font></p> <hr size="1" noshade>     <p>&nbsp;</p>     <p>&nbsp;</p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="3"><b>Introduction</b></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">Lyman, <i>et al.</i>    (2010) reported in this Journal the results of research, both applied and theoretical,    on the analysis of particle size distributions. The Hotelling <i>T<sup>2</sup></i>    statistic that was proposed can be adversely affected by the presence of outliers    among the data. The paper promised further research into an improved methodology.    In this paper we try to fulfil the promise. We propose a robust test procedure    based on ranks as a competitor to the <i>T<sup>2</sup></i> statistic. In a typical    sizing procedure, a mass <i>M</i> kg of raw material is sorted into <i>p</i>    + 1 size intervals (in millimetres)</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s01.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"> and the mass of    material, <i>M1,..., M</i>p+<sub>1</sub>, in each of the size intervals recorded.    Thus <i>M</i> = M<sub>1</sub>+...+M<sub>p+1</sub>. Often the value of <i>M</i>    that is needed to achieve a target precision in sizing is suggested by an industrial    standard (e.g. an ISO document) and is supposed to remain fixed over successive    sieve analyses. In practice it is hardly possible to keep <i>M</i> constant,    and deviations of a greater or lesser magnitude from the prescribed value occur    as a matter of course. Consequently, the results of a sieve analysis are invariably    reported as a vector of proportions x = (x<sub>1</sub>,...,X<sub>p+1</sub>),    where</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05x01.jpg"></p>     ]]></body>
<body><![CDATA[<p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">so that</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05x02.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">As a consequence,    the covariance matrix of the multivariate data set that results upon making    <i>n</i> &gt; <i>d</i> independent observations on the random vector x is singular.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">In order to develop    a statistical method to analyse such data, one thinks of the observation <b>x</b>    = (x<sub>1</sub>,...,X<sub>p+1</sub>) as a corrupted realization of a 'true'    underlying size distribution <b>q</b> = <i>(q0,...,qp+i), &#931;qk</i> = 1).    The principal sources of corruption are mechanical sampling error and laboratory    analysis error. Here <i>qk</i> denotes the true, but unknowable, proportion    of the total mass of a large (conceptually infinite) amount of material that    falls in the <i>k</i>th size interval. In this paper we consider tests of the    hypothesis of equality of two such underlying size distributions. In the first    of the two applications to be considered, two samples were obtained from the    same batch of material, one by each sampling method, each sample having been    sized using a set of <i>p</i> = 10 sieves. This pairwise collection and sizing    of samples was repeated on <i>n</i> = 28 independent batches of material. Thus,    we have 28 pairs</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05x03.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"> of observed sizings.    The question is whether or not it is reasonable to assume that the distributions    <b>q</b> underlying the <b>x</b> results are the same as the distributions <b>q</b>*    that underlie the <b>y</b> results. We emphasize that the underlying size distributions    typically vary from sample to sample because of the timewise variability in    the supply of material being sampled. Thus we do not assume that the observation    pairs (<b>x</b>&iexcl;, <b>y</b><sub>1</sub>), <i>i =</i> 1,...,28 are realizations    of a fixed pair (<b>q</b>,<b>q</b>*) of underlying distributions. As such, an    analysis should be based on the 28 sets of differences</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05x04.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><a href="#f1">Figure    1</a> shows for this data set boxplots of the 28 differences within each of    the 10 size classes. Comparing the medians with the zero line, the overall visual    impression is that the two size distributions differ somewhat. The sampler seems    to be producing less coarse material, and consequently more of the finer material,    than the stopped belt method. On the other hand, the Hotelling <i>T<sup>2</sup></i>    gives a non-significant result (p-value = 0.124), indicating, contrary to what    we see in <a href="#f1">Figure 1</a>, that the observed differences are nothing    out of the ordinary. However, we also see that the data abound in outliers (the    + signs in <a href="#f1">Figure 1</a>) and that any real differences between    the size distributions may have been masked by the large amount of variability    in the data.</font></p>     <p><a name="f1"></a></p>     <p>&nbsp;</p>     ]]></body>
<body><![CDATA[<p align="center"><img src="/img/revistas/jsaimm/v112n7/05f01.jpg"></p>     <p>&nbsp;</p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">The main objective    of the present paper is to demonstrate how outlier effects associated with the    use of the <i>T<sup>2</sup></i> statistic can be overcome. We will show that    replacing the differences (Equation &#91;4&#93;) by appropriate rank scores    takes care of the outlier problem without sacrificing much, if anything, in    the ability to detect true differences (statistical power). A secondary objective    of the paper is to show that the rank score test is, in fact, often more adept    than Hotelling's <i>T<sup>2</sup></i> at detecting real differences between    commonly encountered size distributions.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">The paper is structured    as follows. We first review briefly the calculation and properties of the <i>T<sup>2</sup></i>    statistic in its application to sizing data. We then describe the rank score    version of the statistic, illustrate its calculation on a small set of artificial    data, and apply it to two real data sets. Following this, we use a Monte Carlo    simulation method to compare the abilities of the tests to detect substantive    differences between underlying size distributions. The results of the Monte    Carlo study suggest that the proposed rank test is highly competitive with the    Hotelling <i>T<sup>2</sup></i> test. Finally, we make some concluding remarks    and summarize our main results.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b>Hotelling's    <i>T</i><sup>2</sup> statistic</b></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">Set </font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05x05.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">and form the data    matrix of differences</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s02.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"> and the (column)    vector of means</font></p>     ]]></body>
<body><![CDATA[<p align="center"><img src="/img/revistas/jsaimm/v112n7/05s03.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"> where <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i>k</i>    is the average of the <i>n</i> elements in row <i>k</i> of <b>D</b>. Lyman <i>et    al.</i> (2010) proposed a procedure based on Hotelling's <i>T<sup>2</sup></i>    statistic to test the statistical significance of the observed differences <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i>ki</i>    by deleting any one of the rows of <b>D</b>, say the last one <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><sub>p+1</sub>,    and computing the <i>T<sup>2</sup></i> statistic on the remaining rows <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><sup>1</sup>,...,<img src="/img/revistas/jsaimm/v112n7/05s19.jpg"><sub>p</sub>.    The <i>T<sup>2</sup></i> statistic is</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05x06.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">where</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05x07.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"> denotes the covariance    matrix. The reason for eliminating one row of <b>D</b> from consideration is    that the constraint Equation &#91;2&#93;, which also applies to the j-data,    together with Equation &#91;5&#93; implies that</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05x08.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"> so that the covariance    matrix <b>C</b> is singular. This precludes calculation of <i>T<sup>2</sup></i>    on the full data matrix <b>D.</b> It is important to note that the numerical    value of <i>T<sup>2</sup></i> computed in the manner suggested by Lyman <i>et    al.</i> does not depend upon which one of the <i>p</i> + 1 rows is eliminated    from the data matrix. This fact is intuitively rather obvious because any <i>p</i>    of the <i>p</i> + 1 rows contain the same information as does the full set of    <i>p</i> + 1 rows—the missing row can be reconstructed exactly by applying the    constraint (Equation &#91;8&#93;). Thus, the outcome of the statistical test    is uniquely determined, no matter which one of the rows is eliminated from consideration.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">The following equally    important fact will also be used in this paper. Namely, that for any data matrix,    not necessarily constrained as in Equation &#91;2&#93;, and row <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i><sub>l</sub></i>    the numerical value of <i>T<sup>2</sup></i> computed on the <i>p</i> nonzero    row difference <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i><sub>k</sub></i>    - <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i>l, k</i>    &#8800; <i>l</i> does not depend upon l. For example, the numerical values of    <i>T<sup>2</sup></i> computed on the vectors </font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s04.jpg"></p>     ]]></body>
<body><![CDATA[<p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">will be exactly    the same. This follows from the fact that the numerical value of <i>t<sup>2</sup></i>    does not depend upon the ordering of rows in the data matrix D.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b><i>A rank test</i></b></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">A version of the    <i>t</i> statistic that is robust to outliers is obtained upon replacing the    differences <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i><sub>ik</sub></i>    in row <i>k</i> of the data matrix by a rank score which is more or less immune    to outlier effects. Towards this, notice that each <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i><sub>ik</sub></i>    can be expressed as the product of its absolute value and its sign (+1 or -1),</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05x09.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">In order to negate    outlier effects one replaces | <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i><sub>ki</sub>    I</i> by its rank among the absolute values in row <i>k,</i></font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s05.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">Then, for example,    the two largest absolute values in the row receive the ranks <i>n</i> - 1 and    n, no matter how large they are compared to each other and to the other absolute    values in the row. Appending the sign of <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i><sub>ki</sub></i>    to its rank gives the rank score</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05x10.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">Comparing Equations    &#91;9&#93; and &#91;10&#93; we see that robustness is effected by the replacement    of the numerical value of | <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i>ik    |</i>, which may be an outlier relative to the other | d <i>|</i> values in    the row, with its rank, which is not an outlier relative to the other ranks    in the row. In this way, each row <i>dk = (</i><img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i>ki,...</i><img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i>kn)</i>    of the data matrix <b>D</b> is replaced by a row of rank scores <b>sk</b> =    (s<sub>k1</sub>,...s<sub>kn</sub>) to form a new data matrix</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s07.jpg"></p>     ]]></body>
<body><![CDATA[<p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">The proposed rank    statistic, denoted by <i>T</i><sup>2</sup><i>W,</i> is obtained by choosing    any one of the rows, say s<sub>p</sub>+1, as a 'reference row' and calculating    <i>T2 t2</i> in Equation &#91;6&#93; on the reduced rank score matrix</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s08.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">The subscript <i>w</i>    in <i>t<sup>2</sup><sub>W</sub></i> serves to indicate that the rank scores    are, in fact, those upon which the well-known Wilcoxon symmetry test is based.    Notice that, in contrast to the <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><sub>ki</sub>,    the rank scores generally are not subject to the constraint <img src="/img/revistas/jsaimm/v112n7/05s09.jpg" align="absmiddle">.    Hence, it would be possible in principle to compute &nbsp;T<sup>2</sup><i>W</i>    on the full ranks score matrix <b>S.</b> This is not advisable, though, because    near-singularity of the corresponding covariance matrix is a frequent occurrence    and gives rise to unnecessarily variable <i>T<sup>2</sup><sub>W</sub></i> values.    Eliminating one of the rank score row vectors from the calculation is also not    advisable.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">This is because,    in the absence of the constraint <img src="/img/revistas/jsaimm/v112n7/05s09.jpg" align="absmiddle"><sup>s</sup><i>ki</i>=0    the value of <i>T</i><sup>2</sup><i><sub>W</sub></i> will then depend upon which    score vector is eliminated. On the other hand, as was pointed out in the Introduction,    basing the calculation on the reduced rank score matrix <img src="/img/revistas/jsaimm/v112n7/05s20.jpg" align="absmiddle">    ensures a unique value of <i>T</i><sup>2</sup><i><sub>W.</sub></i></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b><i>Numerical    example</i></b></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">We illustrate the    calculation of the rank scores on the following small set of artificial data:</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s10.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">Here, <i>p</i>    = 2 and <i>n</i> = 5, that is, we have five paired sets of observations on each    of three size fractions. The matrix of differences is</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s12.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">From this we derive    the matrix of signs</font></p>     ]]></body>
<body><![CDATA[<p align="center"><img src="/img/revistas/jsaimm/v112n7/05x11.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">the matrix of absolute    values</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s12.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">and the matrix    of ranks</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05x12.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">The rank score    vectors are now found by multiplying corresponding elements of the matrices    in Equations &#91;11&#93; and &#91;12&#93;,</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s14.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">and the Hotelling    statistic is computed on the matrix</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s15.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">or on the matrix</font></p>     ]]></body>
<body><![CDATA[<p align="center"><img src="/img/revistas/jsaimm/v112n7/05s16.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">or on the matrix</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s17.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">The value of <i>T</i><sup>2</sup><i><sub>W</sub></i>    is 0.915 in all three cases.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">A Matlab (Mathworks    Inc., 2007) program that does the required calculations for data sets of realistic    sizes is available from either of the authors upon request.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b><i>Application    to sizing data</i></b></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">For the data set    exhibited in <a href="#f1">Figure 1</a> the observed value of the <i>T</i><sup>2</sup><i><sub>W</sub></i>    statistic is 2.980 with a <i>p</i> -value of 0.022 found from the <i>F</i> distribution    with 9 and 19 degrees of freedom. (The <i>p</i> -value is the probability that    the <i>T</i><sup>2</sup><i><sub>W</sub></i> statistic exceeds the value 2.98    given that it has the indicated <i>F</i> distribution.) This <i>p</i> -value    of 0.022 is quite different from the <i>p</i>-value of 0.124 produced by the    <i>T2</i> statistic on this data. Thus, the rank test suggests strongly that    the differences seen in <a href="#f1">Figure 1</a> are, in fact, real and not    merely due to chance. Given this conclusion, an obvious question that arises    is 'where do the differences mainly occur'? It was pointed out by Lyman <i>et    al.</i> that some care should be exercised when attempting to answer this question.    This is because a change in the mass fraction reported by any given size class    must of necessity be accompanied by reported changes in one or more of the remaining    size classes: the total of the reported fractions must be 1.00 under all circumstances.    A glance at <a href="#f1">Figure 1</a> suggests that the reported proportions    of coarser material (size fractions 7-10) have decreased while the proportions    of finer material (size fractions 1-6) have increased correspondingly. This    becomes clearer when we look at <a href="#f2">Figure 2</a>, derived from <a href="#f1">Figure    1</a>, which shows a plot of the medians of the 28 differences within each size    class. Notwithstanding that the differences are relatively small, the pattern    is clear.</font></p>     <p><a name="f2"></a></p>     <p>&nbsp;</p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05f02.jpg"></p>     ]]></body>
<body><![CDATA[<p>&nbsp;</p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">Next, we turn to    the data analyzed by Lyman <i>et al.</i> (2010) . This involved a comparison    of sizings from samples of iron ore obtained with two different cross-belt samplers,    a Vezin-type belt-end sampler, and stopped belt sampling. A total of 14 pairs    of sizings into 8 intervals from each procedure were available for analysis.    Belt sampling was carried out by highly experienced staff and sample analysis    (sizing) by a well-accredited laboratory that was closely supervised. The reproducibility    of the sizing protocol was established prior to processing the test samples.    <a href="#f3">Figure 3</a> (which is a reproduction of part of the top left    hand one in <a href="#f3">Figure 3</a> of Lyman <i>et al.</i> (2010), except    that the size fractions are now arranged left to right from finest to coarsest,    shows boxplots of the 14 differences in each of the 8 size intervals for the    belt-end sampler and the stopped belt samples. Student <i>t</i> -tests comparing    each of the size fractions indicated a significant difference at only the finest    size fraction. (We point out here that in <a href="#t1">Table 1</a> of Lyman    <i>et al.</i> (2010) the numbers 0.160, 0.25, 2.68, and 0.64 in the last row    of the belt-end block should be replaced by 0.887, 0.162, 14.6, and 5.48 respectively.)    The Hotelling <i>T</i> also indicated an overall significant difference (p -value    = 0.05). Since a bias from cross-stream (Vezin) sampling was somewhat unexpected,    the question was put whether outliers (indicated by a + in <a href="#f3">Figure    3</a>) could have been the cause of the significance of the values of the <i>t</i>    and <i>T2</i> statistics. The answer seems to be that the observed differences    are real—the rank test also gives a highly significant result for these data    (p -value = 0.003). Some confirmation that the highly significant <i>T2</i>    and <i>T</i><sup>2</sup><i><sub>W</sub></i> values are at least in part due    to the observed differences in the finest size fraction follows upon analysing    the subcompositions <i>(x<sup>2i</sup>,..X<sup>8i</sup>)</i> / (1 -x<sup>1i</sup>)    and <i>(j<sup>2i</sup>,-,y8<sup>7i</sup>)</i> / (1 -<i>y<sup>1i</sup>)</i> consisting    of the size fractions in intervals 1 through 7 only. Then neither <i>T2</i>    nor <i>T</i><sup>2</sup><i><sub>W</sub></i> is significant <i>(p</i>-values    of 0.54 and 0.31 respectively).</font></p>     <p><a name="t1"></a></p>     <p>&nbsp;</p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05t01.jpg"></p>     <p>&nbsp;</p>     <p><a name="f3"></a></p>     <p>&nbsp;</p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05f03.jpg"></p>     <p>&nbsp;</p>     ]]></body>
<body><![CDATA[<p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b><i>Technical    note</i></b></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">The <i>F</i>-distribution    with <i>p</i> and <i>n-p</i> degrees of freedom attributed to the <i>T<sup>2</sup></i>    statistic and which is used to determine significance levels, requires a formal    assumption that the matrix of differences originates from an underlying multivariate    normal distribution. This assumption is hardly likely to be satisfied for sizing    data, particularly since the entries in the vectors <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i><sub>k</sub></i>    must all lie between -1 and +1. Nonetheless, extensive simulation results, some    of which are reported in the next section, indicate that the <i>F</i> -distribution    can be safely used. This can be explained partially by noting that the <i>T<sup>2</sup></i>    statistic depends on the raw data only through the mean vector <img src="/img/revistas/jsaimm/v112n7/05s21.jpg" align="absmiddle">    and the covariance matrix C, both of which are averages computed from the differences;    see Equations &#91;5&#93; and &#91;7&#93;. Now, the central limit theorem guarantees    approximate normality of averages, especially for difference data such as <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i>k</i>    - <img src="/img/revistas/jsaimm/v112n7/05s19.jpg" align="absmiddle"><i>l</i>    that are constrained to lie in a finite interval and that do not exhibit an    excessively skewed distribution. Thus, it is perhaps not so surprising that    the <i>F</i> -distribution is applicable to the data that we are dealing with,    even when only relatively small amounts of such data are available.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><b>Monte Carlo    simulations</b></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">In this section    of the paper we present the results of some Monte Carlo simulations with a view    to (i) demonstrating the applicability of the <i>F</i> distribution when determining    significance levels and (ii) comparing the statistical power of the rank test    with that of the Hotelling test. The power of a statistical test (or a test    statistic) is defined as the probability that the test will succeed in identifying    a difference that is indeed present. A test with a power of 1 (or 100%) 'never    fails'.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">One approach towards    these goals is to generate artificial data sets based on models that attempt    to simulate the mechanisms that lead to an observed size distribution, such    as particle breakage and particle sampling error (Brown and Wohletz (1995);    Dacey and Krumbein (1979); Gy (1982); Lyman (1986)). However, since we have    a number of real data sets at our disposal, we prefer to base our simulations    directly on these.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">In order to judge    the applicability of the <i>F</i> distribution we require pairs of observed    sizings that are generated from the same underlying size distribution and which    differ only in respect of uncertainties added in the sampling and sieving operations.    Given a data set consisting of <i>n</i> pairs of observed sizings (columns)    (<b>x</b>&iexcl;, <b>y</b><sub>i</sub>), <i>i</i> = 1,...,n, data sets with    the required structure can be generated by interchanging at random the roles    of <b>x</b><i>i</i> and <b>y</b><i>i</i> in the data matrix. That is, we flip    an unbiased coin and if it shows heads we replace the pair (<b>x</b>&iexcl;,    <b>y</b><sub>i</sub>) by (<b>y</b><sub>i</sub>, <b>x</b><sub>i</sub>) in the    data matrix, otherwise leaving it unchanged. This means that we are sampling    at random from a large collection of 2<sup>n</sup> pairs of data sets of which    any pair of columns is generated from identical underlying size distributions    (the size distributions may vary from column pair to column pair). In the numerical    example given earlier <i>(n</i> = 5), suppose five coin flips resulted in the    outcome <i>H, T, T, H, T</i></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">Then the corresponding    randomized data matrix would be and the Hotelling and rank statistics would    be computed on this data matrix in the manner indicated in the numerical example    given earlier. This generation of randomized data sets is repeated a large number    of times, say <i>N</i> times. We then compute the observed fractions of these    <i>N</i> times in which each of the two statistics exceeded the upper 100 (1    - a)% percentile <i>c<sub>a</sub></i> of the <i>F</i> distribution with <i>p</i>    and <i>n</i> - <i>p</i> degrees of freedom. If the <i>F</i> distribution is    applicable, the fractions should be close to <i>a.</i> <a href="#t1">Table I</a>    shows the results obtained when applying this method to the data in <a href="#f1">Figure    1</a> (n = 28) and <a href="#f3">Figure 3</a> (n = 14) using in each instance    <i>N</i> = 50 000 trials and <i>a</i> = 0.10, 0.05, and 0.01. Clearly, the empirical    exceedance probabilities of the rank test are quite close to the nominal values    in all cases while the Hotelling statistic seems to undershoot the mark slightly.</font></p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05s18.jpg"></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">To gauge the effect    of outliers on the tests we again use the data from <a href="#f1">Figures 1</a>    and <a href="#f3">3</a>, but now in a different manner. We replace in each interval    those observed size fraction differences d<i><sub>ki</sub></i> that were identified    as outliers by pseudo observations constructed as follows. Denote by <i>iqr</i>    the interquartile range of the data in row <i>k</i> and denote by <i>m</i> their    median. A Gaussian distribution of differences without outliers in row <i>k</i>    would have a mean equal to <i>m</i> and a standard deviation equal to 0.69 x    iqr. Notice that 0.69 is the ratio between the 68th (median + one standard deviation)    and 75th percentiles of a Gaussian distribution. Each outlier in row <i>k</i>    is now replaced by an observation from a Gaussian distribution which has mean    <i>m</i> and standard deviation F x 0.69 x iqr. Here, &#094; is a variance inflation    factor. The data set thus reconstructed has exactly the same configuration of    median differences as the original. When F = 1, all outliers have essentially    been removed. If F is increased, more outliers are artificially introduced.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">We illustrate the    procedure using the data in the eighth size interval in <a href="#f1">Figure    1</a> . There are five outliers, namely -0.046, -0.042, 0.044, 0.056, and 0.068.    The median and <i>iqr</i> of the differences in row 8 are -0.002 and 0.019 respectively.    A Gaussian distribution with this median and <i>iqr</i> would have a mean of    -0.002 and a standard deviation of 0.69 x 0.019 = 0.013. Accordingly, with &#1060;    = 1 we generate five such Gaussian random numbers and insert them into column    8 in place of the original five outliers. If &#1060; = 2, we replace the outliers    by five random numbers from a Gaussian distribution with mean -0.002 and standard    deviation 2 x 0.013 = 0.026, etc. <a href="#f4">Figure 4</a> shows boxplots    of the original data and those that resulted from one application of the above    procedure at &#1060; = 1 (the pseudo observations were -0.022, -0.011, 0.010,    0.017, 0.018) and &#1060; = 2(the pseudo observations were -0.050, -0.022, 0.008,    0.031, 0.057). We see that there are now no outliers at &#1060; = 1and just    two at &#1060; = 2.</font></p>     ]]></body>
<body><![CDATA[<p><a name="f4"></a></p>     <p>&nbsp;</p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05f04.jpg"></p>     <p>&nbsp;</p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">As <i>&#1060;</i>    increases, so does the number of artificially generated outliers. Of course,    after adding the perturbed values, we subtract from each column of differences    its average in order that the necessary constraint <img src="/img/revistas/jsaimm/v112n7/05s09.jpg" align="absmiddle">    be satisfied for the new data. This process of replacing the original outliers    is repeated <i>N</i> times for each of a range of values of &#1060;. Then we    find the fractions of the <i>N</i> times in which each of the statistics exceeds    <i>c<sub>a</sub>.</i> These fractions are our estimates of the statistical powers    of the tests. (Needless to say, this will be a fruitless exercise if neither    of the tests gave a significant result on the original data set.) <a href="#t2">Table    II</a> shows the estimated powers of the tests when this methodology is applied    to the data from <a href="#f1">Figures 1</a> and <a href="#f3">3</a> with <i>a</i>    = 0.05 and <i>N</i> = 10 000. <a href="#f5">Figure 5</a> shows for each value    of 4 in the first line of <a href="#t2">Table II</a> (data from <a href="#f1">Figure    1</a>) a boxplot of one randomly generated set of pseudo differences. <a href="#f6">Figure    6</a> shows the same in respect of the 4 values in the fourth line of the table    (data from <a href="#f3">Figure 3</a>). A general conclusion from these simulation    results is that the <i>T2</i> and <i>T</i> <sup>2</sup> statistics have similar    powers when the numbers and sizes of outliers are not extensive, but that the    former statistic loses power much more rapidly than the latter as the numbers    and sizes of outliers increase. Overall, it would seem that the rank test is    to be preferred because of its greater power robustness.</font></p>     <p><a name="T2"></a></p>     <p>&nbsp;</p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05t02.jpg"></p>     <p>&nbsp;</p>     <p><a name="f5"></a></p>     ]]></body>
<body><![CDATA[<p>&nbsp;</p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05f05.jpg"></p>     <p>&nbsp;</p>     <p><a name="f6"></a></p>     <p>&nbsp;</p>     <p align="center"><img src="/img/revistas/jsaimm/v112n7/05f06.jpg"></p>     <p>&nbsp;</p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="3"><b>Concluding remarks</b></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">The data used in    this study to illustrate the virtues of the ranked score version of the Hotelling    procedure by comparison to those of the conventional Hotelling test are real,    and reflect the realities of bias testing using size distribution data. This    contribution to the tools for making bias tests is motivated by the need to    ensure that such tests are evaluated using the most powerful statistical tools.</font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">However, it is    critical in bias testing that the magnitude of the detectable bias be economically    relevant. That is, the testing protocol, which includes sample masses collected,    precision of the particle sizing process, the number of paired samples collected,    and the statistical analysis procedure must be adequate to detect a difference    between the test samples and the reference samples that is economically significant.    Too often, a bias test is carried out using the guidelines provided by a standard.    The accuracy of the work carried out is not specifically assessed and so the    extent of the detectable bias is not quantified. A bias test which is not qualified    by the extent of the detectable bias is no test at all, and may give comfort    where none is to be had. Bias tests must be designed to detect a certain level    of bias that has been based on an economic evaluation of the critical level    of bias to be detected for all analytes of interest. Consideration of the size-by-size    analyses of the material sampled for all economically important analytes, coupled    with a critical consideration of the contractual tolerances on analytes, can    be used to arrive at the critical level of bias to be detected in a test.</font></p>     ]]></body>
<body><![CDATA[<p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">The authors continue    to work towards a workable methodology for accurate assessment of the detectable    bias over the full sizing distribution.</font></p>     <p>&nbsp;</p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="3"><b>Summary</b></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">In this paper we    propose a robust procedure based on ranks to test for sizing bias in a mechanical    sampling system. The data consists of observed mass size distributions and our    results are applicable to such data in general. The proposed rank test has been    applied to two data sets and its advantages over Hotelling's <i>T<sup>2</sup></i>    statistic has been illustrated in a Monte Carlo simulation study using a very    realistic method of injecting outliers into the data set. In particular, and    in contrast to the <i>T<sup>2</sup></i> statistic, the power of the rank test    is not unduly affected by the presence of outliers. Our simulation results indicate    that use of the <i>F</i> -distribution to compute <i>p</i> -values is permissible,    even if relatively few sizing pairs are available. The testing carried out in    this paper clears the way for the ranked Hotelling test to become a reliable    standard tool in size distribution comparison, which comparison is a best tool    for bias testing of sampling systems.</font></p>     <p>&nbsp;</p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="3"><b>References</b></font></p>     <!-- ref --><p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">1. BROWN, W.K.    and WOHLETZ, K.H. 1995. Derivation of the Weibull distribution based on physical    principles and its connection to the Rosin-Rammler and lognormal distributions.    <i>Journal of Applied Physics,</i> vol. 78, no. 4. pp. 2758-2763.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=254457&pid=S0038-223X201200070000500001&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">2. DACEY, M.F.    and KRUMBEIN, W.C. 1979. Models of breakage and selection for particle size    distributions. <i>Mathematical Geology,</i> vol. 11 . pp. 193-222. Gy, P.M.    1982. Sampling of Particulate Materials—Theory and Practice,</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=254458&pid=S0038-223X201200070000500002&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">3. ELSEVIER, NEW    YORK. LYMAN, G.J., NEL, M., LOMBARD, F., STEINHAUS, R., and BARTLETT, H. 2010.    Bias testing of cross-belt samplers. <i>Journal of the Southern African Intitute    of</i></font></p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">4. MINING and METALLURGY,    vol. 110, no. 6. pp. 289-298. Lyman, G.J. 1986. Application of Gy's sampling    theory to coal: A simplified explanation and illustration of some basic aspects.    <i>International Journal of Mineral Processing,</i> vol. 17. pp. 1-22. Mathworks    Inc. 2007. Matlab, Version 7.5 (release 2007b). </font><font  size="2">&#9830;</font></p>     ]]></body>
<body><![CDATA[<p>&nbsp;</p>     <p>&nbsp;</p>     <p><font face="Verdana, Arial, Helvetica, sans-serif" size="2">Paper received    Jul. 2011    <br>   Revised paper received Nov. 2011</font></p>      ]]></body>
<REFERENCES></REFERENCES<back>
<ref-list>
<ref id="B1">
<label>1.</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[BROWN]]></surname>
<given-names><![CDATA[W.K.]]></given-names>
</name>
<name>
<surname><![CDATA[WOHLETZ]]></surname>
<given-names><![CDATA[K.H.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Derivation of the Weibull distribution based on physical principles and its connection to the Rosin-Rammler and lognormal distributions.]]></article-title>
<source><![CDATA[Journal of Applied Physics]]></source>
<year>1995</year>
<volume>78</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>2758-2763</page-range></nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[DACEY]]></surname>
<given-names><![CDATA[M.F.]]></given-names>
</name>
<name>
<surname><![CDATA[KRUMBEIN]]></surname>
<given-names><![CDATA[W.C.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Models of breakage and selection for particle size distributions.]]></article-title>
<source><![CDATA[Mathematical Geology]]></source>
<year>1979</year>
<volume>11</volume>
<page-range>193-222</page-range></nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gy]]></surname>
<given-names><![CDATA[P.M.]]></given-names>
</name>
</person-group>
<source><![CDATA[Sampling of Particulate Materials-Theory and Practice]]></source>
<year>1982</year>
<publisher-loc><![CDATA[NEW YORK. ]]></publisher-loc>
<publisher-name><![CDATA[ELSEVIER]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[LYMAN]]></surname>
<given-names><![CDATA[G.J.]]></given-names>
</name>
<name>
<surname><![CDATA[NEL]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[LOMBARD]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[STEINHAUS]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[BARTLETT]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Bias testing of cross-belt samplers.]]></article-title>
<source><![CDATA[Journal of the Southern African Intitute of MINING and METALLURGY]]></source>
<year>2010</year>
<volume>110</volume>
<numero>6</numero>
<issue>6</issue>
<page-range>289-298</page-range></nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Lyman]]></surname>
<given-names><![CDATA[G.J.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Application of Gy's sampling theory to coal: A simplified explanation and illustration of some basic aspects.]]></article-title>
<source><![CDATA[International Journal of Mineral Processing]]></source>
<year>1986</year>
<volume>17</volume>
<page-range>1-22</page-range><publisher-name><![CDATA[Mathworks Inc.]]></publisher-name>
</nlm-citation>
</ref>
</ref-list>
</back>
</article>
