Page 169 - Computational Retinal Image Analysis
P. 169

164    CHAPTER 9  Validation




                         3.2.4   Bland-Altman graphs
                         Bland-Altman graphs [20, 21] are a much-used way in the medical literature to vi-
                         sualize the agreement between two sets of paired numerical measurements, for in-
                         stance the true and estimated values of, say, diameter of the optic disc, or systolic
                         blood pressure and fractal dimension of the retinal vasculature [22]. They plot the
                         differences d k , k = 1, …, N, of N paired values against their mean, and indeed you
                         can think of them as scattergrams of the pairs’ differences against the pairs’ means.
                         A Bland-Altman graph also plots two horizontal lines defining limits of agreement,
                         e.g., at d = d m  ± σ, where d m  and σ are, respectively, the mean and standard deviation
                         of the differences, or within an interval of confidence (e.g., 95%). Hence, if the mea-
                         surements in all pairs are very close to each other, all differences are very small, and
                         the plot follows closely the horizontal line at d k  = 0. A bias is immediately visualized
                         by plotting the line d = d m .
                            The motivation behind Bland-Altman graphs, introduced by the authors in 1986
                         [21], was that the much-used correlation does not indicate the agreement of two
                         sets of paired measures, but their degree of linear dependence (a particular relation
                         between the variables). The authors give a cogent example of the difference between
                         agreement and correlation: the correlation between a set of caliper measurements and
                         their respective halves is 1 (perfect linear dependence), but the two sets do not agree.
                         An example is given in Fig. 1 [23]. The agreement zone is given by the two dashed
                         lines at the 95% limits of agreement (LOA), defined as the two lines d m  ± 1.96σ. The
                         plot shows reasonable agreement, with only one point outside the LOA zone, and
                         very few close to its borders. What level of agreement is acceptable depends on the
                         specific application. Notice the negative bias in the graph (−0.58).

                         3.2.5   Cohen’s kappa and related measures
                         Cohen’s kappa [24] estimates the agreement between two sets of paired categorical
                         decisions, accounting for the possibility that the agreement may occur by chance. Its
                         definition is
                                                          p −  p
                                                      K =  0  e  ,
                                                          1 −  p e
                         where p o  is the observed agreement and p e  the probability of random agreement,
                         estimated from the data (contingency tables). There are no ultimate guidelines on
                         what values constitute good or bad agreement; indicatively, values of K above ~0.7
                         indicate very good to excellent agreement, between ~0.4 and ~0.7 good agreement,
                         and below ~0.4 poor agreement. Such definitions must be used with care.
                            If there are more than two sets, Fleiss’s kappa is used [25]. The weighted kappa
                         [26] allows one to weigh measurements differently.
                         3.2.6   Error histograms
                         It is often useful to visualize error histograms for each variable measured, to have a
                         feeling for the underlying error distribution. To create a meaningful histogram, atten-
                         tion must be given to two factors: eliminating outliers and choosing an appropriate
                         number of bins.
   164   165   166   167   168   169   170   171   172   173   174