Page 169 - Computational Retinal Image Analysis
P. 169
164 CHAPTER 9 Validation
3.2.4 Bland-Altman graphs
Bland-Altman graphs [20, 21] are a much-used way in the medical literature to vi-
sualize the agreement between two sets of paired numerical measurements, for in-
stance the true and estimated values of, say, diameter of the optic disc, or systolic
blood pressure and fractal dimension of the retinal vasculature [22]. They plot the
differences d k , k = 1, …, N, of N paired values against their mean, and indeed you
can think of them as scattergrams of the pairs’ differences against the pairs’ means.
A Bland-Altman graph also plots two horizontal lines defining limits of agreement,
e.g., at d = d m ± σ, where d m and σ are, respectively, the mean and standard deviation
of the differences, or within an interval of confidence (e.g., 95%). Hence, if the mea-
surements in all pairs are very close to each other, all differences are very small, and
the plot follows closely the horizontal line at d k = 0. A bias is immediately visualized
by plotting the line d = d m .
The motivation behind Bland-Altman graphs, introduced by the authors in 1986
[21], was that the much-used correlation does not indicate the agreement of two
sets of paired measures, but their degree of linear dependence (a particular relation
between the variables). The authors give a cogent example of the difference between
agreement and correlation: the correlation between a set of caliper measurements and
their respective halves is 1 (perfect linear dependence), but the two sets do not agree.
An example is given in Fig. 1 [23]. The agreement zone is given by the two dashed
lines at the 95% limits of agreement (LOA), defined as the two lines d m ± 1.96σ. The
plot shows reasonable agreement, with only one point outside the LOA zone, and
very few close to its borders. What level of agreement is acceptable depends on the
specific application. Notice the negative bias in the graph (−0.58).
3.2.5 Cohen’s kappa and related measures
Cohen’s kappa [24] estimates the agreement between two sets of paired categorical
decisions, accounting for the possibility that the agreement may occur by chance. Its
definition is
p − p
K = 0 e ,
1 − p e
where p o is the observed agreement and p e the probability of random agreement,
estimated from the data (contingency tables). There are no ultimate guidelines on
what values constitute good or bad agreement; indicatively, values of K above ~0.7
indicate very good to excellent agreement, between ~0.4 and ~0.7 good agreement,
and below ~0.4 poor agreement. Such definitions must be used with care.
If there are more than two sets, Fleiss’s kappa is used [25]. The weighted kappa
[26] allows one to weigh measurements differently.
3.2.6 Error histograms
It is often useful to visualize error histograms for each variable measured, to have a
feeling for the underlying error distribution. To create a meaningful histogram, atten-
tion must be given to two factors: eliminating outliers and choosing an appropriate
number of bins.