Page 163 - Computational Retinal Image Analysis

P. 163

158 CHAPTER 9 Validation

Mission accomplished then? Hardly. Expanding the definition of “validation”
above even just a little more [6], we realize that, for a well-specified image analysis
problem (see above), one must:
(i) procure clinically relevant, well-characterized data sets (of sufficient size);
(ii) procure a sufficient quantity of annotations from well-characterized experts;
(iii) compute automatic measurements;
(iv) compare statistically the annotations from experts with the automatic
results.
Even this limited expansion exposes some nontrivial questions. For instance,
when exactly is a data set clinically relevant? What do we need to know to declare
experts and data sets well characterized? How do we reconcile different annotations
for the same images? What should we do, if anything, when different annotators
disagree (the usual situation) before we compare their annotations with automatic
results? What does sufficient quantity mean in practice?
If we think a little broader, further challenges appear. For example, the answers to
the questions above change if considering a proof-of-concept validation for a novel
algorithm, say suitable for publication in research journals, or the actual translation
of the technology into healthcare: the latter requires, among others, much larger,
more carefully selected (from a clinical point of view) patient cohorts, replications in
multiple, independent cohorts, and conformity with rules from regulatory bodies like
the FDA in the United States or the EMA in Europe.
This chapter opens with a concise discussion of the issues making validation a
serious challenge (Section 2). It then reviews tools and techniques that we regard as
good practice, including data selection and evaluation criteria (Section 3). Further
discussion is devoted to the important point of the design of annotation protocols for
annotators (Section 4). The chapter closes with a summary and ideas for spreading
good practice internationally (Section 5).

2 Challenges

The gross national product measures everything, except what makes life
worthwhile.
Robert Kennedy

2.1 Annotations are expensive
Validating and training contemporary computational systems like deep learning sys-
tems requires larger and larger volumes of annotations [7–9], but annotating images
is time consuming, hence expensive in terms of time and money. The time of clinical
practitioners is normally at a premium. The cost of an annotation task depends on
what and how much must be annotated: for example, assigning a grade to a fundus
image for diabetic retinopathy is quicker than tracing blood vessels with a software
tool in the same image.

158 159 160 161 162 163 164 165 166 167 168