Page 171 - Computational Retinal Image Analysis
P. 171
166 CHAPTER 9 Validation
3.3 Validation on outcome: Focus on the clinical task
Validation on outcome aims to validate an algorithm or software tool within the sys-
tem of which it is a part. Consider for instance an algorithm detecting microaneu-
rysms in retinal fundus images, meant as a component of an automatic refer/do not
refer system [33]. Direct validation requires annotations of individual microaneu-
rysms. Validation on outcome requires only the referral decision. At a parity of other
conditions during testing, the microaneurysms detection module is validated suc-
cessfully when automatic referral decisions achieve the accuracy desired.
An important advantage of this approach is that it avoids creating additional tasks
for doctors providing annotations. In our example, referral decisions are generated in
normal practice, but detailed annotations of lesions on images are not. A challenge is
that deciding what constitutes the “outcome” may not always be obvious [5].
4 Annotations and data, annotations as data
In God we trust; others must provide data
Edwin R. Fisher
Further, important elements involved in validation emerge if we stand back from
the discussion so far, and attempt to look at validation in all its aspects. We discuss
concisely a few in this section.
4.1 Annotation protocols and their importance
The collection of ground truth to validate RIA and MIA systems requires the de-
velopment of a protocol for annotating images or videos, in itself a complex task.
Various tasks are involved; we summarize the main ones below.
• Protocol design. The protocol must be designed jointly by the clinical and
technical (MIA) team. Multiple clinicians ought to be involved [2, 3, 34].
A pilot study can, in our experience, help to identify key parameters of the
protocol: for instance, if an ordinal grading scale is involved (e.g., scoring
tortuosity, or the severity of a lesion), the optimal number of levels may be
identified not only on the basis of current clinical practice, but also of pilot
experiments suggesting the number yielding the most accurate results with
an automatic system. Hence the final number is obtained by discussion as a
compromise between the original one (clinical practice) and the result of the
pilot study.
• Ground truth type. Once a protocol is agreed, the designers may simply decide
to output a set of measurements for each annotator, or also define summative
ones capturing some form of consensus among annotators to reconcile
differences in measurements. Note that we use “consensus” in a general sense:
generating a single value from a set of differing ones (e.g., the tortuosity level of
an artery given the different estimates of, say, three annotators).