Page 172 - Computational Retinal Image Analysis
P. 172
4 Annotations and data, annotations as data 167
The difference between deciding the consensus type during protocol design and
leaving it to the MIA team (i.e., after annotations are provided) is that the former
encapsulates consensus-achieving procedures in the protocol design stage, in which
all decisions regarding the generation of the ground truth take place, promoting
consistency and awareness to the whole interdisciplinary team. Consensus
measurements are best obtained by discussion among the annotators, typically in
cases where the disagreement is above a level defined as acceptable (a problem-
specific decision). Other, simple consensus measurements include the average (for
numerical values) and majority (for categorical or ordinal labels, given at least
three annotators). Descriptive statistics characterizing the disagreement among
annotators may also be provided, in the form of basic statistics (mean and standard
deviation of the signed and absolute differences), histograms or others.
• Software tool. An appropriate software tool must be provided to the annotators,
selected or designed and developed to make the annotation task as efficient and
unambiguous as possible. It is again important to involve clinicians in the choice
or design of the software annotation tool.
• Training sessions. Once a protocol has been agreed and a software tool
identified or created, the technical team should run training sessions to ensure
that the annotators follow the protocol consistently. Experience indicates that
such training sessions are valuable to avoid inconsistencies in the data which
may weaken the subsequent validation of the MIA algorithm.
• How much detail? An annotation protocol must support the consistent
generation of a set of measurements by different annotators. We stress that it is
the procedure that must be consistent, not the measurements: there is important
information in the variability among annotators, assuming that they followed
the same procedure. If annotators make independent decisions or depart
from the protocol in various ways, random variations not related to the target
measurements are introduced in their annotations, weakening the validation of
MIA algorithms. It is critical to discuss these aspects with the clinical team.
4.2 Reducing the need for manual annotations
An important trend of contemporary research addresses techniques for limiting the
volume of annotations needed for validating a medical image analysis system main-
taining its accuracy and related performance parameters. Research aimed to reduce
the number of annotations needed is particularly important to achieve all-round au-
tomation on a large scale, given the unabating proliferation of deep learning systems
(artificial intelligence) where a typical network must train millions of parameters. We
refer the reader to recent papers [35–38] and to the related literature on automatic
annotations in computer vision [39, 40].
We note that validation on outcome (Section 3.3) can be regarded as a paradigm
for limiting the burden and so to some extent the volume of annotations, as it aims to
use information recorded anyway when seeing patients, instead of asking clinicians
for additional work like tracing contours on images.