Page 108 - Computational Retinal Image Analysis
P. 108
2 Benchmark datasets and evaluation metrics 101
As shown in Table 1, the number of images in each dataset ranges from 8 to
143. The typical dataset size is often several order-of-magnitude smaller when
comparing to the popular image detection/segmentation benchmarks used in the
broader computer vision community, such as ImageNet [39] and COCO [40] that
typical have 20–200 K images or more in the training and validation sets. In a sense,
this dataset characteristic makes the practical situation less appealing. The situation
is especially pronounced when deep learning methods such as convolutional neural
networks (CNN) are engaged, as CNN models often require access to large and well-
annotated training set. The CNN models trained in the retinal datasets thus tend to be
less robust, and tend to be more vulnerable to small perturbation of input images or
adversarial attacks at test run.
2.2 Evaluation metrics
With the increasing amount of activities toward vessel segmentation and tracing,
there is naturally a demand for proper evaluation metrics, where different methods
can be compared quantitatively and objectively on the same ground. The typical
metrics are usually individual pixel-based, which including, for example, the
precision-recall curve, and the F1 score as a single-value performance indicator. The
sensitivity and specificity pair could be another popular choice. It is worth noting
that the commonly used metric of receiver operating characteristic curve or ROC
curve may not be suitable in our situation here, since the positive vessel and negative
background pixels or examples are severely imbalanced. As a well-established mean
to globally quantify pixel-based deviations, a major drawback of this type of metrics
is that the vasculature geometric information is not well preserved in evaluation.
This motives the construction of structural type of metrics [41–43] to best account
for the differences from vasculature geometry perspective. The metric considered
in Ref. [41] places more emphasis toward the aspects of both detection rate and
detection accuracy, where the vascular structures between the predicted and the
reference are optimally matched as a solution to the induced maximum-cardinality
minimum-cost graph matching problem. Meanwhile, the structural metric proposed
by Gegundez-Arias et al. [42] involves the comparison of three aspects between the
predicted vessel segmentation and the corresponding reference annotation, namely
number, overlap, and length. Number refers to comparing the number of connected
segments presented in prediction as well as in the annotation images. Overlap is to
assess the amount of overlaps between the predicted segmentation and the reference
annotation. Length is to examine the length similarity between the predicted and
the annotated vessel skeletons. In a more recent attempt [43], a structural similarity
score is designed to incorporate both location and thickness differences that the
segmentation is departed from the reference vessel trees. A very detailed discussion
on validation of retinal image analysis can be found in Ref. [44]. See also Chapter 9
in this book for a discussion on validation techniques.
In addition to the aforementioned situation where there exists full annotations in
the dataset for performance evaluation, there are often practical vessel segmentation