Page 108 - Computational Retinal Image Analysis
P. 108

2  Benchmark datasets and evaluation metrics  101




                     As shown in Table 1, the number of images in each dataset ranges from 8 to
                  143.  The typical dataset size is often several order-of-magnitude smaller when
                  comparing to the popular image detection/segmentation benchmarks used in the
                  broader computer vision community, such as ImageNet [39] and COCO [40] that
                  typical have 20–200 K images or more in the training and validation sets. In a sense,
                  this dataset characteristic makes the practical situation less appealing. The situation
                  is especially pronounced when deep learning methods such as convolutional neural
                  networks (CNN) are engaged, as CNN models often require access to large and well-
                  annotated training set. The CNN models trained in the retinal datasets thus tend to be
                  less robust, and tend to be more vulnerable to small perturbation of input images or
                  adversarial attacks at test run.

                  2.2  Evaluation metrics
                  With the increasing amount of activities toward vessel segmentation and tracing,
                  there is naturally a demand for proper evaluation metrics, where different methods
                  can be compared quantitatively and objectively on the same ground. The typical
                  metrics are usually individual pixel-based, which including, for example, the
                    precision-recall curve, and the F1 score as a single-value performance indicator. The
                  sensitivity and specificity pair could be another popular choice. It is worth noting
                  that the commonly used metric of receiver operating characteristic curve or ROC
                  curve may not be suitable in our situation here, since the positive vessel and negative
                  background pixels or examples are severely imbalanced. As a well-established mean
                  to globally quantify pixel-based deviations, a major drawback of this type of metrics
                  is that the vasculature geometric information is not well preserved in evaluation.
                  This motives the construction of structural type of metrics [41–43] to best account
                  for the differences from vasculature geometry perspective. The metric considered
                  in Ref. [41] places more emphasis toward the aspects of both detection rate and
                  detection accuracy, where the vascular structures between the predicted and the
                  reference are optimally matched as a solution to the induced maximum-cardinality
                  minimum-cost graph matching problem. Meanwhile, the structural metric proposed
                  by Gegundez-Arias et al. [42] involves the comparison of three aspects between the
                  predicted vessel segmentation and the corresponding reference annotation, namely
                  number, overlap, and length. Number refers to comparing the number of connected
                  segments presented in prediction as well as in the annotation images. Overlap is to
                  assess the amount of overlaps between the predicted segmentation and the reference
                  annotation. Length is to examine the length similarity between the predicted and
                  the annotated vessel skeletons. In a more recent attempt [43], a structural similarity
                  score is designed to incorporate both location and thickness differences that the
                  segmentation is departed from the reference vessel trees. A very detailed discussion
                  on validation of retinal image analysis can be found in Ref. [44]. See also Chapter 9
                  in this book for a discussion on validation techniques.
                     In addition to the aforementioned situation where there exists full annotations in
                  the dataset for performance evaluation, there are often practical vessel segmentation
   103   104   105   106   107   108   109   110   111   112   113