Page 98 - Computational Retinal Image Analysis

P. 98

90 CHAPTER 5 Automatic landmark detection in fundus photography

machine learning requires a set of handcrafted features that are “learned” with the aid
of ground truth (supervised learning), clustered based on intrinsic properties (unsu-
pervised learning), or some area in between. Given enough training examples, deep
networks learn their own set of features from internal convolutions with a set of
filters, generally performed at different scales and abstractions of the input data. If
large amounts of labeled data are available along with sufficient computing power
for training, one is likely to see performance gains over traditional methods on the
same data.
Deep CNNs are best known for their classification abilities. Given an input im-
age, predict a level of disease such as for diabetic retinopathy. They can also be used
for regression tasks. In this case, the training labels are the OD and fovea center
points marked by two graders on the Messidor and Kaggle datasets. The first step
is to preprocess the data. Remarkably, the only preprocessing done on the images is
to convert the image to grayscale, resize to 256 × 256 and perform contrast limited
adaptive histogram equalization [24]. The deep CNN is built from a combination of
standard layers:
Convolutional layer—sets of filters convolved with the input to produce feature
maps.
Pooling layer—form of subsampling of the convolutional layers. In this case
max-pooling is used, where a window slides over a feature map and the max
value in that window is selected.
Dropout layer—with a certain probability, drops the output (ignores) of certain
hidden units to prevent overfitting in the training phase.
Fully connected layer—each node in this layer is connected to every node in the
previous layer. This is usually the final layer of a deep CNN.

All layers except the output layer use a Rectified Linear Unit [25] as an activation
function, define as:
ϕ : x → max(0, x) (10)

So that the output of any particular node is 0 if the output is less than 0 and x
otherwise. The output layer uses a linear function to combine the output layer ac-
tivations. The layers used in this architecture can be visualized in Fig. 6. There are
two steps to detection the fovea and OD. The first step runs the preprocessed image
through the network and the found areas become areas of interest. These areas of
interest are then run through other networks to fine tune the found locations (Fig. 6).
10,000 images from the Kaggle dataset were used for training and testing (7000
for training and validation, 3000 for testing) and 1200 images from the Messidor
dataset were also used for testing. Results were presented as percent of images with
OD and fovea found within 1 disc radius, 0.5 disc radii and 0.25 disc radii. Results
using the 1 disc radius criteria were 97/96.6% for OD and fovea respectively in the
Messidor dataset and 96.7/95.6% in the Kaggle dataset. Further, with all the over-
head timewise in the training of the model, this method is able to run almost instan-
taneously (0.007 s).

93 94 95 96 97 98 99 100 101 102 103