Page 180 - Computational Retinal Image Analysis
P. 180
2 Data classification, data capture and data management 175
of analysis. If observations come from different individuals they may be regarded as
independent. If however there is a relationship between observations—for example
the intraocular pressure in a glaucomatous eye pre- and post-delivery of eye drops, or
indeed the intraocular pressure of the right and left eyes of the individual, the data are
not independent. Within imaging data there may be several thousand different values
measured on a single eye yielding data that are not independent. It is therefore impor-
tant that the statistical technique used to explore such data addresses such potential
nonindependence as well as addresses the unit of analysis [12].
A further issue to consider is whether or not a value being analyzed is an ac-
tual measurement or whether it is actually a summary score that represents some
pre-processing of data. If the later has occurred it is necessary to know how the
pre-processing has been done. Failure to do this may result in spurious associations
between variables being seen (see e.g. ocular perfusion and intraocular pressure in
Refs. [5, 6]). Measuring devices often pre-process the data. This is a point that is
often forgotten.
2.2 Data collection and management
Many statistical textbooks and courses on statistics begin with a clean data set.
Unfortunately in the real word researchers are often faced with something that
is very different to a clean dataset. They are presented with data sets that may
have missing values for some patients, there may be values recorded which are not
feasible, dates may be captured in varying forms (day/month/year) (month/day/
year) and variables might be captured as text fields. Below are two tables from
spreadsheets (both fictitious). One would require considerable modification prior
to data analysis (Table 1) while the other would not (Table 2). An example of the
modification that would be needed would be to convert all weights captured so that
they are in the same units—not alternating between kg and stones and pounds. If
weights of differing units were to be read as a single variable then a summariz-
ing such data would be meaningless. In the dirty spreadsheet Ethnicity has been
captured as free text. A variety of entries have been made for this variable but if
we consider the category White there are three terms (White, W and w) that have
been used within this column to indicate that the subject was white. Prior to data
analysis these need to be converted into the same term so that when the categories
are summed, the correct totals are provided rather than having to tally several sub-
totals. The example (Tables 1 and 2) illustrate a very small data set, but consider
this amplified by several tens, hundreds or thousands. While code can be written
to facilitate the data conversion, writing such code can be time consuming and
may introduce error. This can be avoided by carefully considering how to capture
data correctly in the first place. Time spent planning data capture—avoiding free
text, use of standard coding systems where possible (such as the ICD coding for
capturing disease) mean that data analysis can be conducted efficiently and results
delivered in a timely fashion.