Page 183 - Computational Retinal Image Analysis
P. 183
178 CHAPTER 10 Statistics in ophthalmology
The use of new technology clearly means that many measurements are now
captured and exported automatically. Such automation should reduce error but it is
important to acknowledge that sometimes it can introduce new error if an algorithm
within a computer has not been correctly programmed for example. Even if the data
are captured robustly there will still be a need for careful indexing of such exports so
that the correct export is assigned to the correct patient. Within big data a new term
is evolving called provenance. The term data provenance refers to a record trail that
accounts for the origin of a piece of data (in a database, document or repository)
together with an explanation of how and why it got to the present place. It should be
noted that many statisticians would simply describe this as good statistical practice!
Formalizing the process however may well reduce errors that are arising because of
a lack of involvement with the statistical community.
2.3 Words of caution about data collection in the current era
of big data
Big data may mean that researchers have little control over data collection since they
are combining datasets, which have been compiled by others. It is nevertheless im-
portant for all involved in research to understand the need for rigor in data collection
and management.
When data are first imported into a statistical program a statistician conducts
a descriptive analysis of the data. For continuous data this typically involves con-
struction of histograms and scatter plots. These allow detection of values that are
inconsistent with other data—so called outliers. Different statistical packages may
identify different values as outliers because they used different definitions. This can
cause confusion to those new to statistical analysis but in reality, the idea is to as-
sess whether there are values that are atypical. If these are found, it is important to
go back to the laboratory to assess whether or not they are errors. Most statisticians
would not routinely advise dropping outliers. They may, however, advise that an
analysis is conducted with and without the unusual observations to assess whether or
not their inclusion impacts upon results.
In the past, the scientific community typically analyzed the data that they had.
If there were missing observations on a few subjects then it became challenging to
know what to use as denominator for percentages (the total number of subjects or
the number of subjects reporting that value) but beyond this, missing data was not
really considered. In 1987, however a revolution took place in relation to thinking
about missing data. This came about following two highly influential books and the
development of powerful personal computing [7, 8]. Researchers started to acknowl-
edge that data that were missing might systematically differ to data that were not
missing and that analyzing only available data had the potential to distort results and
mislead. New concepts were introduced—missing at random, missing completely at
random, missing not at random and researchers were strongly advised to document
the degree of missing data and how this might impact upon results (see more about
data missingness in Section 5).