Page 183 - Computational Retinal Image Analysis
P. 183

178    CHAPTER 10  Statistics in ophthalmology




                            The use of new technology clearly means that many measurements are now
                         captured and exported automatically. Such automation should reduce error but it is
                           important to acknowledge that sometimes it can introduce new error if an algorithm
                         within a computer has not been correctly programmed for example. Even if the data
                         are captured robustly there will still be a need for careful indexing of such exports so
                         that the correct export is assigned to the correct patient. Within big data a new term
                         is evolving called provenance. The term data provenance refers to a record trail that
                         accounts for the origin of a piece of data (in a database, document or repository)
                         together with an explanation of how and why it got to the present place. It should be
                         noted that many statisticians would simply describe this as good statistical practice!
                         Formalizing the process however may well reduce errors that are arising because of
                         a lack of involvement with the statistical community.

                         2.3  Words of caution about data collection in the current era
                         of big data
                         Big data may mean that researchers have little control over data collection since they
                         are combining datasets, which have been compiled by others. It is nevertheless im-
                         portant for all involved in research to understand the need for rigor in data collection
                         and management.
                            When data are first imported into a statistical program a statistician conducts
                         a descriptive analysis of the data. For continuous data this typically involves con-
                         struction of histograms and scatter plots. These allow detection of values that are
                         inconsistent with other data—so called outliers. Different statistical packages may
                         identify different values as outliers because they used different definitions. This can
                         cause confusion to those new to statistical analysis but in reality, the idea is to as-
                         sess whether there are values that are atypical. If these are found, it is important to
                         go back to the laboratory to assess whether or not they are errors. Most statisticians
                         would not routinely advise dropping outliers. They may, however, advise that an
                         analysis is conducted with and without the unusual observations to assess whether or
                         not their inclusion impacts upon results.
                            In the past, the scientific community typically analyzed the data that they had.
                         If there were missing observations on a few subjects then it became challenging to
                         know what to use as denominator for percentages (the total number of subjects or
                         the number of subjects reporting that value) but beyond this, missing data was not
                         really considered. In 1987, however a revolution took place in relation to thinking
                         about missing data. This came about following two highly influential books and the
                         development of powerful personal computing [7, 8]. Researchers started to acknowl-
                         edge that data that were missing might systematically differ to data that were not
                         missing and that analyzing only available data had the potential to distort results and
                         mislead. New concepts were introduced—missing at random, missing completely at
                         random, missing not at random and researchers were strongly advised to document
                         the degree of missing data and how this might impact upon results (see more about
                         data missingness in Section 5).
   178   179   180   181   182   183   184   185   186   187   188