Page 86 -
P. 86
72 CHAPTER 4 Statistical analysis
manually by participants, may contain errors or may be presented in inconsistent
formats. If those errors or inconsistencies are not filtered out or fixed, they may con-
taminate the entire data set. Second, the original data collected may be too primitive
and higher level coding may be necessary to help identify the underlying themes.
Third, the specific statistical analysis method or software may require the data to be
organized in a predefined layout or format so that they can be processed (Delwiche
and Slaughter, 2008).
4.1.1 CLEANING UP DATA
The first thing that you need to do after data collection is to screen the data for pos-
sible errors. This step is necessary for any type of data collected, but is particularly
important for data entered manually by participants. To err is human. All people
make mistakes (Norman, 1988). Although it is not possible to identify all the errors,
you want to trace as many errors as possible to minimize the negative impact of hu-
man errors. There are various ways to identify errors depending on the nature of the
data collected.
Sometimes you can identify errors by conducting a reasonableness check. For
instance, if the age of a participant is entered as “223,” you can easily conclude that
there is something wrong. Your participant might have accidentally pushed the num-
ber “2” button twice, in which case the correct age should be 23, or he might have
accidentally hit the number “3” button after the correct age, 22, has been entered.
Sometimes you need to check multiple data fields in order to identify possible er-
rors. For example, you may compare the participant's “age” and “years of computing
experience” to check whether there is an unreasonable entry.
For automatically collected data, error checking usually boils down to time con-
sistency issues or whether the performance is within a reasonable range. Something
is obviously wrong if the logged start time of an event is later than the logged end
time of the same event. You should also be on alert if any unreasonably high or low
performance levels are documented.
In many studies, data about the same participant are collected from multiple
channels. For example, in a study investigating multiple data-entry techniques, the
performance data (such as time and number of keystrokes) might be automatically
logged by data-logging software. The participants' subjective preference and sat-
isfaction data might be manually collected via paper-based questionnaires. In this
case, you need to make sure that all the data about the same participant are correctly
grouped together. The result will be invalid if the performance data of one participant
is grouped with the subjective data of another participant.
After errors are identified, how shall we deal with them? It is obvious that you
always want to fix errors and replace them with accurate data. This is possible
in some cases. If the age of a participant is incorrect, you can contact that par-
ticipant and find out the accurate information. In many cases, fixing errors in the
preprocessing stage is impossible. In many online studies or studies in which the
participant remains anonymous, you may have no means of reaching participants