Page 370 -
P. 370
360 CHAPTER 12 Automated data collection methods
domain knowledge of both the software being studied and user behavior to identify
patterns that would be representative of defined tasks (Hammontree et al., 1992;
Ivory and Hearst, 2001). These inferential efforts face many challenges. For ex-
ample, applications that provide multiple methods for accessing given functionality
(such as both a menu choice and a toolbar button for Print) may generate log files
that contain all of these methods. However, log entry analysis approaches may not
recognize these multiple paths as leading to a common goal. Establishing appropri-
ate contextual information may also be difficult: log file entries that indicate a button
was pressed are less informative than those that indicate which button was pressed
(Hilbert and Redmiles, 2000).
Analysis challenges are particularly pronounced in the analysis of web server logs,
which may contain interleaved requests from dozens of different users. Statistical
analyses and visualization tools have been used to try to identify individual user ses-
sions from log files (Pirolli and Pitkow, 1999; Hochheiser and Shneiderman, 2001;
Heer and Chi, 2002), but these tools are imperfect at best. If a web browser coming
from a given Internet address accesses a page on your site and then accesses a second
page 10 minutes later, does that count as one session or two? Your log file cannot tell
you if the user was reading the page between those two requests or if she was talking
on the telephone. Those requests may not have come from the same person—for all
you know, it is a shared computer in a library or classroom that is used by dozens of
individuals on any given day.
Custom-built or instrumented software may alleviate some data-granularity prob-
lems by providing you with complete control over the data that is collected, at the
expense of the time and effort required to develop the data collection tools. If you are
willing and able to commit the resources necessary for software customization, you
can configure the software to capture all of the data that you think might be interest-
ing: nothing more, nothing less.
Unfortunately, matters are rarely so clean-cut. There may be a vast difference be-
tween what you think you need before you start large-scale data collection and what
you may wish you had collected once you begin analyzing the data. The expense of
running experiments—particularly those that involve substantial effort in participant
recruitment—creates a tendency toward collecting as much data as possible. “It's
easy to collect this information,” the thinking goes, “so we may as well. After all,
storage is inexpensive, and these details may prove useful later on.”
Although there is a certain logic to the defensive approach of collecting as much
data as possible, there are some limits to this approach. As anyone who has sifted
through megabytes of event logs can tell you, collecting lots of data may simply
leave you with lots of uninformative data junk to sift through. Even with software
tools, the identification of meaningful patterns (as opposed to random coincidences)
can be difficult. Lower resolution data may be somewhat easier to analyze.
If your data collection tools can clearly distinguish between coarse-grained and
fine-grained events, you might be able to have your cake and eat it too. Data collec-
tion tools might mark each event with an indication of the level of granularity that