Page 370 -
P. 370

360    CHAPTER 12  Automated data collection methods




                         domain knowledge of both the software being studied and user behavior to identify
                         patterns that would be representative of defined tasks (Hammontree et al., 1992;
                         Ivory and Hearst, 2001). These inferential efforts face many challenges. For ex-
                         ample, applications that provide multiple methods for accessing given functionality
                         (such as both a menu choice and a toolbar button for Print) may generate log files
                         that contain all of these methods. However, log entry analysis approaches may not
                         recognize these multiple paths as leading to a common goal. Establishing appropri-
                         ate contextual information may also be difficult: log file entries that indicate a button
                         was pressed are less informative than those that indicate which button was pressed
                         (Hilbert and Redmiles, 2000).
                            Analysis challenges are particularly pronounced in the analysis of web server logs,
                         which may contain interleaved requests from dozens of different users. Statistical
                         analyses and visualization tools have been used to try to identify individual user ses-
                         sions from log files (Pirolli and Pitkow, 1999; Hochheiser and Shneiderman, 2001;
                         Heer and Chi, 2002), but these tools are imperfect at best. If a web browser coming
                         from a given Internet address accesses a page on your site and then accesses a second
                         page 10 minutes later, does that count as one session or two? Your log file cannot tell
                         you if the user was reading the page between those two requests or if she was talking
                         on the telephone. Those requests may not have come from the same person—for all
                         you know, it is a shared computer in a library or classroom that is used by dozens of
                         individuals on any given day.
                            Custom-built or instrumented software may alleviate some data-granularity prob-
                         lems by providing you with complete control over the data that is collected, at the
                         expense of the time and effort required to develop the data collection tools. If you are
                         willing and able to commit the resources necessary for software customization, you
                         can configure the software to capture all of the data that you think might be interest-
                         ing: nothing more, nothing less.
                            Unfortunately, matters are rarely so clean-cut. There may be a vast difference be-
                         tween what you think you need before you start large-scale data collection and what
                         you may wish you had collected once you begin analyzing the data. The expense of
                         running experiments—particularly those that involve substantial effort in participant
                         recruitment—creates a tendency toward collecting as much data as possible. “It's
                         easy to collect this information,” the thinking goes, “so we may as well. After all,
                         storage is inexpensive, and these details may prove useful later on.”
                            Although there is a certain logic to the defensive approach of collecting as much
                         data as possible, there are some limits to this approach. As anyone who has sifted
                         through megabytes of event logs can tell you, collecting lots of data may simply
                         leave you with lots of uninformative data junk to sift through. Even with software
                         tools, the identification of meaningful patterns (as opposed to random coincidences)
                         can be difficult. Lower resolution data may be somewhat easier to analyze.
                            If your data collection tools can clearly distinguish between coarse-grained and
                         fine-grained events, you might be able to have your cake and eat it too. Data collec-
                         tion tools might mark each event with an indication of the level of granularity that
   365   366   367   368   369   370   371   372   373   374   375