Page 344 -
P. 344
334 CHAPTER 12 Automated data collection methods
including keywords extracted from visited web pages or URLs and page view time can
provide increased accuracy in characterizing user sessions (Heer and Chi, 2002).
As a stand-alone tool, web log analysis is limited by a lack of contextual knowl-
edge about user goals and actions. Even if we are able to extract individual user paths
from log files, these paths do not tell us how the path taken relates to the user's goals.
In some cases, we might be able to make educated guesses: a path consisting of re-
peated cycling between “help” and “search” pages is most likely an indication of a
task not successfully completed. Other session paths may be more ambiguous: long
intervals between page requests might indicate that the user was carefully reading
web content, but they can also arise from distractions and other activity not related to
the website under consideration. Additional information, such as direct observation
through controlled studies or interviews, may be necessary to provide appropriate
context (Hochheiser and Shneiderman, 2001).
Complex web applications can be designed to generate and store additional data
that may be useful for understanding user activity. Database-driven websites can
track views of various pages, along with other actions such as user comments, blog
posts, or searches. Web applications that store this additional data are very similar to
“instrumented” applications—programs designed to capture detailed records of user
interactions and other relevant activities (Section 12.4.1).
The analysis of web log information presents some privacy challenges that must
be handled appropriately. IP numbers that identify computers can be used to track
web requests to a specific computer, which may be used by a single person. Analyses
that track blog posts, comments, purchases, or other activity associated with a user
login can also be used to collect a great deal of potentially sensitive information.
Before collecting any such data, you should make sure that your websites have pri-
vacy policies and other information explaining the data that you are collecting and
how you will use it. Additional steps that you might take to protect user privacy
include taking careful control of the logs and other repositories of this data, report-
ing information only in aggregate form (instead of in a form that could identify in-
dividuals), and destroying the data when your analysis is complete. As these privacy
questions may raise concerns regarding informed consent and appropriate treatment
of research participants, some web log analyses might require approval from your
institutional review board (see Chapter 15).
Web server logs have been the subject of many research studies over the years.
The development of visualization tools to interpret these logs has been a recurring
theme since the 1990s and continuing on to more recent work (Pirolli and Pitkow,
1999; Hochheiser and Shneiderman, 2001; Malik and Koh, 2016). Web search logs,
particularly from search engines, have proven to be a particularly fruitful data source
for studying how users conduct searches and interpret results (White, 2013; White
and Hassan, 2014), particularly for specific tasks such as searching for medical infor-
mation (White and Horvitz, 2009). For more on the use of web search logs to study
user behavior, see Chapter 14. As is often the case, web log analysis studies often
use multiple complementary datasets to confirm and complement log data. A study
of the social network Google+ combined log analysis with surveys and interviews to