Page 342 -
P. 342
332 CHAPTER 12 Automated data collection methods
10.55.10.14 - - [13/Jul/2007:13:42:10 -0400] "GET /homepage/classes/spring07/686/index.html HTTP/1.1"
200 8623
10.55.10.14 - - [13/Jul/2007:13:48:32 -0400] "GET /homepage/classes/spring07/686/schedule.html
HTTP/1.1" 200 16095
10.55.10.14 - - [13/Jul/2007:13:48:33 -0400] "GET /homepage/classes/spring07/686/readings.html
HTTP/1.1" 200 14652
FIGURE 12.2
Log file entries, containing host IP address, timestamp, request, status code, and number
of bytes.
10.55.10.14 %t "GET /homepage/classes/spring07/686/readings.html HTTP/1.1" 200 14652
"http://10.55.10.128/homepage/classes/spring07/686/schedule.html" "Mozilla/5.0 (X11; U; Linux i686; en-
US; rv:1.8) Gecko/20051202 Fedora/1.5-0.fc4 Firefox/1.5"
FIGURE 12.3
A detailed version of the last entry from Figure 12.2, including the referrer and the user
agent.
changes can be made regarding the recording of the referrer, the user agent, or other
fields. For many studies, it may be useful to create a special-purpose log in parallel
with a traditional access log. The customized log file provides the information needed
for your study, without interfering with access logs that might be used for ongoing
website maintenance. Customized log file formats may require customization of the
web server software or of the log analysis tools, but this is generally not hard to do.
Most web servers generate error logs in addition to access logs. The list of re-
quests that generated server errors can be useful for identifying problems with a site
design, such as links to nonexistent pages or resources. Check your server documen-
tation for details.
As web logs can become quite voluminous, proper care and handling is very im-
portant. Numerous software tools extract information from log files for static reports
or interactive analysis: several approaches to this analysis are described in this chapter.
Logs from publicly accessible sites may include regular and repeated visits from
web robots, tools used by search engines and other tools to retrieve web pages, follow
links, and analyze web content. Before using the logs of your publicly accessible site
for research purposes, you might consider using the robot exclusion protocol (Koster,
2007) to discourage these automated tools. This protocol is very straightforward: all
you need to do is to place one simple file in the root directory of your server. Polite
bots will not make further requests once they see this file. As a result, the proportion
of your log entries generated by these crawlers will be reduced, leaving you with
more of the good stuff—visits from human users. As this step may have the (possibly
undesirable) effect of reducing your site's visibility to search engines, you may wish
to exclude robots for short periods of time while you collect data. Once your data
collection is complete, you can disable your robot exclusion measures, thus allowing
search engines to index your site and maintain your visibility.
Note that web requests—and therefore web logs—are not limited solely to re-
cording clicks in web browsers. Many web sites provide Application Programming