Page 342 -
P. 342

332    CHAPTER 12  Automated data collection methods




                         10.55.10.14 - - [13/Jul/2007:13:42:10 -0400] "GET /homepage/classes/spring07/686/index.html HTTP/1.1"
                         200 8623
                         10.55.10.14 - - [13/Jul/2007:13:48:32 -0400] "GET /homepage/classes/spring07/686/schedule.html
                         HTTP/1.1" 200 16095
                         10.55.10.14 - - [13/Jul/2007:13:48:33 -0400] "GET /homepage/classes/spring07/686/readings.html
                         HTTP/1.1" 200 14652
                         FIGURE 12.2
                         Log file entries, containing host IP address, timestamp, request, status code, and number
                         of bytes.

                         10.55.10.14 %t "GET /homepage/classes/spring07/686/readings.html HTTP/1.1" 200 14652
                         "http://10.55.10.128/homepage/classes/spring07/686/schedule.html" "Mozilla/5.0 (X11; U; Linux i686; en-
                         US; rv:1.8) Gecko/20051202 Fedora/1.5-0.fc4 Firefox/1.5"
                         FIGURE 12.3
                         A detailed version of the last entry from Figure 12.2, including the referrer and the user
                         agent.

                         changes can be made regarding the recording of the referrer, the user agent, or other
                         fields. For many studies, it may be useful to create a special-purpose log in parallel
                         with a traditional access log. The customized log file provides the information needed
                         for your study, without interfering with access logs that might be used for ongoing
                         website maintenance. Customized log file formats may require customization of the
                         web server software or of the log analysis tools, but this is generally not hard to do.
                            Most web servers generate error logs in addition to access logs. The list of re-
                         quests that generated server errors can be useful for identifying problems with a site
                         design, such as links to nonexistent pages or resources. Check your server documen-
                         tation for details.
                            As web logs can become quite voluminous, proper care and handling is very im-
                         portant. Numerous software tools extract information from log files for static reports
                         or interactive analysis: several approaches to this analysis are described in this chapter.
                            Logs from publicly accessible sites may include regular and repeated visits from
                         web robots, tools used by search engines and other tools to retrieve web pages, follow
                         links, and analyze web content. Before using the logs of your publicly accessible site
                         for research purposes, you might consider using the robot exclusion protocol (Koster,
                         2007) to discourage these automated tools. This protocol is very straightforward: all
                         you need to do is to place one simple file in the root directory of your server. Polite
                         bots will not make further requests once they see this file. As a result, the proportion
                         of your log entries generated by these crawlers will be reduced, leaving you with
                         more of the good stuff—visits from human users. As this step may have the (possibly
                         undesirable) effect of reducing your site's visibility to search engines, you may wish
                         to exclude robots for short periods of time while you collect data. Once your data
                         collection is complete, you can disable your robot exclusion measures, thus allowing
                         search engines to index your site and maintain your visibility.
                            Note that web requests—and therefore web logs—are not limited solely to re-
                         cording clicks in web browsers. Many web sites provide Application Programming
   337   338   339   340   341   342   343   344   345   346   347