Page 428 -
P. 428

14.2  Online research  419




                  Twitter, Flickr, and other sites and tracked the utilization and dissemination of con-
                  tent over time as a means of examining the impact of their efforts (Winandy et al.,
                  2016). Such focused efforts have the advantage of generally being feasible with in-
                  formation available to account holders on these sites. Other, similarly small studies,
                  can be conducted through standard interactions, as in a study of YouTube video blogs
                  for illness support: researchers manually searched YouTube to identify videos of in-
                  terest and reviewed transcripts and comments on those videos to see how they were
                  used for social support (Huh et al., 2014)
                     For larger studies,  APIs provided by vendors are often the most effective
                  means of capturing data. Twitter APIs have been used to access data for many
                  studies, including investigation of spammers' social networks (Yang et al., 2012),
                  extraction of sporting event summaries from Tweets (Nichols et al., 2012), and
                  understanding the spread of information during times of social upheaval (Starbird
                  and Palen, 2012). Twitter data has been used to explore patterns of discussion
                  during emergency situations (Cassa et al., 2013), smoking behavior (Myslín et al.,
                  2013), and many other health-related topics. Facebook has also been the subject of
                  significant research interest, including studies of strengths of relationships (Xiang
                  et  al., 2010), relationships between social network use and well-being (Burke
                  et al., 2010), and information diffusion (Bakshy et al., 2012) to name just a few.
                  However, as for-profit businesses, Twitter and Facebook consider their data to be
                  valuable, making only a subset available through APIs, with access to larger data
                  sets possibly available for a fee (Finley, 2014). Twitter has also made limited ac-
                  cess to their archives of historical content available to researchers through a data
                  grant program (Kirkorian, 2014). Largely as a result of restrictions on data avail-
                  ability, this research is often conducted by researchers employed by the social
                  networking sites being studied (Xiang et al., 2010; Burke et al., 2010; Bakshy
                  et al., 2012).
                     Bulk datasets often make good data sources for studies of interaction patterns.
                  Studies of Wikipedia trends have relied on bulk data downloads providing snapshots
                  of site content at specific points in time (Viégas et al., 2007b)—such datasets can be
                  invaluable when available, but the volume of content can also be daunting. Sampling
                  of a smaller subset, either randomly, by time, or by content, can be an appropriate
                  means of identifying a more manageable dataset. The Enron corpus, a database of
                  several hundred thousand email messages from the failed energy company, provides
                  an uncommon view into the electronic communications in a large company. This
                  dataset has been analyzed in dozens of studies, addressing questions such as the
                  identification of words and phrases used to indicate power relations in the corporate
                  structure (Gilbert, 2012).
                     As with social network data, search engine research is perhaps most easily con-
                  ducted by scientists working in the research labs of prominent search engine firms
                  like Google (Ginsberg et al., 2009) and Microsoft (Huang et al., 2011, 2012; White
                  and Horvitz, 2009; White, 2013; White et al., 2013; White and Hassan, 2014). See
                  the “Google Flu” Sidebar for a discussion of the promises and challenges of log
                  analysis, as illustrated by the high profile case of Google's Flu prediction analysis.
   423   424   425   426   427   428   429   430   431   432   433