Page 428 -
P. 428
14.2 Online research 419
Twitter, Flickr, and other sites and tracked the utilization and dissemination of con-
tent over time as a means of examining the impact of their efforts (Winandy et al.,
2016). Such focused efforts have the advantage of generally being feasible with in-
formation available to account holders on these sites. Other, similarly small studies,
can be conducted through standard interactions, as in a study of YouTube video blogs
for illness support: researchers manually searched YouTube to identify videos of in-
terest and reviewed transcripts and comments on those videos to see how they were
used for social support (Huh et al., 2014)
For larger studies, APIs provided by vendors are often the most effective
means of capturing data. Twitter APIs have been used to access data for many
studies, including investigation of spammers' social networks (Yang et al., 2012),
extraction of sporting event summaries from Tweets (Nichols et al., 2012), and
understanding the spread of information during times of social upheaval (Starbird
and Palen, 2012). Twitter data has been used to explore patterns of discussion
during emergency situations (Cassa et al., 2013), smoking behavior (Myslín et al.,
2013), and many other health-related topics. Facebook has also been the subject of
significant research interest, including studies of strengths of relationships (Xiang
et al., 2010), relationships between social network use and well-being (Burke
et al., 2010), and information diffusion (Bakshy et al., 2012) to name just a few.
However, as for-profit businesses, Twitter and Facebook consider their data to be
valuable, making only a subset available through APIs, with access to larger data
sets possibly available for a fee (Finley, 2014). Twitter has also made limited ac-
cess to their archives of historical content available to researchers through a data
grant program (Kirkorian, 2014). Largely as a result of restrictions on data avail-
ability, this research is often conducted by researchers employed by the social
networking sites being studied (Xiang et al., 2010; Burke et al., 2010; Bakshy
et al., 2012).
Bulk datasets often make good data sources for studies of interaction patterns.
Studies of Wikipedia trends have relied on bulk data downloads providing snapshots
of site content at specific points in time (Viégas et al., 2007b)—such datasets can be
invaluable when available, but the volume of content can also be daunting. Sampling
of a smaller subset, either randomly, by time, or by content, can be an appropriate
means of identifying a more manageable dataset. The Enron corpus, a database of
several hundred thousand email messages from the failed energy company, provides
an uncommon view into the electronic communications in a large company. This
dataset has been analyzed in dozens of studies, addressing questions such as the
identification of words and phrases used to indicate power relations in the corporate
structure (Gilbert, 2012).
As with social network data, search engine research is perhaps most easily con-
ducted by scientists working in the research labs of prominent search engine firms
like Google (Ginsberg et al., 2009) and Microsoft (Huang et al., 2011, 2012; White
and Horvitz, 2009; White, 2013; White et al., 2013; White and Hassan, 2014). See
the “Google Flu” Sidebar for a discussion of the promises and challenges of log
analysis, as illustrated by the high profile case of Google's Flu prediction analysis.