Page 442 -

P. 442

14.3 Human computation 433

Table 14.1 Quality Control Measures for Crowdsourcing Studies
Strategy Proposed Approach
Question design Include questions with known answers (Kittur et al., 2008)
Make accurate answers easy to provide (Kittur et al., 2008)
Study design Develop predictive models based on question types to determine
how many responses are needed to ensure high-quality answers
for each question type (Barowy et al., 2016)
Use micro-diversions or other distracters to offset declines in
response quality as users get bored or tired (Dai et al., 2015)
Task performance Look for patterns indicating answers that might have been faked
data analysis or rushed, including repeated free text or questions answered too
quickly (Kittur et al., 2008)
Use task completion metadata to develop predictive models of
individual workers (Ipeirotis et al., 2010) and tasks (Rzeszotarski
and Kittur, 2011; Zhu et al., 2012)

Anniket Kittur, Ed Chi, and Bongwun Suh (Kittur et al., 2008) made three sugges-
tions for designing high-quality crowdsourcing tasks. (1) Each task should include
questions with known answers that can be easily checked. Asking participants to
count the number of images in the page, or to answer a simple question based on the
text in the page, can help determine if they are answering seriously or simply rushing
through. (2) Accurate answers should be no harder to provide than rushed, inaccurate
answers. For example, a task asking users to summarize a site might be easily sub-
verted by short one-word answers, but an explicit requirement that users provide a
certain number of keywords to describe content might be easier to fill out accurately.
(3) Look for other ways to find low-quality answers, such as by identifying tasks that
are completed too quickly or have answers repeated across multiple tasks (Kittur et
al., 2008). Having multiple users complete each task and using agreement on results
as a measure of quality—just as described earlier for CAPTCHA—is another possi-
bility, but redundancy can be expensive (Ipeirotis et al., 2010). Alternatively, models
of the complexity of different response types (checkboxes, radio boxes, free text) can
be used to predict the number of responses needed to arrive at high-quality levels
with high confidence (Barowy et al., 2016). “Micro-diversions”—games or other
entertaining distractions designed to disrupt the monotony of performing multiple
repeated tasks over long periods of time—might also help improve response quality
(Dai et al., 2015).
Other studies have used task completion metadata to develop predictive mod-
els suitable for identifying invalid answers. Noting that Mechanical Turk collects
detailed data on each task, including measures of start and end time, Zhu and col-
leagues built predictive models based on initial estimates of task performance and
data from actual tasks. They then used these models to classify subsequent re-
sponses as either valid or invalid (Zhu et al., 2012). Other efforts have explored
building models of individual workers (Ipeirotis et al., 2010) and using JavaScript

437 438 439 440 441 442 443 444 445 446 447