Notes:1 There are a number of new and emerging academic disciplines developing in this area, most notably computational sociology and digital anthropology.
2 It may also partly be a reflection of the network effect of social networks. For example, given the high proportion of English on Twitter, non-English users may also feel compelled to use English as well, to take part in conversations on the network.
3 Attitudinal research itself can often change the context of what is said, and in doing so introduce ‘observation’ or ‘measurement’ effects’. This is ‘reactivity’ – the phenomenon that occurs when individuals alter their behaviour when they are aware that they are being observed. People involved in a poll are often seen to change their behaviour in consistent ways: to be more acceptable in general, more acceptable to the researcher specifically, or in ways that they believe meet the expectations of the observers. See PP Heppner, BE Wampold and DM Kivlighan, Research Design in Counseling, Thompson, 2008, p 331.
4 See BG Glaser and AL Strauss, The Discovery of Grounded Theory, New Brunswick: AldineTransaction, 1967.
5 These are the six principles: research should be designed, reviewed and undertaken to ensure integrity, quality and transparency; research staff and participants must normally be informed fully about the purpose, methods and intended possible uses of the research, what their participation in the research entails and what risks, if any, are involved; the confidentiality of information supplied by research participants and the anonymity of respondents must be respected; research participants must take part voluntarily, free from any coercion; harm to research participants and researchers must be avoided in all instances; and the independence of research must be clear, and any conflicts of interest or partiality must be explicit. See ESRC, ‘Framework for Research Ethics’, latest version, Economic and Social Research Council Sep 2012,
www.esrc.ac.uk/ about-esrc/information/research-ethics.aspx (accessed 13 Apr 2014).
6 However, a growing group of internet researchers has issued various types of guidance themselves. See AoIR, Ethical Decision-Making and Internet Research: Recommendations from the AoIR Ethics Working Committee (Version 2.0), Association of Internet Researchers, 2012, p 2.
7 European Commission, Eurobarometer survey on trust in institutions, Nov 2013,
http://ec.europa.eu/public_opinion/ cf/showchart_column.cfm?keyID=2189&nationID=6,3,15,& startdate=2012.05&enddate=2013.11 (accessed 24 Apr 2014); I van Biezen, P Mair and T Poguntke (2012) ‘Going, going… gone? The decline of party membership in contemporary Europe’, European Journal of Political Research 51, no 1, 2012, pp 24–56.
8 J Birdwell, F Farook and S Jones, Trust in Practice, London: Demos, 2009.
9 European Commission, ‘Public opinion in the European Union: first results’, Standard Eurobarometer 78, Dec 2012,
http://ec.europa.eu/public_opinion/archives/eb/eb78/ eb78_first_en.pdf (accessed 10 Apr 2014).
10 Pew Research Center, ‘The sick man of Europe: the European Union’, 13 May 2013,
www.pewglobal.org/ 2013/05/13/the-new-sick-man-of-europe-the-european-union/ (accessed 10 Apr 2014).
11 European Commission, ‘Two years to go to the 2014 European elections’, Eurobarometer 77, no 4, 2012, www. europarl.europa.eu/pdf/eurobarometre/2012/election_2012/ eb77_4_ee2014_synthese_analytique_en.pdf (accessed 11 Apr 2014).
12 P Huyst, ‘The Europeans of tomorrow: researching European identity among young Europeans’, Centre for EUstudies, Ghent University, nd,
http://aei.pitt.edu/33069/1/ huyst._petra.pdf (accessed 11 Apr 2014).
13 M Henn and N Foard, ‘Young people, political participation and trust in Britain’, Parliamentary Affairs 65, no 1, 2012.
14 Eg J Sloam, ‘Rebooting democracy: youth participation in politics in the UK’, Parliamentary Affairs, 60, 2007.
15 D Zeng et al, ‘Social media analytics and intelligence: guest editors’ introduction’, in Proceedings of the IEEE Computer Society, Nov–Dec 2010, p 13.
16 Emarketer, ‘Where in the world are the hottest social networking countries?’, 29 Feb 2012,
www.emarketer. com/Article/Where-World-Hottest-Social-Networking- Countries/1008870 (accessed 11 Apr 2014).
17 Social-media-prism, ‘The conversation’, nd,
www.google. co.uk/imgres?imgurl=http://spirdesign.no/wp-content/ uploads/2010/11/social-media-prism.jpg&imgrefurl=http:// spirdesign.no/blog/webdesignidentitet-og-trender/ attachment/social-media-prism/&h=958&w=1024&sz=3 01&tbnid=EFQcS2D-zhOj8M:&tbnh=90&tbnw=96&z oom=1&usg=__VXussUcXEMznT42YLhgk6kOsPIk= &docid=ho9_RAXkIYvcpM&sa=X&ei=9QBXUdeYOiJ0AXdyIHYAg& ved=0CEoQ9QEwAg&dur=47 (accessed 11 Apr 2014).
18 F Ginn, ‘Global social network stats confirm Facebook as largest in US & Europe (with 3 times the usage of 2nd place)’, Search Engine Land, 17 Oct 2011, http:// searchengineland.com/global-social-network-stats-confirmfacebook- as-largest-in-u-s-europe-with-3-times-the-usageof- 2nd-place-97337 (accessed 11 Apr 2014).
19 Emarketer, ‘Twitter is widely known in France, but garners few regular users’, 30 Apr 2013,
www.emarketer.com/Article/ Twitter-Widely-Known-France-Garners-Few-Regular- Users/1009851 (accessed 11 Apr 2014).
20 For a map of current Twitter languages and demographic data, see E Fischer, ‘Language communities of Twitter’, 24 Oct 2011,
www.flickr.com/photos/walkingsf/6277163176/ in/photostream/lightbox/ (accessed 10 Apr 2014); DMR, ‘(March 2014) by the numbers: 138 amazing Twitter statistics’, Digital Market Ramblings, 23 Mar 2014, http:// expandedramblings.com/index.php/march-2013-by-thenumbers- a-few-amazing-twitter-stats/ (accessed 10 Apr 2014).
21 Slideshare, ‘Media measurement: social media trends by age and country’, 2011,
www.slideshare.net/MML_Annabel/ media-measurement-social-media-trends-by-country-and-age (accessed 11 Apr 2014).
22 Emarketer, ‘Twitter grows stronger in Mexico’, 24 Sep 2012,
www.emarketer.com/Article/Twitter-Grows-Stronger- Mexico/1009370 (accessed 10 Apr 2014); Inforrm’s Blog, ‘Social media: how many people use Twitter and what do we think about it?’, International Forum for Responsible Media Blog, 16 Jun 2013,
http://inforrm.wordpress.com/2013/06/16/ social-media-how-many-people-use-twitter-and-what-dowe- think-about-it/ (accessed 11 Apr 2014).
23 Eg M Bamburic, ‘Twitter: 500 million accounts, billions of tweets, and less than one per cent use their location’, 2012,
http://betanews.com/2012/07/31/twitter- ... naccounts- billions-of-tweets-and-less-than-one-per cent-usetheir- location/ (accessed 11 Apr 2014).
24 Beevolve, ‘Global heatmap of Twitter users’, 2012, www. beevolve.com/twitter-statistics/#a3 (accessed 11 Apr 2014).
25 European Commission, ‘Political participation and EU citizenship: perceptions and behaviours of young people’, nd,
http://eacea.ec.europa.eu/youth/tools/documents/ perception-behaviours.pdf (accessed 11 Apr 2014).
26 S Creasey, ‘Perceptual engagement: the potential and pitfalls of using social media for political campaigning’, London School of Economics, 2011,
http://blogs.lse.ac.uk/ polis/files/2011/06/PERPETUAL-ENGAGEMENT-THEPOTENTIAL- AND-PITFALLS-OF-USING-SOCIALMEDIA- FOR-POLITICAL-CAMPAIGNING.pdf (accessed 29 Apr 2014).
27 WH Dutton and G Blank, Next Generation Users: The internet in Britain, Oxford Internet Survey 2011 report, 2011,
www.oii.ox.ac.uk/publications/oxis2011_report.pdf (accessed 3 Apr 2013).
28 Ibid.
29 J Bartlett et al, Virtually Members: The Facebook and Twitter followers of UK political parties, London: Demos 2013.
30 J Bartlett et al, New Political Actors in Europe: Beppe Grillo and the M5S, London: Demos, 2012; J Birdwell and J Bartlett, Populism in Europe: CasaPound, London: Demos, 2012; J Bartlett, J Birdwell and M Littler, The New Face of Digital Populism, London: Demos, 2011.
31 C McPhedran, ‘Pirate Party makes noise in German politics’, Washington Times, 10 May 2012,
www.washingtontimes.com/news/2012/may/10/upstartparty- making-noise-in-german-politics/?page=all (accessed 11 Apr 2014).
32 T Postmes and S Brunsting, ‘Collective action in the age of the internet: mass communication and online mobilization’, Social Science Computer Review 20, issue 3, 2002; M Castells, ‘The mobile civil society: social movements, political power and communication networks’ in M Castells et al, Mobile Communication and Society: A global perspective, Cambridge MA: MIT Press, 2007.
33 G Blakeley, ‘Los Indignados: a movement that is here to stay’, Open Democracy, 5 Oct 2012,
www.opendemocracy.net/georgina-blakeley ... smovement- that-is-here-to-stay (accessed 11 Apr 2014).
34 N Vallina-Rodriguez et al, ‘Los Twindignados: the rise of the Indignados Movement on Twitter’, in Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on Social Computing (SocialCom),
www.cl.cam.ac.uk/~nv240/ papers/twindignados.pdf (accessed 11 Apr 2014).
35 GT Madonna and M Young, ‘The first political poll’, Politically Uncorrected, 18 Jun 2002,
www.fandm.edu/ politics/politically-uncorrected-column/2002-politicallyuncorrected/ the-first-political-poll (accessed 11 Apr 2014).
36 For example, are federal expenditures for relief and recovery too great, too little, or about right? Responses were as follows: 60 per cent too great; 9 per cent too little; 31 per cent about right. See ‘75 years ago, the first Gallup Poll’, Polling Matters, 20 Oct 2010,
http://pollingmatters. gallup.com/2010/10/75-years-ago-first-gallup-poll.html (accessed 11 Apr 2014).
37 Thereby avoiding a number of measurement biases often present during direct solicitation of social information, including memory bias, questioner bias and social acceptability bias. Social media, by contrast, is often a completely unmediated spectacle.
38 VM Schonberger and K Cukier, Big Data, London: John Murray, 2013.
39 Early and emerging examples of Twitterology were presented at the International Conference on Web Search and Data Mining 2008. It is important to note that there is a large difference between what are current capabilities, and what are published capabilities. We do not have access to a great deal of use-cases – including novel techniques, novel applications of techniques or substantive findings – that are either under development or extant but unpublished. Academic peer-reviewed publishing can take anywhere from six months to two years, while many commercial capabilities are proprietary. Furthermore, much social media research is conducted either by or on behalf of the social media platforms themselves, and never made public. The growing distance between development and publishing, and the increasing role of proprietary methodologies and private sector ownership and exploitation of focal data sets, are important characteristics of the social media research environment. Good examples include P Carvalhoet al, ‘Liars and saviors in a sentiment annotated corpus of comments to political debates’ in Proceedings of the Association for Computational Linguistics, 2011, pp 564–68; N Diakopoulos and D Shammar, ‘Characterising debate performance via aggregated Twitter sentiment’ in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2010, pp 1195–8; S Gonzalez-Bailon, R Banchs and A Kaltenbrunner, ‘Emotional reactions and the pulse of public opinion: measuring the impact of political events on the sentiment of online discussions’, ArXiv e-prints, 2010, arXiv 1009.4019; G Huwang et al, ‘Conversational tagging in Twitter’ in Proceedings of the 21st ACM conference on Hypertext and Hypermedia, 2010, pp 173–8; M Marchetti-Bowick and N Chambers, ‘Learning for microblogs with distant supervision: political forecasting with Twitter’ in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp 603–12; B O’Connor et al, ‘From tweets to polls: linking text sentiment to public opinion time series’ in Proceedings of the AAAI Conference on Weblogs and Social Media, 2010, pp 122–9; A Pak and P Paroubak, ‘Twitter as a corpus for sentiment analysis and opinion mining’ in Proceedings of the Seventh International Conference on Language Resources and Evaluation, 2010; C Tan et al, ‘User-level sentiment analysis incorporating social networks’ in Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2011; A Tumasjan et al, ‘Election forecasts with Twitter: how 140 characters reflect the political landscape’, Social Science Computer Review, 2010. See also RE Wilson, SD Gosling and LT Graham, ‘A review of Facebook research in the social sciences’, Perspectives on Psychological Science, 7, no 3, 2012 pp 203–20.
40 Early and emerging examples of Twitterology were presented at the International Conference on Web Search and Data Mining, 2008.
41 European Commission, ‘Europeans and their languages’, Special Eurobarometer 243, Feb 2006,
http://ec.europa. eu/public_opinion/archives/ebs/ebs_243_sum_en.pdf (accessed 11 Apr 2014).
42 It is also possible to acquire a large amount of social media data via licensed data providers. These are often third party resellers.
43 Some APIs can deliver historical data, stretching back months or years, while others only deliver very recent content. Some deliver a random selection of social media data taken from the platform; others deliver all data that match the queries – usually keywords selected by the analyst to be present in the post or tweet – provided by the researcher. In general, all APIs produce data in a consistent, ‘structured’ format, in large quantities.
44 Twitter has three different APIs available to researchers. The search API returns a collection of relevant tweets matching a specified query (word match) from an index that extends up to roughly a week in the past. Its filter API continually produces tweets that contain one of a number of keywords to the researcher, in real time as they are made. Its sample API returns a random sample of a fixed percentage of all public tweets in real time. Each of these APIs (consistent with the vast majority of all social media platform APIs) is constrained by the amount of data they will return. A public, free ‘spritzer’ account caps the search API at 180 calls every 15 minutes with up to 100 tweets returned per call; the filter API caps the number of matching tweets returned by the filter to no more than 1 per cent of the total stream in any given second; and the sample API returns a random 1 per cent of the tweet stream. Others use white-listed research accounts (known informally as ‘the garden hose’), which have 10 per cent rather than 1 per cent caps on the filter and sample APIs, while still others use the commercially available ‘firehose’ of 100 per cent of daily tweets. With daily tweet volumes averaging roughly 400 million, many researchers do not find the spritzer account restrictions to be limiting to the number of tweets they collect (or need) on any particular topic.
45 S Fodden, ‘Anatomy of a tweet: metadata on Twitter’, Slaw, 17 Nov 2011,
www.slaw.ca/2011/11/17/the-anatomy-of-a-tweetmetadata- on-twitter/ (accessed 11 Apr 2014); R Krikorian, ‘Map of a Twitter status object’, 18 Apr 2010,
www.slaw. ca/wp-content/uploads/2011/11/map-of-a-tweet-copy.pdf (accessed 11 Apr 2014).
46 Acquiring data from Twitter on a particular topic is a trade-off between precision and comprehensiveness. A precise data collection strategy only returns tweets that are on-topic, but is likely to miss some. A comprehensive data collection strategy collects all the tweets that are on-topic, but is likely to include some which are off-topic. Individual words themselves can be inherently either precise or comprehensive, depending on how and when they are used.
47 Ibid.
48 The choice of these keywords and hashtags for each topic in each language was made in a quick manual review of the data collected in the early stages of the project. The inclusion of these terms was meant to bring in conversations that were relevant to the stream but did not explicitly reference the topic by its full name, without overwhelming the streams with irrelevant data. For a full list of scraper terms used per stream, see the annex.
49 AoIR, Ethical Decision-Making and Internet Research; J Bartlett and C Miller, ‘How to measure and manage harms to privacy when accessing and using communications data’, submission by the Centre for the Analysis of Social Media, as requested by the Joint Parliamentary Select Committee on the Draft Communications Data Bill, Oct 2012,
www.demos.co.uk/files/Demos%20CASM%20 submission%20on%20Draft%20Communications%20 Data%20bill.pdf (accessed 11 Apr 2014).
50 It may also partly be a reflection of the network affect of social networks. For example, given the high proportion of tweets in English on Twitter, non-English users may also feel compelled to use English as well to take part in conversations on the network.
51 Emarketer, ‘Twitter grows stronger in Mexico’; Inforrm’s Blog, ‘Social media’.
52 Given the historical nature of our data set, each twitcident was identified from a single data stream, rather than across Twitter as a whole (which would be a far better way of collecting data relating to an event). See discussion in chapter 4.
53 ‘Dix milliards d’euros pour sauver Chypre’, Libération, 16 Mar 2013,
www.liberation.fr/economie/2013/03/16/ chypre-cinquieme-pays-de-la-zone-euro-a-beneficier-del- aide-internationale_889016 (accessed 11 Apr 2014); I de Foucaud, ‘Chypre: un sauvetage inédit à 10 milliards d’euros’, Le Figaro, 16 Mar 2013,
www.lefigaro.fr/ conjoncture/2013/03/16/20002-20130316ARTFIG00293- chypre-un-sauvetage-inedit-a-10-milliards-d-euros.php (accessed 11 Apr 2014); ‘A Chypre, la population sous le choc, le président justifie les sacrifices’, Le Monde, 17 Mar 2014,
www.lemonde.fr/europe/article/2013/03/16/a-chyprela- population-dans-l-incertitude-apres-l-annonce-du-plande- sauvetage_1849491_3214.html (accessed 11 Apr 2014).
54 ‘Hitting the savers: Eurozone reaches deal on Cyprus bailout’, Spiegel International, 16 Mar 2013,
www.spiegel.de/ international/europe/savers-will-be-hit-as-part-of-deal-tobail- out-cyprus-a-889252.html (accessed 11 Apr 2014)
55 This echoed much of the early press coverage, especially in Germany, with the Frankfurter Allgemeine stating ‘Zyperns Rettung Diesmal bluten die Sparer’(‘Cyprus rescue bleeding time savers’)
56 ‘Does the bailout deal mean the worst is over for Cyprus? – poll’, Guardian, 25 Mar 2013,
www.theguardian.com/ business/poll/2013/mar/25/bailout-deal-worst-over-cypruspoll (accessed 11 Apr 2014).
57 Pew Research Center, The New Sick Man of Europe: The European Union, 2013,
www.pewglobal.org/files/2013/05/ Pew-Research-Center-Global-Attitudes-Project-European- Union-Report-FINAL-FOR-PRINT-May-13-2013.pdf (accessed 11 Apr 2014).
58 YouGov survey results, fieldwork 21–27 Mar 2013,
http://d25d2506sfb94s.cloudfront.net/cumulus_uploads/ document/eh65gpse1v/YG-Archive_Eurotrack-March- Cyprus-EU-representatives-Easter.pdf (accessed 11 Apr 2014).
59 From a background level of 117 tweets on 9 March 2013, 141 on 10 March and 288 on 11 March, there is an increase to 391 on 12 March, 844 on 13 March and a peak of 1,786 on 14 March.
60 See the section ‘classifier performance’ in the annex for a discussion of its accuracy.
61 ‘Affichette “casse-toi pov’ con”: la France condamnée par la CEDH’, Le Monde, 14 Mar 2013,
www.lemonde.fr/societe/ article/2013/03/14/affichette-casse-toi-pov-con-la-francecondamnee- par-la-cedh_1847686_3224.html (accessed 11 Apr 2014).
62 European Court of Human Rights, ‘Affaire Eon c. France’, requête 26118/10, 14 Mar 2013,
http://hudoc.echr.coe.int/ sites/fra/pages/search.aspx?i=001-117137#{‘itemid’:[‘001-117137’]} (accessed 11 Apr 2014).
63 W Jordan, ‘Public: ignore courts and deport Qatada’, YouGov, 26 Apr 2013,
http://yougov.co.uk/news/2013/ 04/26/brits-ignore-courts-and-deport-qatada/ (accessed 24 Apr 2014).
64 Ipsos MORI, ‘Public blamed ECHR over the Home Secretary for Qatada delays’, 26 Apr 2013,
www.ipsos-mori.com/researchpublications ... charchive/ 2964/Public-blamed-ECHR-over-the-Home-Secretary-for- Abu-Qatada-delays.aspx (accessed 24 Apr 2014).
65 Average calculated across March, April and May.
66 ‘Récession: “La situation est grave”, juge Hollande’.
67 ‘Barroso: “la France doit présenter des reformes crédibles”’.
68 ‘José Manuel Barroso: “Être contre la mondialisation, c’est cracher contre le vent”’.
69 ‘Hollande ne va pas passer un “examen” à Bruxelles, souligne Barroso’.
70 ‘François Hollande au révélateur de la Commission européenne: le président de la République a rencontré les 27 commissaires européens à Bruxelles pour évoquer les réformes structurelles réclamées à la France’.
71 This was the highest performing classifier trained during the project – with a far higher accuracy than the generic attitudinal classifiers that attempted to make more generic decisions over a longer term.
72 See, for instance, O’Connor et al, ‘From tweets to polls’. The authors collected their sample using just a few keyword searches. Some more promisingly methodical approaches also exist: see J Leskovec, J Kleinberg and C Faloutsos, ‘Graphevolution: densification and shrinking diameters’, Data 1, no 1, Mar 2007,
www.cs.cmu.edu/~jure/ pubs/powergrowth-tkdd.pdf (accessed 16 Apr 2012); J Leskovec and C Faloutsos, ‘Sampling from large graphs’ in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006,
www.stat.cmu.edu/~fienberg/Stat36-835/L ... plingkdd06. pdf (accessed 17 Apr 2012); P Rusmevichientong et al, ‘Methods for sampling pages uniformly from the world wide web’ in Proceedings of the AAAI Fall Symposium on Using Uncertainty Within Computation, 2001, pp 121–8.
73 D Singer, ‘Forget the 80/20 principle, with Twitter it is 79/7’, Social Media Today, 25 Feb 2010,
http://socialmediatoday.com/index.php?q=SMC/177538 (accessed 11 Apr 2014).
74 European Union, ‘Twitter accounts’, nd,
http://europa.eu/contact/take-part/twitter/index_en.htm (accessed 11 Apr 2014).
75 S Bennett, ‘Who uses Twitter? Young, affluent, educated non-white males, suggests data [study]’, All Twitter, 6 Aug 2013,
www.mediabistro.com/alltwitter/twitterusers- 2013_b47437 (accessed 11 Apr 2014).
76 M Bulmer, ‘Facts, concepts, theories and problems’ in M Bulmer (ed.), Sociological Research Methods: An introduction, London: Macmillan, 1984.
77 Surveys often tap attitudes by using a sophisticated barrage of different indicators and different ways of measuring them. The Likert scale measures intensity of feelings (usually measured on a scale from 1 to 5) on a number of different specific questions to gauge an underlying attitude. A body of work around question design has produced settled dos and don’ts aimed at avoiding the unreliable measurement of attitudinal indicators. Questions are avoided if they are too long, ambiguous, leading, general, technical or unbalanced, and many surveys use specific wordings of questions drawn from ‘question banks’ designed to best practice standards for use by major surveys.
78 S Jeffares, ‘Coding policy tweets’, paper presented to the social text analysis workshop, University of Birmingham, 28 Mar 2012.
79 S Wibberley and C Miller, ‘Detecting events from Twitter: situational awareness in the age of social media’ in C Hobbs, M Matthew and D Salisbury, Open Source Intelligence in the Twenty-first Century: New approaches and opportunities, Palgrave MacMillan, forthcoming 2014.
80 Glaser and Strauss, The Discovery of Grounded Theory.
81 COSMOS platform.
82 Open Knowledge Foundation, ‘Open data – an introduction’, nd,
http://okfn.org/opendata/ (accessed 11 Apr 2014).
83 The choice of these keywords and hashtags for each topic in each language was made on the basis of a quick manual review of the data that were collected in the early stages of the project. The inclusion of these terms was meant to bring in conversations that were relevant to the stream but did not explicitly reference the topic by its full name, without overwhelming the streams with irrelevant data. For a full list of scraper terms used per stream see annex.
84 Marchetti-Bowick and Chambers, ‘Learning for microblogs with distant supervision’; O’Connor et al, ‘From tweets to polls’.
85 Method51 is a software suite developed by the project team over the last 18 months. It is based on an open source project called DUALIST. See B Settles, ‘Closing the loop: fast, interactive semi-supervised annotation with queries on features and instances’, Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011, pp 1467–78. Method51 enables non-technical analysts to build machine-learning classifiers. The most important feature is the speed wherein accurate classifiers can be built. Classically, an NLP algorithm would require many thousands of examples of ‘marked-up’ tweets to achieve reasonable accuracy. This is expensive and takes days to complete. However, DUALIST innovatively uses ‘active learning’ (an application of information theory that can identify pieces of text that the NLP algorithm would learn most from) and semi-supervised learning (an approach to learning that not only learns from manually labelled data, but also exploits patterns in large unlabelled data sets). This radically reduces the number of marked-up examples from many thousands to a few hundred. Overall, in allowing social scientists to build and evaluate classifiers quickly, and therefore to engage directly with big social media data sets, the AAF makes possible the methodology used in this project.
86 On the one hand, we have been fairly inclusive on the relevancy level, in that all discussions on something directly related to the topic were usually included as relevant. For example, for the European Parliament stream, all tweets about individual MEPs were considered relevant, as were tweets about individual commissioners for the European Commission stream. Similarly, for the European Court of Human Rights stream, tweets about the European Convention of Human Rights, on which the Court’s jurisdiction is based, were included. Anything about the management of the euro by the Eurozone countries and the European Central Bank, as well as euro-induced austerity, was considered relevant for the euro stream. On the other hand, some tweets that directly referred to the stream topic were considered irrelevant, because they did not match our criteria of interest in the six streams as they relate to the European project. For example, tweets that referred to the European Union purely as a geographical area, as a shorthand for a group of countries, without referring in any sense to this group of countries as belonging to a political union, were marked as irrelevant (eg ‘Car sales in the EU have gone down 20 per cent’). Similarly, tweets referring to the euro from a purely financial perspective, quoting solely the price of things in euros or exchange rates, were irrelevant.
87 For example, for the European Parliament stream, tweets that expressed an opinion about its decisions, discussions taking place in the Parliament, individual MEPs and ‘lobbying’ directed at it (eg ‘@EP: please outlaw pesticides and save the bees!’) were considered attitudinal.
88 For the Parliament and Commission streams, positive or negative comments on individual MEPs and commissioners and specific decisions taken by each institution were marked as such.
89 The harmonic mean of p and r is equal to p × q, p + q.
90 Wilson et al, ‘A review of Facebook research in the social sciences’.
91 M Madden, ‘Privacy management on social media sites’, Pew Research Center, 2012,
www.pewinternet.org/~/ media//Files/Reports/2012/PIP_Privacy_management_ on_social_media_sites_022412.pdf (accessed 11 Apr 2014).
92 J Bartlett, The Data Dialogue, London: Demos, 2010. Also see J Bartlett and C Miller, Demos CASM Submission to the Joint Committee on the Draft Communications Data Bill, Demos, 2012
93 Bartlett, The Data Dialogue. This is based on a representative population level poll of circa 5,000 people. See also Bartlett and Miller, ‘How to measure and manage harms to privacy when accessing and using communications data’.
94 Twitter, ‘Terms of Service’, 2012,
www.twitter.com/tos (accessed 11 Apr 2014); Twitter, ‘Twitter Privacy Policy’, 2013,
www.twitter.com/privacy (accessed 11 Apr 2014).
95 Twitter, ‘Terms of Service’.