Representing the Twittersphere: Archiving a representative sample of Twitter data under resource constraints

Airo Hino, Robert Andrew Fahey

Research output: Contribution to journalArticle

Abstract

The rising popularity of social media posts, most notably Twitter posts, as a data source for social science research poses significant problems with regard to access to representative, high-quality data for analysis. Cheap, publicly available data such as that obtained from Twitter's public application programming interfaces is often of low quality, while high-quality data is expensive both financially and computationally. Moreover, data is often available only in real-time, making post-hoc analysis difficult or impossible. We propose and test a methodology for inexpensively creating an archive of Twitter data through population sampling, yielding a database that is highly representative of the targeted user population (in this test case, the entire population of Japanese-language Twitter users). Comparing the tweet volume, keywords, and topics found in our sample data set with the ground truth of Twitter's full data feed confirmed a very high degree of representativeness in the sample. We conclude that this approach yields a data set that is suitable for a wide range of post-hoc analyses, while remaining cost effective and accessible to a wide range of researchers.

Original languageEnglish
Pages (from-to)175-184
Number of pages10
JournalInternational Journal of Information Management
Volume48
DOIs
Publication statusPublished - 2019 Oct 1

Fingerprint

Social sciences
twitter
Application programming interfaces (API)
Sampling
resources
Costs
data quality
social media
popularity
social science
programming
methodology
costs
language

Keywords

  • Data collection
  • Representativeness
  • Sampling
  • Social media
  • Twitter

ASJC Scopus subject areas

  • Information Systems
  • Computer Networks and Communications
  • Library and Information Sciences

Cite this

Representing the Twittersphere : Archiving a representative sample of Twitter data under resource constraints. / Hino, Airo; Fahey, Robert Andrew.

In: International Journal of Information Management, Vol. 48, 01.10.2019, p. 175-184.

Research output: Contribution to journalArticle

@article{3a26c164761449d4a15895e8d79b958e,
title = "Representing the Twittersphere: Archiving a representative sample of Twitter data under resource constraints",
abstract = "The rising popularity of social media posts, most notably Twitter posts, as a data source for social science research poses significant problems with regard to access to representative, high-quality data for analysis. Cheap, publicly available data such as that obtained from Twitter's public application programming interfaces is often of low quality, while high-quality data is expensive both financially and computationally. Moreover, data is often available only in real-time, making post-hoc analysis difficult or impossible. We propose and test a methodology for inexpensively creating an archive of Twitter data through population sampling, yielding a database that is highly representative of the targeted user population (in this test case, the entire population of Japanese-language Twitter users). Comparing the tweet volume, keywords, and topics found in our sample data set with the ground truth of Twitter's full data feed confirmed a very high degree of representativeness in the sample. We conclude that this approach yields a data set that is suitable for a wide range of post-hoc analyses, while remaining cost effective and accessible to a wide range of researchers.",
keywords = "Data collection, Representativeness, Sampling, Social media, Twitter",
author = "Airo Hino and Fahey, {Robert Andrew}",
year = "2019",
month = "10",
day = "1",
doi = "10.1016/j.ijinfomgt.2019.01.019",
language = "English",
volume = "48",
pages = "175--184",
journal = "International Journal of Information Management",
issn = "0268-4012",
publisher = "Elsevier Limited",

}

TY - JOUR

T1 - Representing the Twittersphere

T2 - Archiving a representative sample of Twitter data under resource constraints

AU - Hino, Airo

AU - Fahey, Robert Andrew

PY - 2019/10/1

Y1 - 2019/10/1

N2 - The rising popularity of social media posts, most notably Twitter posts, as a data source for social science research poses significant problems with regard to access to representative, high-quality data for analysis. Cheap, publicly available data such as that obtained from Twitter's public application programming interfaces is often of low quality, while high-quality data is expensive both financially and computationally. Moreover, data is often available only in real-time, making post-hoc analysis difficult or impossible. We propose and test a methodology for inexpensively creating an archive of Twitter data through population sampling, yielding a database that is highly representative of the targeted user population (in this test case, the entire population of Japanese-language Twitter users). Comparing the tweet volume, keywords, and topics found in our sample data set with the ground truth of Twitter's full data feed confirmed a very high degree of representativeness in the sample. We conclude that this approach yields a data set that is suitable for a wide range of post-hoc analyses, while remaining cost effective and accessible to a wide range of researchers.

AB - The rising popularity of social media posts, most notably Twitter posts, as a data source for social science research poses significant problems with regard to access to representative, high-quality data for analysis. Cheap, publicly available data such as that obtained from Twitter's public application programming interfaces is often of low quality, while high-quality data is expensive both financially and computationally. Moreover, data is often available only in real-time, making post-hoc analysis difficult or impossible. We propose and test a methodology for inexpensively creating an archive of Twitter data through population sampling, yielding a database that is highly representative of the targeted user population (in this test case, the entire population of Japanese-language Twitter users). Comparing the tweet volume, keywords, and topics found in our sample data set with the ground truth of Twitter's full data feed confirmed a very high degree of representativeness in the sample. We conclude that this approach yields a data set that is suitable for a wide range of post-hoc analyses, while remaining cost effective and accessible to a wide range of researchers.

KW - Data collection

KW - Representativeness

KW - Sampling

KW - Social media

KW - Twitter

UR - http://www.scopus.com/inward/record.url?scp=85063272313&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063272313&partnerID=8YFLogxK

U2 - 10.1016/j.ijinfomgt.2019.01.019

DO - 10.1016/j.ijinfomgt.2019.01.019

M3 - Article

VL - 48

SP - 175

EP - 184

JO - International Journal of Information Management

JF - International Journal of Information Management

SN - 0268-4012

ER -