Abstract
The rising popularity of social media posts, most notably Twitter posts, as a data source for social science research poses significant problems with regard to access to representative, high-quality data for analysis. Cheap, publicly available data such as that obtained from Twitter's public application programming interfaces is often of low quality, while high-quality data is expensive both financially and computationally. Moreover, data is often available only in real-time, making post-hoc analysis difficult or impossible. We propose and test a methodology for inexpensively creating an archive of Twitter data through population sampling, yielding a database that is highly representative of the targeted user population (in this test case, the entire population of Japanese-language Twitter users). Comparing the tweet volume, keywords, and topics found in our sample data set with the ground truth of Twitter's full data feed confirmed a very high degree of representativeness in the sample. We conclude that this approach yields a data set that is suitable for a wide range of post-hoc analyses, while remaining cost effective and accessible to a wide range of researchers.
Original language | English |
---|---|
Pages (from-to) | 175-184 |
Number of pages | 10 |
Journal | International Journal of Information Management |
Volume | 48 |
DOIs | |
Publication status | Published - 2019 Oct 1 |
Fingerprint
Keywords
- Data collection
- Representativeness
- Sampling
- Social media
ASJC Scopus subject areas
- Information Systems
- Computer Networks and Communications
- Library and Information Sciences
Cite this
Representing the Twittersphere : Archiving a representative sample of Twitter data under resource constraints. / Hino, Airo; Fahey, Robert Andrew.
In: International Journal of Information Management, Vol. 48, 01.10.2019, p. 175-184.Research output: Contribution to journal › Article
}
TY - JOUR
T1 - Representing the Twittersphere
T2 - Archiving a representative sample of Twitter data under resource constraints
AU - Hino, Airo
AU - Fahey, Robert Andrew
PY - 2019/10/1
Y1 - 2019/10/1
N2 - The rising popularity of social media posts, most notably Twitter posts, as a data source for social science research poses significant problems with regard to access to representative, high-quality data for analysis. Cheap, publicly available data such as that obtained from Twitter's public application programming interfaces is often of low quality, while high-quality data is expensive both financially and computationally. Moreover, data is often available only in real-time, making post-hoc analysis difficult or impossible. We propose and test a methodology for inexpensively creating an archive of Twitter data through population sampling, yielding a database that is highly representative of the targeted user population (in this test case, the entire population of Japanese-language Twitter users). Comparing the tweet volume, keywords, and topics found in our sample data set with the ground truth of Twitter's full data feed confirmed a very high degree of representativeness in the sample. We conclude that this approach yields a data set that is suitable for a wide range of post-hoc analyses, while remaining cost effective and accessible to a wide range of researchers.
AB - The rising popularity of social media posts, most notably Twitter posts, as a data source for social science research poses significant problems with regard to access to representative, high-quality data for analysis. Cheap, publicly available data such as that obtained from Twitter's public application programming interfaces is often of low quality, while high-quality data is expensive both financially and computationally. Moreover, data is often available only in real-time, making post-hoc analysis difficult or impossible. We propose and test a methodology for inexpensively creating an archive of Twitter data through population sampling, yielding a database that is highly representative of the targeted user population (in this test case, the entire population of Japanese-language Twitter users). Comparing the tweet volume, keywords, and topics found in our sample data set with the ground truth of Twitter's full data feed confirmed a very high degree of representativeness in the sample. We conclude that this approach yields a data set that is suitable for a wide range of post-hoc analyses, while remaining cost effective and accessible to a wide range of researchers.
KW - Data collection
KW - Representativeness
KW - Sampling
KW - Social media
KW - Twitter
UR - http://www.scopus.com/inward/record.url?scp=85063272313&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063272313&partnerID=8YFLogxK
U2 - 10.1016/j.ijinfomgt.2019.01.019
DO - 10.1016/j.ijinfomgt.2019.01.019
M3 - Article
AN - SCOPUS:85063272313
VL - 48
SP - 175
EP - 184
JO - International Journal of Information Management
JF - International Journal of Information Management
SN - 0268-4012
ER -