Python tweet archiver

12/13/2023

We then took a random sample of 6000 tweets from the tobacco archive, which did not match e-cigarette search filter, and manually labeled those to count number of e-cigarette relevant tweets missed by our search filter 20 relevant tweets were found among the unmatched sample. Then, from this broad tobacco archive, we filtered for tweets that matched our e-cigarette search filter ( N = 82,205). We first collected a broad archive of tobacco-related tweets-including various tobacco, e-cigarette/vaping products, related attitudes, behaviors, and policy-using hundreds keyword-based search rules via the PowerTrack. In this early and rough experiment, we retrieved e-cigarette-related tweets in a two-stage process. Thus, we decided to undertake an experiment: a direct comparison of the amount, content, and data quality for each data source.

There was virtually no technical documentation of how the public stream was generated. Yet, once we had developed a robust set of keyword search filters, we wondered whether the ‘free’ API could provide a comparable sample of data, which would be sufficiently generalizable for our research questions, without the ongoing subscription cost. We decided that the Firehose offered us the best opportunity to capture relevant data for our research agenda. Thus, we weighed the cost of the Firehose with the security of complete coverage and the opportunity to go back and retrieve relevant posts missed by our initial search filters. For broad behavioral and public opinion research, idiosyncrasies of slang and regional dialects, as well as unanticipated marketing or policy events, make it challenging to anticipate all potentially relevant search terms ahead of time. Since when we began our social media research in 2012, our main Twitter data source has been the PowerTrack, historic archive of Firehose, which provides access to 100% of the public posts that match the search filter criteria, and offers the advantage of supporting retrospective inquiry. Our interest in the sources of social media data stemmed from our first experience with the Twitter streaming API data. The interpretation, validity, and replicability of this study’s findings are directly related to data sources and their credibility. This problem is not limited to Twitter data other social media platforms and data vendors also provide insufficient technical guidance to make informed or transparent decisions. However, there is no systematic guideline to help researchers evaluate the advantages and limitations of each data source for their research question.

Researchers likely choose one source over another because of its accessibility or affordability. There are multiple ways to access Twitter data. Twitter data have been used for infodemioloy/infoveillance studies, tracking health attitudes and behaviors, and measuring community-level environment related to health outcomes. Twitter is the most widely used source because of its public-facing nature and relatively straightforward access to data through public APIs. Health and social research using social media data is increasing rapidly. Researchers need to understand how different data sources can influence both the amount, content, and user accounts of data they retrieve from social media, in order to assess the implications of their choice of data source.

The retrieved tweets largely overlapped between three APIs, but each also retrieved unique tweets, and the extent of overlap varied over time and by topic, resulting in different trends and potentially supporting diverging inferences. We collected tweets about anti-smoking, e-cigarettes, and tobacco using the aforementioned APIs. This study examines whether tweets collected using the same search filters over the same time period, but calling different APIs, would retrieve comparable datasets. Such information is crucial to the validity, interpretation, and replicability of research findings. To date, no clear guidance exists about the advantages and limitations of each API, or about the comparability of the amount, content, and user accounts of retrieved tweets from each API. The three primary application programming interfaces (API) of Twitter data sources are Streaming, Search, and Firehose. However, few studies provide sufficient detail about Twitter data collection to allow either direct comparisons between studies or to support replication. Public health and social science increasingly use Twitter for behavioral and marketing surveillance.

0 Comments

Python tweet archiver

Leave a Reply.

Author

Archives

Categories