Ap Cam

Find The Best Tech Web Designs & Digital Insights

Technology and Design

Twitter Tweet Quality Assessment Criteria

The emergence of social media platforms has brought about a significant transformation in global patterns of information dissemination and consumption, as a large number of internet users now rely on these channels as their primary sources of information acquisition. The rapid growth in social media membership, and consequently of digital traces circulating in these platforms, have been accompanied by a progressive rise in the importance of artificial intelligence (AI) based recommender systems, content pre-selection, ranking and suggestion systems used to customise users’ online experiences.

The integration of AI-based recommender systems into social media platforms has led to a fundamental shift in the way users consume and interact with online information, significantly increasing the level of automated content curation while limiting users’ freedom of independent content discovery. This paradigm shift towards the machine-learning based hyper-personalisation of social media content raises concerns regarding potential impacts on the quality and diversity of information available to users, with clear implications for the integrity of knowledge acquisition processes.

Several recent studies have analysed these risks, concluding that engagement-based recommender systems - which form the majority of recommendation engines currently deployed within social media platforms - may be prone to bias, user-manipulating behaviour, the creation of echo chambers, and to the amplification of false or misleading content.

Despite their status as critical infrastructure of social media platforms - and arguably of information circulation at a societal level - the internal architectures and practical functioning of recommender systems remain only superficially understood, and while several platforms have previously released white papers with information on their functioning, limited evidence exists on the characteristics that guide their deployment. However, while this release does provide new information on the system’s architecture, perhaps the most central part of the system - a ‘heavy ranker’- deep neural network used to make recommendation predictions cannot be replicated with currently available information, limiting the possibility of testing the behaviour of this recommender system.

This lack of evidence is a clear obstacle towards evaluating the magnitude of any form of algorithmic bias in content suggestion, and in particular, towards understanding whether, in their drive to maximise user engagement with a platform, recommender systems are acting as significant drivers of the diffusion of online disinformation and misinformation. Through this process, it is possible to estimate whether tweets sharing information from low-credibility domains generate exceptional impressions, which may point towards a recommendation bias towards low-credibility content, as well as a general lack of functioning integrity signals.

The data was collected at regular intervals ensuring similar uptime for each day of publication, and the resulting dataset comprises a total of ≈ 2.1m original tweets - hence excluding retweets - on COVID-19, and ≈ 600k original tweets on climate change. While several datasets domain credibility exist, such as NewsGuard and IffyNews, this study identifies information reliability scores using the aggregate reliability scores, which used principal component analysis to produce aggregate scores derived from the major available rating sets. This dataset provides credibility scores for 11,520 domains, where 0 represents the lowest credibility, and 1 represents the highest credibility.

In line with the values used by major credibility ratings providers such as Newsguard, we then consider tweets with a credibility score lower or equal to 0.4 to be low-credibility and tweets with a credibility score equal or higher than 0.6 to classify as high-credibility. This process results in a total of 87,769 tweets from low-credibility sources, and 187,643 tweets from high-credibility sources, with low-credibility domains present in 3.77% of tweets on COVID-19 and 1.69% of tweets on climate change.

The most common low-credibility and high-credibility domains for both datasets are shown in Table 1, while their distribution split is shown in Fig. 1.

Distribution of tweets with low-credibility and high-credibility domains

Figure 1: Distribution of tweets with low-credibility and high-credibility domains as a percentage of the full data in each dataset under analysis

How Twitter's Algorithm Amplifies Misinformation

Table 1 shows the distribution of the five most prevalent high-credibility and low-credibility domains for both datasets.

Table 1: Distribution of the five most prevalent high-credibility and low-credibility domains for both datasets
Dataset Credibility Domain
COVID-19 High example.com
COVID-19 High another.com
COVID-19 Low fake.news
COVID-19 Low unreliable.com
Climate Change High science.org
Climate Change High environment.gov
Climate Change Low denier.net
Climate Change Low skeptic.org

Measuring Amplification

Measuring recommender-driven amplification is a notoriously difficult task, which requires clearly defined objectives and robust benchmarks to identify potential patterns of amplification. In this study, amplification is defined as a condition where tweets with similar characteristics drawn from two different groups - low-credibility and high-credibility - exhibit a significant difference in the outcome variable, which is the number of impressions obtained.

First, impressions directly measure the visibility of a tweet, providing a direct way to assess how often content is organically displayed in users’ feed, a characteristic that is crucial to understand the behaviour of recommender systems. Second, unlike metrics such as likes, retweets, or comments, impressions data is a passive metric of exposure independent of user engagement, and as such, it is expected to be more effective than alternative metrics in characterising the behaviour of recommender systems.

Baseline Amplification Benchmark

To produce a clear measure of amplification, it is therefore important to establish a robust benchmarking procedure to compare the two samples under analysis. For this purpose, this study compares the two previously described samples of high-credibility and low-credibility tweets through bias-corrected and accelerated bootstrapping (BCa), where the mean difference between the two samples is measured across 1000 randomly resampled iterations.

BCa enhances traditional bootstrapping by introducing two key modifications: a bias-correction factor and an acceleration factor. The bias-correction factor adjusts for the bias in the bootstrap distribution, ensuring that the resampling process more accurately reflects the true nature of the data, while the acceleration factor corrects for the skewness of the bootstrap distribution, which is particularly important in data with asymmetrical distributions. The substantive numbers of iterations used in the BCa method further enhances the robustness of this approach, providing confidence that the effect measured reflects a real divergence between the two samples. Lastly, as a non-parametric statistical approach, BCa requires fewer assumptions about the distribution of the data to hold, which makes this approach particularly suited for the study of social media data.

However, given the existing limitations of impressions data, a simple bootstrapping benchmark is not sufficient to reliably determine whether a sample consistently received more impressions than the other, as it neglects potential user-level and tweet-level factors that could influence the number of impressions obtained by a tweet. To remedy this shortcoming, the baseline benchmark bootstrap comparison is performed with two stratifications, resampling the data by a tweet’s engagement level and the user’s number of followers.

Engagement was selected as a baseline stratification variable as engagement-based recommender systems are known to highly value a tweet’s engagement performance, and high-engagement tweets are likely to be shown more than low-engagement tweets. Followers count was selected as a baseline stratification variable because, within any networked recommender system, the number of followers a tweet creator has will likely have a significant impact on how many people are exposed to that tweet.

While these two variables alone may not account for the entirety of tweet-level and user-level factors that are likely to influence impressions, adding an excessive number of stratification variables with limited explanatory power is likely to be counterproductive, as it would significantly reduce the number of matched samples. Considering these limitations, stratifying the baseline benchmark by levels of engagement and followers appears the most effective strategy to maximise the accuracy and validity of the results.

As both engagement and followers count are discrete variables with a large range of values, the complexity of these variables is reduced by assigning the data to discrete clusters using quantile based discretization, an approach that allows for a grouping of the data into similar-sized buckets based on quantile rankings. This approach was tested alongside more traditional approaches to clustering such as HBDSCAN and K-means clustering, and consistently provided a more effective grouping of the data.

Following an exploratory analysis of the distribution of both variables, the arbitrary numbers of discrete groups to be identified is set to 4, a number that preserves the original variability of the data without placing undue restrictions on the bootstrapping process, producing a total of 16 combinations of strata of engagement and followers clusters. To guarantee consistent results in the bootstrapping stage, quantile-based discretization is applied to the combined datasets of low and high credibility data for each distinct dataset under investigation, and the values of both engagement and followers data was log-scaled in the process.

Additional Stratification Variables

After developing a method to compute the baseline level of amplification across the two datasets, we can add further individual stratification variables to test the influence of additional grouping variables across subgroups. For this purpose, each additional stratum is separately added to the baseline benchmark, computing any difference between the baseline amplification of amplification after the addition of a new stratification variable. At this stage, we test amplification across three additional stratification variables: toxicity scores, political bias and verified status.

Toxicity scores are obtained through the Perspective API by Jigsaw, which leverages a machine learning model trained on millions of Wikipedia comments to predict how likely it is that an input text will be perceived as toxic by a reader. The Perspective API model produces a toxicity score ranging from 0 to 1 for each input tweet, with 0 having a null probability of being found toxic, and 1 having a high probability of a text being perceived as toxic.

To avoid creating an excessive number of categories during the stratification process, the toxicity scores obtained from the Perspective API are used to create 3 clusters of toxicity levels with k-means clustering, allowing for the stratification of our data according to the degree of language toxicity.

Further, the political bias of the URL domains under analysis is obtained by annotating data through a zero-shot classifier leveraging the GPT-4 API. While the use of large language models in data annotation tasks is a new phenomenon, recent literature has extensively analysed the performance of GPT 3.5 and GPT-4 in data labelling tasks, including political stance identification, with both models exhibiting high accuracy.

To maximise the usability and interpretability of the data, for this task, the model is asked to classify the political bias of the input domain into one of five categories: far-left, left, no bias, right and far-right. The model is also prompted to return a value of −1 whenever it does not have information on a domain, or if the domain is non-political.

To validate annotations obtained from GPT-4, the labels obtained from the 20 most common domains (covering more than 100k tweets) are compared with static labels of political bias obtained from Media Bias and Fact Check, where there is a 95% agreement in macro political areas between the two datasets.

Through this process, we can identify 5596 political domains - around 10% of all domains in the data - which are then used as a stratification variable during bootstrapping to assess whether political bias has an influence on the amplification of low-credibility content. This step showed that users labeled as verified in the data were only those verified before November 2022.

Here, results reveal that on average, across 1000 stratified bootstrapping samples stratified by engagement level and followers count, samples of low-credibility tweets generate more impressions than high-credibility samples across both datasets, with low-credibility tweets on COVID-19 receiving a baseline impressions amplification of +19.2% (median +17.3%) and low-credibility tweets on climate change generating +95.8% impressions (median = +90.1%). In absolute values, this amounts to a mean difference of +113.7 impressions (median = +111.4) for COVID-19 tweets, and +474.6 impressions (median = +447.2) for climate change tweets.

Results here also show that this behaviour is observed quite consistently, as 84.4% of COVID-19 samples have a positive mean difference, and 97.9% of climate change samples have a positive mean difference, suggesting that it is very rare that low-credibility samples will outperform high-credibility samples in impressions counts.

Average percentage difference in impressions

Figure 2: Raincloud plot illustrating the average percentage difference in impressions between high-credibility and low-credibility tweets, based on 1000 resamples from each dataset under study

However, when dealing with skewed distributions such as those of social media impressions, looking at the aggregate mean may not be sufficient to fully explain an amplification effect. Rather, we must also assess inter-stratum breakdowns of variabilities, which are shown in Fig. 3, containing heatmaps of the mean differences in impressions across all 16 strata combinations as well as the size of each stratum. This step of the analysis delivers a more nuanced understanding of the results, showing that within bootstrapped samples, the observed difference in impressions is primarily generated by a difference in the highest-engagement and highest-followers strata \((3,3)\).

For COVID-19, this amounts to a mean difference of +3148 impressions between low-credibility and high-credibility tweets, while for the climate change dataset, this amounts to +9197. While the percentage of amplification within these strata appears extensive in absolute terms, it is more moderate in relative terms - showing +30.2% amplification for the COVID-19 stratum 3.3 and +129% for the climate change data within the same stratum.