Social media is great for networking and sharing insights with your friends and contacts. But, there is an unfortunate outcome of this wonderful technology. Fake News - that spreads like fire and can potentially cause a lot of violence and fuel hatred. We have all seen instances of such fake posts flooding our walls. We wish there was a good way to overcome these problems.
Researchers have worked on the problem and have now come up with a way to train a supervised learning model to identify and suppress such fake news. A recent publication Supervised Learning for Fake News Detection by Julio C. S. Reis, et all recently published in the IEEE magazine Computing Edge describes one such solution.
Features for Fake News
The most fundamental requirement for training a supervised learning model is to identify the input features that can potentially impact and define the required output. Naturally, the problem starts with identifying such a set of input features.
On a high level, these features can be classified as
- News Content
- News Source
- News Environment
Let's check out each of these:
News Content
This involves the textual features of the post. It could include sentence-level features, including bag-of-words approaches, "n-grams" and part-of-speech, number of words and syllables per sentence, tags of word categories (such as noun, verb, adjective) and features based on text readability to measure the writer's style.
Lexical features include character and word-level signals, such as amount of unique words and their frequency in the text. We can use linguistic features, including number of words, first-person pronouns, demonstrative pronouns, verbs, hashtags, all punctuation counts, etc.
Along with that, we can consider some Psycho-linguistic Features. Linguistic Inquiry and Word Count (LIWC) is a dictionary based text mining software whose output has been explored in many classification tasks, including fake news detection. We can use it to extract features that capture additional signals of persuasive and biased language.
Next, we can consider some semantic features that capture the semantic aspects of a text. They are useful to infer patterns of meaning from data. As part of this set of features, we can consider the toxicity score obtained from Google’s Perspective API. This API uses machine learning models to quantify the extent to which a text (or comment, for instance) can be perceived as "toxic".
TextBlob's API, can help us compute subjectivity and sentiment scores of a text.
News Source
This consist of information about the publisher of the news article.
To extract these features, we need to identify the credibility of all different news sources along with the domain information. Based on this, we can extract indicators of political bias, credibility and source trustworthiness.
Bias, or political polarization is quite related to spread of misinformation. This political bias of the news source is an important feature for identifying fake news.
Next task is identifying the Credibility and Trustworthiness of the news source. We can collect user engagement metrics of the Facebook page that published the news article, using Facebook's API. We can use the Alexa’s API to get the relative position of news domain on the Alexa Ranking. Furthermore, using this same API, we can collect Alexa's top newspapers. Some unreliable domains may try to disguise themselves using domains similar to those of well known newspapers. To identify this, we can define the dissimilarity between domains from the Alexa ranking and the news domains in our dataset as a feature.
Some cities are are known to have residents who create and disseminate fake news. The location features can prove useful in identifying fake news.
Environment Features
These consist of statistics of user engagement and temporal patterns from social media (i.e., Facebook).
We consider number of likes, shares, and comments from Facebook users. Moreover, we compute the number of comments within intervals from publication time (900, 1800, 2700, 3600, 7200, 14400, 28800, 57600 and 86400 seconds).
And to capture the temporal patterns from user commenting activities, we compute the rate at which comments are posted for the same time windows defined before.
Training the Model
Based on these features, we can train a supervised model. We have different contenders for supervised learning algorithms that can be used here. Among them, Random Forest and XGBoost algorithms work better than the others. Since we have handcrafted features, neural networks may not add much value.
Based on such details, we can get an accuracy that pacify research. But it is way too low for creating a fool proof model. Fact checking based on human intervention is necessary as of today. But such a model can surely help in assisting a human working on the job.