Fake News Detection in Python using Natural language processing – Can applied computing help a journalist in automatic fact-checking?

As the US has elected a new president, I found interesting to write an article on fake news, a real Trump’s era sign of the time.

In the recent past, I already made some experiment to help journalists to enable factchecking. This post is available on this blog and a very basic WP plugin code that help to FactCheck was released too. You can check @ ClaimReview, Fact Check – Using ClaimReview structured data element to Fact Check articles and build a WordPress plugin for it
http://flaven.fr/2018/12/claimreview-fact-check-using-claimreview-structured-data-element-to-fact-check-articles-and-build-a-wordpress-plugin-for-it/

But, due to the increasing fake news number, manually factchecking any new is preposterous and seems as obsolete as the sundial. It is like “emptying the sea with a spoon”. That’s remind me the ancient usage controversy between search engine (Google) and links catalogue (Yahoo) in the 2000’s where the links’ inflation ruined at least Yahoo and made Google’s fortune. You cannot defeat figures.

I guess that fake news phenomenon is so important nowadays that you have to find a way to delegate indexation and analysis of fake news to computers and not to humans.

As I am learning NLP (Natural Language Processing), I was asking myself: “Does NLP can come to the rescue to unmask these “mother fakers” behind this modern plague?”. Apparently yes, I found a set of articles dealing with that issue.

Recently, either you are living on Planet Mars but you cannot ignore conspirators, paranoiacs from all kind that are producing and consuming fake news. People who truly believe that even the tiniest things do have great political significance, and everything they thought might that be true actually is ! That’s freaking!
It is not just being righteous, arrogant and condescending to tell people how they ought to behave, how they should stop being so credulous and thinking so much against their own interests!

Anyway, investigating on this issue has given me the occasion to encounter some funny comments like this one “Too much Trump in the dataset” 🙂

I have selected almost 2 identical posts but I decided to go randomly with the code of the first post.

What amused me, it is that with few lines of code you can dig quickly and get some results in understanding how NLP (Natural Language Processing) can help you out in predicting fake news from news headlines! I didn’t dare applying this try to a news dataset coming from the medias I am working with! I’m probably too wimpy or too cautious to do so but that’s another question.

  1. Fake News Detection Using NLP
    shttps://medium.com/swlh/fake-news-detection-using-nlp-e744a6909276
  2. FAKE NEWS DETECTION USING NLP AND MACHINE LEARNING IN PYTHON
    https://wisdomml.com/fake-news-detection-using-nlp-and-machine-learning/

The dataset can be found at https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

Thanks to Chase Thompson for sharing this post. Connect with Chase Thompson on LinkedIn: https://www.linkedin.com/in/wchasethompson

All the files can be found on my githib account at https://github.com/bflaven/BlogArticlesExamples/tree/master/fake_news_nlp_detection

Fake news detection with python – the environment

In practice, fake news detection required the install of a tremendous number of python librairies such as : nltk, seaborn, wordcloud, PIL, BeautifulSoup, sklearn, keras, tensorflow… etc.

In several previous blog posts, I explained how to perform these installations either with Anaconda or Pip. What work for me was mainly Pip and that’s all. Here is below a quick reminder to perform the install through the Mac console.

The command to install the librairies

# to install tensorflow and keras
# be careful tensorflow and keras require a lot of space disk 
pip install tensorflow
pip install keras
 
# I guess you can run the same command pip install [library_name] to install
# the following librairies : nltk, seaborn, wordcloud, PIL, BeautifulSoup, sklearn, keras, tensorflow... etc.

Not a jupyter notebook, just a simple python script with comments

About the jupyter notebook, I am not really fond of jupyter notebook, I’d rather user standard python file to discover and execute script. I know jupyter notebook is made for true educational purpose but I prefer the logic of the code exposed in a script! So for both of these 2 articles that I have selected that are by the way almost identical, I converted the jupyter notebook (.ipynb) file into a standard python script (.py) and slightly rewrite the script to underline the script main steps. Anyway, I keep track on how to launch a Jupyter Notebook for the record.

How to launch a Jupyter Notebook App on a mac

  1. Click on spotlight, type terminal to open a terminal window.
  2. Enter the startup folder by typing cd /some_folder_name.
    cd /Users/brunoflaven/Documents/02_copy/_000_IA_bruno_light/article_bert_detecting_fake_news_1/fake-news-nlp-master/
  3. Type jupyter notebook to launch the Jupyter Notebook App e.g fake-news-nlp-master_2.ipynb The notebook interface will appear in a new browser window or tab then you just click on the *.ipynb
    jupyter notebook

Source: https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/execute.html

Light changes in the python script

Here are the light changes I made to make it understandable and easier to use:

1. I have reduced the sample just because my mac is almost dying when I am loading the dataset.

# Import our data
 
# an insignificant dataset with 3 records
# true = pd.read_csv("TrueSmallSample.csv")
# fake = pd.read_csv("FakeSmallSample.csv")
 
# a medium dataset with 90 records
true = pd.read_csv("TrueMediumSample.csv")
fake = pd.read_csv("FakeMediumSample.csv")
 
# the complete dataset
# true = pd.read_csv("TrueBigSample.csv")
# fake = pd.read_csv("FakeBigSample.csv")

2. I’d rather save image files than show it. Why? First to preserve my mac, at the same time I am working.. so I don’t want my mac to turn as a zombie. Second, I don’t like when the script flow is interrupted by image generation.

# PICTURE_3 OUTPUT
""" Article Subjects By Type """
plt.figure(figsize=(16, 9))
sns.countplot(x='subject', hue='category', data=df)
plt.title('Article Subjects By Type', fontsize=24)
plt.ylabel('Total', fontsize=16)
plt.xlabel('')
plt.xticks(fontsize=12)
plt.legend(['Fake', 'Real'])
# plt.show()
plt.savefig('picture_3_article_subjects_by_type.png')
print("\n--- # PICTURE_3 OUTPUT ")

3. I print significant results inside the console every time, it is meaningful like in a kind jupyter notebook… e.g. # PICTURE_3 OUTPUT I know it may sound contradictory but the jupyter notebook is too CPU intensive.

# Let's explore the data at a base level
 
sample_true = true.sample(20)
sample_fake = fake.sample(20)
 
print("\n--- sample_true only 20")
print (sample_true)
print("\n--- sample_true only 20")
print(sample_fake)

Conclusion:

Well with only few searches, and a very fresh understanding of NLP, I was able to find some basic scripts that are supposed to help me detect Fake News. Indeed, Fake News blossomed everywhere and constantly in digital media and social networks. It is one of the symptoms of the scary Post-truth phenomenon. Just think for a minute about the news “bullshitt” talking about the COVID-19, the vaccine or some social issues.

For sure, detecting fake news may become increasingly necessary and due to the volume why don’t we apply Artificial Intelligence (AI) and, more specifically Natural Language Processing (NLP). I guess, it will be required for future journalists to manipulate more advanced tools to detect Fake News like they already learn how to make video shooting with mobile for social networks!

Well, ok fine, another tech answer to a human problem! But, in Fake News, there are backslashes with already more pernicious consequences: Journalists’ self-censorship or the fact that news are no more about truth or fake but it is more a matter of belief than anything else!

Read more

FAKE NEWS DETECTION