Building a Fake News Classifier

Introduction

The internet era is changing the way an individual consumes and shares news. Now we are just a click away from getting the information that we want. But are we sure the news that we are reading is not fake ?

Fake news is a misleading or false content spread in an intentional or unintentional way. With all the advancements in technology spreading the fake news has become easy. Fake news can be distributed to harm the reputation of an individual or an organization or it can be spread to get political inclination or monetary benefits .

Photo by Joshua Miranda

Why is Fake News an Important Problem to Solve ?

➔ Fake news detection can save businesses from financial damages.If fake news spreads about a business it can negatively impact consumers confidence and since the consumer is less confident about a product or a service hence less likely he is going to spend his money on the product.

➔ It can save a brand’s reputation.It affects the buying behavior of the customer.

➔ Timely detection of fake news can save businesses from dropping in stock prices.In 2017 fake news began spreading that the CEO of ethereum had died in a car accident.Later he publicly confirm that he was still alive, but by that time, Ethereum overall market value had plummeted by $4 billion.

Objectives

  • We have to predict if a News entered by a user is Fake or not
  • We will be doing thorough Exploratory Data Analysis
  • Selecting Best Machine Learning Model for the classifier
  • Creating a web app and deploying it on Heroku

Dataset

Here is the link to the dataset that we are going to use.

for detailed code and analysis do check this out.

Let’s start by importing necessary libraries and packages.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download(‘punkt’)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import re
from bs4 import BeautifulSoup
from tqdm import tqdm

let’s import dataset and see what it contains

path = "provide the path to the dataset where it is located"
news = pd.read_csv(path)

news.head()

Dataset Overview

The dataset has the following six features

● Unnamed : — 0 Unique id of news

● Title : — Contains news headlines

● Text : — Contains news content/article (could be incomplete)

● Subject : — The subject to which the news belongs to e.g

      ○ politicsNews       ○ worldNews      ○ news      ○ politics      ○ left-news      ○ Government News      ○ US_News      ○ Middle-east

● Date : — The date at which news was posted

● Labels : — It is the target variable that we have to predict

      1 : News is fake      0 : News is real

All the features are categorical except “Unnamed :0” .

let’s rename first column and check if there are any None values in the dataset or not

news.rename(columns = {'Unnamed: 0':'Id'},inplace = True) 
# renaming the first column
news.isnull().values.sum() # checking for nan valuesoutput -- 0 # means we do not have nan values in the entire dataset.

Data Preprocessing

Here I will be focusing only on “Text” feature if you want the Univariate and Multivariate Analysis then you can check out my Github

Just follow the standard NLP text preprocessing techniques to make the Text column ready for the featurization.

  • Following function will return the preprocessed “text”
preprocessed_news['text'] = preprocessing_data(news['text'])# it took around 5 minutes to execute this cell on google colab.

let’s change our Target/Output Variable into Integer

def str_to_int(Label): # changing Fake to 1 and Real to 0  if Label == 'Fake':    return 1  return 0values = preprocessed_news[‘Labels’]preprocessed_news[‘Labels’] = values.apply(str_to_int)

now we let’s check if the dataset is balanced or not

Count of Fake news is more than the True News but the difference is not that much , Hence it is safe to say that the Dataset is Balanced.

TEXT Featurization

Let’s start by splitting the data into train,cv and test dataset.

here we will be using TF-IDF vectorizer for the featurization .

TF-IDF is an abbreviation of Term Frequency Inverse Document Frequency .It comprises of two terms Term Frequency and Inverse Document frequency

**Term Frequency**

The Term frequency is the number of times a word appears in the document. It can also be thought of as the probability of finding the word in the document. The term frequency indicates how important a word is in that document .

term frequency = number of times a term appears in document / total numbers of terms in the document

**Document Frequency** — Document frequency is the number of documents containing a specific term. Document frequency indicates how common the term is.

**Inverse Document Frequency** — Inverse document frequency looks at how uncommon a word is amongst the corpus.It is inverse of the Document frequency hence it gives importance to the rarer words.

inverse document frequency = log(number of documents / number of documents containing the terms)

now combining above two terms mathematically we get

tf idf = tf*idf

Here tf is giving importance to the more frequent words and less importance to the rare words however idf is giving more importance to the rare words in the corpus and less importance to frequent words.Hence both are balancing each other .

Advantages

* TF-IDF is based on the Bag of Words model and hence is very useful in lexical level features.

* TF-IDF is very simple to calculate and is computationally cheap.

* We can use TF-IDF to discover the top terms related to a document or a group of documents.

Limitations

* TF-IDF ignores the word order.It just neglects the sequence of the terms.

* The highest TF-IDF score may not make sense with the topic of the document, since IDF gives high weight if the DF of a term is low.

* It does not capture semantics.

print("some feature names:--",vectorizer.get_feature_names_out()[:10])print("tf_idf_X_train",tf_idf_X_train.shape)print("tf_idf_X_cv   ",tf_idf_X_cv.shape)print("tf_idf_X_test ",tf_idf_X_test.shape)print(type(tf_idf_X_train))OUTPUT --some feature names:-- ['aaron' 'abadi' 'abandon' 'abandoned'
'abandoning' 'abbas' 'abbott' 'abc' 'abc news' 'abdel'] tf_idf_X_train (20149, 10000) tf_idf_X_cv (9925, 10000) tf_idf_X_test (14814, 10000) <class 'scipy.sparse.csr.csr_matrix'>

Modeling

I have tried various models using

Bag of words

Tf-idf Vectorizer

TF-idf weighted word2vec

Bert Embedding

but the best performing model was Logistic Regression using Tf-Idf word Embeddings , so here I will be discussing only the best performing model however if you want to go in depth just follow this link, I have tried to explain every line of code in easy to understand manner.

Performance Metric

before moving on the Logistic Regression we should now the performance metric on the basis of which model has to be evaluated

Recall

Recall measures how correctly our model is identifying the True positives(Actual Positives).It is also known as Sensitivity/True Positive Rate.

Sensitivity = ( True positives ) / (Actual positives)

Sensitivity = (True positives) / (True positives + False negatives)

Why was Sensitivity used ? The cost of “False Negative” is fairly high in Fake news detection , as we know fake news may lead to plummeting stock prices in just a couple of days.Imagine a scenario where our model predicted a fake news to be true news, due to this companies may incur huge tangible and intangible loss

Alternate Metrics

F1 score

This metric provides balance between Precision and Recall.

f1 score = 2* (precision)*(recall) / (precision + recall)

This metric is used when both False negative and False positive are decisive, but we will not be using this metric to evaluate our model as we want to reduce only False negatives.

Pros of Sensitivity

● Self explanatory evaluation metric

● Recall penalize the model when False negative is detected

● Sensitivity is used extensively in

○ Email spam classifier ○ Credit card Fraud detection ○ Diabetes prediction

Cons of Sensitivity

● This metric fails when we want to penalize false positives.

● Most of the time if we want high sensitivity then we have to compromise with the model’s precision

Logistic Regression

Logistic Regression opposed to it’s name is a classification model instead of Regression model . It is one of the most basic classification model that yield the best results when the data is linearly separable.Logistic Regression uses the logistic loss to optimize the performance.

PROS

Easy to Implement and very very efficient to train.

It also provides feature importance if the features are non collinear .

It does not overfit that easily and even if the dimensionality of data is high we can always use L1 or L2 regularisations.

Most used Model of the features are linearly separable

CONS

Non Linear problems cannot be solved using Logistic Regression as it assumes data to be linearly separable.

For feature importance it requires only non collinear features .

It cannot handle the missing data as efficiently as other techniques like Decision Tree and Random Forest.

let’s write code for confusion matrix before moving on to the logistic regression part.

HyperParameter Tuning

from sklearn.linear_model import LogisticRegressionclf = LogisticRegression(solver = ‘liblinear’)grid_values = {‘penalty’: [‘l1’, ‘l2’],’C’:[1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]}search = GridSearchCV(clf, param_grid = grid_values,scoring = ‘recall’,verbose = 3,cv = 3)result = search.fit(tf_idf_X_train, Y_train)print(‘Best Score: %s’ % result.best_score_)print(‘Best Hyperparameters: %s’ % result.best_params_)--output-- Best Score: 0.9926325538006674 
Best Hyperparameters: {'C': 1, 'penalty': 'l1'}

let’s look at the output

Logistic Regression has produced excellent results be it accuracy ,recall or precision matrix. Only 14 fake news were missclassified as True News which is very good even if the cost of missclassification is high.

Resources

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store