Fake News Classification with Word2Vec

Here we are building a fake news classification system in Python that uses Word2Vec embeddings to represent text and a machine learning classifier to determine whether a news article is real or fake.

Step By Step Implementation

The provided code will be explained step-by-step to help you understand the entire workflow.

Step 1: Installing and Loading necessary libraries

We import various libraries required for data processing, NLP and machine learning.

Python

import re 
import nltk
import spacy 
import warnings  
import numpy as np  
import pandas as pd 
from tqdm import tqdm 

import seaborn as sns  
import matplotlib.pyplot as plt

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 2: Downloading Necessary NLTK Data and Loading SpaCy Model

We download the required NLTK data and load the SpaCy language model:

Python

import spacy
!python -m spacy download en_core_web_lg 
nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_lg')

Step 3: Loading the Dataset

Next, we load the true and fake news datasets and label them accordingly:

Python

!pip install kagglehub[pandas-datasets]
!python -m spacy download en_core_web_lg 
import kagglehub
from kagglehub import KaggleDatasetAdapter

file_path = "WELFake_Dataset.csv"
df = kagglehub.dataset_load(KaggleDatasetAdapter.PANDAS,"saurabhshahane/fake-news-classification",file_path)
print(df.head())

Output:

Screenshot-2026-01-20-174141 — dataset loaded from kaggle

Step 4: Drop Unnecessary columns & shuffling

We then drop unnecessary columns and shuffle the dataset

Python

df = df.drop(columns=['Unnamed: 0'])
df = df.dropna()
df = df.sample(frac=1).reset_index(drop=True)

Step 5: Preprocessing & Vectorization

We preprocess the news titles and convert them into 300-dimensional Word2Vec vectors.

Preprocessing steps:

Remove special characters
Convert to lowercase
Tokenize words
Remove stopwords
Apply stemming
Convert text to Word2Vec vector using spaCy

Python

ps = PorterStemmer()
titles = np.array(df['title'])
corpus = []

for i in tqdm(range(len(titles))):
    news = re.sub(r'[^a-zA-Z]', ' ', titles[i])
    news = word_tokenize(news.lower())
    news = [ps.stem(word) for word in news if word not in stopwords.words('english')]
    news = ' '.join(news)
    vector = nlp(news).vector
    corpus.append(vector)

X = np.array(corpus)

Output:

Screenshot-2026-01-20-174451 — preprocessing & vectorizing whole corpus using spacy

Step 6: Splitting the Data

We split the data into training and testing sets.

Python

y = df['label'].values
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

Step 7: Training the Model

We train a logistic regression model on the training data:

Python

classifier = LogisticRegression(random_state=1, max_iter=100)
classifier.fit(X_train, y_train)

Output:

Screenshot-2026-01-20-174701 — Logistic Regression

Step 8: Making Predictions and Evaluating the Model

We evaluate the trained model using accuracy, confusion matrix and classification report.

Python

y_pred = classifier.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred) * 100, 2))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

Output:

Screenshot-2026-01-20-174816 — Classification report & confusion matrix

You can download full code from here

Fake News Classification with Word2Vec

Step By Step Implementation

Step 1: Installing and Loading necessary libraries

Step 2: Downloading Necessary NLTK Data and Loading SpaCy Model

Step 3: Loading the Dataset

Step 4: Drop Unnecessary columns & shuffling

Step 5: Preprocessing & Vectorization

Step 6: Splitting the Data

Step 7: Training the Model

Step 8: Making Predictions and Evaluating the Model

Explore