Fake News Classification with Word2Vec

Last Updated : 14 Feb, 2026

Here we are building a fake news classification system in Python that uses Word2Vec embeddings to represent text and a machine learning classifier to determine whether a news article is real or fake.

Step By Step Implementation

The provided code will be explained step-by-step to help you understand the entire workflow.

Step 1: Installing and Loading necessary libraries

We import various libraries required for data processing, NLP and machine learning.

Python
import re 
import nltk
import spacy 
import warnings  
import numpy as np  
import pandas as pd 
from tqdm import tqdm 

import seaborn as sns  
import matplotlib.pyplot as plt

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 2: Downloading Necessary NLTK Data and Loading SpaCy Model

We download the required NLTK data and load the SpaCy language model:

Python
import spacy
!python -m spacy download en_core_web_lg 
nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_lg') 

Step 3: Loading the Dataset

Next, we load the true and fake news datasets and label them accordingly:

Python
!pip install kagglehub[pandas-datasets]
!python -m spacy download en_core_web_lg 
import kagglehub
from kagglehub import KaggleDatasetAdapter

file_path = "WELFake_Dataset.csv"
df = kagglehub.dataset_load(KaggleDatasetAdapter.PANDAS,"saurabhshahane/fake-news-classification",file_path)
print(df.head())

Output:

Screenshot-2026-01-20-174141
dataset loaded from kaggle

Step 4: Drop Unnecessary columns & shuffling

We then drop unnecessary columns and shuffle the dataset

Python
df = df.drop(columns=['Unnamed: 0'])
df = df.dropna()
df = df.sample(frac=1).reset_index(drop=True) 

Step 5: Preprocessing & Vectorization

We preprocess the news titles and convert them into 300-dimensional Word2Vec vectors.

Preprocessing steps:

  • Remove special characters
  • Convert to lowercase
  • Tokenize words
  • Remove stopwords
  • Apply stemming
  • Convert text to Word2Vec vector using spaCy
Python
ps = PorterStemmer()
titles = np.array(df['title'])
corpus = []

for i in tqdm(range(len(titles))):
    news = re.sub(r'[^a-zA-Z]', ' ', titles[i])
    news = word_tokenize(news.lower())
    news = [ps.stem(word) for word in news if word not in stopwords.words('english')]
    news = ' '.join(news)
    vector = nlp(news).vector
    corpus.append(vector)

X = np.array(corpus)

Output:

Screenshot-2026-01-20-174451
preprocessing & vectorizing whole corpus using spacy

Step 6: Splitting the Data

We split the data into training and testing sets.

Python
y = df['label'].values
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

Step 7: Training the Model

We train a logistic regression model on the training data:

Python
classifier = LogisticRegression(random_state=1, max_iter=100)
classifier.fit(X_train, y_train)

Output:

Screenshot-2026-01-20-174701
Logistic Regression

Step 8: Making Predictions and Evaluating the Model

We evaluate the trained model using accuracy, confusion matrix and classification report.

Python
y_pred = classifier.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred) * 100, 2))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

Output:

Screenshot-2026-01-20-174816
Classification report & confusion matrix

You can download full code from here

Comment
Article Tags:

Explore