Here we are building a fake news classification system in Python that uses Word2Vec embeddings to represent text and a machine learning classifier to determine whether a news article is real or fake.
Step By Step Implementation
The provided code will be explained step-by-step to help you understand the entire workflow.
Step 1: Installing and Loading necessary libraries
We import various libraries required for data processing, NLP and machine learning.
import re
import nltk
import spacy
import warnings
import numpy as np
import pandas as pd
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Step 2: Downloading Necessary NLTK Data and Loading SpaCy Model
We download the required NLTK data and load the SpaCy language model:
import spacy
!python -m spacy download en_core_web_lg
nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load('en_core_web_lg')
Step 3: Loading the Dataset
Next, we load the true and fake news datasets and label them accordingly:
!pip install kagglehub[pandas-datasets]
!python -m spacy download en_core_web_lg
import kagglehub
from kagglehub import KaggleDatasetAdapter
file_path = "WELFake_Dataset.csv"
df = kagglehub.dataset_load(KaggleDatasetAdapter.PANDAS,"saurabhshahane/fake-news-classification",file_path)
print(df.head())
Output:

Step 4: Drop Unnecessary columns & shuffling
We then drop unnecessary columns and shuffle the dataset
df = df.drop(columns=['Unnamed: 0'])
df = df.dropna()
df = df.sample(frac=1).reset_index(drop=True)
Step 5: Preprocessing & Vectorization
We preprocess the news titles and convert them into 300-dimensional Word2Vec vectors.
Preprocessing steps:
- Remove special characters
- Convert to lowercase
- Tokenize words
- Remove stopwords
- Apply stemming
- Convert text to Word2Vec vector using spaCy
ps = PorterStemmer()
titles = np.array(df['title'])
corpus = []
for i in tqdm(range(len(titles))):
news = re.sub(r'[^a-zA-Z]', ' ', titles[i])
news = word_tokenize(news.lower())
news = [ps.stem(word) for word in news if word not in stopwords.words('english')]
news = ' '.join(news)
vector = nlp(news).vector
corpus.append(vector)
X = np.array(corpus)
Output:

Step 6: Splitting the Data
We split the data into training and testing sets.
y = df['label'].values
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
Step 7: Training the Model
We train a logistic regression model on the training data:
classifier = LogisticRegression(random_state=1, max_iter=100)
classifier.fit(X_train, y_train)
Output:

Step 8: Making Predictions and Evaluating the Model
We evaluate the trained model using accuracy, confusion matrix and classification report.
y_pred = classifier.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred) * 100, 2))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
Output:

You can download full code from here