Introduction to Graph-Based Semi-Supervised Learning

Graph‑Based Semi‑Supervised Learning is a machine learning approach that uses both labelled and unlabelled data by modelling their relationships as a graph. Here, data points are treated as nodes and edges represent similarities, allowing labels to propagate through the graph based on structural connections.

Effectively uses limited labelled data
Uses relationships and similarity between data points
Labels spread through the graph from labelled to unlabelled nodes
Works well when labelled data is limited
Commonly applied in text classification, image recognition and recommendation systems

graph_based_semi_supervised_learning_explained — Graph based learning

Need for Graph-Based Learning

In many real-world scenarios, labelling data is expensive and time consuming. At the same time, unlabelled data is often abundant. So graph-based methods offer an useful way to combine both:

Captures relationships between data points
Uses unlabelled data to improve accuracy
Reduces reliance on large labelled datasets

How It Works

In Graph-Based Semi-Supervised Learning, the dataset is first represented as a graph where each data point becomes a node. Nodes that are similar to each other are connected through edges. These connections allow the model to use the structure of the data to spread label information from a small set of labelled nodes to many unlabelled ones.

The learning process mainly happens in two stages:

1. Graph Construction

In this step, the graph structure is created to represent relationships among data points.

Each data point is represented as a node in the graph.
Edges are added between nodes that are similar to each other.
Similarity can be measured using distance metrics, k-nearest neighbours or other similarity functions.
Edge weights indicate how strong the relationship between two nodes is.
Stronger connections mean higher influence during label propagation.

2. Label Propagation

Once the graph is built, labels are spread across it using the connections.

The process starts with a small number of labelled nodes.
Labels of these nodes are kept fixed throughout learning.
Unlabelled nodes receive label information from their neighbouring nodes.
Nodes with stronger connections have a greater influence on label assignment.
This process is repeated iteratively until the labels stabilize and no longer change.

label_propagation_spreading_information — Label Propagation

Types of Graph-Based Semi Supervised Learning Methods

Different types of Graph-based semi-supervised learning exist like:

Graph Regularization Methods

These methods enforce smoothness over the graph.

Assume connected or similar nodes should have similar labels.
Use smoothness constraints to prevent sudden label changes across neighbours.
Often based on graph Laplacian or energy minimization.
Common examples include label propagation and label spreading.
They are simple and effective and works well when data naturally forms clusters.

Graph Embedding Methods

These methods convert nodes into vector representations.

Learn embeddings that capture graph structure and similarity.
Nodes with similar embeddings are likely to share labels.
Embeddings can be used with traditional classifiers or deep learning models.
They helps to reduce dimensionality.
Makes graph data easier to use in downstream tasks.

Implementation

Lets implement Graph-Based Semi-Supervised Learning using python:

Step 1: Import Required Libraries

We will import the necessary libraries such as numpy, scikit learn, matplotlib, LabelPropagation and networkx for our model

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.semi_supervised import LabelPropagation
from sklearn.neighbors import kneighbors_graph
import networkx as nx

Step 2: Create Sample Data

This block defines the input data and labels. Two main clusters exist:

One near (2,2): class 0
Another near (8,8): class 1
In X we have Feature data (each row is a data point)
In y we have Labels (-1 means unlabelled)
Many samples are marked as -1, meaning they are unlabelled.

Python

X = np.array([
    [1, 2],
    [1, 3],
    [2, 2],
    [2, 3],
    [3, 2],
    [4, 4],
    [5, 5],
    [6, 6],
    [8, 8],
    [8, 9],
    [9, 8],
    [9, 9]
])

y = np.array([
    0, -1, 0, -1, -1,
    -1, -1, -1,
    1, -1, 1, -1
])

Step 3: Construct Graph Using Nearest Neighbours

This block converts the dataset into a graph.

Each data point becomes a node.
Edges connect points to their 3 nearest neighbours.
The adjacency matrix (A) stores connectivity information.
NetworkX converts this into a visual graph structure.

Python

A = kneighbors_graph(X, n_neighbors=3, mode='connectivity', include_self=False)
G = nx.from_scipy_sparse_array(A)
pos = {i: X[i] for i in range(len(X))}

Step 4: Visualize Graph Before Label Propagation

This visualization shows the initial state of the graph.

Red nodes: class 0
Blue nodes: class 1
Gray nodes: unlabelled

Python

plt.figure(figsize=(6,6))
before_colors = []
for label in y:
    if label == 0:
        before_colors.append("red")
    elif label == 1:
        before_colors.append("blue")
    else:
        before_colors.append("gray")

nx.draw(G, pos, node_color=before_colors, with_labels=True, node_size=500)

plt.title("Before Label Propagation (Gray = Unlabeled)")
plt.show()

Output:

Step 5: Apply Label Propagation Model

This block performs semi-supervised learning.

The algorithm uses graph connections to spread labels.
Labelled nodes act as sources of information.
Nearby unlabelled nodes receive labels based on similarity.

Python

model = LabelPropagation(kernel='knn', n_neighbors=3)
model.fit(X, y)
predicted_labels = model.transduction_
print("Predicted labels:", predicted_labels)

Output: Predicted labels: [0 0 0 0 0 0 0 0 1 1 1 1]

Step 6: Visualize Graph After Label Propagation

This shows the final result after learning.

Previously gray nodes now receive predicted labels.
The graph visually demonstrates how labels spread across connected regions.
Comparing before vs after makes the learning process intuitive.

Python

plt.figure(figsize=(6,6))

after_colors = ["red" if label == 0 else "blue" for label in predicted_labels]

nx.draw(G, pos, node_color=after_colors, with_labels=True, node_size=500)

plt.title("After Label Propagation")
plt.show()

Output:

after-label-propagation — After Label Propagation

You can download the source code from here

Applications

Graph-based semi-supervised learning is widely used in areas where data points are related or connected like:

Text Classification: Documents are linked based on similarity, allowing labels to spread to unlabelled text such as articles or emails.
Image Recognition: Similar images are connected in a graph, helping label images even when only a few are manually annotated.
Social Network Analysis: Users are treated as nodes and relationships help predict interests, communities or behaviours.
Bioinformatics: Genes or proteins are connected based on biological similarity to assist in function prediction.
Recommendation Systems: Users and items are linked through interactions, enabling better recommendations using limited labelled data.

Limitations

Graph-Based Semi-Supervised Learning still has some limitations like:

Graph Construction: Creating graphs that correctly capture real relationships between data points is still difficult.
Label Propagation Accuracy: Making label spreading more stable and reliable, especially with noisy data, needs more work.
Integration with Deep Learning: Combining graph-based methods with neural networks is an active research area.
Scalability and New Domains: Applying GSSL to very large datasets and new fields remains challenging.

Introduction to Graph-Based Semi-Supervised Learning

Need for Graph-Based Learning

How It Works

1. Graph Construction

2. Label Propagation

Types of Graph-Based Semi Supervised Learning Methods

Graph Regularization Methods

Graph Embedding Methods

Implementation

Step 1: Import Required Libraries

Step 2: Create Sample Data

Step 3: Construct Graph Using Nearest Neighbours

Step 4: Visualize Graph Before Label Propagation

Step 5: Apply Label Propagation Model

Step 6: Visualize Graph After Label Propagation

Applications

Limitations

Explore