Scikit Library in Python

Last Updated : 28 Apr, 2026

Scikit-learn (also written as sklearn) is a popular open-source Python library for machine learning. Built on top of NumPy, SciPy and Matplotlib, it provides efficient tools for data analysis and predictive modeling. It offers simple and reusable functions to build models for tasks such as classification, regression, clustering and dimensionality reduction.

Key Features of Scikit-learn

Below are several key features of Scikit-learn that make data preparation, modeling and evaluation simple and efficient.

1. Data Preprocessing: Preparing data is an important step in any machine learning project. Scikit-learn simplifies this process with built-in tools for:

2. Model Evaluation: helps you check how well your machine learning model predicts and performs on data.

  • Metrics: Evaluate model performance (accuracy, precision, recall and F1-score).
  • Model Selection: Tools for selecting the best model hyperparameters through techniques like grid search and randomized search.

3. Pipeline Support: Combine preprocessing and modeling steps efficiently.

4. Integration: Works seamlessly with Python libraries like NumPy, Pandas and Matplotlib.

5. Ease of Use: Simple, consistent and user-friendly API for all tasks.

Installing and Importing Scikit-learn

To install Scikit-learn, use Python's package manager pip with the following command:

pip install scikit-learn

Once installed, import Scikit-learn modules into a Python script or environment using the import statement. For example:

import sklearn

Machine Learning Techniques Supported by Scikit-learn

1. Supervised Learning

Supervised learning involves training models using labeled data, where the correct output is already known.

Example: Logistic Regression Algorithm

Logistic Regression is a supervised machine learning algorithm used to predict categories (yes/no, spam/not spam, disease/no disease). It works by estimating the probability of an outcome and is simple, easy to interpret and effective for problems where classes can be separated.

This example uses Logistic Regression to classify flowers in the Iris dataset and check how accurately the model predicts their types.

Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Output

Accuracy: 1.0

2. Unsupervised Learning

Unsupervised learning works with unlabeled data to discover patterns and structure.

  • Clustering: Scikit-learn offers clustering techniques to group similar data points, including K-means clustering, DBSCAN and hierarchical clustering.
  • Dimensionality Reduction: To handle high-dimensional data efficiently, Scikit-learn provides techniques like principal component analysis (PCA).

Example: KMeans Algorithm

KMeans groups data into k clusters based on similarity. It is an unsupervised learning algorithm, ideal for tasks like customer segmentation, image compression and anomaly detection, especially when the underlying data structure is unknown.

This program demonstrates how to use KMeans clustering from Scikit-learn to group the Iris dataset into three clusters based on feature similarity.

Python
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

iris = load_iris()
kmeans = KMeans(n_clusters=3)

kmeans.fit(iris.data)
cluster_labels = kmeans.labels_

print("Cluster Labels:", cluster_labels)

Output

Cluster Labels: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2
2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2
2 0]

Advantages

  • Simple and user-friendly interface for machine learning tasks.
  • Offers a wide range of algorithms for various tasks like classification, regression, clustering and more.
  • Provides tools for data preprocessing, including scaling, normalization and handling missing values.
  • Offers metrics for evaluating model performance and techniques like cross-validation for robust assessment.
  • Integrates well with other Python libraries like NumPy, Pandas and Matplotlib.
Comment