Working with Excel files using Pandas

Excel files store data in rows and columns, making them useful for managing structured datasets.

To work with Excel files, we use Pandas library which allows us to read, modify and analyze Excel data in a DataFrame format.
First, we install and import Pandas, then use the read_excel() function to load Excel data into Python for processing.

In the below code, we are working with an Excel file named students.xlsx which contains student data.

Python

import pandas as pd
df = pd.read_excel('students.xlsx')
print(df)

Output

Roll No. English Maths Science
0 1 19 13 17
1 2 14 20 18
2 3 15 18 19
3 4 13 14 14
4 5 17 16 20
5 6 19 13 17
6 7 14 20 18
7 8 15 18 19
8 9 13 14 14
9 10 17 16 20

Note: You may need to install openpyxl using pip install openpyxl to read Excel files.

Loading Multiple Sheets using concat()

By default, read_excel() loads only the first sheet of an Excel workbook. If your file contains multiple sheets, you can read each sheet separately and then combine them into a single DataFrame using pd.concat(). The read_excel() function provides useful arguments to control how data is loaded:

sheet_name: Specify the name of the sheet that needs to be used.
index_col: Defines the column to be used as the index.

Example: Here we concatenate the two sheets into a single DataFrame using the concat() function and to view the complete combined DataFrame, we simply run the following command:

Python

file = 'students.xlsx'
sheet1 = pd.read_excel(file, 
                        sheet_name = 0, 
                        index_col = 0)

sheet2 = pd.read_excel(file, 
                        sheet_name = 1, 
                        index_col = 0)

newData = pd.concat([sheet1, sheet2])
print(newData)

Output

Roll No. English Maths Science
1 19 13 17
2 14 20 18
3 15 18 19
4 13 14 14
5 17 16 20
6 19 13 17
7 14 20 18
8 15 18 19
9 13 14 14
10 17 16 20
1 14 18 20
2 11 19 18
3 12 18 16
4 15 18 19
5 13 14 14
6 14 18 20
7 11 19 18
8 12 18 16
9 15 18 19
10 13 14 14

Head() and Tail() methods

The head() and tail() methods are used to quickly preview data in a DataFrame. They help you inspect the top or bottom rows without printing the entire dataset. You can pass a number inside the brackets to specify how many rows you want to see

head(): Displays the first 5 rows by default.
tail(): Displays the last 5 rows by default.

Python

print(newData.head())
print(newData.tail())

Output

Roll No. English Maths Science
1 19 13 17
2 14 20 18
3 15 18 19
4 13 14 14
5 17 16 20
Roll No. English Maths Science
6 14 18 20
7 11 19 18
8 12 18 16
9 15 18 19
10 13 14 14

Shape() attribute

shape attribute is used to check the dimensions of a DataFrame. It returns a tuple showing the total number of rows and columns.

first value represents the number of rows
second value represents the number of columns

Python

newData.shape

Output

(20, 3)

Sort_values() method

sort_values() method is used to sort a DataFrame based on the values of a specific column. It is especially useful when working with numerical data, but it can also sort text data.

By default, it sorts values in ascending order.
To sort in descending order, use ascending=False.

Python

sorted_column = newData.sort_values(['English'], ascending = False)

Now, let's suppose we want the top 5 values of the sorted column, we can use the head() method here:

Python

sorted_column.head(5)

Output

Roll No. English Maths Science
1 19 13 17
6 19 13 17
5 17 16 20
10 17 16 20
3 15 18 19

We can do that with any numerical column of the data frame as shown below:

Python

newData['Maths'].head()

Output

Roll No.
1 13
2 20
3 18
4 14
5 16
Name: Maths, dtype: int64

Describe() method

When your dataset contains numerical data, describe() method provides a quick statistical summary of the DataFrame. It includes Count (number of non null values), Mean, Standard Deviation, Minimum and Maximum values and Percentiles (25%, 50%, 75%)

Python

newData.describe()

Output

English Maths Science
count 20.00000 20.000000 20.000000
mean 14.30000 16.800000 17.500000
std 2.29645 2.330575 2.164304
min 11.00000 13.000000 14.000000
25% 13.00000 14.000000 16.000000
50% 14.00000 18.000000 18.000000
75% 15.00000 18.000000 19.000000
max 19.00000 20.000000 20.000000

Pandas also provides individual statistical methods like mean(), sum(), min() and max() to calculate specific values. This can also be done separately for all the numerical columns using following command:

Python

newData['English'].mean()

Output

np.float64(14.3)

You can also create calculated columns, just like Excel formulas, by performing operations on existing columns.

Python

newData['Total Marks'] = newData["English"] + newData["Maths"] + newData["Science"]
newData['Total Marks'].head()

Output

Roll No.
1 49
2 52
3 52
4 41
5 53
Name: Total Marks, dtype: int64

After operating on the data in the data frame, we can export the data back to an Excel file using the method to_excel. For this, we need to specify an output Excel file where the transformed data is to be written, as shown below:

Python

newData.to_excel('Output File.xlsx')

Output

It creates a new Excel file if it doesn’t exist.
Overwrites the file if a file with the same name already exists.

Working with Excel files using Pandas

Loading Multiple Sheets using concat()

Head() and Tail() methods

Shape() attribute

Sort_values() method

Describe() method

Explore