Chapter 03 - Data Handling with Pandas II

CBSE Class 12 Informatics Practices

3.1 Introduction

3.1.1 Context from Chapter 02 (NCERT Continuity)

NCERT begins this chapter by explicitly linking it with Chapter 02: Data Handling with Pandas I.

In the previous chapter, you learnt:

What Pandas is
How to work with:
- Series
- DataFrame
How to:
- Create DataFrames
- Access rows and columns
- Import data from CSV files
- Export data back to CSV

At that stage, the focus was mainly on:

Creating and viewing data
Basic manipulation

However, real-world data handling does not stop there.

3.1.2 Why Chapter 03 is Needed (NCERT Motivation)

NCERT now shifts the focus from basic data handling to data analysis.

In practical situations:

Data is often very large
Raw data by itself is not useful
We need to:
- Summarise data
- Identify patterns
- Compare groups
- Prepare reports

For example:

What is the average score of students?
Which city has the highest sales?
How many records belong to each category?
Which values are missing or incorrect?

Such questions require advanced DataFrame operations, which are introduced in this chapter.

3.1.3 Nature of Data Handled in This Chapter

NCERT clarifies that this chapter deals with:

Structured data
Data stored in rows and columns
Data that can be:
- Numeric
- Categorical
- Textual

The emphasis is on analysing data, not merely displaying it.

3.1.4 Major Concepts Introduced in This Chapter

NCERT outlines that this chapter will focus on the following capabilities of Pandas:

Calculating descriptive statistics
Performing data aggregation
Sorting data based on:
- Values
- Index
Grouping data using GROUP BY
Modifying the index
Performing additional DataFrame operations
Handling missing data
Exchanging data between:
- Pandas
- MySQL database

Each of these topics builds on the DataFrame knowledge gained earlier.

3.1.5 Importance of This Chapter in Data Analysis

NCERT implicitly highlights that:

This chapter introduces operations used in:
- Business analytics
- Data science
- Research
The techniques taught here are:
- Practical
- Frequently used
- Exam-relevant

Students are expected to:

Apply logical thinking
Interpret results, not just write code
Understand the meaning of outputs

3.1.6 How This Chapter Differs from Chapter 02

Chapter 02	Chapter 03
Creating data	Analysing data
Structure of DataFrame	Behaviour of DataFrame
Reading / writing data	Summarising / grouping data
Basic operations	Advanced operations

This distinction is important for conceptual clarity.

3.1.7 Learning Outcome of Section 3.1

By the end of this introductory section, NCERT expects students to:

Understand the scope of the chapter
Recognise the shift from data creation to data analysis
Appreciate the importance of Pandas in real-world data handling
Be prepared for advanced operations on DataFrames

Section 3.1 Completed

We have now fully and sequentially covered:

Why Chapter 03 exists
How it builds on Chapter 02
What types of problems it addresses
What topics will be studied next

3.2 Descriptive Statistics

3.2.1 What are Descriptive Statistics? (NCERT Explanation)

NCERT introduces descriptive statistics as methods used to:

Summarise large datasets
Describe the central tendency
Measure the spread or variability of data

Instead of examining every value individually, descriptive statistics provide:

Single numerical values
Quick understanding of data behaviour

This is essential when working with large DataFrames.

3.2.2 Sample Data Used for Explanation

NCERT explains these concepts using a DataFrame.

We will use a similar structure.

import pandas as pd

data = {
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
    'Maths': [85, 90, 78, 92, 88],
    'Science': [80, 85, 75, 89, 84],
    'English': [78, 82, 74, 88, 80]
}

df = pd.DataFrame(data)
print(df)

Output

    Name  Maths  Science  English
0   Amit     85       80       78
1   Neha     90       85       82
2  Rahul     78       75       74
3  Pooja     92       89       88
4  Karan     88       84       80

3.2.3 The `describe()` Function (Core NCERT Method)

NCERT introduces describe() as the most important descriptive statistics function.

print(df.describe())

Output

           Maths    Science    English
count   5.000000   5.000000   5.000000
mean   86.600000  82.600000  80.400000
std     5.941380   5.941380   5.263079
min    78.000000  75.000000  74.000000
25%    85.000000  80.000000  78.000000
50%    88.000000  84.000000  80.000000
75%    90.000000  85.000000  82.000000
max    92.000000  89.000000  88.000000

3.2.4 Interpretation of `describe()` Output (Very Important)

NCERT expects students to understand, not memorise, these values.

count → Number of non-null values
mean → Average value
std → Standard deviation (spread of data)
min → Minimum value
25% → First quartile
50% → Median
75% → Third quartile
max → Maximum value

This output gives a complete statistical summary.

3.2.5 Calculating Mean Using `mean()`

Instead of using describe(), NCERT shows individual functions.

print(df['Maths'].mean())

Output:

86.6

3.2.6 Calculating Median Using `median()`

print(df['Science'].median())

Output:

84.0

Explanation:

Median is the middle value after sorting

3.2.7 Calculating Mode Using `mode()`

NCERT explains that:

Mode is the most frequently occurring value
A dataset can have:
- One mode
- Multiple modes
- No mode
```
print(df['English'].mode())
```

Output:

0    78
1    82
2    74
3    88
4    80
dtype: int64

Explanation:

All values appear only once
Hence, Pandas returns all values

3.2.8 Calculating Minimum and Maximum Values

print(df['Maths'].min())
print(df['Maths'].max())

Output:

78
92

These indicate:

Lowest and highest scores

3.2.9 Calculating Standard Deviation Using `std()`

print(df['Science'].std())

Output:

5.941380

Explanation:

Measures dispersion
Higher value → more spread

NCERT does not require mathematical formula, only interpretation.

3.2.10 Calculating Variance Using `var()`

print(df['English'].var())

Output:

27.7

Variance is the square of standard deviation.

3.2.11 Counting Values Using `count()`

print(df['Maths'].count())

Output:

Counts non-null values only.

3.2.12 Descriptive Statistics Across Rows

NCERT mentions that:

Functions operate column-wise by default

To operate row-wise, use axis=1

print(df[['Maths', 'Science', 'English']].mean(axis=1))

Output:

0    81.0
1    85.7
2    75.7
3    89.7
4    84.0
dtype: float64

This gives:

Average marks per student

3.2.13 Descriptive Statistics on Entire DataFrame

print(df.mean(numeric_only=True))

Output:

Maths      86.6
Science    82.6
English    80.4
dtype: float64

Explanation:

Only numeric columns are considered

3.2.14 Handling Missing Values in Statistics (NCERT Concept)

If a column contains NaN:

Descriptive functions ignore NaN by default

Example:

df.loc[2, 'Science'] = None
print(df['Science'].mean())

Explanation:

Mean is calculated using remaining values
Missing values do not cause errors

3.2.15 Key Learning Outcomes of Section 3.2

NCERT expects students to be able to:

Define descriptive statistics
Use describe() to summarise data
Calculate:
- Mean
- Median
- Mode
- Min, Max
- Standard deviation
- Variance
Interpret statistical results
Apply statistics row-wise and column-wise

Section 3.2 Completed

We have now fully and sequentially covered:

Concept of descriptive statistics
describe() function
Individual statistical functions
Interpretation of results
NCERT-aligned examples with code

3.3 Data Aggregations

3.3.1 Meaning of Data Aggregation (NCERT Explanation)

NCERT introduces data aggregation as the process of:

Combining multiple data values
Producing a single summary value
Performing operations such as:
- Sum
- Average
- Minimum
- Maximum
- Count

In simple words:

Data aggregation means summarising data to extract meaningful information.

Aggregation is a core activity in data analysis, especially when dealing with large datasets.

3.3.2 Difference Between Descriptive Statistics and Aggregation

NCERT implicitly distinguishes the two:

Descriptive statistics → Describe data distribution (mean, median, std, etc.)
Data aggregation → Combine values to produce summaries (sum, count, min, max)

Both are related, but aggregation focuses more on combining data.

3.3.3 Sample DataFrame Used for Aggregation

We continue with a marks dataset, similar to NCERT examples.

import pandas as pd

data = {
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
    'Maths': [85, 90, 78, 92, 88],
    'Science': [80, 85, 75, 89, 84],
    'English': [78, 82, 74, 88, 80]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Maths  Science  English
0   Amit     85       80       78
1   Neha     90       85       82
2  Rahul     78       75       74
3  Pooja     92       89       88
4  Karan     88       84       80

3.3.4 Using `sum()` for Aggregation

Column-wise sum (default behaviour)

print(df[['Maths', 'Science', 'English']].sum())

Output:

Maths      433
Science    413
English    402
dtype: int64

Explanation:

Values are added column-wise
Default axis = 0 (rows)

Row-wise sum

print(df[['Maths', 'Science', 'English']].sum(axis=1))

Output:

0    243
1    257
2    227
3    269
4    252
dtype: int64

Explanation:

Each student’s total marks are calculated
axis=1 means across columns

3.3.5 Using `mean()` for Aggregation

print(df[['Maths', 'Science', 'English']].mean())

Output:

Maths      86.6
Science    82.6
English    80.4
dtype: float64

Explanation:

Returns average marks per subject

3.3.6 Using `min()` and `max()`

print(df[['Maths', 'Science', 'English']].min())
print(df[['Maths', 'Science', 'English']].max())

Output:

Maths      78
Science    75
English    74
dtype: int64

Maths      92
Science    89
English    88
dtype: int64

Explanation:

Helps identify:
- Lowest score
- Highest score

3.3.7 Using `count()` in Aggregation

print(df[['Maths', 'Science', 'English']].count())

Output:

Maths      5
Science    5
English    5
dtype: int64

Explanation:

Counts non-null values
Useful when data may have missing values

3.3.8 Aggregation Using `agg()` Function (Very Important)

NCERT introduces agg() as a powerful aggregation method.

It allows:

Applying multiple aggregation functions at once

Example: Multiple aggregations on columns

print(df[['Maths', 'Science', 'English']].agg(['sum', 'mean', 'min', 'max']))

Output:

        Maths  Science  English
sum       433      413      402
mean     86.6     82.6     80.4
min        78       75       74
max        92       89       88

Explanation

Rows represent aggregation functions
Columns represent subjects
This gives a compact statistical summary

3.3.9 Different Aggregations for Different Columns

NCERT also highlights that:

Different columns may need different aggregations

print(df.agg({
'Maths': ['mean', 'max'],
'Science': ['min', 'mean'],
'English': ['sum']
}))

Output:

         Maths  Science  English
mean     86.6     82.6      NaN
max      92.0      NaN      NaN
min       NaN     75.0      NaN
sum       NaN      NaN    402.0

Explanation:

Each column has its own aggregation logic
Missing combinations appear as NaN

3.3.10 Aggregation on Entire DataFrame

print(df.agg(['sum', 'mean'], numeric_only=True))

Explanation:

Only numeric columns are aggregated
Text columns are ignored

3.3.11 Aggregation with Missing Values (NCERT Concept)

NCERT notes that:

Aggregation functions ignore NaN by default

Example:

df.loc[1, 'Science'] = None
print(df['Science'].mean())

Explanation:

Mean is computed using remaining values
No error is raised

3.3.12 Key Differences: `describe()` vs `agg()`

describe()	agg()
Fixed set of statistics	Customisable
Automatic summary	User-defined
Quick overview	Detailed control

NCERT expects students to know when to use each.

3.3.13 Key Learning Outcomes of Section 3.3

By the end of this section, NCERT expects students to:

Understand data aggregation
Use aggregation functions:
- sum, mean, min, max, count
Apply aggregation row-wise and column-wise
Use agg() for:
- Multiple aggregations
- Column-specific aggregations
Handle missing values during aggregation

Section 3.3 Completed

We have now fully and sequentially covered:

Concept of data aggregation
Built-in aggregation functions
Axis-based aggregation
agg() function with examples
NCERT-aligned interpretation

3.4 Sorting a DataFrame

3.4.1 Meaning of Sorting (NCERT Explanation)

NCERT introduces sorting as the process of:

Arranging data in a specific order
Based on:
- Values in one or more columns, or
- Index labels

Sorting helps to:

Analyse trends
Identify highest or lowest values
Present data in a meaningful order

Pandas provides built-in functions to sort a DataFrame efficiently.

3.4.2 Sample DataFrame Used for Sorting

We will continue with a familiar dataset.

import pandas as pd

data = {
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
    'Maths': [85, 90, 78, 92, 88],
    'Science': [80, 85, 75, 89, 84],
    'English': [78, 82, 74, 88, 80]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Maths  Science  English
0   Amit     85       80       78
1   Neha     90       85       82
2  Rahul     78       75       74
3  Pooja     92       89       88
4  Karan     88       84       80

3.4.3 Sorting a DataFrame by Index

NCERT first introduces sorting by index.

Using `sort_index()`

df_sorted_index = df.sort_index()
print(df_sorted_index)

Explanation:

Default sorting is ascending
Since the index is already in order (0,1,2,3,4), output remains unchanged

3.4.4 Sorting Index in Descending Order

df_sorted_index_desc = df.sort_index(ascending=False)
print(df_sorted_index_desc)

Output:

    Name  Maths  Science  English
4  Karan     88       84       80
3  Pooja     92       89       88
2  Rahul     78       75       74
1   Neha     90       85       82
0   Amit     85       80       78

Explanation:

Rows are rearranged based on index labels
Highest index appears first

3.4.5 Sorting a DataFrame by Column Values

NCERT now introduces sorting by column values, which is more common.

Using `sort_values()`

df_sorted_marks = df.sort_values(by='Maths')
print(df_sorted_marks)

Output:

    Name  Maths  Science  English
2  Rahul     78       75       74
0   Amit     85       80       78
4  Karan     88       84       80
1   Neha     90       85       82
3  Pooja     92       89       88

Explanation:

Rows are sorted based on Maths marks
Default order is ascending

3.4.6 Sorting by Column in Descending Order

df_sorted_desc = df.sort_values(by='Maths', ascending=False)
print(df_sorted_desc)

Output:

    Name  Maths  Science  English
3  Pooja     92       89       88
1   Neha     90       85       82
4  Karan     88       84       80
0   Amit     85       80       78
2  Rahul     78       75       74

Explanation:

Useful for:
- Ranking
- Finding top performers

3.4.7 Sorting by Multiple Columns

NCERT explains that sometimes sorting by one column is not sufficient.

Example scenario:

Sort by Maths marks

If Maths marks are equal, sort by Science marks

df_sorted_multi = df.sort_values(by=['Maths', 'Science'])
print(df_sorted_multi)

Explanation:

First sorting key: Maths
Second sorting key: Science

3.4.8 Sorting with Different Orders for Different Columns

df_sorted_multi_order = df.sort_values(
    by=['Maths', 'Science'],
    ascending=[False, True]
)
print(df_sorted_multi_order)

Explanation:

Maths sorted in descending order
Science sorted in ascending order

This gives fine control over sorting behaviour.

3.4.9 Sorting Based on String Columns

NCERT clarifies that:

Sorting also works on text data

df_sorted_name = df.sort_values(by='Name')
print(df_sorted_name)

Output:

    Name  Maths  Science  English
0   Amit     85       80       78
4  Karan     88       84       80
1   Neha     90       85       82
3  Pooja     92       89       88
2  Rahul     78       75       74

Explanation:

Sorting is done in alphabetical order

3.4.10 In-place Sorting (Modifying Original DataFrame)

NCERT introduces inplace=True.

df.sort_values(by='Maths', inplace=True)
print(df)

Explanation:

Original DataFrame is modified
No new DataFrame is created
Use carefully, as changes are permanent

3.4.11 Sorting with Missing Values (NCERT Concept)

If a column contains NaN:

df.loc[2, 'Maths'] = None
print(df.sort_values(by='Maths'))

Explanation:

By default, NaN values appear at the end
Pandas handles missing values automatically

3.4.12 Difference Between `sort_index()` and `sort_values()`

sort_index()	sort_values()
Sorts by index labels	Sorts by column values
Index-based	Data-based
Used for row order	Used for ranking and analysis

NCERT expects students to clearly distinguish these two.

3.4.13 Key Learning Outcomes of Section 3.4

By the end of this section, NCERT expects students to:

Sort DataFrames by:
- Index
- Column values
Sort in:
- Ascending order
- Descending order
Sort by:
- Single column
- Multiple columns
Use inplace option correctly
Understand how missing values affect sorting

Section 3.4 Completed

We have now fully and sequentially covered:

Concept of sorting
Sorting by index
Sorting by values
Multiple-column sorting
Ascending/descending order
NCERT-aligned examples

3.5 GROUP BY Functions

3.5.1 Why GROUP BY is Required in Pandas (NCERT Motivation)

NCERT introduces GROUP BY in Pandas for the same reason it exists in SQL.

So far, we have:

Calculated statistics on entire columns
Aggregated data without categories

However, real-world questions are usually group-based, such as:

Average marks class-wise
Total sales city-wise
Number of students stream-wise

To answer such questions, data must be:

Divided into groups
Aggregated within each group

This is exactly what GROUP BY functions do.

3.5.2 Concept of “Split–Apply–Combine” (NCERT Explanation)

NCERT explains GROUP BY using the idea of:

Split → Apply → Combine

Split Data is divided into groups based on a column
Apply An aggregation function is applied to each group
Combine Results are combined into a new DataFrame or Series

Understanding this concept is more important than memorising syntax.

3.5.3 Sample DataFrame Used for GROUP BY

We now use a slightly richer dataset, similar to NCERT’s examples.

import pandas as pd

data = {
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan', 'Riya'],
    'Class': ['XII', 'XII', 'XI', 'XI', 'XII', 'XI'],
    'Subject': ['Maths', 'Maths', 'Maths', 'Science', 'Science', 'Science'],
    'Marks': [85, 90, 78, 88, 92, 80]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name Class  Subject   Marks
0   Amit   XII    Maths      85
1   Neha   XII    Maths      90
2  Rahul    XI    Maths      78
3  Pooja    XI  Science      88
4  Karan   XII  Science      92
5   Riya    XI  Science      80

3.5.4 Creating a GROUP BY Object

NCERT shows that groupby() does not immediately compute results.

grp = df.groupby('Class')
print(grp)

Explanation:

This creates a GroupBy object
No output is displayed yet
Actual computation happens only when an aggregation function is applied

3.5.5 GROUP BY with `mean()`

Example: Average marks class-wise

print(df.groupby('Class')['Marks'].mean())

Output:

Class
XI     82.0
XII    89.0
Name: Marks, dtype: float64

Explanation

Data is split into two groups: XI and XII
Mean is calculated separately for each group
Result is a Series indexed by Class

This is conceptually identical to SQL GROUP BY Class.

3.5.6 GROUP BY with `sum()`

Example: Total marks class-wise

print(df.groupby('Class')['Marks'].sum())

Output:

Class
XI     246
XII    267
Name: Marks, dtype: int64

3.5.7 GROUP BY with `count()`

Example: Number of students in each class

print(df.groupby('Class')['Name'].count())

Output:

Class
XI     3
XII    3
Name: Name, dtype: int64

Explanation:

Counts non-null values
Equivalent to SQL COUNT(*)

3.5.8 GROUP BY on Multiple Columns

NCERT highlights that:

Grouping can be done on more than one column

Example: Average marks class-wise and subject-wise

print(df.groupby(['Class', 'Subject'])['Marks'].mean())

Output:

Class  Subject
XI     Maths       78.0
       Science     84.0
XII    Maths       87.5
       Science     92.0
Name: Marks, dtype: float64

Explanation

Data is first grouped by Class
Then sub-grouped by Subject
Result uses a MultiIndex

This is very important conceptually.

3.5.9 Using `agg()` with GROUP BY

NCERT shows that multiple aggregations can be applied to groups.

Example: Multiple aggregations on Marks

print(df.groupby('Class')['Marks'].agg(['mean', 'max', 'min']))

Output:

       mean  max  min
Class                
XI     82.0   88   78
XII    89.0   92   85

Explanation

Each group produces multiple summary values
Output is a DataFrame

3.5.10 GROUP BY on Entire DataFrame

NCERT also allows aggregation without selecting a specific column.

print(df.groupby('Class').mean(numeric_only=True))

Explanation:

Aggregation applied to all numeric columns
Non-numeric columns are ignored

3.5.11 Iterating Through Groups (Conceptual Awareness)

NCERT briefly introduces the idea that groups can be iterated.

for name, group in df.groupby('Class'):
    print("Group:", name)
    print(group)

Explanation:

name → group label
group → DataFrame for that group

This helps in understanding how grouping works internally.

3.5.12 Difference Between Pandas GROUP BY and SQL GROUP BY

NCERT expects students to relate this with SQL knowledge.

Pandas GROUP BY	SQL GROUP BY
groupby() function	GROUP BY clause
Works on DataFrame	Works on table
Split–Apply–Combine	Group–Aggregate
Supports multiple aggregations easily	More rigid syntax

3.5.13 Key Learning Outcomes of Section 3.5

By the end of this section, NCERT expects students to:

Understand the concept of grouping
Use groupby() on one or more columns
Apply aggregation functions to groups
Use agg() with GROUP BY
Interpret grouped results
Relate Pandas GROUP BY to SQL GROUP BY

Section 3.5 Completed

We have now fully and sequentially covered:

GROUP BY concept
Split–Apply–Combine
Single-column grouping
Multi-column grouping
Aggregations with GROUP BY
NCERT-aligned examples

3.6 Altering the Index

3.6.1 Meaning of Index in a DataFrame (NCERT Context)

NCERT reminds students that:

Every Pandas DataFrame has an index
The index is used to:
- Label rows
- Identify data
- Access records efficiently

By default:

Pandas assigns an integer index starting from 0

However, in real datasets:

A meaningful column (like RollNo, EmpID, Date) is often better suited as an index

This section explains how to modify the index of a DataFrame.

3.6.2 Sample DataFrame Used for Index Operations

import pandas as pd

data = {
    'RollNo': [101, 102, 103, 104],
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja'],
    'Marks': [85, 90, 78, 92]
}

df = pd.DataFrame(data)
print(df)

Output:

   RollNo   Name  Marks
0     101   Amit     85
1     102   Neha     90
2     103  Rahul     78
3     104  Pooja     92

Here:

Index is 0, 1, 2, 3
RollNo is just a normal column

3.6.3 Setting a Column as Index using `set_index()`

NCERT introduces set_index() to make a column the index.

Example: Set RollNo as index

df1 = df.set_index('RollNo')
print(df1)

Output:

        Name  Marks
RollNo             
101     Amit     85
102     Neha     90
103    Rahul     78
104    Pooja     92

Explanation

RollNo column is removed from data
It becomes the row index
Index values are now meaningful identifiers

This is very common in real datasets.

3.6.4 Setting Index Without Losing the Column

NCERT explains that sometimes:

We want to keep the column as well

df2 = df.set_index('RollNo', drop=False)
print(df2)

Explanation:

drop=False keeps the column
RollNo exists both as:
- Column
- Index

3.6.5 Resetting the Index using `reset_index()`

NCERT now explains how to restore the default index.

df_reset = df1.reset_index()
print(df_reset)

Output:

   RollNo   Name  Marks
0     101   Amit     85
1     102   Neha     90
2     103  Rahul     78
3     104  Pooja     92

Explanation

Index is converted back into a column
Default integer index is restored

3.6.6 Resetting Index and Dropping Old Index

If we do not want the old index as a column:

df_reset_drop = df1.reset_index(drop=True)
print(df_reset_drop)

Explanation:

Old index values are discarded
Clean new index is created

3.6.7 Renaming Index Labels

NCERT shows that index labels can be renamed.

Example: Rename index values

df1.index = ['S1', 'S2', 'S3', 'S4']
print(df1)

Output:

     Name  Marks
S1   Amit     85
S2   Neha     90
S3  Rahul     78
S4  Pooja     92

Explanation

Index values are directly reassigned
Length must match number of rows

3.6.8 Naming the Index (Index Name Attribute)

NCERT introduces index.name.

df1.index.name = 'StudentID'
print(df1)

Output shows:

           Name  Marks
StudentID             
S1         Amit     85
S2         Neha     90
S3        Rahul     78
S4        Pooja     92

Explanation:

Index itself now has a name
Useful when exporting or displaying data

Although not strictly index values, NCERT connects this logically.

df1.columns = ['StudentName', 'TotalMarks']
print(df1)

Explanation:

Column labels can also be modified
Improves readability and clarity

3.6.10 Setting Multiple Columns as Index (Conceptual Awareness)

NCERT briefly mentions that:

More than one column can be used as index

df_multi = df.set_index(['RollNo', 'Name'])
print(df_multi)

Explanation:

Creates a MultiIndex
Useful for hierarchical data
Detailed usage is beyond this level, but concept is important

3.6.11 Why Index Alteration is Important (NCERT Intent)

NCERT wants students to understand that:

Index controls:
- How data is accessed
- How data is aligned
- How data is grouped
Proper indexing:
- Improves readability
- Makes analysis easier
- Helps in merging and grouping

3.6.12 Key Learning Outcomes of Section 3.6

By the end of this section, NCERT expects students to:

Understand the role of index in a DataFrame
Use set_index() to change index
Use reset_index() to restore default index
Rename index labels
Name the index
Understand the idea of multi-level index

3.7 Other DataFrame Operations

3.7.1 Purpose of This Section (NCERT Context)

NCERT introduces this section to familiarise students with day-to-day operations performed on a DataFrame, such as:

Adding rows and columns
Deleting rows and columns
Applying functions on data
Selecting data based on conditions

These operations are essential for:

Cleaning data
Preparing data for analysis
Modifying existing datasets

3.7.2 Sample DataFrame Used

We continue with a student dataset.

import pandas as pd

data = {
    'RollNo': [101, 102, 103, 104],
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja'],
    'Marks': [85, 90, 78, 92]
}

df = pd.DataFrame(data)
print(df)

Output:

   RollNo   Name  Marks
0     101   Amit     85
1     102   Neha     90
2     103  Rahul     78
3     104  Pooja     92

3.7.3 Adding a New Column to a DataFrame

NCERT explains that a new column can be added by simple assignment.

Example: Add Grade column

df['Grade'] = ['B', 'A', 'C', 'A']
print(df)

Output:

   RollNo   Name  Marks Grade
0     101   Amit     85     B
1     102   Neha     90     A
2     103  Rahul     78     C
3     104  Pooja     92     A

Explanation:

Column is added to the right
Length of data must match number of rows

3.7.4 Adding a Column Using Computation

df['MarksPlus5'] = df['Marks'] + 5
print(df)

Explanation:

Vectorised operation
No loop required

3.7.5 Deleting a Column using `drop()`

NCERT introduces drop() for removal.

df1 = df.drop('MarksPlus5', axis=1)
print(df1)

Explanation:

axis=1 → column
Original DataFrame remains unchanged unless inplace=True is used

3.7.6 Deleting a Column Permanently

df.drop('Grade', axis=1, inplace=True)
print(df)

Explanation:

Column removed permanently from df

3.7.7 Adding a New Row to a DataFrame

NCERT introduces row addition using loc.

Example: Add a new student

df.loc[4] = [105, 'Karan', 88]
print(df)

Output:

   RollNo   Name  Marks
0     101   Amit     85
1     102   Neha     90
2     103  Rahul     78
3     104  Pooja     92
4     105  Karan     88

Explanation:

Index 4 did not exist earlier
Pandas adds a new row

3.7.8 Deleting Rows from a DataFrame

Delete row by index

df2 = df.drop(2)
print(df2)

Explanation:

Row with index 2 (Rahul) is removed
Original DataFrame is unchanged

3.7.9 Deleting Multiple Rows

df3 = df.drop([0, 3])
print(df3)

Explanation:

Rows with index 0 and 3 are removed

3.7.10 Selecting Data Based on Condition (Filtering)

NCERT introduces conditional selection.

Example: Students scoring more than 85

high_scorers = df[df['Marks'] > 85]
print(high_scorers)

Output:

   RollNo   Name  Marks
1     102   Neha     90
3     104  Pooja     92
4     105  Karan     88

Explanation:

Boolean condition inside square brackets
Very common in data analysis

3.7.11 Selecting Specific Columns with Condition

print(df.loc[df['Marks'] > 85, ['Name', 'Marks']])

Explanation:

Combines row filtering with column selection

3.7.12 Using `apply()` on a DataFrame Column

NCERT introduces apply() to apply a function to each element.

Example: Assign Pass/Fail status

def result(m):
    if m >= 40:
        return 'Pass'
    else:
        return 'Fail'

df['Result'] = df['Marks'].apply(result)
print(df)

Explanation:

Function applied to each value in Marks
New column created based on logic

3.7.13 Using Lambda Function with `apply()`

df['Status'] = df['Marks'].apply(lambda x: 'Distinction' if x >= 90 else 'Normal')
print(df)

Explanation:

Lambda functions are short, anonymous functions
Commonly used with apply()

3.7.14 Replacing Values in a DataFrame

NCERT shows the use of replace().

df['Status'] = df['Status'].replace('Normal', 'Regular')
print(df)

Explanation:

Useful for cleaning categorical data

3.7.15 Checking DataFrame Information

print(df.info())

Explanation:

Displays:
- Column names
- Data types
- Non-null counts
Very useful for quick inspection

3.7.16 Checking for Duplicate Rows (Conceptual)

NCERT briefly mentions duplicate detection.

print(df.duplicated())

Explanation:

Returns True/False for each row
Duplicate handling is part of data cleaning

3.7.17 Key Learning Outcomes of Section 3.7

By the end of this section, NCERT expects students to:

Add and delete rows and columns
Modify data using operations
Filter DataFrame based on conditions
Use apply() and lambda functions
Replace values
Inspect DataFrame structure

3.8 Handling Missing Values

3.8.1 What are Missing Values? (NCERT Explanation)

NCERT explains that missing values occur when:

Data is not available for some records
Data is lost during collection or transfer
Certain fields are left blank intentionally or unintentionally

In datasets, missing values are common and cannot be ignored, because they can:

Affect calculations
Produce incorrect analysis
Lead to wrong conclusions

3.8.2 Representation of Missing Values in Pandas

NCERT introduces the special value:

NaN

NaN stands for Not a Number
Pandas uses NaN to represent missing data
It applies to both:
- Numeric data
- Non-numeric data

3.8.3 Sample DataFrame with Missing Values

We use a dataset similar to NCERT examples.

import pandas as pd
import numpy as np

data = {
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
    'Maths': [85, 90, np.nan, 92, 88],
    'Science': [80, np.nan, 75, 89, 84],
    'English': [78, 82, 74, np.nan, 80]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Maths  Science  English
0   Amit   85.0     80.0     78.0
1   Neha   90.0      NaN     82.0
2  Rahul    NaN     75.0     74.0
3  Pooja   92.0     89.0      NaN
4  Karan   88.0     84.0     80.0

3.8.4 Identifying Missing Values using `isnull()`

NCERT introduces isnull() to detect missing values.

print(df.isnull())

Output:

    Name  Maths  Science  English
0  False  False    False    False
1  False  False     True    False
2  False   True    False    False
3  False  False    False     True
4  False  False    False    False

Explanation:

True → missing value
False → valid value

3.8.5 Using `notnull()`

The opposite of isnull() is notnull().

print(df.notnull())

Explanation:

True → value exists
False → missing value

3.8.6 Counting Missing Values

NCERT explains that we can count missing values column-wise.

print(df.isnull().sum())

Output:

Name       0
Maths      1
Science    1
English    1
dtype: int64

Explanation:

Helps identify which columns have missing data
Useful before cleaning data

3.8.7 Dropping Rows with Missing Values using `dropna()`

NCERT introduces dropna() to remove records containing missing values.

Drop rows with any missing value

df_drop = df.dropna()
print(df_drop)

Output:

    Name  Maths  Science  English
0   Amit   85.0     80.0     78.0
4  Karan   88.0     84.0     80.0

Explanation:

Rows containing even one NaN are removed
Use with caution, as data loss can occur

3.8.8 Dropping Columns with Missing Values

df_drop_col = df.dropna(axis=1)
print(df_drop_col)

Explanation:

axis=1 → columns
Columns with any missing value are removed

3.8.9 Dropping Rows Based on Condition

NCERT mentions selective dropping.

df_drop_cond = df.dropna(thresh=3)
print(df_drop_cond)

Explanation:

thresh=3 means:
- Keep rows with at least 3 non-null values

3.8.10 Filling Missing Values using `fillna()`

Instead of deleting data, NCERT recommends filling missing values.

Fill missing values with a constant

df_filled = df.fillna(0)
print(df_filled)

Explanation:

All missing values replaced with 0
Simple but may not be meaningful in all cases

3.8.11 Filling Missing Values with Column Mean

NCERT highlights this as a common practice.

df['Maths'].fillna(df['Maths'].mean(), inplace=True)
df['Science'].fillna(df['Science'].mean(), inplace=True)
df['English'].fillna(df['English'].mean(), inplace=True)

print(df)

Explanation:

Missing values replaced with average of column
Preserves overall data trend

3.8.12 Forward Fill and Backward Fill

NCERT introduces propagation-based filling.

Forward Fill (`ffill`)

df_ffill = df.fillna(method='ffill')
print(df_ffill)

Explanation:

Missing value filled using previous valid value

Backward Fill (`bfill`)

df_bfill = df.fillna(method='bfill')
print(df_bfill)

Explanation:

Missing value filled using next valid value

3.8.13 Replacing Missing Values using `replace()`

NCERT mentions that replace() can also be used.

df_replace = df.replace(np.nan, 0)
print(df_replace)

Explanation:

Replaces NaN with specified value
Similar effect to fillna()

3.8.14 Effect of Missing Values on Calculations

NCERT stresses an important point:

Pandas ignores NaN values by default in:
- sum()
- mean()
- min()
- max()

Example:

print(df['Maths'].mean())

Explanation:

Mean is calculated using available values only
Missing values do not cause errors

3.8.15 Choosing the Right Method (NCERT Guidance)

NCERT expects students to understand that:

Dropping data may cause loss of information
Filling data must be logically justified
There is no single correct method
Choice depends on:
- Nature of data
- Purpose of analysis

3.8.16 Key Learning Outcomes of Section 3.8

By the end of this section, NCERT expects students to:

Identify missing values using isnull() and notnull()
Count missing values
Drop rows or columns with missing values
Fill missing values using:
- Constant values
- Mean
- Forward fill
- Backward fill
Understand the impact of missing values on analysis

3.9 Import and Export of Data between Pandas and MySQL

3.9.1 Why Connect Pandas with MySQL? (NCERT Context)

NCERT introduces this section to show how data analysis tools (Pandas) work with databases (MySQL).

So far, you have worked with:

Data entered manually
Data stored in CSV files

However, in real-world applications:

Data is often stored in relational databases
MySQL is one of the most widely used database systems
Data analysts frequently need to:
- Read data from database tables
- Analyse it using Pandas
- Store processed data back into the database

This section demonstrates basic interaction between Pandas and MySQL.

3.9.2 Tools Required for Pandas–MySQL Connectivity

NCERT explains that to connect Pandas with MySQL, we need:

MySQL Server
MySQL Connector for Python
Pandas library

The Python module used is:

mysql.connector

This module allows Python programs to:

Establish connection with MySQL
Execute SQL queries
Fetch and insert data

3.9.3 Importing Required Libraries

Before connecting to MySQL, required libraries must be imported.

import pandas as pd
import mysql.connector

Explanation:

pandas → data handling and analysis
mysql.connector → database connectivity

3.9.4 Establishing Connection with MySQL

NCERT shows how to create a connection object.

mycon = mysql.connector.connect(
    host="localhost",
    user="root",
    password="password",
    database="school"
)

Explanation:

host → location of MySQL server
user → MySQL username
password → MySQL password
database → database to be used

If the connection is successful:

mycon becomes an active connection object

3.9.5 Reading Data from MySQL into a DataFrame

NCERT demonstrates how to:

Fetch records from a MySQL table
Load them into a Pandas DataFrame

Example: Read STUDENT table from MySQL

Assume MySQL table:

STUDENT
------------------------
RollNo | Name  | Marks

query = "SELECT * FROM STUDENT"

df = pd.read_sql(query, mycon)
print(df)

Explanation:

read_sql() executes the SQL query
Result is directly converted into a DataFrame
Column names are preserved

This is the most important function in this section.

3.9.6 Reading Selected Columns from MySQL

query = "SELECT RollNo, Name FROM STUDENT"
df = pd.read_sql(query, mycon)
print(df)

Explanation:

Only required columns are fetched
Reduces data transfer overhead

3.9.7 Applying Pandas Operations on Database Data

NCERT stresses that:

Once data is in a DataFrame, all Pandas operations can be applied.

Example:

print(df.describe())

high_scorers = df[df['Marks'] > 85]
print(high_scorers)

This shows the power of combining databases with Pandas.

3.9.8 Writing Data from DataFrame to MySQL Table

NCERT also explains the reverse operation:

Storing processed data back into MySQL.

Steps involved:

Create a cursor
Execute INSERT queries using DataFrame values

Example: Insert DataFrame rows into MySQL

cursor = mycon.cursor()

for i in range(len(df)):
    sql = "INSERT INTO STUDENT VALUES (%s, %s, %s)"
    values = (
        int(df.loc[i, 'RollNo']),
        df.loc[i, 'Name'],
        int(df.loc[i, 'Marks'])
    )
    cursor.execute(sql, values)

mycon.commit()

Explanation:

Each row of DataFrame is inserted into MySQL table
commit() saves the changes permanently

3.9.9 Exporting DataFrame to MySQL (Conceptual Note)

NCERT does not require students to use advanced tools like SQLAlchemy.

At CBSE level:

Understanding logic and flow is sufficient
Focus is on:
- Reading from MySQL
- Writing back using basic SQL commands

3.9.10 Closing the Database Connection

NCERT highlights good programming practice.

cursor.close()
mycon.close()

Explanation:

Frees system resources
Prevents connection leaks

3.9.11 Common Errors and Precautions (NCERT Awareness)

NCERT expects conceptual awareness of:

Wrong username/password → connection failure
Database/table not found → SQL error
Forgetting commit() → data not saved
Mismatch between table structure and DataFrame → insertion error

Students are not expected to debug deeply, only understand causes.

3.9.12 Key Learning Outcomes of Section 3.9

By the end of this section, NCERT expects students to:

Understand the need to connect Pandas with MySQL
Import required libraries for database connectivity
Establish connection with MySQL
Read database tables into a DataFrame using read_sql()
Perform Pandas operations on database data
Insert DataFrame data into MySQL tables
Close database connections properly

Chapter 03 Completed Fully

We have now fully and sequentially completed Chapter 03:

3.1 Introduction
3.2 Descriptive Statistics
3.3 Data Aggregations
3.4 Sorting a DataFrame
3.5 GROUP BY Functions
3.6 Altering the Index
3.7 Other DataFrame Operations
3.8 Handling Missing Values
3.9 Pandas ↔ MySQL

Page last updated: 17-Mar-2026 18:38 IST

Document Version: v17

Chapter 03 - Data Handling with Pandas II

3.1 Introduction

3.1.1 Context from Chapter 02 (NCERT Continuity)

3.1.2 Why Chapter 03 is Needed (NCERT Motivation)

3.1.3 Nature of Data Handled in This Chapter

3.1.4 Major Concepts Introduced in This Chapter

3.1.5 Importance of This Chapter in Data Analysis

3.1.6 How This Chapter Differs from Chapter 02

3.1.7 Learning Outcome of Section 3.1

Section 3.1 Completed

3.2 Descriptive Statistics

3.2.1 What are Descriptive Statistics? (NCERT Explanation)

3.2.2 Sample Data Used for Explanation

Output

3.2.3 The describe() Function (Core NCERT Method)

Output

3.2.4 Interpretation of describe() Output (Very Important)

3.2.5 Calculating Mean Using mean()

3.2.6 Calculating Median Using median()

3.2.7 Calculating Mode Using mode()

3.2.8 Calculating Minimum and Maximum Values

3.2.9 Calculating Standard Deviation Using std()

3.2.10 Calculating Variance Using var()

3.2.11 Counting Values Using count()

3.2.12 Descriptive Statistics Across Rows

3.2.13 Descriptive Statistics on Entire DataFrame

3.2.14 Handling Missing Values in Statistics (NCERT Concept)

3.2.15 Key Learning Outcomes of Section 3.2

Section 3.2 Completed

3.3 Data Aggregations

3.3.1 Meaning of Data Aggregation (NCERT Explanation)

3.3.2 Difference Between Descriptive Statistics and Aggregation

3.3.3 Sample DataFrame Used for Aggregation

3.3.4 Using sum() for Aggregation

Column-wise sum (default behaviour)

Row-wise sum

3.3.5 Using mean() for Aggregation

3.3.6 Using min() and max()

3.3.7 Using count() in Aggregation

3.3.8 Aggregation Using agg() Function (Very Important)

Example: Multiple aggregations on columns

Explanation

3.3.9 Different Aggregations for Different Columns

3.3.10 Aggregation on Entire DataFrame

3.3.11 Aggregation with Missing Values (NCERT Concept)

3.3.12 Key Differences: describe() vs agg()

3.3.13 Key Learning Outcomes of Section 3.3

Section 3.3 Completed

3.4 Sorting a DataFrame

3.4.1 Meaning of Sorting (NCERT Explanation)

3.4.2 Sample DataFrame Used for Sorting

3.4.3 Sorting a DataFrame by Index

Using sort_index()

3.4.4 Sorting Index in Descending Order

3.4.5 Sorting a DataFrame by Column Values

Using sort_values()

3.4.6 Sorting by Column in Descending Order

3.4.7 Sorting by Multiple Columns

3.4.8 Sorting with Different Orders for Different Columns

3.4.9 Sorting Based on String Columns

3.4.10 In-place Sorting (Modifying Original DataFrame)

3.4.11 Sorting with Missing Values (NCERT Concept)

3.4.12 Difference Between sort_index() and sort_values()

3.4.13 Key Learning Outcomes of Section 3.4

Section 3.4 Completed

3.5 GROUP BY Functions

3.5.1 Why GROUP BY is Required in Pandas (NCERT Motivation)

3.5.2 Concept of “Split–Apply–Combine” (NCERT Explanation)

3.5.3 Sample DataFrame Used for GROUP BY

3.5.4 Creating a GROUP BY Object

3.5.5 GROUP BY with mean()

Example: Average marks class-wise

Explanation

3.5.6 GROUP BY with sum()

Example: Total marks class-wise

3.5.7 GROUP BY with count()

Example: Number of students in each class

3.5.8 GROUP BY on Multiple Columns

Example: Average marks class-wise and subject-wise

Explanation

3.2.3 The `describe()` Function (Core NCERT Method)

3.2.4 Interpretation of `describe()` Output (Very Important)

3.2.5 Calculating Mean Using `mean()`

3.2.6 Calculating Median Using `median()`

3.2.7 Calculating Mode Using `mode()`

3.2.9 Calculating Standard Deviation Using `std()`

3.2.10 Calculating Variance Using `var()`

3.2.11 Counting Values Using `count()`

3.3.4 Using `sum()` for Aggregation

3.3.5 Using `mean()` for Aggregation

3.3.6 Using `min()` and `max()`

3.3.7 Using `count()` in Aggregation

3.3.8 Aggregation Using `agg()` Function (Very Important)

3.3.12 Key Differences: `describe()` vs `agg()`

Using `sort_index()`

Using `sort_values()`

3.4.12 Difference Between `sort_index()` and `sort_values()`

3.5.5 GROUP BY with `mean()`

3.5.6 GROUP BY with `sum()`

3.5.7 GROUP BY with `count()`

3.5.9 Using `agg()` with GROUP BY

3.6.3 Setting a Column as Index using `set_index()`

3.6.5 Resetting the Index using `reset_index()`

3.7.5 Deleting a Column using `drop()`

3.7.12 Using `apply()` on a DataFrame Column

3.7.13 Using Lambda Function with `apply()`

3.8.4 Identifying Missing Values using `isnull()`

3.8.5 Using `notnull()`

3.8.7 Dropping Rows with Missing Values using `dropna()`

3.8.10 Filling Missing Values using `fillna()`

Forward Fill (`ffill`)

Backward Fill (`bfill`)

3.8.13 Replacing Missing Values using `replace()`