Chapter 03 - Data Handling with Pandas II

CBSE Class 12 Informatics Practices

3.1 Introduction

3.1.1 Context from Chapter 02 (NCERT Continuity)

NCERT begins this chapter by explicitly linking it with Chapter 02: Data Handling with Pandas I.

In the previous chapter, you learnt:

  • What Pandas is
  • How to work with:

    • Series
    • DataFrame
  • How to:

    • Create DataFrames
    • Access rows and columns
    • Import data from CSV files
    • Export data back to CSV

At that stage, the focus was mainly on:

  • Creating and viewing data
  • Basic manipulation

However, real-world data handling does not stop there.


3.1.2 Why Chapter 03 is Needed (NCERT Motivation)

NCERT now shifts the focus from basic data handling to data analysis.

In practical situations:

  • Data is often very large
  • Raw data by itself is not useful
  • We need to:

    • Summarise data
    • Identify patterns
    • Compare groups
    • Prepare reports

For example:

  • What is the average score of students?
  • Which city has the highest sales?
  • How many records belong to each category?
  • Which values are missing or incorrect?

Such questions require advanced DataFrame operations, which are introduced in this chapter.


3.1.3 Nature of Data Handled in This Chapter

NCERT clarifies that this chapter deals with:

  • Structured data
  • Data stored in rows and columns
  • Data that can be:

    • Numeric
    • Categorical
    • Textual

The emphasis is on analysing data, not merely displaying it.


3.1.4 Major Concepts Introduced in This Chapter

NCERT outlines that this chapter will focus on the following capabilities of Pandas:

  • Calculating descriptive statistics
  • Performing data aggregation
  • Sorting data based on:

    • Values
    • Index
  • Grouping data using GROUP BY

  • Modifying the index

  • Performing additional DataFrame operations

  • Handling missing data

  • Exchanging data between:

    • Pandas
    • MySQL database

Each of these topics builds on the DataFrame knowledge gained earlier.


3.1.5 Importance of This Chapter in Data Analysis

NCERT implicitly highlights that:

  • This chapter introduces operations used in:

    • Business analytics
    • Data science
    • Research
  • The techniques taught here are:

    • Practical
    • Frequently used
    • Exam-relevant

Students are expected to:

  • Apply logical thinking
  • Interpret results, not just write code
  • Understand the meaning of outputs

3.1.6 How This Chapter Differs from Chapter 02

Chapter 02 Chapter 03
Creating data Analysing data
Structure of DataFrame Behaviour of DataFrame
Reading / writing data Summarising / grouping data
Basic operations Advanced operations

This distinction is important for conceptual clarity.


3.1.7 Learning Outcome of Section 3.1

By the end of this introductory section, NCERT expects students to:

  • Understand the scope of the chapter
  • Recognise the shift from data creation to data analysis
  • Appreciate the importance of Pandas in real-world data handling
  • Be prepared for advanced operations on DataFrames

Section 3.1 Completed

We have now fully and sequentially covered:

  • Why Chapter 03 exists
  • How it builds on Chapter 02
  • What types of problems it addresses
  • What topics will be studied next

3.2 Descriptive Statistics

3.2.1 What are Descriptive Statistics? (NCERT Explanation)

NCERT introduces descriptive statistics as methods used to:

  • Summarise large datasets
  • Describe the central tendency
  • Measure the spread or variability of data

Instead of examining every value individually, descriptive statistics provide:

  • Single numerical values
  • Quick understanding of data behaviour

This is essential when working with large DataFrames.


3.2.2 Sample Data Used for Explanation

NCERT explains these concepts using a DataFrame.

We will use a similar structure.

import pandas as pd

data = {
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
    'Maths': [85, 90, 78, 92, 88],
    'Science': [80, 85, 75, 89, 84],
    'English': [78, 82, 74, 88, 80]
}

df = pd.DataFrame(data)
print(df)

Output

    Name  Maths  Science  English
0   Amit     85       80       78
1   Neha     90       85       82
2  Rahul     78       75       74
3  Pooja     92       89       88
4  Karan     88       84       80

3.2.3 The describe() Function (Core NCERT Method)

NCERT introduces describe() as the most important descriptive statistics function.

print(df.describe())

Output

           Maths    Science    English
count   5.000000   5.000000   5.000000
mean   86.600000  82.600000  80.400000
std     5.941380   5.941380   5.263079
min    78.000000  75.000000  74.000000
25%    85.000000  80.000000  78.000000
50%    88.000000  84.000000  80.000000
75%    90.000000  85.000000  82.000000
max    92.000000  89.000000  88.000000

3.2.4 Interpretation of describe() Output (Very Important)

NCERT expects students to understand, not memorise, these values.

  • count → Number of non-null values
  • mean → Average value
  • std → Standard deviation (spread of data)
  • min → Minimum value
  • 25% → First quartile
  • 50% → Median
  • 75% → Third quartile
  • max → Maximum value

This output gives a complete statistical summary.


3.2.5 Calculating Mean Using mean()

Instead of using describe(), NCERT shows individual functions.

print(df['Maths'].mean())

Output:

86.6

3.2.6 Calculating Median Using median()

print(df['Science'].median())

Output:

84.0

Explanation:

  • Median is the middle value after sorting

3.2.7 Calculating Mode Using mode()

NCERT explains that:

  • Mode is the most frequently occurring value
  • A dataset can have:

    • One mode
    • Multiple modes
    • No mode

      print(df['English'].mode())
      

Output:

0    78
1    82
2    74
3    88
4    80
dtype: int64

Explanation:

  • All values appear only once
  • Hence, Pandas returns all values

3.2.8 Calculating Minimum and Maximum Values

print(df['Maths'].min())
print(df['Maths'].max())

Output:

78
92

These indicate:

  • Lowest and highest scores

3.2.9 Calculating Standard Deviation Using std()

print(df['Science'].std())

Output:

5.941380

Explanation:

  • Measures dispersion
  • Higher value → more spread

NCERT does not require mathematical formula, only interpretation.


3.2.10 Calculating Variance Using var()

print(df['English'].var())

Output:

27.7

Variance is the square of standard deviation.


3.2.11 Counting Values Using count()

print(df['Maths'].count())

Output:

5

Counts non-null values only.


3.2.12 Descriptive Statistics Across Rows

NCERT mentions that:

  • Functions operate column-wise by default
  • To operate row-wise, use axis=1

    print(df[['Maths', 'Science', 'English']].mean(axis=1))
    

Output:

0    81.0
1    85.7
2    75.7
3    89.7
4    84.0
dtype: float64

This gives:

  • Average marks per student

3.2.13 Descriptive Statistics on Entire DataFrame

print(df.mean(numeric_only=True))

Output:

Maths      86.6
Science    82.6
English    80.4
dtype: float64

Explanation:

  • Only numeric columns are considered

3.2.14 Handling Missing Values in Statistics (NCERT Concept)

If a column contains NaN:

  • Descriptive functions ignore NaN by default

Example:

df.loc[2, 'Science'] = None
print(df['Science'].mean())

Explanation:

  • Mean is calculated using remaining values
  • Missing values do not cause errors

3.2.15 Key Learning Outcomes of Section 3.2

NCERT expects students to be able to:

  • Define descriptive statistics
  • Use describe() to summarise data
  • Calculate:

    • Mean
    • Median
    • Mode
    • Min, Max
    • Standard deviation
    • Variance
  • Interpret statistical results

  • Apply statistics row-wise and column-wise


Section 3.2 Completed

We have now fully and sequentially covered:

  • Concept of descriptive statistics
  • describe() function
  • Individual statistical functions
  • Interpretation of results
  • NCERT-aligned examples with code

3.3 Data Aggregations

3.3.1 Meaning of Data Aggregation (NCERT Explanation)

NCERT introduces data aggregation as the process of:

  • Combining multiple data values
  • Producing a single summary value
  • Performing operations such as:

    • Sum
    • Average
    • Minimum
    • Maximum
    • Count

In simple words:

Data aggregation means summarising data to extract meaningful information.

Aggregation is a core activity in data analysis, especially when dealing with large datasets.


3.3.2 Difference Between Descriptive Statistics and Aggregation

NCERT implicitly distinguishes the two:

  • Descriptive statistics → Describe data distribution (mean, median, std, etc.)

  • Data aggregation → Combine values to produce summaries (sum, count, min, max)

Both are related, but aggregation focuses more on combining data.


3.3.3 Sample DataFrame Used for Aggregation

We continue with a marks dataset, similar to NCERT examples.

import pandas as pd

data = {
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
    'Maths': [85, 90, 78, 92, 88],
    'Science': [80, 85, 75, 89, 84],
    'English': [78, 82, 74, 88, 80]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Maths  Science  English
0   Amit     85       80       78
1   Neha     90       85       82
2  Rahul     78       75       74
3  Pooja     92       89       88
4  Karan     88       84       80

3.3.4 Using sum() for Aggregation

Column-wise sum (default behaviour)

print(df[['Maths', 'Science', 'English']].sum())

Output:

Maths      433
Science    413
English    402
dtype: int64

Explanation:

  • Values are added column-wise
  • Default axis = 0 (rows)

Row-wise sum

print(df[['Maths', 'Science', 'English']].sum(axis=1))

Output:

0    243
1    257
2    227
3    269
4    252
dtype: int64

Explanation:

  • Each student’s total marks are calculated
  • axis=1 means across columns

3.3.5 Using mean() for Aggregation

print(df[['Maths', 'Science', 'English']].mean())

Output:

Maths      86.6
Science    82.6
English    80.4
dtype: float64

Explanation:

  • Returns average marks per subject

3.3.6 Using min() and max()

print(df[['Maths', 'Science', 'English']].min())
print(df[['Maths', 'Science', 'English']].max())

Output:

Maths      78
Science    75
English    74
dtype: int64

Maths      92
Science    89
English    88
dtype: int64

Explanation:

  • Helps identify:

    • Lowest score
    • Highest score

3.3.7 Using count() in Aggregation

print(df[['Maths', 'Science', 'English']].count())

Output:

Maths      5
Science    5
English    5
dtype: int64

Explanation:

  • Counts non-null values
  • Useful when data may have missing values

3.3.8 Aggregation Using agg() Function (Very Important)

NCERT introduces agg() as a powerful aggregation method.

It allows:

  • Applying multiple aggregation functions at once

Example: Multiple aggregations on columns

print(df[['Maths', 'Science', 'English']].agg(['sum', 'mean', 'min', 'max']))

Output:

        Maths  Science  English
sum       433      413      402
mean     86.6     82.6     80.4
min        78       75       74
max        92       89       88

Explanation

  • Rows represent aggregation functions
  • Columns represent subjects
  • This gives a compact statistical summary

3.3.9 Different Aggregations for Different Columns

NCERT also highlights that:

  • Different columns may need different aggregations

    print(df.agg({
    'Maths': ['mean', 'max'],
    'Science': ['min', 'mean'],
    'English': ['sum']
    }))
    

Output:

         Maths  Science  English
mean     86.6     82.6      NaN
max      92.0      NaN      NaN
min       NaN     75.0      NaN
sum       NaN      NaN    402.0

Explanation:

  • Each column has its own aggregation logic
  • Missing combinations appear as NaN

3.3.10 Aggregation on Entire DataFrame

print(df.agg(['sum', 'mean'], numeric_only=True))

Explanation:

  • Only numeric columns are aggregated
  • Text columns are ignored

3.3.11 Aggregation with Missing Values (NCERT Concept)

NCERT notes that:

  • Aggregation functions ignore NaN by default

Example:

df.loc[1, 'Science'] = None
print(df['Science'].mean())

Explanation:

  • Mean is computed using remaining values
  • No error is raised

3.3.12 Key Differences: describe() vs agg()

describe() agg()
Fixed set of statistics Customisable
Automatic summary User-defined
Quick overview Detailed control

NCERT expects students to know when to use each.


3.3.13 Key Learning Outcomes of Section 3.3

By the end of this section, NCERT expects students to:

  • Understand data aggregation
  • Use aggregation functions:

    • sum, mean, min, max, count
  • Apply aggregation row-wise and column-wise

  • Use agg() for:

    • Multiple aggregations
    • Column-specific aggregations
  • Handle missing values during aggregation


Section 3.3 Completed

We have now fully and sequentially covered:

  • Concept of data aggregation
  • Built-in aggregation functions
  • Axis-based aggregation
  • agg() function with examples
  • NCERT-aligned interpretation

3.4 Sorting a DataFrame

3.4.1 Meaning of Sorting (NCERT Explanation)

NCERT introduces sorting as the process of:

  • Arranging data in a specific order
  • Based on:

    • Values in one or more columns, or
    • Index labels

Sorting helps to:

  • Analyse trends
  • Identify highest or lowest values
  • Present data in a meaningful order

Pandas provides built-in functions to sort a DataFrame efficiently.


3.4.2 Sample DataFrame Used for Sorting

We will continue with a familiar dataset.

import pandas as pd

data = {
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
    'Maths': [85, 90, 78, 92, 88],
    'Science': [80, 85, 75, 89, 84],
    'English': [78, 82, 74, 88, 80]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Maths  Science  English
0   Amit     85       80       78
1   Neha     90       85       82
2  Rahul     78       75       74
3  Pooja     92       89       88
4  Karan     88       84       80

3.4.3 Sorting a DataFrame by Index

NCERT first introduces sorting by index.

Using sort_index()

df_sorted_index = df.sort_index()
print(df_sorted_index)

Explanation:

  • Default sorting is ascending
  • Since the index is already in order (0,1,2,3,4), output remains unchanged

3.4.4 Sorting Index in Descending Order

df_sorted_index_desc = df.sort_index(ascending=False)
print(df_sorted_index_desc)

Output:

    Name  Maths  Science  English
4  Karan     88       84       80
3  Pooja     92       89       88
2  Rahul     78       75       74
1   Neha     90       85       82
0   Amit     85       80       78

Explanation:

  • Rows are rearranged based on index labels
  • Highest index appears first

3.4.5 Sorting a DataFrame by Column Values

NCERT now introduces sorting by column values, which is more common.

Using sort_values()

df_sorted_marks = df.sort_values(by='Maths')
print(df_sorted_marks)

Output:

    Name  Maths  Science  English
2  Rahul     78       75       74
0   Amit     85       80       78
4  Karan     88       84       80
1   Neha     90       85       82
3  Pooja     92       89       88

Explanation:

  • Rows are sorted based on Maths marks
  • Default order is ascending

3.4.6 Sorting by Column in Descending Order

df_sorted_desc = df.sort_values(by='Maths', ascending=False)
print(df_sorted_desc)

Output:

    Name  Maths  Science  English
3  Pooja     92       89       88
1   Neha     90       85       82
4  Karan     88       84       80
0   Amit     85       80       78
2  Rahul     78       75       74

Explanation:

  • Useful for:

    • Ranking
    • Finding top performers

3.4.7 Sorting by Multiple Columns

NCERT explains that sometimes sorting by one column is not sufficient.

Example scenario:

  • Sort by Maths marks
  • If Maths marks are equal, sort by Science marks

    df_sorted_multi = df.sort_values(by=['Maths', 'Science'])
    print(df_sorted_multi)
    

Explanation:

  • First sorting key: Maths
  • Second sorting key: Science

3.4.8 Sorting with Different Orders for Different Columns

df_sorted_multi_order = df.sort_values(
    by=['Maths', 'Science'],
    ascending=[False, True]
)
print(df_sorted_multi_order)

Explanation:

  • Maths sorted in descending order
  • Science sorted in ascending order

This gives fine control over sorting behaviour.


3.4.9 Sorting Based on String Columns

NCERT clarifies that:

  • Sorting also works on text data

    df_sorted_name = df.sort_values(by='Name')
    print(df_sorted_name)
    

Output:

    Name  Maths  Science  English
0   Amit     85       80       78
4  Karan     88       84       80
1   Neha     90       85       82
3  Pooja     92       89       88
2  Rahul     78       75       74

Explanation:

  • Sorting is done in alphabetical order

3.4.10 In-place Sorting (Modifying Original DataFrame)

NCERT introduces inplace=True.

df.sort_values(by='Maths', inplace=True)
print(df)

Explanation:

  • Original DataFrame is modified
  • No new DataFrame is created
  • Use carefully, as changes are permanent

3.4.11 Sorting with Missing Values (NCERT Concept)

If a column contains NaN:

df.loc[2, 'Maths'] = None
print(df.sort_values(by='Maths'))

Explanation:

  • By default, NaN values appear at the end
  • Pandas handles missing values automatically

3.4.12 Difference Between sort_index() and sort_values()

sort_index() sort_values()
Sorts by index labels Sorts by column values
Index-based Data-based
Used for row order Used for ranking and analysis

NCERT expects students to clearly distinguish these two.


3.4.13 Key Learning Outcomes of Section 3.4

By the end of this section, NCERT expects students to:

  • Sort DataFrames by:

    • Index
    • Column values
  • Sort in:

    • Ascending order
    • Descending order
  • Sort by:

    • Single column
    • Multiple columns
  • Use inplace option correctly

  • Understand how missing values affect sorting


Section 3.4 Completed

We have now fully and sequentially covered:

  • Concept of sorting
  • Sorting by index
  • Sorting by values
  • Multiple-column sorting
  • Ascending/descending order
  • NCERT-aligned examples

3.5 GROUP BY Functions

3.5.1 Why GROUP BY is Required in Pandas (NCERT Motivation)

NCERT introduces GROUP BY in Pandas for the same reason it exists in SQL.

So far, we have:

  • Calculated statistics on entire columns
  • Aggregated data without categories

However, real-world questions are usually group-based, such as:

  • Average marks class-wise
  • Total sales city-wise
  • Number of students stream-wise

To answer such questions, data must be:

  1. Divided into groups
  2. Aggregated within each group

This is exactly what GROUP BY functions do.


3.5.2 Concept of “Split–Apply–Combine” (NCERT Explanation)

NCERT explains GROUP BY using the idea of:

Split → Apply → Combine

  1. Split Data is divided into groups based on a column

  2. Apply An aggregation function is applied to each group

  3. Combine Results are combined into a new DataFrame or Series

Understanding this concept is more important than memorising syntax.


3.5.3 Sample DataFrame Used for GROUP BY

We now use a slightly richer dataset, similar to NCERT’s examples.

import pandas as pd

data = {
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan', 'Riya'],
    'Class': ['XII', 'XII', 'XI', 'XI', 'XII', 'XI'],
    'Subject': ['Maths', 'Maths', 'Maths', 'Science', 'Science', 'Science'],
    'Marks': [85, 90, 78, 88, 92, 80]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name Class  Subject   Marks
0   Amit   XII    Maths      85
1   Neha   XII    Maths      90
2  Rahul    XI    Maths      78
3  Pooja    XI  Science      88
4  Karan   XII  Science      92
5   Riya    XI  Science      80

3.5.4 Creating a GROUP BY Object

NCERT shows that groupby() does not immediately compute results.

grp = df.groupby('Class')
print(grp)

Explanation:

  • This creates a GroupBy object
  • No output is displayed yet
  • Actual computation happens only when an aggregation function is applied

3.5.5 GROUP BY with mean()

Example: Average marks class-wise

print(df.groupby('Class')['Marks'].mean())

Output:

Class
XI     82.0
XII    89.0
Name: Marks, dtype: float64

Explanation

  • Data is split into two groups: XI and XII
  • Mean is calculated separately for each group
  • Result is a Series indexed by Class

This is conceptually identical to SQL GROUP BY Class.


3.5.6 GROUP BY with sum()

Example: Total marks class-wise

print(df.groupby('Class')['Marks'].sum())

Output:

Class
XI     246
XII    267
Name: Marks, dtype: int64

3.5.7 GROUP BY with count()

Example: Number of students in each class

print(df.groupby('Class')['Name'].count())

Output:

Class
XI     3
XII    3
Name: Name, dtype: int64

Explanation:

  • Counts non-null values
  • Equivalent to SQL COUNT(*)

3.5.8 GROUP BY on Multiple Columns

NCERT highlights that:

  • Grouping can be done on more than one column

Example: Average marks class-wise and subject-wise

print(df.groupby(['Class', 'Subject'])['Marks'].mean())

Output:

Class  Subject
XI     Maths       78.0
       Science     84.0
XII    Maths       87.5
       Science     92.0
Name: Marks, dtype: float64

Explanation

  • Data is first grouped by Class
  • Then sub-grouped by Subject
  • Result uses a MultiIndex

This is very important conceptually.


3.5.9 Using agg() with GROUP BY

NCERT shows that multiple aggregations can be applied to groups.

Example: Multiple aggregations on Marks

print(df.groupby('Class')['Marks'].agg(['mean', 'max', 'min']))

Output:

       mean  max  min
Class                
XI     82.0   88   78
XII    89.0   92   85

Explanation

  • Each group produces multiple summary values
  • Output is a DataFrame

3.5.10 GROUP BY on Entire DataFrame

NCERT also allows aggregation without selecting a specific column.

print(df.groupby('Class').mean(numeric_only=True))

Explanation:

  • Aggregation applied to all numeric columns
  • Non-numeric columns are ignored

3.5.11 Iterating Through Groups (Conceptual Awareness)

NCERT briefly introduces the idea that groups can be iterated.

for name, group in df.groupby('Class'):
    print("Group:", name)
    print(group)

Explanation:

  • name → group label
  • group → DataFrame for that group

This helps in understanding how grouping works internally.


3.5.12 Difference Between Pandas GROUP BY and SQL GROUP BY

NCERT expects students to relate this with SQL knowledge.

Pandas GROUP BY SQL GROUP BY
groupby() function GROUP BY clause
Works on DataFrame Works on table
Split–Apply–Combine Group–Aggregate
Supports multiple aggregations easily More rigid syntax

3.5.13 Key Learning Outcomes of Section 3.5

By the end of this section, NCERT expects students to:

  • Understand the concept of grouping
  • Use groupby() on one or more columns
  • Apply aggregation functions to groups
  • Use agg() with GROUP BY
  • Interpret grouped results
  • Relate Pandas GROUP BY to SQL GROUP BY

Section 3.5 Completed

We have now fully and sequentially covered:

  • GROUP BY concept
  • Split–Apply–Combine
  • Single-column grouping
  • Multi-column grouping
  • Aggregations with GROUP BY
  • NCERT-aligned examples

3.6 Altering the Index

3.6.1 Meaning of Index in a DataFrame (NCERT Context)

NCERT reminds students that:

  • Every Pandas DataFrame has an index
  • The index is used to:

    • Label rows
    • Identify data
    • Access records efficiently

By default:

  • Pandas assigns an integer index starting from 0

However, in real datasets:

  • A meaningful column (like RollNo, EmpID, Date) is often better suited as an index

This section explains how to modify the index of a DataFrame.


3.6.2 Sample DataFrame Used for Index Operations

import pandas as pd

data = {
    'RollNo': [101, 102, 103, 104],
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja'],
    'Marks': [85, 90, 78, 92]
}

df = pd.DataFrame(data)
print(df)

Output:

   RollNo   Name  Marks
0     101   Amit     85
1     102   Neha     90
2     103  Rahul     78
3     104  Pooja     92

Here:

  • Index is 0, 1, 2, 3
  • RollNo is just a normal column

3.6.3 Setting a Column as Index using set_index()

NCERT introduces set_index() to make a column the index.

Example: Set RollNo as index

df1 = df.set_index('RollNo')
print(df1)

Output:

        Name  Marks
RollNo             
101     Amit     85
102     Neha     90
103    Rahul     78
104    Pooja     92

Explanation

  • RollNo column is removed from data
  • It becomes the row index
  • Index values are now meaningful identifiers

This is very common in real datasets.


3.6.4 Setting Index Without Losing the Column

NCERT explains that sometimes:

  • We want to keep the column as well

    df2 = df.set_index('RollNo', drop=False)
    print(df2)
    

Explanation:

  • drop=False keeps the column
  • RollNo exists both as:

    • Column
    • Index

3.6.5 Resetting the Index using reset_index()

NCERT now explains how to restore the default index.

df_reset = df1.reset_index()
print(df_reset)

Output:

   RollNo   Name  Marks
0     101   Amit     85
1     102   Neha     90
2     103  Rahul     78
3     104  Pooja     92

Explanation

  • Index is converted back into a column
  • Default integer index is restored

3.6.6 Resetting Index and Dropping Old Index

If we do not want the old index as a column:

df_reset_drop = df1.reset_index(drop=True)
print(df_reset_drop)

Explanation:

  • Old index values are discarded
  • Clean new index is created

3.6.7 Renaming Index Labels

NCERT shows that index labels can be renamed.

Example: Rename index values

df1.index = ['S1', 'S2', 'S3', 'S4']
print(df1)

Output:

     Name  Marks
S1   Amit     85
S2   Neha     90
S3  Rahul     78
S4  Pooja     92

Explanation

  • Index values are directly reassigned
  • Length must match number of rows

3.6.8 Naming the Index (Index Name Attribute)

NCERT introduces index.name.

df1.index.name = 'StudentID'
print(df1)

Output shows:

           Name  Marks
StudentID             
S1         Amit     85
S2         Neha     90
S3        Rahul     78
S4        Pooja     92

Explanation:

  • Index itself now has a name
  • Useful when exporting or displaying data

Although not strictly index values, NCERT connects this logically.

df1.columns = ['StudentName', 'TotalMarks']
print(df1)

Explanation:

  • Column labels can also be modified
  • Improves readability and clarity

3.6.10 Setting Multiple Columns as Index (Conceptual Awareness)

NCERT briefly mentions that:

  • More than one column can be used as index

    df_multi = df.set_index(['RollNo', 'Name'])
    print(df_multi)
    

Explanation:

  • Creates a MultiIndex
  • Useful for hierarchical data
  • Detailed usage is beyond this level, but concept is important

3.6.11 Why Index Alteration is Important (NCERT Intent)

NCERT wants students to understand that:

  • Index controls:

    • How data is accessed
    • How data is aligned
    • How data is grouped
  • Proper indexing:

    • Improves readability
    • Makes analysis easier
    • Helps in merging and grouping

3.6.12 Key Learning Outcomes of Section 3.6

By the end of this section, NCERT expects students to:

  • Understand the role of index in a DataFrame
  • Use set_index() to change index
  • Use reset_index() to restore default index
  • Rename index labels
  • Name the index
  • Understand the idea of multi-level index

3.7 Other DataFrame Operations

3.7.1 Purpose of This Section (NCERT Context)

NCERT introduces this section to familiarise students with day-to-day operations performed on a DataFrame, such as:

  • Adding rows and columns
  • Deleting rows and columns
  • Applying functions on data
  • Selecting data based on conditions

These operations are essential for:

  • Cleaning data
  • Preparing data for analysis
  • Modifying existing datasets

3.7.2 Sample DataFrame Used

We continue with a student dataset.

import pandas as pd

data = {
    'RollNo': [101, 102, 103, 104],
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja'],
    'Marks': [85, 90, 78, 92]
}

df = pd.DataFrame(data)
print(df)

Output:

   RollNo   Name  Marks
0     101   Amit     85
1     102   Neha     90
2     103  Rahul     78
3     104  Pooja     92

3.7.3 Adding a New Column to a DataFrame

NCERT explains that a new column can be added by simple assignment.

Example: Add Grade column

df['Grade'] = ['B', 'A', 'C', 'A']
print(df)

Output:

   RollNo   Name  Marks Grade
0     101   Amit     85     B
1     102   Neha     90     A
2     103  Rahul     78     C
3     104  Pooja     92     A

Explanation:

  • Column is added to the right
  • Length of data must match number of rows

3.7.4 Adding a Column Using Computation

df['MarksPlus5'] = df['Marks'] + 5
print(df)

Explanation:

  • Vectorised operation
  • No loop required

3.7.5 Deleting a Column using drop()

NCERT introduces drop() for removal.

df1 = df.drop('MarksPlus5', axis=1)
print(df1)

Explanation:

  • axis=1 → column
  • Original DataFrame remains unchanged unless inplace=True is used

3.7.6 Deleting a Column Permanently

df.drop('Grade', axis=1, inplace=True)
print(df)

Explanation:

  • Column removed permanently from df

3.7.7 Adding a New Row to a DataFrame

NCERT introduces row addition using loc.

Example: Add a new student

df.loc[4] = [105, 'Karan', 88]
print(df)

Output:

   RollNo   Name  Marks
0     101   Amit     85
1     102   Neha     90
2     103  Rahul     78
3     104  Pooja     92
4     105  Karan     88

Explanation:

  • Index 4 did not exist earlier
  • Pandas adds a new row

3.7.8 Deleting Rows from a DataFrame

Delete row by index

df2 = df.drop(2)
print(df2)

Explanation:

  • Row with index 2 (Rahul) is removed
  • Original DataFrame is unchanged

3.7.9 Deleting Multiple Rows

df3 = df.drop([0, 3])
print(df3)

Explanation:

  • Rows with index 0 and 3 are removed

3.7.10 Selecting Data Based on Condition (Filtering)

NCERT introduces conditional selection.

Example: Students scoring more than 85

high_scorers = df[df['Marks'] > 85]
print(high_scorers)

Output:

   RollNo   Name  Marks
1     102   Neha     90
3     104  Pooja     92
4     105  Karan     88

Explanation:

  • Boolean condition inside square brackets
  • Very common in data analysis

3.7.11 Selecting Specific Columns with Condition

print(df.loc[df['Marks'] > 85, ['Name', 'Marks']])

Explanation:

  • Combines row filtering with column selection

3.7.12 Using apply() on a DataFrame Column

NCERT introduces apply() to apply a function to each element.

Example: Assign Pass/Fail status

def result(m):
    if m >= 40:
        return 'Pass'
    else:
        return 'Fail'

df['Result'] = df['Marks'].apply(result)
print(df)

Explanation:

  • Function applied to each value in Marks
  • New column created based on logic

3.7.13 Using Lambda Function with apply()

df['Status'] = df['Marks'].apply(lambda x: 'Distinction' if x >= 90 else 'Normal')
print(df)

Explanation:

  • Lambda functions are short, anonymous functions
  • Commonly used with apply()

3.7.14 Replacing Values in a DataFrame

NCERT shows the use of replace().

df['Status'] = df['Status'].replace('Normal', 'Regular')
print(df)

Explanation:

  • Useful for cleaning categorical data

3.7.15 Checking DataFrame Information

print(df.info())

Explanation:

  • Displays:

    • Column names
    • Data types
    • Non-null counts
  • Very useful for quick inspection


3.7.16 Checking for Duplicate Rows (Conceptual)

NCERT briefly mentions duplicate detection.

print(df.duplicated())

Explanation:

  • Returns True/False for each row
  • Duplicate handling is part of data cleaning

3.7.17 Key Learning Outcomes of Section 3.7

By the end of this section, NCERT expects students to:

  • Add and delete rows and columns
  • Modify data using operations
  • Filter DataFrame based on conditions
  • Use apply() and lambda functions
  • Replace values
  • Inspect DataFrame structure

3.8 Handling Missing Values

3.8.1 What are Missing Values? (NCERT Explanation)

NCERT explains that missing values occur when:

  • Data is not available for some records
  • Data is lost during collection or transfer
  • Certain fields are left blank intentionally or unintentionally

In datasets, missing values are common and cannot be ignored, because they can:

  • Affect calculations
  • Produce incorrect analysis
  • Lead to wrong conclusions

3.8.2 Representation of Missing Values in Pandas

NCERT introduces the special value:

NaN
  • NaN stands for Not a Number
  • Pandas uses NaN to represent missing data
  • It applies to both:

    • Numeric data
    • Non-numeric data

3.8.3 Sample DataFrame with Missing Values

We use a dataset similar to NCERT examples.

import pandas as pd
import numpy as np

data = {
    'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
    'Maths': [85, 90, np.nan, 92, 88],
    'Science': [80, np.nan, 75, 89, 84],
    'English': [78, 82, 74, np.nan, 80]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Maths  Science  English
0   Amit   85.0     80.0     78.0
1   Neha   90.0      NaN     82.0
2  Rahul    NaN     75.0     74.0
3  Pooja   92.0     89.0      NaN
4  Karan   88.0     84.0     80.0

3.8.4 Identifying Missing Values using isnull()

NCERT introduces isnull() to detect missing values.

print(df.isnull())

Output:

    Name  Maths  Science  English
0  False  False    False    False
1  False  False     True    False
2  False   True    False    False
3  False  False    False     True
4  False  False    False    False

Explanation:

  • True → missing value
  • False → valid value

3.8.5 Using notnull()

The opposite of isnull() is notnull().

print(df.notnull())

Explanation:

  • True → value exists
  • False → missing value

3.8.6 Counting Missing Values

NCERT explains that we can count missing values column-wise.

print(df.isnull().sum())

Output:

Name       0
Maths      1
Science    1
English    1
dtype: int64

Explanation:

  • Helps identify which columns have missing data
  • Useful before cleaning data

3.8.7 Dropping Rows with Missing Values using dropna()

NCERT introduces dropna() to remove records containing missing values.

Drop rows with any missing value

df_drop = df.dropna()
print(df_drop)

Output:

    Name  Maths  Science  English
0   Amit   85.0     80.0     78.0
4  Karan   88.0     84.0     80.0

Explanation:

  • Rows containing even one NaN are removed
  • Use with caution, as data loss can occur

3.8.8 Dropping Columns with Missing Values

df_drop_col = df.dropna(axis=1)
print(df_drop_col)

Explanation:

  • axis=1 → columns
  • Columns with any missing value are removed

3.8.9 Dropping Rows Based on Condition

NCERT mentions selective dropping.

df_drop_cond = df.dropna(thresh=3)
print(df_drop_cond)

Explanation:

  • thresh=3 means:

    • Keep rows with at least 3 non-null values

3.8.10 Filling Missing Values using fillna()

Instead of deleting data, NCERT recommends filling missing values.


Fill missing values with a constant

df_filled = df.fillna(0)
print(df_filled)

Explanation:

  • All missing values replaced with 0
  • Simple but may not be meaningful in all cases

3.8.11 Filling Missing Values with Column Mean

NCERT highlights this as a common practice.

df['Maths'].fillna(df['Maths'].mean(), inplace=True)
df['Science'].fillna(df['Science'].mean(), inplace=True)
df['English'].fillna(df['English'].mean(), inplace=True)

print(df)

Explanation:

  • Missing values replaced with average of column
  • Preserves overall data trend

3.8.12 Forward Fill and Backward Fill

NCERT introduces propagation-based filling.

Forward Fill (ffill)

df_ffill = df.fillna(method='ffill')
print(df_ffill)

Explanation:

  • Missing value filled using previous valid value

Backward Fill (bfill)

df_bfill = df.fillna(method='bfill')
print(df_bfill)

Explanation:

  • Missing value filled using next valid value

3.8.13 Replacing Missing Values using replace()

NCERT mentions that replace() can also be used.

df_replace = df.replace(np.nan, 0)
print(df_replace)

Explanation:

  • Replaces NaN with specified value
  • Similar effect to fillna()

3.8.14 Effect of Missing Values on Calculations

NCERT stresses an important point:

  • Pandas ignores NaN values by default in:

    • sum()
    • mean()
    • min()
    • max()

Example:

print(df['Maths'].mean())

Explanation:

  • Mean is calculated using available values only
  • Missing values do not cause errors

3.8.15 Choosing the Right Method (NCERT Guidance)

NCERT expects students to understand that:

  • Dropping data may cause loss of information
  • Filling data must be logically justified
  • There is no single correct method
  • Choice depends on:

    • Nature of data
    • Purpose of analysis

3.8.16 Key Learning Outcomes of Section 3.8

By the end of this section, NCERT expects students to:

  • Identify missing values using isnull() and notnull()
  • Count missing values
  • Drop rows or columns with missing values
  • Fill missing values using:

    • Constant values
    • Mean
    • Forward fill
    • Backward fill
  • Understand the impact of missing values on analysis


3.9 Import and Export of Data between Pandas and MySQL

3.9.1 Why Connect Pandas with MySQL? (NCERT Context)

NCERT introduces this section to show how data analysis tools (Pandas) work with databases (MySQL).

So far, you have worked with:

  • Data entered manually
  • Data stored in CSV files

However, in real-world applications:

  • Data is often stored in relational databases
  • MySQL is one of the most widely used database systems
  • Data analysts frequently need to:

    • Read data from database tables
    • Analyse it using Pandas
    • Store processed data back into the database

This section demonstrates basic interaction between Pandas and MySQL.


3.9.2 Tools Required for Pandas–MySQL Connectivity

NCERT explains that to connect Pandas with MySQL, we need:

  1. MySQL Server
  2. MySQL Connector for Python
  3. Pandas library

The Python module used is:

mysql.connector

This module allows Python programs to:

  • Establish connection with MySQL
  • Execute SQL queries
  • Fetch and insert data

3.9.3 Importing Required Libraries

Before connecting to MySQL, required libraries must be imported.

import pandas as pd
import mysql.connector

Explanation:

  • pandas → data handling and analysis
  • mysql.connector → database connectivity

3.9.4 Establishing Connection with MySQL

NCERT shows how to create a connection object.

mycon = mysql.connector.connect(
    host="localhost",
    user="root",
    password="password",
    database="school"
)

Explanation:

  • host → location of MySQL server
  • user → MySQL username
  • password → MySQL password
  • database → database to be used

If the connection is successful:

  • mycon becomes an active connection object

3.9.5 Reading Data from MySQL into a DataFrame

NCERT demonstrates how to:

  • Fetch records from a MySQL table
  • Load them into a Pandas DataFrame

Example: Read STUDENT table from MySQL

Assume MySQL table:

STUDENT
------------------------
RollNo | Name  | Marks
query = "SELECT * FROM STUDENT"

df = pd.read_sql(query, mycon)
print(df)

Explanation:

  • read_sql() executes the SQL query
  • Result is directly converted into a DataFrame
  • Column names are preserved

This is the most important function in this section.


3.9.6 Reading Selected Columns from MySQL

query = "SELECT RollNo, Name FROM STUDENT"
df = pd.read_sql(query, mycon)
print(df)

Explanation:

  • Only required columns are fetched
  • Reduces data transfer overhead

3.9.7 Applying Pandas Operations on Database Data

NCERT stresses that:

Once data is in a DataFrame, all Pandas operations can be applied.

Example:

print(df.describe())

or

high_scorers = df[df['Marks'] > 85]
print(high_scorers)

This shows the power of combining databases with Pandas.


3.9.8 Writing Data from DataFrame to MySQL Table

NCERT also explains the reverse operation:

Storing processed data back into MySQL.

Steps involved:

  1. Create a cursor
  2. Execute INSERT queries using DataFrame values

Example: Insert DataFrame rows into MySQL

cursor = mycon.cursor()

for i in range(len(df)):
    sql = "INSERT INTO STUDENT VALUES (%s, %s, %s)"
    values = (
        int(df.loc[i, 'RollNo']),
        df.loc[i, 'Name'],
        int(df.loc[i, 'Marks'])
    )
    cursor.execute(sql, values)

mycon.commit()

Explanation:

  • Each row of DataFrame is inserted into MySQL table
  • commit() saves the changes permanently

3.9.9 Exporting DataFrame to MySQL (Conceptual Note)

NCERT does not require students to use advanced tools like SQLAlchemy.

At CBSE level:

  • Understanding logic and flow is sufficient
  • Focus is on:

    • Reading from MySQL
    • Writing back using basic SQL commands

3.9.10 Closing the Database Connection

NCERT highlights good programming practice.

cursor.close()
mycon.close()

Explanation:

  • Frees system resources
  • Prevents connection leaks

3.9.11 Common Errors and Precautions (NCERT Awareness)

NCERT expects conceptual awareness of:

  • Wrong username/password → connection failure
  • Database/table not found → SQL error
  • Forgetting commit() → data not saved
  • Mismatch between table structure and DataFrame → insertion error

Students are not expected to debug deeply, only understand causes.


3.9.12 Key Learning Outcomes of Section 3.9

By the end of this section, NCERT expects students to:

  • Understand the need to connect Pandas with MySQL
  • Import required libraries for database connectivity
  • Establish connection with MySQL
  • Read database tables into a DataFrame using read_sql()
  • Perform Pandas operations on database data
  • Insert DataFrame data into MySQL tables
  • Close database connections properly

Chapter 03 Completed Fully

We have now fully and sequentially completed Chapter 03:

  • 3.1 Introduction
  • 3.2 Descriptive Statistics
  • 3.3 Data Aggregations
  • 3.4 Sorting a DataFrame
  • 3.5 GROUP BY Functions
  • 3.6 Altering the Index
  • 3.7 Other DataFrame Operations
  • 3.8 Handling Missing Values
  • 3.9 Pandas ↔ MySQL