Chapter 03 - Data Handling with Pandas II
3.1 Introduction
3.1.1 Context from Chapter 02 (NCERT Continuity)
NCERT begins this chapter by explicitly linking it with Chapter 02: Data Handling with Pandas I.
In the previous chapter, you learnt:
- What Pandas is
How to work with:
- Series
- DataFrame
How to:
- Create DataFrames
- Access rows and columns
- Import data from CSV files
- Export data back to CSV
At that stage, the focus was mainly on:
- Creating and viewing data
- Basic manipulation
However, real-world data handling does not stop there.
3.1.2 Why Chapter 03 is Needed (NCERT Motivation)
NCERT now shifts the focus from basic data handling to data analysis.
In practical situations:
- Data is often very large
- Raw data by itself is not useful
We need to:
- Summarise data
- Identify patterns
- Compare groups
- Prepare reports
For example:
- What is the average score of students?
- Which city has the highest sales?
- How many records belong to each category?
- Which values are missing or incorrect?
Such questions require advanced DataFrame operations, which are introduced in this chapter.
3.1.3 Nature of Data Handled in This Chapter
NCERT clarifies that this chapter deals with:
- Structured data
- Data stored in rows and columns
Data that can be:
- Numeric
- Categorical
- Textual
The emphasis is on analysing data, not merely displaying it.
3.1.4 Major Concepts Introduced in This Chapter
NCERT outlines that this chapter will focus on the following capabilities of Pandas:
- Calculating descriptive statistics
- Performing data aggregation
Sorting data based on:
- Values
- Index
Grouping data using GROUP BY
Modifying the index
Performing additional DataFrame operations
Handling missing data
Exchanging data between:
- Pandas
- MySQL database
Each of these topics builds on the DataFrame knowledge gained earlier.
3.1.5 Importance of This Chapter in Data Analysis
NCERT implicitly highlights that:
This chapter introduces operations used in:
- Business analytics
- Data science
- Research
The techniques taught here are:
- Practical
- Frequently used
- Exam-relevant
Students are expected to:
- Apply logical thinking
- Interpret results, not just write code
- Understand the meaning of outputs
3.1.6 How This Chapter Differs from Chapter 02
| Chapter 02 | Chapter 03 |
|---|---|
| Creating data | Analysing data |
| Structure of DataFrame | Behaviour of DataFrame |
| Reading / writing data | Summarising / grouping data |
| Basic operations | Advanced operations |
This distinction is important for conceptual clarity.
3.1.7 Learning Outcome of Section 3.1
By the end of this introductory section, NCERT expects students to:
- Understand the scope of the chapter
- Recognise the shift from data creation to data analysis
- Appreciate the importance of Pandas in real-world data handling
- Be prepared for advanced operations on DataFrames
Section 3.1 Completed
We have now fully and sequentially covered:
- Why Chapter 03 exists
- How it builds on Chapter 02
- What types of problems it addresses
- What topics will be studied next
3.2 Descriptive Statistics
3.2.1 What are Descriptive Statistics? (NCERT Explanation)
NCERT introduces descriptive statistics as methods used to:
- Summarise large datasets
- Describe the central tendency
- Measure the spread or variability of data
Instead of examining every value individually, descriptive statistics provide:
- Single numerical values
- Quick understanding of data behaviour
This is essential when working with large DataFrames.
3.2.2 Sample Data Used for Explanation
NCERT explains these concepts using a DataFrame.
We will use a similar structure.
import pandas as pd
data = {
'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
'Maths': [85, 90, 78, 92, 88],
'Science': [80, 85, 75, 89, 84],
'English': [78, 82, 74, 88, 80]
}
df = pd.DataFrame(data)
print(df)
Output
Name Maths Science English
0 Amit 85 80 78
1 Neha 90 85 82
2 Rahul 78 75 74
3 Pooja 92 89 88
4 Karan 88 84 80
3.2.3 The describe() Function (Core NCERT Method)
NCERT introduces describe() as the most important descriptive statistics function.
print(df.describe())
Output
Maths Science English
count 5.000000 5.000000 5.000000
mean 86.600000 82.600000 80.400000
std 5.941380 5.941380 5.263079
min 78.000000 75.000000 74.000000
25% 85.000000 80.000000 78.000000
50% 88.000000 84.000000 80.000000
75% 90.000000 85.000000 82.000000
max 92.000000 89.000000 88.000000
3.2.4 Interpretation of describe() Output (Very Important)
NCERT expects students to understand, not memorise, these values.
count→ Number of non-null valuesmean→ Average valuestd→ Standard deviation (spread of data)min→ Minimum value25%→ First quartile50%→ Median75%→ Third quartilemax→ Maximum value
This output gives a complete statistical summary.
3.2.5 Calculating Mean Using mean()
Instead of using describe(), NCERT shows individual functions.
print(df['Maths'].mean())
Output:
86.6
3.2.6 Calculating Median Using median()
print(df['Science'].median())
Output:
84.0
Explanation:
- Median is the middle value after sorting
3.2.7 Calculating Mode Using mode()
NCERT explains that:
- Mode is the most frequently occurring value
A dataset can have:
- One mode
- Multiple modes
No mode
print(df['English'].mode())
Output:
0 78
1 82
2 74
3 88
4 80
dtype: int64
Explanation:
- All values appear only once
- Hence, Pandas returns all values
3.2.8 Calculating Minimum and Maximum Values
print(df['Maths'].min())
print(df['Maths'].max())
Output:
78
92
These indicate:
- Lowest and highest scores
3.2.9 Calculating Standard Deviation Using std()
print(df['Science'].std())
Output:
5.941380
Explanation:
- Measures dispersion
- Higher value → more spread
NCERT does not require mathematical formula, only interpretation.
3.2.10 Calculating Variance Using var()
print(df['English'].var())
Output:
27.7
Variance is the square of standard deviation.
3.2.11 Counting Values Using count()
print(df['Maths'].count())
Output:
5
Counts non-null values only.
3.2.12 Descriptive Statistics Across Rows
NCERT mentions that:
- Functions operate column-wise by default
To operate row-wise, use
axis=1print(df[['Maths', 'Science', 'English']].mean(axis=1))
Output:
0 81.0
1 85.7
2 75.7
3 89.7
4 84.0
dtype: float64
This gives:
- Average marks per student
3.2.13 Descriptive Statistics on Entire DataFrame
print(df.mean(numeric_only=True))
Output:
Maths 86.6
Science 82.6
English 80.4
dtype: float64
Explanation:
- Only numeric columns are considered
3.2.14 Handling Missing Values in Statistics (NCERT Concept)
If a column contains NaN:
- Descriptive functions ignore NaN by default
Example:
df.loc[2, 'Science'] = None
print(df['Science'].mean())
Explanation:
- Mean is calculated using remaining values
- Missing values do not cause errors
3.2.15 Key Learning Outcomes of Section 3.2
NCERT expects students to be able to:
- Define descriptive statistics
- Use
describe()to summarise data Calculate:
- Mean
- Median
- Mode
- Min, Max
- Standard deviation
- Variance
Interpret statistical results
Apply statistics row-wise and column-wise
Section 3.2 Completed
We have now fully and sequentially covered:
- Concept of descriptive statistics
describe()function- Individual statistical functions
- Interpretation of results
- NCERT-aligned examples with code
3.3 Data Aggregations
3.3.1 Meaning of Data Aggregation (NCERT Explanation)
NCERT introduces data aggregation as the process of:
- Combining multiple data values
- Producing a single summary value
Performing operations such as:
- Sum
- Average
- Minimum
- Maximum
- Count
In simple words:
Data aggregation means summarising data to extract meaningful information.
Aggregation is a core activity in data analysis, especially when dealing with large datasets.
3.3.2 Difference Between Descriptive Statistics and Aggregation
NCERT implicitly distinguishes the two:
Descriptive statistics → Describe data distribution (mean, median, std, etc.)
Data aggregation → Combine values to produce summaries (sum, count, min, max)
Both are related, but aggregation focuses more on combining data.
3.3.3 Sample DataFrame Used for Aggregation
We continue with a marks dataset, similar to NCERT examples.
import pandas as pd
data = {
'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
'Maths': [85, 90, 78, 92, 88],
'Science': [80, 85, 75, 89, 84],
'English': [78, 82, 74, 88, 80]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Maths Science English
0 Amit 85 80 78
1 Neha 90 85 82
2 Rahul 78 75 74
3 Pooja 92 89 88
4 Karan 88 84 80
3.3.4 Using sum() for Aggregation
Column-wise sum (default behaviour)
print(df[['Maths', 'Science', 'English']].sum())
Output:
Maths 433
Science 413
English 402
dtype: int64
Explanation:
- Values are added column-wise
- Default axis = 0 (rows)
Row-wise sum
print(df[['Maths', 'Science', 'English']].sum(axis=1))
Output:
0 243
1 257
2 227
3 269
4 252
dtype: int64
Explanation:
- Each student’s total marks are calculated
axis=1means across columns
3.3.5 Using mean() for Aggregation
print(df[['Maths', 'Science', 'English']].mean())
Output:
Maths 86.6
Science 82.6
English 80.4
dtype: float64
Explanation:
- Returns average marks per subject
3.3.6 Using min() and max()
print(df[['Maths', 'Science', 'English']].min())
print(df[['Maths', 'Science', 'English']].max())
Output:
Maths 78
Science 75
English 74
dtype: int64
Maths 92
Science 89
English 88
dtype: int64
Explanation:
Helps identify:
- Lowest score
- Highest score
3.3.7 Using count() in Aggregation
print(df[['Maths', 'Science', 'English']].count())
Output:
Maths 5
Science 5
English 5
dtype: int64
Explanation:
- Counts non-null values
- Useful when data may have missing values
3.3.8 Aggregation Using agg() Function (Very Important)
NCERT introduces agg() as a powerful aggregation method.
It allows:
- Applying multiple aggregation functions at once
Example: Multiple aggregations on columns
print(df[['Maths', 'Science', 'English']].agg(['sum', 'mean', 'min', 'max']))
Output:
Maths Science English
sum 433 413 402
mean 86.6 82.6 80.4
min 78 75 74
max 92 89 88
Explanation
- Rows represent aggregation functions
- Columns represent subjects
- This gives a compact statistical summary
3.3.9 Different Aggregations for Different Columns
NCERT also highlights that:
Different columns may need different aggregations
print(df.agg({ 'Maths': ['mean', 'max'], 'Science': ['min', 'mean'], 'English': ['sum'] }))
Output:
Maths Science English
mean 86.6 82.6 NaN
max 92.0 NaN NaN
min NaN 75.0 NaN
sum NaN NaN 402.0
Explanation:
- Each column has its own aggregation logic
- Missing combinations appear as
NaN
3.3.10 Aggregation on Entire DataFrame
print(df.agg(['sum', 'mean'], numeric_only=True))
Explanation:
- Only numeric columns are aggregated
- Text columns are ignored
3.3.11 Aggregation with Missing Values (NCERT Concept)
NCERT notes that:
- Aggregation functions ignore NaN by default
Example:
df.loc[1, 'Science'] = None
print(df['Science'].mean())
Explanation:
- Mean is computed using remaining values
- No error is raised
3.3.12 Key Differences: describe() vs agg()
| describe() | agg() |
|---|---|
| Fixed set of statistics | Customisable |
| Automatic summary | User-defined |
| Quick overview | Detailed control |
NCERT expects students to know when to use each.
3.3.13 Key Learning Outcomes of Section 3.3
By the end of this section, NCERT expects students to:
- Understand data aggregation
Use aggregation functions:
- sum, mean, min, max, count
Apply aggregation row-wise and column-wise
Use
agg()for:- Multiple aggregations
- Column-specific aggregations
Handle missing values during aggregation
Section 3.3 Completed
We have now fully and sequentially covered:
- Concept of data aggregation
- Built-in aggregation functions
- Axis-based aggregation
agg()function with examples- NCERT-aligned interpretation
3.4 Sorting a DataFrame
3.4.1 Meaning of Sorting (NCERT Explanation)
NCERT introduces sorting as the process of:
- Arranging data in a specific order
Based on:
- Values in one or more columns, or
- Index labels
Sorting helps to:
- Analyse trends
- Identify highest or lowest values
- Present data in a meaningful order
Pandas provides built-in functions to sort a DataFrame efficiently.
3.4.2 Sample DataFrame Used for Sorting
We will continue with a familiar dataset.
import pandas as pd
data = {
'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
'Maths': [85, 90, 78, 92, 88],
'Science': [80, 85, 75, 89, 84],
'English': [78, 82, 74, 88, 80]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Maths Science English
0 Amit 85 80 78
1 Neha 90 85 82
2 Rahul 78 75 74
3 Pooja 92 89 88
4 Karan 88 84 80
3.4.3 Sorting a DataFrame by Index
NCERT first introduces sorting by index.
Using sort_index()
df_sorted_index = df.sort_index()
print(df_sorted_index)
Explanation:
- Default sorting is ascending
- Since the index is already in order (0,1,2,3,4), output remains unchanged
3.4.4 Sorting Index in Descending Order
df_sorted_index_desc = df.sort_index(ascending=False)
print(df_sorted_index_desc)
Output:
Name Maths Science English
4 Karan 88 84 80
3 Pooja 92 89 88
2 Rahul 78 75 74
1 Neha 90 85 82
0 Amit 85 80 78
Explanation:
- Rows are rearranged based on index labels
- Highest index appears first
3.4.5 Sorting a DataFrame by Column Values
NCERT now introduces sorting by column values, which is more common.
Using sort_values()
df_sorted_marks = df.sort_values(by='Maths')
print(df_sorted_marks)
Output:
Name Maths Science English
2 Rahul 78 75 74
0 Amit 85 80 78
4 Karan 88 84 80
1 Neha 90 85 82
3 Pooja 92 89 88
Explanation:
- Rows are sorted based on Maths marks
- Default order is ascending
3.4.6 Sorting by Column in Descending Order
df_sorted_desc = df.sort_values(by='Maths', ascending=False)
print(df_sorted_desc)
Output:
Name Maths Science English
3 Pooja 92 89 88
1 Neha 90 85 82
4 Karan 88 84 80
0 Amit 85 80 78
2 Rahul 78 75 74
Explanation:
Useful for:
- Ranking
- Finding top performers
3.4.7 Sorting by Multiple Columns
NCERT explains that sometimes sorting by one column is not sufficient.
Example scenario:
- Sort by Maths marks
If Maths marks are equal, sort by Science marks
df_sorted_multi = df.sort_values(by=['Maths', 'Science']) print(df_sorted_multi)
Explanation:
- First sorting key: Maths
- Second sorting key: Science
3.4.8 Sorting with Different Orders for Different Columns
df_sorted_multi_order = df.sort_values(
by=['Maths', 'Science'],
ascending=[False, True]
)
print(df_sorted_multi_order)
Explanation:
- Maths sorted in descending order
- Science sorted in ascending order
This gives fine control over sorting behaviour.
3.4.9 Sorting Based on String Columns
NCERT clarifies that:
Sorting also works on text data
df_sorted_name = df.sort_values(by='Name') print(df_sorted_name)
Output:
Name Maths Science English
0 Amit 85 80 78
4 Karan 88 84 80
1 Neha 90 85 82
3 Pooja 92 89 88
2 Rahul 78 75 74
Explanation:
- Sorting is done in alphabetical order
3.4.10 In-place Sorting (Modifying Original DataFrame)
NCERT introduces inplace=True.
df.sort_values(by='Maths', inplace=True)
print(df)
Explanation:
- Original DataFrame is modified
- No new DataFrame is created
- Use carefully, as changes are permanent
3.4.11 Sorting with Missing Values (NCERT Concept)
If a column contains NaN:
df.loc[2, 'Maths'] = None
print(df.sort_values(by='Maths'))
Explanation:
- By default,
NaNvalues appear at the end - Pandas handles missing values automatically
3.4.12 Difference Between sort_index() and sort_values()
| sort_index() | sort_values() |
|---|---|
| Sorts by index labels | Sorts by column values |
| Index-based | Data-based |
| Used for row order | Used for ranking and analysis |
NCERT expects students to clearly distinguish these two.
3.4.13 Key Learning Outcomes of Section 3.4
By the end of this section, NCERT expects students to:
Sort DataFrames by:
- Index
- Column values
Sort in:
- Ascending order
- Descending order
Sort by:
- Single column
- Multiple columns
Use
inplaceoption correctlyUnderstand how missing values affect sorting
Section 3.4 Completed
We have now fully and sequentially covered:
- Concept of sorting
- Sorting by index
- Sorting by values
- Multiple-column sorting
- Ascending/descending order
- NCERT-aligned examples
3.5 GROUP BY Functions
3.5.1 Why GROUP BY is Required in Pandas (NCERT Motivation)
NCERT introduces GROUP BY in Pandas for the same reason it exists in SQL.
So far, we have:
- Calculated statistics on entire columns
- Aggregated data without categories
However, real-world questions are usually group-based, such as:
- Average marks class-wise
- Total sales city-wise
- Number of students stream-wise
To answer such questions, data must be:
- Divided into groups
- Aggregated within each group
This is exactly what GROUP BY functions do.
3.5.2 Concept of “Split–Apply–Combine” (NCERT Explanation)
NCERT explains GROUP BY using the idea of:
Split → Apply → Combine
Split Data is divided into groups based on a column
Apply An aggregation function is applied to each group
Combine Results are combined into a new DataFrame or Series
Understanding this concept is more important than memorising syntax.
3.5.3 Sample DataFrame Used for GROUP BY
We now use a slightly richer dataset, similar to NCERT’s examples.
import pandas as pd
data = {
'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan', 'Riya'],
'Class': ['XII', 'XII', 'XI', 'XI', 'XII', 'XI'],
'Subject': ['Maths', 'Maths', 'Maths', 'Science', 'Science', 'Science'],
'Marks': [85, 90, 78, 88, 92, 80]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Class Subject Marks
0 Amit XII Maths 85
1 Neha XII Maths 90
2 Rahul XI Maths 78
3 Pooja XI Science 88
4 Karan XII Science 92
5 Riya XI Science 80
3.5.4 Creating a GROUP BY Object
NCERT shows that groupby() does not immediately compute results.
grp = df.groupby('Class')
print(grp)
Explanation:
- This creates a GroupBy object
- No output is displayed yet
- Actual computation happens only when an aggregation function is applied
3.5.5 GROUP BY with mean()
Example: Average marks class-wise
print(df.groupby('Class')['Marks'].mean())
Output:
Class
XI 82.0
XII 89.0
Name: Marks, dtype: float64
Explanation
- Data is split into two groups: XI and XII
- Mean is calculated separately for each group
- Result is a Series indexed by Class
This is conceptually identical to SQL GROUP BY Class.
3.5.6 GROUP BY with sum()
Example: Total marks class-wise
print(df.groupby('Class')['Marks'].sum())
Output:
Class
XI 246
XII 267
Name: Marks, dtype: int64
3.5.7 GROUP BY with count()
Example: Number of students in each class
print(df.groupby('Class')['Name'].count())
Output:
Class
XI 3
XII 3
Name: Name, dtype: int64
Explanation:
- Counts non-null values
- Equivalent to SQL
COUNT(*)
3.5.8 GROUP BY on Multiple Columns
NCERT highlights that:
- Grouping can be done on more than one column
Example: Average marks class-wise and subject-wise
print(df.groupby(['Class', 'Subject'])['Marks'].mean())
Output:
Class Subject
XI Maths 78.0
Science 84.0
XII Maths 87.5
Science 92.0
Name: Marks, dtype: float64
Explanation
- Data is first grouped by Class
- Then sub-grouped by Subject
- Result uses a MultiIndex
This is very important conceptually.
3.5.9 Using agg() with GROUP BY
NCERT shows that multiple aggregations can be applied to groups.
Example: Multiple aggregations on Marks
print(df.groupby('Class')['Marks'].agg(['mean', 'max', 'min']))
Output:
mean max min
Class
XI 82.0 88 78
XII 89.0 92 85
Explanation
- Each group produces multiple summary values
- Output is a DataFrame
3.5.10 GROUP BY on Entire DataFrame
NCERT also allows aggregation without selecting a specific column.
print(df.groupby('Class').mean(numeric_only=True))
Explanation:
- Aggregation applied to all numeric columns
- Non-numeric columns are ignored
3.5.11 Iterating Through Groups (Conceptual Awareness)
NCERT briefly introduces the idea that groups can be iterated.
for name, group in df.groupby('Class'):
print("Group:", name)
print(group)
Explanation:
name→ group labelgroup→ DataFrame for that group
This helps in understanding how grouping works internally.
3.5.12 Difference Between Pandas GROUP BY and SQL GROUP BY
NCERT expects students to relate this with SQL knowledge.
| Pandas GROUP BY | SQL GROUP BY |
|---|---|
| groupby() function | GROUP BY clause |
| Works on DataFrame | Works on table |
| Split–Apply–Combine | Group–Aggregate |
| Supports multiple aggregations easily | More rigid syntax |
3.5.13 Key Learning Outcomes of Section 3.5
By the end of this section, NCERT expects students to:
- Understand the concept of grouping
- Use
groupby()on one or more columns - Apply aggregation functions to groups
- Use
agg()with GROUP BY - Interpret grouped results
- Relate Pandas GROUP BY to SQL GROUP BY
Section 3.5 Completed
We have now fully and sequentially covered:
- GROUP BY concept
- Split–Apply–Combine
- Single-column grouping
- Multi-column grouping
- Aggregations with GROUP BY
- NCERT-aligned examples
3.6 Altering the Index
3.6.1 Meaning of Index in a DataFrame (NCERT Context)
NCERT reminds students that:
- Every Pandas DataFrame has an index
The index is used to:
- Label rows
- Identify data
- Access records efficiently
By default:
- Pandas assigns an integer index starting from 0
However, in real datasets:
- A meaningful column (like RollNo, EmpID, Date) is often better suited as an index
This section explains how to modify the index of a DataFrame.
3.6.2 Sample DataFrame Used for Index Operations
import pandas as pd
data = {
'RollNo': [101, 102, 103, 104],
'Name': ['Amit', 'Neha', 'Rahul', 'Pooja'],
'Marks': [85, 90, 78, 92]
}
df = pd.DataFrame(data)
print(df)
Output:
RollNo Name Marks
0 101 Amit 85
1 102 Neha 90
2 103 Rahul 78
3 104 Pooja 92
Here:
- Index is 0, 1, 2, 3
- RollNo is just a normal column
3.6.3 Setting a Column as Index using set_index()
NCERT introduces set_index() to make a column the index.
Example: Set RollNo as index
df1 = df.set_index('RollNo')
print(df1)
Output:
Name Marks
RollNo
101 Amit 85
102 Neha 90
103 Rahul 78
104 Pooja 92
Explanation
RollNocolumn is removed from data- It becomes the row index
- Index values are now meaningful identifiers
This is very common in real datasets.
3.6.4 Setting Index Without Losing the Column
NCERT explains that sometimes:
We want to keep the column as well
df2 = df.set_index('RollNo', drop=False) print(df2)
Explanation:
drop=Falsekeeps the columnRollNo exists both as:
- Column
- Index
3.6.5 Resetting the Index using reset_index()
NCERT now explains how to restore the default index.
df_reset = df1.reset_index()
print(df_reset)
Output:
RollNo Name Marks
0 101 Amit 85
1 102 Neha 90
2 103 Rahul 78
3 104 Pooja 92
Explanation
- Index is converted back into a column
- Default integer index is restored
3.6.6 Resetting Index and Dropping Old Index
If we do not want the old index as a column:
df_reset_drop = df1.reset_index(drop=True)
print(df_reset_drop)
Explanation:
- Old index values are discarded
- Clean new index is created
3.6.7 Renaming Index Labels
NCERT shows that index labels can be renamed.
Example: Rename index values
df1.index = ['S1', 'S2', 'S3', 'S4']
print(df1)
Output:
Name Marks
S1 Amit 85
S2 Neha 90
S3 Rahul 78
S4 Pooja 92
Explanation
- Index values are directly reassigned
- Length must match number of rows
3.6.8 Naming the Index (Index Name Attribute)
NCERT introduces index.name.
df1.index.name = 'StudentID'
print(df1)
Output shows:
Name Marks
StudentID
S1 Amit 85
S2 Neha 90
S3 Rahul 78
S4 Pooja 92
Explanation:
- Index itself now has a name
- Useful when exporting or displaying data
3.6.9 Changing Column Names (Related Operation)
Although not strictly index values, NCERT connects this logically.
df1.columns = ['StudentName', 'TotalMarks']
print(df1)
Explanation:
- Column labels can also be modified
- Improves readability and clarity
3.6.10 Setting Multiple Columns as Index (Conceptual Awareness)
NCERT briefly mentions that:
More than one column can be used as index
df_multi = df.set_index(['RollNo', 'Name']) print(df_multi)
Explanation:
- Creates a MultiIndex
- Useful for hierarchical data
- Detailed usage is beyond this level, but concept is important
3.6.11 Why Index Alteration is Important (NCERT Intent)
NCERT wants students to understand that:
Index controls:
- How data is accessed
- How data is aligned
- How data is grouped
Proper indexing:
- Improves readability
- Makes analysis easier
- Helps in merging and grouping
3.6.12 Key Learning Outcomes of Section 3.6
By the end of this section, NCERT expects students to:
- Understand the role of index in a DataFrame
- Use
set_index()to change index - Use
reset_index()to restore default index - Rename index labels
- Name the index
- Understand the idea of multi-level index
3.7 Other DataFrame Operations
3.7.1 Purpose of This Section (NCERT Context)
NCERT introduces this section to familiarise students with day-to-day operations performed on a DataFrame, such as:
- Adding rows and columns
- Deleting rows and columns
- Applying functions on data
- Selecting data based on conditions
These operations are essential for:
- Cleaning data
- Preparing data for analysis
- Modifying existing datasets
3.7.2 Sample DataFrame Used
We continue with a student dataset.
import pandas as pd
data = {
'RollNo': [101, 102, 103, 104],
'Name': ['Amit', 'Neha', 'Rahul', 'Pooja'],
'Marks': [85, 90, 78, 92]
}
df = pd.DataFrame(data)
print(df)
Output:
RollNo Name Marks
0 101 Amit 85
1 102 Neha 90
2 103 Rahul 78
3 104 Pooja 92
3.7.3 Adding a New Column to a DataFrame
NCERT explains that a new column can be added by simple assignment.
Example: Add Grade column
df['Grade'] = ['B', 'A', 'C', 'A']
print(df)
Output:
RollNo Name Marks Grade
0 101 Amit 85 B
1 102 Neha 90 A
2 103 Rahul 78 C
3 104 Pooja 92 A
Explanation:
- Column is added to the right
- Length of data must match number of rows
3.7.4 Adding a Column Using Computation
df['MarksPlus5'] = df['Marks'] + 5
print(df)
Explanation:
- Vectorised operation
- No loop required
3.7.5 Deleting a Column using drop()
NCERT introduces drop() for removal.
df1 = df.drop('MarksPlus5', axis=1)
print(df1)
Explanation:
axis=1→ column- Original DataFrame remains unchanged unless
inplace=Trueis used
3.7.6 Deleting a Column Permanently
df.drop('Grade', axis=1, inplace=True)
print(df)
Explanation:
- Column removed permanently from
df
3.7.7 Adding a New Row to a DataFrame
NCERT introduces row addition using loc.
Example: Add a new student
df.loc[4] = [105, 'Karan', 88]
print(df)
Output:
RollNo Name Marks
0 101 Amit 85
1 102 Neha 90
2 103 Rahul 78
3 104 Pooja 92
4 105 Karan 88
Explanation:
- Index
4did not exist earlier - Pandas adds a new row
3.7.8 Deleting Rows from a DataFrame
Delete row by index
df2 = df.drop(2)
print(df2)
Explanation:
- Row with index
2(Rahul) is removed - Original DataFrame is unchanged
3.7.9 Deleting Multiple Rows
df3 = df.drop([0, 3])
print(df3)
Explanation:
- Rows with index 0 and 3 are removed
3.7.10 Selecting Data Based on Condition (Filtering)
NCERT introduces conditional selection.
Example: Students scoring more than 85
high_scorers = df[df['Marks'] > 85]
print(high_scorers)
Output:
RollNo Name Marks
1 102 Neha 90
3 104 Pooja 92
4 105 Karan 88
Explanation:
- Boolean condition inside square brackets
- Very common in data analysis
3.7.11 Selecting Specific Columns with Condition
print(df.loc[df['Marks'] > 85, ['Name', 'Marks']])
Explanation:
- Combines row filtering with column selection
3.7.12 Using apply() on a DataFrame Column
NCERT introduces apply() to apply a function to each element.
Example: Assign Pass/Fail status
def result(m):
if m >= 40:
return 'Pass'
else:
return 'Fail'
df['Result'] = df['Marks'].apply(result)
print(df)
Explanation:
- Function applied to each value in
Marks - New column created based on logic
3.7.13 Using Lambda Function with apply()
df['Status'] = df['Marks'].apply(lambda x: 'Distinction' if x >= 90 else 'Normal')
print(df)
Explanation:
- Lambda functions are short, anonymous functions
- Commonly used with
apply()
3.7.14 Replacing Values in a DataFrame
NCERT shows the use of replace().
df['Status'] = df['Status'].replace('Normal', 'Regular')
print(df)
Explanation:
- Useful for cleaning categorical data
3.7.15 Checking DataFrame Information
print(df.info())
Explanation:
Displays:
- Column names
- Data types
- Non-null counts
Very useful for quick inspection
3.7.16 Checking for Duplicate Rows (Conceptual)
NCERT briefly mentions duplicate detection.
print(df.duplicated())
Explanation:
- Returns True/False for each row
- Duplicate handling is part of data cleaning
3.7.17 Key Learning Outcomes of Section 3.7
By the end of this section, NCERT expects students to:
- Add and delete rows and columns
- Modify data using operations
- Filter DataFrame based on conditions
- Use
apply()and lambda functions - Replace values
- Inspect DataFrame structure
3.8 Handling Missing Values
3.8.1 What are Missing Values? (NCERT Explanation)
NCERT explains that missing values occur when:
- Data is not available for some records
- Data is lost during collection or transfer
- Certain fields are left blank intentionally or unintentionally
In datasets, missing values are common and cannot be ignored, because they can:
- Affect calculations
- Produce incorrect analysis
- Lead to wrong conclusions
3.8.2 Representation of Missing Values in Pandas
NCERT introduces the special value:
NaN
NaNstands for Not a Number- Pandas uses
NaNto represent missing data It applies to both:
- Numeric data
- Non-numeric data
3.8.3 Sample DataFrame with Missing Values
We use a dataset similar to NCERT examples.
import pandas as pd
import numpy as np
data = {
'Name': ['Amit', 'Neha', 'Rahul', 'Pooja', 'Karan'],
'Maths': [85, 90, np.nan, 92, 88],
'Science': [80, np.nan, 75, 89, 84],
'English': [78, 82, 74, np.nan, 80]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Maths Science English
0 Amit 85.0 80.0 78.0
1 Neha 90.0 NaN 82.0
2 Rahul NaN 75.0 74.0
3 Pooja 92.0 89.0 NaN
4 Karan 88.0 84.0 80.0
3.8.4 Identifying Missing Values using isnull()
NCERT introduces isnull() to detect missing values.
print(df.isnull())
Output:
Name Maths Science English
0 False False False False
1 False False True False
2 False True False False
3 False False False True
4 False False False False
Explanation:
True→ missing valueFalse→ valid value
3.8.5 Using notnull()
The opposite of isnull() is notnull().
print(df.notnull())
Explanation:
True→ value existsFalse→ missing value
3.8.6 Counting Missing Values
NCERT explains that we can count missing values column-wise.
print(df.isnull().sum())
Output:
Name 0
Maths 1
Science 1
English 1
dtype: int64
Explanation:
- Helps identify which columns have missing data
- Useful before cleaning data
3.8.7 Dropping Rows with Missing Values using dropna()
NCERT introduces dropna() to remove records containing missing values.
Drop rows with any missing value
df_drop = df.dropna()
print(df_drop)
Output:
Name Maths Science English
0 Amit 85.0 80.0 78.0
4 Karan 88.0 84.0 80.0
Explanation:
- Rows containing even one NaN are removed
- Use with caution, as data loss can occur
3.8.8 Dropping Columns with Missing Values
df_drop_col = df.dropna(axis=1)
print(df_drop_col)
Explanation:
axis=1→ columns- Columns with any missing value are removed
3.8.9 Dropping Rows Based on Condition
NCERT mentions selective dropping.
df_drop_cond = df.dropna(thresh=3)
print(df_drop_cond)
Explanation:
thresh=3means:- Keep rows with at least 3 non-null values
3.8.10 Filling Missing Values using fillna()
Instead of deleting data, NCERT recommends filling missing values.
Fill missing values with a constant
df_filled = df.fillna(0)
print(df_filled)
Explanation:
- All missing values replaced with 0
- Simple but may not be meaningful in all cases
3.8.11 Filling Missing Values with Column Mean
NCERT highlights this as a common practice.
df['Maths'].fillna(df['Maths'].mean(), inplace=True)
df['Science'].fillna(df['Science'].mean(), inplace=True)
df['English'].fillna(df['English'].mean(), inplace=True)
print(df)
Explanation:
- Missing values replaced with average of column
- Preserves overall data trend
3.8.12 Forward Fill and Backward Fill
NCERT introduces propagation-based filling.
Forward Fill (ffill)
df_ffill = df.fillna(method='ffill')
print(df_ffill)
Explanation:
- Missing value filled using previous valid value
Backward Fill (bfill)
df_bfill = df.fillna(method='bfill')
print(df_bfill)
Explanation:
- Missing value filled using next valid value
3.8.13 Replacing Missing Values using replace()
NCERT mentions that replace() can also be used.
df_replace = df.replace(np.nan, 0)
print(df_replace)
Explanation:
- Replaces
NaNwith specified value - Similar effect to
fillna()
3.8.14 Effect of Missing Values on Calculations
NCERT stresses an important point:
Pandas ignores NaN values by default in:
- sum()
- mean()
- min()
- max()
Example:
print(df['Maths'].mean())
Explanation:
- Mean is calculated using available values only
- Missing values do not cause errors
3.8.15 Choosing the Right Method (NCERT Guidance)
NCERT expects students to understand that:
- Dropping data may cause loss of information
- Filling data must be logically justified
- There is no single correct method
Choice depends on:
- Nature of data
- Purpose of analysis
3.8.16 Key Learning Outcomes of Section 3.8
By the end of this section, NCERT expects students to:
- Identify missing values using
isnull()andnotnull() - Count missing values
- Drop rows or columns with missing values
Fill missing values using:
- Constant values
- Mean
- Forward fill
- Backward fill
Understand the impact of missing values on analysis
3.9 Import and Export of Data between Pandas and MySQL
3.9.1 Why Connect Pandas with MySQL? (NCERT Context)
NCERT introduces this section to show how data analysis tools (Pandas) work with databases (MySQL).
So far, you have worked with:
- Data entered manually
- Data stored in CSV files
However, in real-world applications:
- Data is often stored in relational databases
- MySQL is one of the most widely used database systems
Data analysts frequently need to:
- Read data from database tables
- Analyse it using Pandas
- Store processed data back into the database
This section demonstrates basic interaction between Pandas and MySQL.
3.9.2 Tools Required for Pandas–MySQL Connectivity
NCERT explains that to connect Pandas with MySQL, we need:
- MySQL Server
- MySQL Connector for Python
- Pandas library
The Python module used is:
mysql.connector
This module allows Python programs to:
- Establish connection with MySQL
- Execute SQL queries
- Fetch and insert data
3.9.3 Importing Required Libraries
Before connecting to MySQL, required libraries must be imported.
import pandas as pd
import mysql.connector
Explanation:
pandas→ data handling and analysismysql.connector→ database connectivity
3.9.4 Establishing Connection with MySQL
NCERT shows how to create a connection object.
mycon = mysql.connector.connect(
host="localhost",
user="root",
password="password",
database="school"
)
Explanation:
host→ location of MySQL serveruser→ MySQL usernamepassword→ MySQL passworddatabase→ database to be used
If the connection is successful:
myconbecomes an active connection object
3.9.5 Reading Data from MySQL into a DataFrame
NCERT demonstrates how to:
- Fetch records from a MySQL table
- Load them into a Pandas DataFrame
Example: Read STUDENT table from MySQL
Assume MySQL table:
STUDENT
------------------------
RollNo | Name | Marks
query = "SELECT * FROM STUDENT"
df = pd.read_sql(query, mycon)
print(df)
Explanation:
read_sql()executes the SQL query- Result is directly converted into a DataFrame
- Column names are preserved
This is the most important function in this section.
3.9.6 Reading Selected Columns from MySQL
query = "SELECT RollNo, Name FROM STUDENT"
df = pd.read_sql(query, mycon)
print(df)
Explanation:
- Only required columns are fetched
- Reduces data transfer overhead
3.9.7 Applying Pandas Operations on Database Data
NCERT stresses that:
Once data is in a DataFrame, all Pandas operations can be applied.
Example:
print(df.describe())
or
high_scorers = df[df['Marks'] > 85]
print(high_scorers)
This shows the power of combining databases with Pandas.
3.9.8 Writing Data from DataFrame to MySQL Table
NCERT also explains the reverse operation:
Storing processed data back into MySQL.
Steps involved:
- Create a cursor
- Execute INSERT queries using DataFrame values
Example: Insert DataFrame rows into MySQL
cursor = mycon.cursor()
for i in range(len(df)):
sql = "INSERT INTO STUDENT VALUES (%s, %s, %s)"
values = (
int(df.loc[i, 'RollNo']),
df.loc[i, 'Name'],
int(df.loc[i, 'Marks'])
)
cursor.execute(sql, values)
mycon.commit()
Explanation:
- Each row of DataFrame is inserted into MySQL table
commit()saves the changes permanently
3.9.9 Exporting DataFrame to MySQL (Conceptual Note)
NCERT does not require students to use advanced tools like SQLAlchemy.
At CBSE level:
- Understanding logic and flow is sufficient
Focus is on:
- Reading from MySQL
- Writing back using basic SQL commands
3.9.10 Closing the Database Connection
NCERT highlights good programming practice.
cursor.close()
mycon.close()
Explanation:
- Frees system resources
- Prevents connection leaks
3.9.11 Common Errors and Precautions (NCERT Awareness)
NCERT expects conceptual awareness of:
- Wrong username/password → connection failure
- Database/table not found → SQL error
- Forgetting
commit()→ data not saved - Mismatch between table structure and DataFrame → insertion error
Students are not expected to debug deeply, only understand causes.
3.9.12 Key Learning Outcomes of Section 3.9
By the end of this section, NCERT expects students to:
- Understand the need to connect Pandas with MySQL
- Import required libraries for database connectivity
- Establish connection with MySQL
- Read database tables into a DataFrame using
read_sql() - Perform Pandas operations on database data
- Insert DataFrame data into MySQL tables
- Close database connections properly
Chapter 03 Completed Fully
We have now fully and sequentially completed Chapter 03:
- 3.1 Introduction
- 3.2 Descriptive Statistics
- 3.3 Data Aggregations
- 3.4 Sorting a DataFrame
- 3.5 GROUP BY Functions
- 3.6 Altering the Index
- 3.7 Other DataFrame Operations
- 3.8 Handling Missing Values
- 3.9 Pandas ↔ MySQL