Explore Pandas in Python to Analyze and Manipulate Tabular Data

By:   |   Updated: 2024-10-15   |   Comments (2)   |   Related: > Python


Problem

Have you heard about Pandas in Python? It is widely used open-source library for analyzing and manipulating data in the Python programming language. Let's explore it with use cases and examples.

Solution

Pandas is a revolutionary, open-source library for analyzing and manipulating tabular data in Python. Pandas is built on the NumPy library and is used for data cleaning, data manipulation, data analysis, data visualization, and time series data.

A few key features of Pandas library are:

  • Offers simpler syntax with rich functionality to deal with data structures in a straightforward manner valuable for data science and machine learning.
  • Can integrate with various data formats, such as Excel files, CSV files, and SQL databases, with multiple data sources increasing flexibility.
  • Provides rich support for data cleansing and transformations, such as finding invalid and missing values, duplicates in additionl to group by and sorting data logic.
  • Works with incomplete datasets with data handling and alignment with existing values.
  • Integrates with Python libraries for advanced calculations, such as MatPlotlib and Scikit-Learn.

Let's start exploring Pandas in Python with various use cases valuable for beginners and experienced developers.

Installing Pandas Module

Run the command below and it will install the latest version of Pandas and its dependencies.

pip install pandas

To verify the Pandas installation, import it and check the version using the code below. If it is not installed, an error will be returned.

import pandas as pd
print("Pandas version", pd.__version__)
verify Pandas installation

Pandas Series

This one-dimensional array can hold any data type, such as integer, float, or string. We can use the function pd.series to create the series().

mydata = ['Apple','Banana','Orange']
s = pd.Series(mydata)
print(s)
create a series

Pandas data frame default indexing starts with 0,1,2, and so on. You can create a Pandas series with a custom index as well.

mydata = ['Apple','Banana','Orange']
s = pd.Series(mydata,index=['Fruit1','Fruit2','Fruit3'])
print(s)
create series with custom index

Arithmetic Operations

You can perform the arithmetic operations and filter on the series.

First, let's create two series, s1 and s2, as shown below.

s1 = pd.Series([100, 200 , 300 , 400, 500])
s2 = pd.Series([5, 10, 15, 25, 50])

Addition

print(s1 + s2) 
addition

Subtraction

print(s1 - s2) 
subtraction

Multiplication

print(s1 * s2) 
multiplication

Division

print(s1 / s2) 
division

Similarly, the aggregations, such as sum, mean, median, and standard deviation, are performed in a series as follows.

Aggregation Code Output
Sum print(s1.sum()) 1500
Mean print(s1.mean()) 300.0
Median print(s1.median()) 300.0
Standard deviation print(s1.std()) 158.11388300841898
Min print(s1.min()) 100
Max print(s1.max()) 500

Indexing and Slicing

The Pandas series elements start with index position 0. You can access an element based on its indexing position.

Aggregation Code Output
First element print(s1[0]) 100
Second element print(s1[1]) 200

Pandas Dataframe

Organized into rows and columns, a dataframe consists of data that is:

  • Two-dimensional,
  • Mutable (size can be changed or modified after it is created) and
  • Heterogeneous (can hold different types of data in the same structure).

You can create a dataframe from a list, dictionary, or NumPy array.

For example, the code below shows a list of strings and uses the pd.dataframe() function to construct a Pandas dataframe.

# list of strings
list1 = ['Apple', 'Banana', 'Orange', 'Guava', 'Grapefruit','Papaya']
 
# Calling DataFrame constructor on list
df = pd.DataFrame(list1)
print(df)
dataframe

Similarly, the example below creates a dataframe from a list of lists. The pd.dataframe() also contains the column names for the mapping.

data = [
    ['Raj', 31, 'India'],
    ['John', 25, 'New Jersey'],
    ['Peter', 29, 'Chicago']
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

You can create a dataframe from the dictionary, as shown in the following code.

emp = {'Name':['Raj', 'nick', 'John', 'Vansh'],
        'Age':[18, 21, 36, 29]}
 
# Create and print DataFrame
df = pd.DataFrame(emp)
print(df)
create dataframe from the dictionary

Let's explore a few useful functions to work with the Pandas dataframe.

df.head(n)

This function returns the first n number of rows from the dataframe. For example, specify df.head(1) to get the first row. If we do not specify a value, it returns the first five rows from the dataframe by default.

df.head(n)

Similarly, to get the first two rows, specify df.head(2).

df.head(2)

df.tail(n)

The next function returns the n number of rows from the bottom. For example, df.tail(n) returns the last two rows from the dataframe. It returns the last five rows if no value is specified in the tail() function.

• df.tail(n):

Get a Specific Row from the Dataframe

The df.head() and df.tail() returns rows from the top or bottom of the dataframe. However, suppose you need a specific row based on its index position. You can use the df.loc[index] function. For example, suppose we need to get the second row from the dataframe:

print(df.loc[1])
get a specific row from the dataframe

Boolean Indexing

Suppose we have a large employee dataframe, but we are interested in only those employees whose age is greater than 30 years. In this case, we can use the Boolean indexing shown below.

print(df[df['Age']>30])
Boolean indexing

df.describe()

This function returns helpful information about the data in the dataframe. For example, as seen below, it returns count, mean, standard deviation, minimum, 25th percentile,50th percentile,75th percentile, and maximum value.

df.describe()

df.info()

This is another function, and it returns data frame details, such as the data types, size, total number of columns, count of non-null values for each column, index range of the dataframe, showing the starting and ending index values, and memory usage.

df.info()

Selecting a Specific Column or Multiple Columns

We can specify the columns required in the output to fetch those columns instead of all the columns. For example, if your dataframe consists of 50 columns, and you need three columns in the output, you can specify those columns by name.

  • df['Name'] returns only the Name column from the data frame.
df['Name']
  • df[['Name','City']] returns the name and city columns from the data frame.
df[['Name','City']]

Add a New Column in the Existing Dataframe

Let's add a new column, department, in the existing dataframe. We can use the df.insert() function with the following arguments:

  • Column position
    • It is the index position in the dataframe where we want to add the new column. So, my dataframe consists of three columns, and to add the new column in the last position, I need to specify the index position as 3.
  • Column name
  • Values
df.insert(3,"department",['IT','Admin','Finance'])
print(df)
add a new column in the existing data frame

Alternatively, you can add a new column using the list below.

department=['IT','Admin','Finance']
df['department']=department
print(df)
add a new column in the existing data frame

Drop the Column

We can use the function df.drop('column') to drop one or multiple columns from the Pandas dataframe. Specify the parameter inplace=true to do the operation and return the modified data frame.

df.drop('department', axis=1, inplace=True)
print(df)
Drop the column

Rename a Column

To rename a dataframe column, use the function df.rename() like so:

df.rename(columns={'old_name': 'new_name'}, inplace=True)

Let's rename the column Name to FirstName.

df.rename(columns={'Name':'First_Name'},inplace=True)
print(df)
Rename a column

Modify an Existing Value

Suppose Peter is transferred to New York, and you need to update the value in the dataframe. To update the value, use the replace function as below.

df['City'] = df['City'].replace('Chicago', 'New York')
print(df)
Modify an existing value

Insert New Data in the Existing Dataframe

We can use the concat() function to insert a new row in the Pandas dataframe.

new_row = pd.DataFrame({'Name': ['David'], 'Age': [40], 'City': ['Houston']})

# Appending the new row to the DataFrame using pd.concat
df = pd.concat([df, new_row], ignore_index=True)
print(df)
Insert new data in the existing dataframe

We can also use the loc indexer to insert a new row.

# Adding a new row using loc
df.loc[len(df)] = ['Jen', 45, 'Sydney']
print(df)
use the loc indexer to insert a new row

Delete a Row

We can use df.drop() function to delete a row from the dataframe. Let's drop the row for Jen with index position 4.

df = df.drop(4) 
print(df)
Delete a row

Or use the following method to delete a row based on its value:

  • df['Name'] != 'Peter' creates a boolean series for all rows where the Name is not 'Peter'.
  • df[df['Name'] != 'Peter'] filters the dataframe to keep only those rows where the Boolean series is True.
df = df[df['Name'] != 'Peter']
print(df)

Suppose you have the below dataframe. Here, there are two Peters, one from Chicago and the other from Sydney.

example dataframe

To delete Chicago Peter's record, specify the condition with the Boolean operator. The relevant row is removed from the dataframe if the condition is satisfied.

df = df[~((df['Name'] == 'Peter') & (df['City'] == 'Chicago'))]
print(df)
The relevant row is removed from the dataframe if the condition is satisfied.
Next Steps
  • We will identify and explore more cases of Pandas dataframe use. Stay tuned for more Python tutorials.
  • Explore more Python tips on MSSQLTips.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Rajendra Gupta Rajendra Gupta is a Consultant DBA with 14+ years of extensive experience in database administration including large critical OLAP, OLTP, Reporting and SharePoint databases.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2024-10-15

Comments For This Article




Thursday, October 17, 2024 - 11:28:42 AM - rajendra Back To Top (92578)
Thanks @steve for the comment.

Thursday, October 17, 2024 - 5:15:02 AM - Steve Anderson Back To Top (92574)
Ha... I've spent the last 2 weeks on a udemy course and you've pretty much summarised 20 hours of training to a concise list. Nice and thanks.














get free sql tips
agree to terms