Converting `pandasql` Queries to Pure Pandas: A Step-by-Step Guide

2 min read4 days ago

Introduction

pandasql is a convenient tool that allows users to write SQL queries to manipulate pandas DataFrames. However, there are situations where you might want to convert these SQL queries into pure pandas code for better integration with the pandas ecosystem or to avoid the overhead of using an additional library. In this article, we will walk through the process of converting a pandasql query into equivalent pandas code, using a specific example.

The `pandasql` Query

Let’s consider the following pandasql query, which involves two common SQL operations: filtering and aggregation.

Example Data

Suppose we have a DataFrame sales_df with the following columns:

transaction_id: Unique identifier for each transaction
customer_id: Unique identifier for each customer
transaction_date: Date of the transaction
amount: Amount of the transaction
region: Region where the transaction took place

SQL Query

Here’s a pandasql query that calculates the total transaction amount for each customer in a specific region over the last 30 days, and filters customers with a total transaction amount greater than a certain threshold.

1WITH recent_transactions AS (
2    SELECT
3        customer_id,
4        SUM(amount) AS total_amount
5    FROM
6        sales_df
7    WHERE
8        region = 'North' AND
9        transaction_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
10    GROUP BY
11        customer_id
12),
13high_value_customers AS (
14    SELECT
15        customer_id
16    FROM
17        recent_transactions
18    WHERE
19        total_amount > 1000
20)
21SELECT * FROM high_value_customers

Converting to Pure Pandas

Step 1: Filter the DataFrame

First, filter the DataFrame sales_df based on the specified conditions.

1import pandas as pd
2from datetime import datetime, timedelta
3
4# Sample DataFrame
5sales_df = pd.DataFrame({
6    'transaction_id': [1, 2, 3, 4, 5],
7    'customer_id': ['C001', 'C002', 'C001', 'C003', 'C002'],
8    'transaction_date': [
9        datetime.now() - timedelta(days=5),
10        datetime.now() - timedelta(days=15),
11        datetime.now() - timedelta(days=25),
12        datetime.now() - timedelta(days=35),
13        datetime.now() - timedelta(days=10)
14    ],
15    'amount': [500, 700, 300, 400, 600],
16    'region': ['North', 'North', 'North', 'South', 'North']
17})
18
19# Filter for recent transactions in the 'North' region
20filtered_sales = sales_df[
21    (sales_df['region'] == 'North') &
22    (sales_df['transaction_date'] >= datetime.now() - timedelta(days=30))
23]

Step 2: Group and Aggregate

Perform the aggregation to calculate the total transaction amount for each customer.

1recent_transactions = (
2    filtered_sales
3    .groupby('customer_id')
4    .agg(total_amount=('amount', 'sum'))
5    .reset_index()
6)

Step 3: Filter High-Value Customers

Filter customers with a total transaction amount greater than the threshold.

1high_value_customers = recent_transactions[recent_transactions['total_amount'] > 1000]

Step 4: Select the Final Result

Select the customer_id column as the final result.

1result = high_value_customers[['customer_id']]

Conclusion

By following these steps, you can convert a pandasql query into pure pandas code while ensuring that your data remains safe and private. This approach allows you to leverage pandas' powerful data manipulation capabilities without exposing sensitive information.

Converting `pandasql` Queries to Pure Pandas: A Step-by-Step Guide

Introduction

The `pandasql` Query

Example Data

SQL Query

Converting to Pure Pandas

Step 1: Filter the DataFrame

Step 2: Group and Aggregate

Step 3: Filter High-Value Customers

Step 4: Select the Final Result

Conclusion

Written by Padmajeet Mhaske

No responses yet

Converting pandasql Queries to Pure Pandas: A Step-by-Step Guide

Introduction

The pandasql Query

Example Data

SQL Query

Converting to Pure Pandas

Step 1: Filter the DataFrame

Step 2: Group and Aggregate

Step 3: Filter High-Value Customers

Step 4: Select the Final Result

Conclusion

Written by Padmajeet Mhaske

No responses yet

Converting `pandasql` Queries to Pure Pandas: A Step-by-Step Guide

The `pandasql` Query