Converting pandasql Queries to Pure Pandas: A Step-by-Step Guide

Padmajeet Mhaske
2 min read4 days ago

--

Introduction

pandasql is a convenient tool that allows users to write SQL queries to manipulate pandas DataFrames. However, there are situations where you might want to convert these SQL queries into pure pandas code for better integration with the pandas ecosystem or to avoid the overhead of using an additional library. In this article, we will walk through the process of converting a pandasql query into equivalent pandas code, using a specific example.

The pandasql Query

Let’s consider the following pandasql query, which involves two common SQL operations: filtering and aggregation.

Example Data

Suppose we have a DataFrame sales_df with the following columns:

  • transaction_id: Unique identifier for each transaction
  • customer_id: Unique identifier for each customer
  • transaction_date: Date of the transaction
  • amount: Amount of the transaction
  • region: Region where the transaction took place

SQL Query

Here’s a pandasql query that calculates the total transaction amount for each customer in a specific region over the last 30 days, and filters customers with a total transaction amount greater than a certain threshold.

1WITH recent_transactions AS (
2 SELECT
3 customer_id,
4 SUM(amount) AS total_amount
5 FROM
6 sales_df
7 WHERE
8 region = 'North' AND
9 transaction_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
10 GROUP BY
11 customer_id
12),
13high_value_customers AS (
14 SELECT
15 customer_id
16 FROM
17 recent_transactions
18 WHERE
19 total_amount > 1000
20)
21SELECT * FROM high_value_customers

Converting to Pure Pandas

Step 1: Filter the DataFrame

First, filter the DataFrame sales_df based on the specified conditions.

1import pandas as pd
2from datetime import datetime, timedelta
3
4# Sample DataFrame
5sales_df = pd.DataFrame({
6 'transaction_id': [1, 2, 3, 4, 5],
7 'customer_id': ['C001', 'C002', 'C001', 'C003', 'C002'],
8 'transaction_date': [
9 datetime.now() - timedelta(days=5),
10 datetime.now() - timedelta(days=15),
11 datetime.now() - timedelta(days=25),
12 datetime.now() - timedelta(days=35),
13 datetime.now() - timedelta(days=10)
14 ],
15 'amount': [500, 700, 300, 400, 600],
16 'region': ['North', 'North', 'North', 'South', 'North']
17})
18
19# Filter for recent transactions in the 'North' region
20filtered_sales = sales_df[
21 (sales_df['region'] == 'North') &
22 (sales_df['transaction_date'] >= datetime.now() - timedelta(days=30))
23]

Step 2: Group and Aggregate

Perform the aggregation to calculate the total transaction amount for each customer.

1recent_transactions = (
2 filtered_sales
3 .groupby('customer_id')
4 .agg(total_amount=('amount', 'sum'))
5 .reset_index()
6)

Step 3: Filter High-Value Customers

Filter customers with a total transaction amount greater than the threshold.

1high_value_customers = recent_transactions[recent_transactions['total_amount'] > 1000]

Step 4: Select the Final Result

Select the customer_id column as the final result.

1result = high_value_customers[['customer_id']]

Conclusion

By following these steps, you can convert a pandasql query into pure pandas code while ensuring that your data remains safe and private. This approach allows you to leverage pandas' powerful data manipulation capabilities without exposing sensitive information.

--

--

Padmajeet Mhaske
Padmajeet Mhaske

Written by Padmajeet Mhaske

Padmajeet is a seasoned leader in artificial intelligence and machine learning, currently serving as the VP and AI/ML Application Architect at JPMorgan Chase.

No responses yet