framequery - SQL on dataframes

Pandas is a superb tool for data handling and the de facto standard for data science tasks in Python. Great as it is already, I always was a bit surprised by its lack of an SQL interface. In Spark, for example, you can easily reuse existing SQL statements on top of dataframes without having to resort to a database.

A couple of days ago, I sat down and started to implement an SQL interpreter for Pandas, called framequery. It relies heavily on the many SQL-like operations implemented by Pandas already, but adds a SQL parser and interpreter on top. While this package is similar in spirit to pandasql by yhat, its implementation is quite different. With framequery, data is never transferred between Pandas and a database, but always maintained as a dataframe. Most directly, this choice avoids conversions. In a more indirecty way, it also offers the possiblity of using framequery with alternative implementations of the dataframe API, such as dask.

To test the current state, install the package via pip install framequery and use the select function as in

import pandas as pd
import framequery as fq

sales = pd.read_csv("sales.csv")
products = pd.read_csv("products.csv")

yearly_sales = fq.select("""
    SELECT year, sum(sales) as sales

    FROM sales

    JOIN products
    ON sales.product_id = products.id

    GROUP BY year
""")

Currently, the implementation has a number of shortcomings and quite some features are still missing, but the core functionality is already usable. If you find this package useful, I would love your feedback. Similary, contributions are also very welcome.

For details, please check the GitHub page.