Partition By VS Group By#

Test the performance of polars.DataFrame.partition_by and polars.DataFrame.group_by.

test_partition_by_vs_group_by.py
 1# -*- coding: utf-8 -*-
 2
 3import polars as pl
 4import numpy as np
 5from dbsnaplake.vendor.timer import DateTimeTimer
 6
 7# n_row = 1_000
 8n_row = 1_000_000
 9# n_row = 100_000_000
10df = pl.DataFrame(
11    {
12        "year": np.random.randint(2001, 2010, n_row),
13        "month": np.random.randint(1, 12, n_row),
14        "value": np.random.randn(n_row),
15    },
16)
17pkeys = ["year", "month"]
18with DateTimeTimer("Method 1") as timer:
19    for sub_df in df.partition_by(by=pkeys):
20        kvs = sub_df.select(pkeys).head(1).to_dicts()[0]
21        sub_df = sub_df.drop(pkeys)
22        # sub_df.count()
23        # print(kvs, sub_df)
24
25with DateTimeTimer("Method 2") as timer:
26    for pvalues, sub_df in df.group_by(pkeys):
27        kvs = dict(zip(pkeys, pvalues))
28        sub_df.count()
29        # print(kvs, sub_df)

Conclusion

If you don’t do any operation in sub dataframe, and you need the partition key values, group by is faster.

If you just want to split the dataframe into sub dataframes, and you don’t need the partition key values, use partition by.