Partition By VS Group By#
Test the performance of polars.DataFrame.partition_by and polars.DataFrame.group_by.
test_partition_by_vs_group_by.py
1# -*- coding: utf-8 -*-
2
3import polars as pl
4import numpy as np
5from dbsnaplake.vendor.timer import DateTimeTimer
6
7# n_row = 1_000
8n_row = 1_000_000
9# n_row = 100_000_000
10df = pl.DataFrame(
11 {
12 "year": np.random.randint(2001, 2010, n_row),
13 "month": np.random.randint(1, 12, n_row),
14 "value": np.random.randn(n_row),
15 },
16)
17pkeys = ["year", "month"]
18with DateTimeTimer("Method 1") as timer:
19 for sub_df in df.partition_by(by=pkeys):
20 kvs = sub_df.select(pkeys).head(1).to_dicts()[0]
21 sub_df = sub_df.drop(pkeys)
22 # sub_df.count()
23 # print(kvs, sub_df)
24
25with DateTimeTimer("Method 2") as timer:
26 for pvalues, sub_df in df.group_by(pkeys):
27 kvs = dict(zip(pkeys, pvalues))
28 sub_df.count()
29 # print(kvs, sub_df)
Conclusion
If you don’t do any operation in sub dataframe, and you need the partition key values, group by is faster.
If you just want to split the dataframe into sub dataframes, and you don’t need the partition key values, use partition by.