partition#

Datalake partition utilities.

This module provides functions and classes for managing and manipulating partitions in a data lake stored on S3. It includes utilities for extracting partition information, encoding partition data, and listing partitions.

class dbsnaplake.partition.Partition(uri: str, data: Dict[str, str])[source]#

Represents a partition in an S3-based data lake.

A partition is a directory in S3 that contains data files but no subdirectories. It typically follows a hierarchical structure based on partition keys.

For example, in the following S3 directory structure:

s3://bucket/folder/year=2021/month=01/day=01/data.json
s3://bucket/folder/year=2021/month=01/day=02/data.json
s3://bucket/folder/year=2021/month=02/day=01/data.json
s3://bucket/folder/year=2021/month=02/day=02/data.json

Then:

  • s3://bucket/folder/year=2021/month=01/day=01/ is a partition.

  • s3://bucket/folder/year=2021/month=01/ is NOT a partition.

  • s3://bucket/folder/year=2021/ is NOT a partition.

Parameters:
  • uri – The S3 URI of the partition. For example: s3://bucket/folder/year=2021/month=01/day=01/data.json

  • data – A dictionary of partition data. Note that the value is always a string, even if it represents a number. For example: {"year": "2021", "month": "01", "day": "01"}

property s3dir: S3Path#

The S3 directory path of the partition.

classmethod from_uri(s3uri: str, s3uri_root: str)[source]#

Construct a Partition object from S3 URIs.

Parameters:
  • s3uri – The S3 URI of the partition.

  • s3uri_root – The S3 URI of the root directory.

Returns:

A new Partition object.

list_parquet_files(s3_client: S3Client, ext: str = '.parquet') List[S3Path][source]#

List all Parquet files in the partition.

Parameters:

ext – File extension to filter. Defaults to “.parquet”.

Returns:

A list of S3Path objects representing Parquet files.

dbsnaplake.partition.extract_partition_data(s3dir_root: S3Path, s3dir_partition: S3Path) Dict[str, str][source]#

Extract partition data from the S3 directory path.

Parameters:
  • s3dir_root – The root S3 directory.

  • s3dir_partition – The partition S3 directory.

Example

>>> s3dir_root = S3Path("s3://bucket/folder/")
>>> s3dir_partition = S3Path("s3://bucket/folder/year=2021/month=01/day=15/")
>>> extract_partition_data(s3dir_root, s3dir_partition)
{"year": "2021", "month": "01", "day": "15"}
dbsnaplake.partition.encode_hive_partition(kvs: Dict[str, str]) str[source]#

Encode partition data into hive styled partition string.

Parameters:

kvs – A dictionary of partition key-value pairs.

For example:

>>> encode_hive_partition({"year": "2021", "month": "01", "day": "01"})
'year=2021/month=01/day=01'
dbsnaplake.partition.get_s3dir_partition(s3dir_root: S3Path, kvs: Dict[str, str]) S3Path[source]#

Get the S3 directory path of the partition.

Parameters:
  • s3dir_root – The root S3 directory.

  • kvs – A dictionary of partition key-value pairs.

Example

>>> s3dir_root = S3Path("s3://bucket/folder/")
>>> s3dir_partition = get_s3dir_partition(s3dir_root, {"year": "2021", "month": "01", "day": "01"})
>>> s3dir_partition.uri
's3://bucket/folder/year=2021/month=01/day=01/'
dbsnaplake.partition.get_partitions_v2(s3_client: S3Client, s3dir_root: S3Path, _s3dir_partition: Optional[S3Path] = None, _partitions: List[Partition] = None) List[Partition][source]#

Recursively scan the S3 directory and return a list of partitions.

For example, for the following S3 structure:

s3://bucket/folder/year=2021/month=01/day=01/data.json s3://bucket/folder/year=2021/month=01/day=02/data.json s3://bucket/folder/year=2021/month=02/day=01/data.json s3://bucket/folder/year=2021/month=02/day=02/data.json

The function will return partitions:

s3://bucket/folder/year=2021/month=01/day=01/
s3://bucket/folder/year=2021/month=01/day=02/
s3://bucket/folder/year=2021/month=02/day=01/
s3://bucket/folder/year=2021/month=02/day=02/

Note

This implementation recursively scan all S3 folder. It is slower than get_partitions(). I intentionally leave this code here for reference.

dbsnaplake.partition.get_partitions(s3_client: S3Client, s3dir_root: S3Path) List[Partition][source]#

Efficiently scan the S3 directory and return a list of partitions.

For example, for the following S3 structure:

s3://bucket/folder/year=2021/month=01/day=01/data.json s3://bucket/folder/year=2021/month=01/day=02/data.json s3://bucket/folder/year=2021/month=02/day=01/data.json s3://bucket/folder/year=2021/month=02/day=02/data.json

The function will return partitions:

s3://bucket/folder/year=2021/month=01/day=01/
s3://bucket/folder/year=2021/month=01/day=02/
s3://bucket/folder/year=2021/month=02/day=01/
s3://bucket/folder/year=2021/month=02/day=02/

Note

This implementation has higher performance compared to get_partitions_v1() as it avoids recursive S3 API calls.