partition#
Datalake partition utilities.
This module provides functions and classes for managing and manipulating partitions in a data lake stored on S3. It includes utilities for extracting partition information, encoding partition data, and listing partitions.
- class dbsnaplake.partition.Partition(uri: str, data: Dict[str, str])[source]#
Represents a partition in an S3-based data lake.
A partition is a directory in S3 that contains data files but no subdirectories. It typically follows a hierarchical structure based on partition keys.
For example, in the following S3 directory structure:
s3://bucket/folder/year=2021/month=01/day=01/data.json s3://bucket/folder/year=2021/month=01/day=02/data.json s3://bucket/folder/year=2021/month=02/day=01/data.json s3://bucket/folder/year=2021/month=02/day=02/data.json
Then:
s3://bucket/folder/year=2021/month=01/day=01/is a partition.s3://bucket/folder/year=2021/month=01/is NOT a partition.s3://bucket/folder/year=2021/is NOT a partition.
- Parameters:
uri – The S3 URI of the partition. For example:
s3://bucket/folder/year=2021/month=01/day=01/data.jsondata – A dictionary of partition data. Note that the value is always a string, even if it represents a number. For example:
{"year": "2021", "month": "01", "day": "01"}
- property s3dir: S3Path#
The S3 directory path of the partition.
- dbsnaplake.partition.extract_partition_data(s3dir_root: S3Path, s3dir_partition: S3Path) Dict[str, str][source]#
Extract partition data from the S3 directory path.
- Parameters:
s3dir_root – The root S3 directory.
s3dir_partition – The partition S3 directory.
Example
>>> s3dir_root = S3Path("s3://bucket/folder/") >>> s3dir_partition = S3Path("s3://bucket/folder/year=2021/month=01/day=15/") >>> extract_partition_data(s3dir_root, s3dir_partition) {"year": "2021", "month": "01", "day": "15"}
- dbsnaplake.partition.encode_hive_partition(kvs: Dict[str, str]) str[source]#
Encode partition data into hive styled partition string.
- Parameters:
kvs – A dictionary of partition key-value pairs.
For example:
>>> encode_hive_partition({"year": "2021", "month": "01", "day": "01"}) 'year=2021/month=01/day=01'
- dbsnaplake.partition.get_s3dir_partition(s3dir_root: S3Path, kvs: Dict[str, str]) S3Path[source]#
Get the S3 directory path of the partition.
- Parameters:
s3dir_root – The root S3 directory.
kvs – A dictionary of partition key-value pairs.
Example
>>> s3dir_root = S3Path("s3://bucket/folder/") >>> s3dir_partition = get_s3dir_partition(s3dir_root, {"year": "2021", "month": "01", "day": "01"}) >>> s3dir_partition.uri 's3://bucket/folder/year=2021/month=01/day=01/'
- dbsnaplake.partition.get_partitions_v2(s3_client: S3Client, s3dir_root: S3Path, _s3dir_partition: Optional[S3Path] = None, _partitions: List[Partition] = None) List[Partition][source]#
Recursively scan the S3 directory and return a list of partitions.
For example, for the following S3 structure:
s3://bucket/folder/year=2021/month=01/day=01/data.json s3://bucket/folder/year=2021/month=01/day=02/data.json s3://bucket/folder/year=2021/month=02/day=01/data.json s3://bucket/folder/year=2021/month=02/day=02/data.json
The function will return partitions:
s3://bucket/folder/year=2021/month=01/day=01/ s3://bucket/folder/year=2021/month=01/day=02/ s3://bucket/folder/year=2021/month=02/day=01/ s3://bucket/folder/year=2021/month=02/day=02/
Note
This implementation recursively scan all S3 folder. It is slower than
get_partitions(). I intentionally leave this code here for reference.
- dbsnaplake.partition.get_partitions(s3_client: S3Client, s3dir_root: S3Path) List[Partition][source]#
Efficiently scan the S3 directory and return a list of partitions.
For example, for the following S3 structure:
s3://bucket/folder/year=2021/month=01/day=01/data.json s3://bucket/folder/year=2021/month=01/day=02/data.json s3://bucket/folder/year=2021/month=02/day=01/data.json s3://bucket/folder/year=2021/month=02/day=02/data.json
The function will return partitions:
s3://bucket/folder/year=2021/month=01/day=01/ s3://bucket/folder/year=2021/month=01/day=02/ s3://bucket/folder/year=2021/month=02/day=01/ s3://bucket/folder/year=2021/month=02/day=02/
Note
This implementation has higher performance compared to
get_partitions_v1()as it avoids recursive S3 API calls.