s3_loc#

This module provides a structured way to manage and interact with S3 locations in the ETL pipeline. It defines the S3Location class, which encapsulates the logic for constructing and accessing various S3 paths used throughout the ETL process.

The S3Location class handles two main areas:

  1. Staging area: Temporary storage for processed data and manifests.

  2. Data lake: Final storage location for optimized data.

This module is crucial for maintaining a consistent and organized S3 structure throughout the pipeline, facilitating efficient data processing and retrieval.

class dbsnaplake.s3_loc.S3Location(s3uri_staging: str, s3uri_datalake: str)[source]#

A central class as a namespace to access all important S3 paths in the ETL pipeline.

Example:
>>> s3_loc = S3Location(
...     s3uri_staging="s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/"
...     s3uri_datalake="s3://bucket/prefix/datalake/mydatabase/mytable_2021_01_01_08_30_00/"
... )
>>> s3_loc.s3dir_staging
...
>>> s3_loc.s3dir_datalake
...
>>> s3_loc.s3dir_staging_manifest
...
>>> s3_loc.s3dir_snapshot_file_group_manifest
...
>>> s3_loc.s3dir_snapshot_file_group_manifest_summary
...
>>> s3_loc.s3dir_snapshot_file_group_manifest_data
...
>>> s3_loc.s3dir_staging_file_group_manifest
...
>>> s3_loc.s3dir_staging_file_group_manifest_summary
...
>>> s3_loc.s3dir_staging_file_group_manifest_data
...
>>> s3_loc.s3dir_partition_file_group_manifest
...
>>> s3_loc.s3dir_partition_file_group_manifest_summary
...
>>> s3_loc.s3dir_partition_file_group_manifest_data
...
>>> s3_loc.s3dir_staging_datalake
...
property s3dir_staging: S3Path#

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/
property s3dir_datalake: S3Path#

Example:

s3://bucket/prefix/datalake/mydatabase/mytable_2021_01_01_08_30_00/
property s3dir_staging_manifest: S3Path#

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/manifests/
property s3dir_snapshot_file_group_manifest: S3Path#

Where you store snapshot file group manifest files.

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/manifests/snapshot-file-groups/
property s3dir_snapshot_file_group_manifest_summary: S3Path#

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/manifests/snapshot-file-groups/manifest-summary/
property s3dir_snapshot_file_group_manifest_data: S3Path#

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/manifests/snapshot-file-groups/manifest-data/
property s3dir_staging_file_group_manifest: S3Path#

Where you store staging file group manifest files.

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/manifests/staging-file-groups/
property s3dir_staging_file_group_manifest_summary: S3Path#

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/manifests/staging-file-groups/manifest-summary/
property s3dir_staging_file_group_manifest_data: S3Path#

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/manifests/staging-file-groups/manifest-data/
property s3dir_partition_file_group_manifest: S3Path#

Where you store staging file group manifest files.

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/manifests/partition-file-groups/
property s3dir_partition_file_group_manifest_summary: S3Path#

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/manifests/partition-file-groups/manifest-summary/
property s3dir_partition_file_group_manifest_data: S3Path#

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/manifests/partition-file-groups/manifest-data/
property s3dir_staging_datalake: S3Path#

Where you store the staged data that similar to your final datalake folder structure.

Example:

s3://bucket/prefix/staging/mydatabase/mytable/snapshot=2021-01-01T08:30:00Z/datalake/
iter_staging_datalake_partition(s3_client: S3Client) List[Partition][source]#

List all partitions in the staging datalake folder. So that you can plan how to compact small files in each partition into bigger files and move them to final datalake folder.

get_s3dir_staging_datalake_partition(kvs: Dict[str, str]) S3Path[source]#

Get the S3 path for a specific partition in the staging datalake folder.

get_s3dir_datalake_partition(kvs: Dict[str, str]) S3Path[source]#

Get the S3 path for a specific partition in the datalake folder.

property s3path_validate_datalake_result: S3Path#

todo: docstring