validate_datalake#

Data Lake Validation Module

This module provides utilities for validating and analyzing the contents of a data lake. It includes classes for representing partitions and validation results, as well as a function to perform the actual validation process.

class dbsnaplake.validate_datalake.Partition(data: Dict[str, str], n_files: int, total_size: int, total_size_4_human: str, total_n_record: Optional[int])[source]#
class dbsnaplake.validate_datalake.ValidateDatalakeResult(before_n_files: int, before_total_size: int, before_total_size_4_human: str, before_total_n_record: int, after_n_files: int, after_total_size: int, after_total_size_4_human: str, after_total_n_record: Optional[int], n_partition: int, partitions: List[Partition])[source]#

Encapsulates the results of a data lake validation process.

This class stores comparison data between the original snapshot and the processed data lake, including file counts, sizes, and record counts.

Parameters:
  • before_n_files – Number of files in the original snapshot.

  • before_total_size – Total size of the original snapshot in bytes.

  • before_total_size_4_human – Human-readable representation of the original snapshot size.

  • before_total_n_record – Total number of records in the original snapshot.

  • after_n_files – Number of files in the processed data lake.

  • after_total_size – Total size of the processed data lake in bytes.

  • after_total_size_4_human – Human-readable representation of the processed data lake size.

  • after_total_n_record – Total number of records in the processed data lake.

  • n_partition – Number of partitions in the processed data lake.

  • partitions) – List of Partition objects representing each partition in the data lake.

dbsnaplake.validate_datalake.count_records(polars_writer: Writer, count_column: str, s3path_list: List[S3Path]) int[source]#

Count the total number of records in a list of S3 paths in a S3 partition folder.

Parameters:
  • polars_writer – Writer object used to write the data to S3.

  • count_column – Name of the column used to count the number of records.

  • s3path_list – List of S3 paths to scan for records.

Returns:

Total number of records in the S3 partition folder.

dbsnaplake.validate_datalake.validate_datalake(s3_client: S3Client, s3_loc: ~dbsnaplake.s3_loc.S3Location, db_snapshot_manifest_file: ~dbsnaplake.snapshot_to_staging.DBSnapshotManifestFile, polars_writer: ~typing.Optional[~polars_writer.writer.Writer] = None, count_column: ~typing.Optional[str] = None, logger=<dbsnaplake.logger.DummyLogger object>) ValidateDatalakeResult[source]#

Validates the data lake by scanning its contents and collecting statistics.

This function compares the original database snapshot with the processed data lake, providing detailed information about file counts, sizes, and record counts.

Parameters:
  • s3_client – An initialized boto3 S3 client for S3 operations.

  • s3_loc – S3 location information for the data lake.

  • db_snapshot_manifest_file – Manifest file of the original database snapshot.

  • polars_writerpolars_writer.Writer object.

  • count_column – Name of the column used to count the number of records. This column has to exist in all rows. If not provided, then it will not include the record count in the validation result.

Note

We don’t use previous manifest data to validate the datalake. We only use the current snapshot data to validate the datalake.

Note

The count n record feature is not available in unit test, because the polars.scan_xyz method is not working well with moto (mock).