CLI

Tidylake's CLI can help you manage and automate your project, as al alternative you can also interact directly from python using the SDK.

Basic commands

These commands need to be launched from a tidylake project, for that you must create a valid context file that points to one or more data product manifest files.

Every command looks for a tidylake.yml file on the root folder of your project, but you can modify this path with the --file argument.

list

Lists the defined data products by following the internal DAG ordering. Several options can be configured to obtain richer representation of the graph to improve visualization or include in documentations.

Usage: tidylake list [OPTIONS]

 Display pipeline structure and visualization.

 Args:     file: Path to the configuration YAML file.     mermaid: Generate Mermaid diagram output.     textual: Launch interactive terminal UI viewer.

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --file           TEXT  Path to the YAML file defining processes [default: tidylake.yml]                                                                            │
│ --mermaid              Generate Mermaid diagram                                                                                                                    │
│ --textual              Open Textual viewer                                                                                                                         │
│ --help                 Show this message and exit.                                                                                                                 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

run

Runs data products scripts in order, this is meant for testing with smaller sets of data or to manage very simple projects that do not require additional orchestration. Can be configured to start and filter subsets of the graph to test only required parts of the process.

This package is not an orchestration tool

We didn't design tidylake as a replacement for orchestration, there are many important functionalities that the package will never cover in the base functionality that should be provided from specialized tools via plugins or by integrating the project with the SDK. Use this command to test and automate early stage projects but plan for a future integration of an orchestration tool when you need to scale things up.

uv run tidylake run --help

 Usage: tidylake run [OPTIONS]

 Execute pipeline or individual data product.

 Args:     file: Path to the configuration YAML file.     data product: Name of specific data product to execute. If None, runs entire pipeline.     continue_run:
 Continue execution from the last completed data product.

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --file                                 TEXT  Path to the YAML file defining processes [default: tidylake.yml]                                                      │
│ --data-product                         TEXT  Specific data product to run                                                                                          │
│ --continue-run    --no-continue-run          Continue from the last data product [default: no-continue-run]                                                        │
│ --help                                       Show this message and exit.                                                                                           │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

init

Create an scaffold project by seeding base files and configuration. Comes with different flavours but they can be easily adapted to other engines.

uv run tidylake init --help

 Usage: tidylake init [OPTIONS]

 Initialize a new project from demo template.

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --name          TEXT  Project name [default: my_tidylake_project]                                                                                                  │
│ --engine        TEXT  Engine type: pandas, spark, iceberg [default: pandas]                                                                                        │
│ --help                Show this message and exit.                                                                                                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Compute Engine commands

This commands are only available when a compute engine has been correctly configured as they interact with catalog and storage systems that are not native to the library. These are meant to encapsulate and abstract under a common framework tasks are are usually repetitive in the management of datalakes.

schema diff

Prints the differences between the schema in the manifest file and the configured catalog. It uses an intermediate abstraction for data types so your workflow is invariant across projects.

uv run tidylake schema diff --help

 Usage: tidylake schema diff [OPTIONS]

 Show schema differences between defined and catalog schemas.

 Args:     file: Path to the configuration YAML file.     data_product: Name of specific data product to check. If None, checks all data products.

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --file                TEXT  Path to the YAML file defining processes [default: tidylake.yml]                                                                       │
│ --data-product        TEXT  Specific data product to check                                                                                                         │
│ --help                      Show this message and exit.                                                                                                            │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

schema update

Run schema modification in your catalog based on the differences between the manifest file and the schema that is already defined in your catalog. Please use the command in dry run mode and avoid calling commit unintentionally as the results can be destructive depending on your compute and storage engine.

uv run tidylake schema update --help

 Usage: tidylake schema update [OPTIONS]

 Update schema definitions to match catalog schemas.

 Args:     file: Path to the configuration YAML file.     data_product: Name of specific data product to update. If None, updates all data products.     commit:
 Apply changes to schema files. If False, runs in dry-run mode.

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --file                           TEXT  Path to the YAML file defining processes [default: tidylake.yml]                                                            │
│ --data-product                   TEXT  Specific data product to update                                                                                             │
│ --commit          --no-commit          Commit changes [default: no-commit]                                                                                         │
│ --help                                 Show this message and exit.                                                                                                 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

peek

Renders a sample of the real dataset by accesing the underlying compute engine. This command is optional during the plugin definition as it can pose a data security risk in some scenarios.

 uv run tidylake peek --help

 Usage: tidylake peek [OPTIONS]

 Preview data product output data without executing the pipeline.

 Args:     file: Path to the configuration YAML file.     data_product: Name of specific data product to preview.

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --file                TEXT  Path to the YAML file defining processes [default: tidylake.yml]                                                                       │
│ --data-product        TEXT  Specific data product to run                                                                                                           │
│ --help                      Show this message and exit.                                                                                                            │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯