tidylake
Managing your data lakehouse assets
The Data Platform Balancing Act
Building a data platform is a constant struggle to balance three types of complexity:
- Technical Stack: Managing storage, compute, and evolving tools.
- Process: Handling orchestration, monitoring, and deployments.
- People: Coordinating a diverse team—from Data Engineers and DevOps to Data Scientists and Governance leads.
Why traditional approaches fail:
-
Too Engineering-Heavy: If you focus solely on complex orchestration frameworks, less technical users (Governance, Business Analysts) can’t contribute. This creates an engineering bottleneck that doesn't scale.
-
Too Governance-Heavy: If you rely on standalone documentation tools, they quickly become detached from the actual code. Syncing the two becomes an impossible manual task.
tidylake provides the "common ground" for your data platform. Instead of choosing between code and documentation, tidylake connects them.
We bridge the gap between governance documentation and process execution without forcing you to switch your favorite tools. It streamlines how you define assets, ensuring your metadata and your pipelines stay in perfect sync.
Your first project
Follow this minimal example to understand the core mechanics of a tidylake project.
In this guide, we will build a single data product, the fundamental building block of tidylake. It is an abstraction that allows you to define where your data comes from and what it should look like (metadata), then "wire it up" to your actual processing logic using our helpers.
The Philosophy: Organization over Interference
tidylake is an unopinionated framework. By structuring your scripts and defining your data products, you unlock powerful tooling and automation. However, tidylake does not modify your data or add features to your processing engine (like pandas in this example) or filesystem. It is designed for organization and governance, leaving the data processing entirely to you. Our plugin system allows you for more flexibility and integration of your chosen framework.
The Manifest File
Every data product begins with a manifest. This is a standardized YAML schema that acts as the "source of truth" for your metadata.
data_product:
name: 'silver_customers' # (1)
description: Customer profile from the CRM database
script: 'silver_customers' # (2)
schema: # (3)
properties:
customer_id:
type: 'string'
desc: 'Unique identifier for the customer'
customer_name:
type: 'string'
desc: 'Full name of the customer'
customer_active:
type: 'boolean'
desc: 'Flag indicating if the customer account is active'
customer_city:
type: 'string'
desc: 'City where the customer is located'
- Identity: Reusable metadata used for documentation and CLI commands.
- Binding: Links this metadata directly to the Python logic.
- Contract: Used for data quality validation and automating table definitions in your data warehouse.
The Context File
Tidylake uses a global context file to discover which data products belong to your project. The following is a minimal example including the previous data product.
tidylake:
name: Hello World
include_data_products:
- silver_customers.yml
The Script File
This is where you implement your logic (reading, transforming, and storing data). Tidylake stays out of your way, letting you use standard Python.
# %%
# (1)!
import pandas as pd
from tidylake import get_or_create_context
data_product = get_or_create_context().get_data_product("silver_customers") # (2)!
# %%
@data_product.add_input() # (3)!
def bronze_customers():
return pd.read_parquet("/tmp/bronze_customers")
df_bronze_customers = bronze_customers()
# %%
df_silver_customers = (
df_bronze_customers.loc[lambda df: df["customer__active"]]
.assign(
customer_city=lambda df: df["customer_city"].str.upper(),
customer_name=lambda df: df["customer_name"].str.title(),
)
# (4)!
# %%
@data_product.set_sink() # (5)!
def write_silver_customers():
return df_silver_customers.to_parquet("/tmp/silver_customers", index=False)
- Context Awareness: Works in both interactive (Notebooks) and batch (Production) modes.
- Lineage: Automatically tracks where data comes from.
- More Lineage: Automatically tracks dependencies with other data products
- Freedom: Tidylake doesn't change how you write Pandas/Spark code.
- Safety: Sinks can be disabled in development to prevent accidental data overwrites.
Repeat!
Scaling your project is simply a matter of repeating this three-step workflow for every new asset:
- Define the schema and metadata in a manifest file.
- Implement the logic in a script, linking it to the manifest.
- Register the product in your context file.
Run the complete example yourself
The snippets above are simplified for clarity. You can find the full, runnable project—including the sample data and all five data products here.
The Medallion Architecture
In the following examples, you will see us using terms like Bronze, Silver, and Gold. While tidylake is architecture-agnostic, we recommend adhering to a methodology such as the medallion architecture, as it is used along our documentation because it is a proven standard for organizing data products.
Using tidylake's Tooling
Once your project is structured, tidylake unlocks powerful automation and safety features via its Command Line Interface (CLI).
Introspect the Lineage Graph
Tidylake parses your scripts to build an internal Directed Acyclic Graph (DAG). It automatically detects dependencies by looking at which products are used as inputs for others.
Use the tidylake list command to see your data products in their correct execution order:
uv run tidylake list
01. bronze_customers
02. bronze_profile
03. silver_customers
04. gold_customers
Automatic DAG Validation
The order above is determined by script dependencies, not the order in your configuration file. Tidylake ensures the graph is valid and will fail early if it detects circular dependencies. Learn more about detecting data product dependencies.
You can also generate a visual representation of your pipeline using Mermaid syntax:
uv run tidylake list --mermaid
flowchart TD
bronze_customers["Bronze Customers"]
bronze_profile["Bronze Profile"]
silver_customers["Silver Customers"]
gold_customers["Gold Customers"]
bronze_customers --> silver_customers
bronze_profile --> silver_customers
silver_customers --> gold_customers
Execution in Batch Mode
During testing, in production or in CI/CD environments, you will run the project in batch mode. This tells tidylake to execute the scripts in order and activates all Sinks to write data to your storage layer.
Use the tidylake run command to run the entire project or a specific subset:
$ tidylake run
⚡️ Running data product: bronze_customers
⚡️ Running data product: bronze_profile
⚡️ Running data product: silver_customers
⚡️ Running data product: gold_customers
Development in Interactive Mode
One of the most powerful features of tidylake is its "context awareness." You can run the exact same scripts interactively (using IPython kernels, Jupyter notebooks, or VS Code interactive windows) without modification.
When tidylake detects an interactive session, it shifts into safety mode:
- Disabled Sinks: The
sinkfunctions are bypassed. You can run your script as many times as you like without accidentally overwriting production data or creating "spurious" files. - Quiet Logging: Production-level logging is silenced to keep your console clean for development.
This allows you to develop logic interactively while retaining the exact same code that will eventually run in production as a scheduled batch job, see more by reading the (notebook vs script dilemma)[#batch-vs-interactive] design principle.
Agnostic to your environment
While we use .py files with vscode jupyter code cells in this guide, tidylake works with any interactive Python environment. Remember, it’s about the methodology, not the IDE.
Next Steps: Leveling Up Your Workflow
You have now seen the core mechanics of tidylake: linking metadata to logic and managing execution contexts.
However, there is much more to explore. You can continue through the guide in order (start with the core concepts guide), or jump straight into the features that interest you most:
- Go beyond basic pandas, learn how to create a compute engine plugin to encapsulate I/O operations, enable schema automation for your data lake, or generate synthetic data directly from your manifests for rapid testing.
- (WIP) Discover how to implement and log data quality tests as a native part of your product lifecycle.
- (WIP) See how to export your tidylake DAG and integrate it seamlessly with orchestration engines like Airflow, Dagster, or Prefect.
You can also go straight to the demos to see examples of complete projects for a variety of frameworks.