Design principles
As established in our getting started guide, the primary objective of tidylake is to solve the Data Platform Balancing Act.
We built tidylake based on lessons learned from countless data pipeline projects and after testing a myriad of other frameworks. This methodology isn't just about a specific tool; it’s a set of principles for building and maintaining a sustainable Data Lake (or Lakehouse), regardless of your chosen framework or infrastructure.
Encapsulating the data product lifecycle
The term data product has been defined extensively and is often categorized into three distinct "waves" of meaning:
-
The "Functional" Definition (DJ Patil, 2012): In his seminal O’Reilly book Data Jujitsu, DJ defined it as "a product that facilitates an end goal through the use of data."
-
The "Architectural" Definition (Zhamak Dehghani, 2019): In the original Data Mesh principles, Dehghani treats it as an "architectural quantum." She defines it not just as a dataset, but as a self-contained unit that includes the data, the code (pipelines/transformation), the metadata, and the infrastructure (access ports).
-
The "Operational" Definition (Industry Standard): Modern analyst firms like Gartner or Mckinsey often define a data product as a "curated set of data, metadata, and semantics" designed to be immediately useful for specific business objectives. Among practitioners focusing on reliability, a data product is defined by its contracts and observability. Here, a product isn't a product unless it has a designated owner, a clear consumer base, and measurable "health" metrics.
When building tidylake, we chose this term intentionally. We aren't just building datasets; we are managing the full lifecycle of data by unifying:
- Code: Logic to transform raw inputs into refined outputs.
- Metadata: That explains what the data means and how it should be used.
- Contracts: Explicit definitions that combine strict typing, semantics, and content restrictions.
By capturing this information in two compact files (the manifest and the script), the framework can agnostically automate almost every step of the lifecycle:
- Execution: Running the actual ETL/ELT processes.
- Discovery: Syncing technical and business metadata into data catalogs and governance platforms.
- Observability: Automatically generating lineage, provenance, and quality checks.
- Development: Creating synthetic samples for testing and early-stage modeling.
This approach provides data experts with a clear, unified view of their contracts and resources without the usual manual overhead.
This approach provides data experts with a clear, unified view of their contracts and resources without the usual manual overhead.
Being framework agnostic
We built tidylake to be flexible enough to accommodate the unique workflows and requirements of any data architecture or project.
To achieve this, we intentionally avoid hard-coding framework integrations. Instead, we focus on providing the metadata structure and helpers needed to encapsulate your project, leaving the specific implementation details to your own code and extensions.
We will now delve deeper into the core challenges tidylake addresses: bridging the gap between different personas and managing the complex lifecycle of a modern data lake.
The Notebook vs. Script Dilemma: Balancing Engineering and Analysis
Data projects fail when they can't accommodate the different ways people work. Striking a balance between "production-ready" and "research-friendly" is notoriously difficult:
-
The Engineering View: DataOps and Data Engineers focus on productionizing complex workflows. They prioritize bug-free transformations, containers, and CI/CD pipelines. For them, scripts and version control are non-negotiable.
-
The Analyst View: Business-oriented analysts and scientists need to wrangle data to find value. They prioritize documentation, discovery, and fast iteration. For them, notebooks and interactive sessions are the natural habitat.
These two approaches are often antagonistic. While other tools attempt to bridge this gap, they often impose rigid structures that are difficult for less technical users to adopt.
tidylake overcomes this dilemma with three key principles:
- Uniform Code: The exact same code is used for both batch and interactive runs. The framework handles the environment-specific outcomes automatically.
- Version Control First: Code is stored in plain formats that are easily managed in Git.
- Literate Programming: We keep logic plain and linear, mirroring the notebook experience. By avoiding deep nesting or complex class structures, we preserve the "beauty" and readability of the transformation logic.
Check out how script files can be configured to see how these principles work in practice:
- Helpers automate repetitive I/O while keeping code testable.
- Transformations remain agnostic and free-form.
- The Automatic Switch ensures your code "just works" whether you are in a production pipeline or a VS Code interactive window.
Managing Data, Metadata, and Code During the Lake's Lifecycle
Most projects begin with a simple "read-transform-write" script. The real pain starts when you transition to a managed orchestration platform:
- The "Zero Day" Problem: How do you develop when you can't access the original production datasets?
- Schema Evolution: How do you safely branch your data lake to prepare for upcoming upstream changes?
- Testing: Do you have a way to unit test your logic without hitting your real infrastructure?
- Coupling: How tightly is your business logic tied to your orchestration tool (Airflow, Dagster, etc.)?
Tidylake addresses these "DataOps" challenges by providing patterns that decouple your logic from your frameworks and infrastructure.
Centralized Metadata Management
In a modern lakehouse, metadata (names, types, descriptions) is everywhere: in your code, parquet files, SQL catalogs, BI tools, governance platforms and even in ML tools such as feature stores. Keeping these in sync manually is a recipe for disaster.
We use a central manifest as the single source of truth. This single file serves as a contract to automate several critical processes:
Schema Enforcement & Quality
Define your schema once. tidylake can be used to inject validation and quality checks directly into your workflow before data is even written by recalling the correct information when you need it.
Automated Catalog Maintenance
Using a compute engine plugin, the CLI can automate table creation and schema evolution in your metastore. This separates your data lifecycle from your catalog maintenance—essential for modern formats like Delta Lake or Apache Iceberg.
Testing with Synthetic Data
Testing shouldn't require production data. tidylake can generate synthetic datasets directly from your manifest metadata, allowing you to run and test your transformation pipelines in a completely isolated environment.
Orchestration Agnosticism
tidylake provides a clean, graph-based SDK to invoke your data products. You can call your scripts directly from your orchestration code or via the CLI, ensuring your business logic remains independent of your scheduler.
Alternatives
While we have used and enjoyed many of the following tools, we found that they often solve specific technical hurdles rather than the entire data product lifecycle.
While we have used and enjoyed many of the following tools, we found that they often solve specific technical hurdles rather than the entire data product lifecycle.
- jupytext: Excellent for pairing notebooks with version-control-friendly
.pyor.mdfiles. However, it doesn't provide the metadata framework or the "Data Product" abstraction needed for governance. - papermill: A powerful tool for parameterizing and running notebooks as batch jobs (a core philosophy of platforms like Databricks). While great for execution, it lacks a native way to manage schemas, lineage, or documentation outside of the notebook cells.
- nbdev: A sophisticated environment for developing Python packages entirely within notebooks, including tests and documentation. It is highly opinionated and powerful, but it can be a steep learning curve for teams that want to keep their metadata and transformation logic cleanly separated.
Why tidylake?
The main difference is focus. Most alternatives focus on how you write code (Notebook vs. Script). tidylake focuses on what you are building (the Data Product).
By putting the metadata manifest at the center, we ensure that documentation, data quality, and schema evolution aren't "afterthoughts" added to a notebook—they are the foundation that the code is built upon.