Sales Intelligence AI for sales insights and conversation intelligence AI-powered

Building a Snowflake data warehouse from scratch using Soda Core

Share

Facebook Twitter Linkedin Copy link post URL copied
11 min read

My name is Pavel, and I am a Data Engineer in the RingCentral Bulgaria office. Ensuring data quality in our Snowflake warehouse used to be an important function in my role. In this piece, I will share how our team tackled the challenge from scratch using Soda Core. It’s important to note that Snowflake’s constraints are declarative in nature, allowing us to define data rules without complex programming. I won’t delve into all aspects of data quality assurance or final reporting however, I will focus on the challenges we faced and how we addressed them using Soda Core.

Ensuring data quality from the early stages of building a data warehouse is crucial. In modern data pipelines, errors in the initial stages can propagate, leading to significant issues later on. Therefore, building quality checks early on can help avoid problems as data volumes and complexity grow.

Soda Core is an open-source tool that allows for easy querying and writing of data checks. It also integrates seamlessly with databases using Python, making the process of validating data more flexible and user-friendly.


The Challenges of Modern Data Stacks

The modern data stack is often more agile in its approach to database constraints, focusing more on scalability and speed than enforcing rigid data rules upfront. While this brings certain advantages, such as faster iteration and development cycles, it also means that monitoring and validating data quality becomes essential. Without rigorous checks, data integrity may be compromised over time.

However, this flexibility presents a unique challenge for data quality assurance teams. Modern data warehouse architectures, often built on cloud platforms like Snowflake, BigQuery, or Redshift, prioritize scale and ease of use over strict data governance. For data quality teams, this becomes a balancing act: how to ensure high data quality without compromising flexibility and performance that modern architectures offer.

This article will explore how we navigated these challenges, particularly focusing on the use of Soda Core for building a robust data quality monitoring system.

What We Had

Snowflake and the Need for Monitoring Constraints

We were working with Snowflake, which required us to actively monitor data constraints. Unlike traditional databases, modern data stacks like Snowflake don’t enforce strict rules such as Primary Keys (PK) or Foreign Keys (FK). This flexibility can be an advantage, but it also places the burden of enforcing and monitoring these constraints on the data quality team. Ensuring that data meets these constraints becomes a critical part of the data quality process.

Data Vault Architecture and the Complexity of Monitoring Multiple Tables

Our architecture was based on Data Vault, which gave the team the option to quickly connect and disconnect data sources. Data Vault’s design allows seamless changes in the sources without having to fully rebuild the data warehouse’s layers. For us, this meant the data quality process had to be equally adjustable, enabling the quick deployment of tests for new sources. With this architecture, it’s not just about defining constraints for the incoming data, but also leveraging the warehouse’s metadata to streamline other aspects of data testing and monitoring.

The goal was to maintain simplicity and efficiency in deploying data quality checks for new sources. We realized the value of using metadata not only to define constraints but also to automate broader data quality monitoring tasks, improvingboth speed and adjustability.

Limited Resources and the Search for Automation

Given our limited resources, we needed a tool that could automate as much of the process as possible. Writing SQL queries based on metadata and executing them directly in the database might seem straightforward—and often it is.  However, we wanted a solution that minimized manual intervention and allowed us to defer complex tasks without sacrificing data quality. Automation was key to keeping our processes efficient and scalable.

Team Skillset: SQL Over Python

Our team consisted of members who were far more proficient in SQL than Python. Based on this, we chose to follow a “SQL-first” approach. Although we still utilized Python where necessary, all critical checks and validations were written in SQL. This approach ensured that our data quality checks were understandable and accessible to everyone on the team, including business analysts and testers. Using a Python-first approach or a tool like Great Expectations would have reduced this opportunity.

Dealing with Uncertainty: Defining Data Quality Metrics

We faced a lot of uncertainty regarding which data quality metrics would ultimately prove valuable. While data freshness was a top priority, we were uncertain about how to measure other important metrics like accuracy or detect anomalies over time. This uncertainty meant we had to remain flexible and ready to experiment, adjusting our metrics as our understanding of the data and business needs evolved.

Working Within the Existing Tech Stack

Another consideration was integrating with our existing tech stack. Our data warehouse was populated using Airflow, so it made sense to leverage Airflow for data quality monitoring as well. This approach allowed us to maintain a unified, coherent process and avoid introducing unnecessary complexity by adding new tools that didn’t align with our existing workflow.

If you’re working in a similar environment, this article might provide insights that are directly relevant to your own challenges.

What is Soda and Why We Used It

The Leading Competitor: Great Expectations

In the realm of data quality, Great Expectations stands out as a feature-rich framework. It provides a wide range of capabilities for data validation and monitoring, including:

  • Extensive built-in checks: Great Expectations offers a robust library of predefined validations, ranging from schema checks to anomaly detection and value testing.
  • Automated documentation and visualization: It generates documentation and detailed reports for every data check, which simplifies tracking and communication about data quality across teams.
  • Multiple integrations: Great Expectations supports a wide array of data sources — databases, CSV files, Parquet, and more — which makes it a versatile tool for almost any data pipeline.

However, while Great Expectations is undeniably powerful, we chose Soda Core for a few key reasons that made it a better fit for our use case.

What Soda Core Brings to the Table

While Great Expectations is a powerful and versatile tool, Soda Core offers a number of features that made it the right choice for our project, providing almost all the necessary functionality with a SQL-first approach. Here’s why we chose Soda Core:

SQL-First Approach and Automation

One of Soda Core’s biggest advantages is its SQL-first design. It allows users to define and execute data quality checks directly in SQL, which simplifies the validation process, especially for teams more comfortable with SQL. Additionally, Soda Core can be run as a Python script, automatically generating SQL queries. This makes it easy to automate the process of data validation.

For example, a simple check can be initiated like this:

import soda

scan = soda.create_scan()

scan.add_query("""

    SELECT COUNT(*) FROM my_table WHERE value IS NULL

""")

scan.run()

Support for Custom SQL Queries

Custom SQL checks are another key feature. Soda Core enables users to define highly specific and complex checks, tailored to business rules or unique data patterns. This flexibility was crucial for us, enabling the direct handling of complex business logic through custom SQL queries. This flexibility ensured that even the most nuanced data quality requirements were covered.

Wide Database Support, Including Snowflake

Soda Core supports a wide range of relational databases, including Snowflake, making it versatile for different infrastructures. This flexibility allowed us to easily integrate it into our existing system, regardless of the database technology in use. Soda Core’s database connectors enable seamless connectivity across modern data warehouses like Snowflake, Redshift, BigQuery, Spark, Vertica, Clickhouse and others.

Optimized SQL Execution

The SQL queries generated by Soda Core are optimized for performance, ensuring that even on large datasets, the checks run efficiently. This is critical for large-scale data warehouses where performance is a top priority. Optimized queries mean faster checks without compromising accuracy or completeness.

Metadata Extraction from Checks

Soda Core enables users to extract key metadata from validation results, such as the number of records checked, errors found, and other important metrics. This helps teams monitor data quality over time and make informed decisions based on comprehensive data.

For example:

results = scan.get_results()

for result in results:

    print(f"Check: {result.check_name}, Errors: {result.failed}")

Switching Between Warehouses and Databases

Soda Core supports switching between different warehouses and databases within the same project. This feature is particularly useful in environments where data resides in multiple isolated systems or stages of development.

Example:

scan.set_connection(database="prod_db", warehouse="wh_analytics")

Open Source and Accessibility

Soda Core is part of the larger Soda ecosystem and is an open-source solution, which means it can be used without licensing costs or additional fees. This makes it accessible to teams of all sizes and budgets, offering a high degree of flexibility without financial constraints.

Customizable Severity Levels for Checks

Soda Core allows setting criticality levels for each check, which helps prioritize issues based on their severity. For instance, critical checks can be set to stop the ETL process, while non-critical ones can simply trigger alerts or be included in reports.

scan.add_check("Check name").critical()

Schema, Table, and Column-Level Checks

Soda Core supports validations at multiple data levels—schema, table, and column. This allows for fine-grained control over data quality checks depending on the requirements of the project. You can validate specific data points or enforce broader constraints on entire datasets.

Example of a YAML configuration:

checks:

  - schema: my_schema

  - table: my_table

  - column: my_column

    checks:

      - missing_percentage < 5%

Final Thoughts

Great Expectations remains an excellent choice for projects that demand extensive out-of-the-box data quality checks and where Python is the preferred language. But for those seeking a more streamlined, SQL-centric solution, Soda Core offers almost all of the same key features with added simplicity. It’s a lightweight, highly adaptable framework that integrates smoothly with modern data stacks while offering powerful data validation capabilities.

How We Used Soda Core for Data Quality

In our project, we structured data quality checks into three primary categories: constraints checks, technical checks, and business checks. This helped us manage different types of validations efficiently, using a combination of automated metadata-driven queries and custom scripts developed with business analysts.

Types of Checks

1. Constraint Checks

Constraint checks focus on ensuring that the data adheres to specific rules and database-level restrictions, such as uniqueness, references, data formats, and required fields (e.g., NOT NULL). These checks help maintain the structural integrity of the data. We were able to generate constraint checks from the database metadata, automating the validation process and making monitoring more efficient.

For example, if a table must contain unique identifiers, we could set up a query to automatically verify this parameter and flag any violations.

Example SQL Query for Constraint Check:

SELECT COUNT(*)

FROM my_schema.my_table

WHERE id IS NOT DISTINCT FROM NULL

2. Technical Checks

Technical checks ensure the stability and correctness of the data infrastructure. These include checks for data freshness, timeliness of updates, data volumes, and the smooth operation of ETL pipelines. These checks guarantee that data is ingested on time and error-free, which is crucial for maintaining the integrity of the data warehouse.

3. Business Checks

Business checks validate data against specific business logic or requirements, which are defined in collaboration with system and business analysts. These checks are unique to each project, often validating data correctness according to the specific calculations, ranges, or logical conditions relevant to the business.

Business checks are custom-built and stored separately from the general constraint or technical checks, allowing us to manage them flexibly and adapt quickly to changes in business processes.

Example Business Check:

SELECT COUNT(*)

FROM sales

WHERE revenue < cost

This checks if any rows in the sales table violate the business rule that revenue should always exceed cost.

Automated Check Generation from Metadata

We automated the generation of constraint checks by extracting relevant information directly from the database metadata. By accessing details about table relationships, fields, and schemas, we could dynamically create SQL queries to validate these constraints.

For example, to validate foreign key relationships between layers of the warehouse, we generated SQL queries from metadata like this:

Example Foreign Key Check:

checks for fk_table:

  - values in (fk_column) must exist in pk_table (pk_column)

However, one limitation of Soda Core’s standard package is the inability to check foreign key constraints across different layers of the warehouse. To address this, we generated custom queries for such cases:

SELECT COUNT(SOURCE.fk_column_name, SOURCE.fk_column_name2)

FROM database_name.schema_name.FK_NAME SOURCE

LEFT JOIN database_name.schema_name.PK_NAME TARGET

  ON SOURCE.fk_column_name = TARGET.pk_column_name

  AND SOURCE.fk_column_name2 = TARGET.pk_column_name2

WHERE ((SOURCE.fk_column_name IS NOT NULL AND TARGET.pk_column_name IS NULL)

  OR (SOURCE.fk_column_name2 IS NOT NULL AND TARGET.pk_column_name2 IS NULL))

Profiling for Technical Checks

For technical checks, we performed data profiling depending on the attribute types. Profiling results were used to drive validation at the warehouse level. The checks differed based on whether the attribute was numeric, text, or date-based:

Numeric Columns:

    Minimum value

    Maximum value

    Five smallest/largest values

    Most frequent values

    Average, sum, standard deviation, variance

    Count of distinct and missing values

    Histogram of values

Text Columns:

    Most frequent values

    Count of distinct and missing values

    Average, minimum, and maximum length

Dates were not profiled, but technical checks on timeliness and freshness were applied directly to date fields.

Storing Results for Iterative Improvement

We stored the results of these checks in a separate database. This allowed us to maintain a history of validation results and continuously improve our data quality approach over time. By keeping a record of checks, we could apply them retrospectively to historical data, which is especially important for evolving time-series validation.

Final Thoughts on Our Approach

This structured approach—separating checks by type, generating them from metadata, and leveraging profiling—allowed us to comprehensively monitor and ensure data quality across both technical and business dimensions. As our data quality needs evolved, we could easily extend the system to adapt, all while maintaining clear visibility into the performance of each type of check.

Next Steps

  • Integrate Additional Data Warehouses: We can expand our process to include multiple data warehouses by adding connectors. Such adjustability allows us to monitor and validate data across different environments seamlessly.
  • Build and Visualize Data Quality Dimensions: For deeper analysis of data quality, it’s essential to define and visualize data quality dimensions such as freshness, completeness, accuracy, and consistency. These dimensions provide a clearer understanding of data health and highlight areas that require attention. By integrating these dimensions into visualizations using Business Intelligence (BI) tools, we can offer intuitive representations of data quality, making interpretation easier.
  • Implement Real-Time Checks During Data Ingestion: We plan to perform validations as data is ingested into the warehouse, with the ability to halt the data-loading process in the event of validation failures. This proactive approach helps maintain data integrity right from the outset.

Summary

Soda Core has proven to be particularly user-friendly for SQL developers, utilizing Python primarily for executing tasks rather than for writing checks. This design streamlines workflows, enabling developers to focus on familiar SQL for data validation while avoiding the complexities of Python programming.

One of our key innovations was the ability to generate SQL queries based on metadata, a feature we implemented using Soda Core. This approach provides significant flexibility, allowing us to easily adapt our checks according to the evolving data structure and changing requirements. As a result, our data quality framework has become more dynamic and responsive, ensuring we can maintain high standards even as our data landscape evolves.

Originally published Feb 03, 2025

Share

Facebook Twitter Linkedin Copy link post URL copied

Related content