Tools like Snowflake have revolutionized access to data. With just a few clicks and a basic understanding of SQL, almost anyone can retrieve and manipulate data. No installation, no scaling problems, pay-per-compute—it just works and makes sense. This fuels innovation, scalability, and velocity in an organization, but it also introduces a new type of complexity and drives up data management costs.
Optimizing Snowflake for cost and performance stands as a critical, yet complex, task. Balancing fast, reliable data delivery against cost-effective resource utilization requires deep technical expertise—from data modeling and partitioning to workload management. The days when a dedicated database administrator (DBA) would handle these challenges is long gone. Today's data leaders are left to grapple with both the technical aspects of data management and the strategic integration of these technologies into broader business objectives.
Given this skill gap, some companies turn to cost observability dashboards (which in reality only add to the cognitive load). Some have unfortunately normalized the escalating complexity and cost as just "part of doing business."
The challenge of balancing cost, performance, and accessibility
Why is optimizing Snowflake so difficult? It involves juggling multiple priorities: cost, performance, and accessibility. These challenges are compounded by the complexity of today’s cloud data environments, where hundreds of users may run queries across numerous warehouses from multiple tools. Traditional cost tracking tools, often no more than sophisticated dashboards, frequently fail to provide actionable insights, especially at scale.
Organizational complexity adds another layer of difficulty. Different teams within a company—data engineering focusing on query performance and data quality, finance on costs, and business intelligence on dashboard integrity—often have competing and conflicting priorities. Without a clear strategy for optimization, these divergent goals can lead to suboptimal resource usage and wasteful spending.
Plus, optimization decisions almost always involve trade-offs. How much latency might you introduce for downstream teams by reconfiguring settings like max clusters (on a multi-cluster warehouse)? Is enabling query acceleration worth the extra cost? What's the right balance between cost savings and performance impact? To make a good decision, data leaders need a clear-eyed view of the downstream effects, something that is lacking in traditional tools.
Obvious optimizations
Luckily despite these challenges, there are clear-cut opportunities for optimization within Snowflake environments that can yield immediate benefits. Here are a few:
Adjusting auto-suspend settings to some relatively low value (1-5 minutes, depending on the type of workloads and downstream application sensitivity to latency) has immense cost impact, especially on low-utilized warehouses
Consolidating underutilized warehouses to improve efficiency
Setting up resource monitors to alert you to overspending, flag highly-inefficient queries, and avoid remote spillage
Right-sizing warehouses during periods of low activity
These are all straightforward strategies that can significantly enhance efficiency. While these quick fixes offer immediate savings, to achieve ongoing data optimization, practitioners must delve deeper into the intricacies of data operations to uncover less obvious inefficiencies. This requires more sophisticated analysis than the standard dashboard can provide.
Here are 3 examples of ML-powered insights and impact analysis done using 8knots:
Example 1 - Cost savings without noticeable tradeoff
First, let’s look at a warehouse used for machine learning workloads (typically driven by scheduled jobs that prepare some ML inputs). Below is an hourly activity chart over the course of a typical week:
The blue bars show time (in minutes) that the warehouse spent actually executing queries every hour, while the green line shows the hourly compute cost. What’s going on? What explains the ‘hidden cost’? The answer: idle time and provisioning time—time that you pay for, but don’t get any benefit from.
8knots shows you that ‘wasted time’ as red bars stacked on top of the blue bars:
What is causing all this waste? To find out, let’s look at some of the warehouse settings:
Our first suspect, the auto-suspend timeout, is set to 600 seconds (10 minutes). Snowflake’s documentation clearly presents the trade-offs of adjusting this parameter, offering various rules of thumbs depending on your business needs. It boils down to a trade-off between cost (wasting time on idling, a minimum resumption cost of 60 seconds) and latency (caching benefits, re-provisioning overhead).
But how is one supposed to make that sort of tradeoff decision without looking at concrete numbers? This is where the 8knots Impact Analyzer comes to the rescue. With the click of a button we can examine the impact of changing the auto-suspend timeout from 10 minutes to 1 minute on the workloads submitted within that week:
We see that this change alone causes spending to drop by $548/week (or $28.5K annually, a whopping 75% of the weekly cost of this warehouse). And the best part is that the model suggests very minimal impact on query latency; only a few queries with latency greater than 10 seconds will be impacted. What we see here is a huge gain with only a minor downside. A clear win! In fact, 8knots will generate an insight (and an alert) to the user, recommending that they apply this change:
Example 2 - How can I improve query latency of my Looker users?
Let’s look at a typical week of activity in a warehouse serving queries coming from Looker dashboards:
A few users have been complaining about “slow queries” from their dashboards. Is there anything we can do at the warehouse setting level to help reduce this latency?
First, let’s examine the impact of changing two parameters that impact latency: auto-suspend timeout and max clusters (this is a multi-cluster warehouse, but we are currently not taking advantage of it as the max clusters parameter is set to 1).
Here’s what it looks like when we raise the auto-suspend timeout from 60 to 120 seconds:
And here’s what happens when we raise the max clusters setting from 1 to 2:
We can clearly see that the second approach is preferable in terms of cost-performance trade-offs, adding only $403 annually (versus $1272 when we adjust the auto-suspend timeout), while significantly reducing latency (especially on queries greater than 10 seconds, where the typical Looker user starts getting annoyed).
Is paying $400 a year for a better Looker experience worth it? That’s up to the user. But at least they are now aware of the options and can make the smart and cost-effective decision.
Example 3 - Is resizing worth it?
Let’s continue examining the options with the example above.
We forgot about another parameter other than latency that affects the user's perception of ‘slow’: execution time. The main parameter that affects execution time is warehouse size (there are others like Query Acceleration Service which we will discuss in a future blog post). But changing the warehouse size does not affect all queries equally.
For example, queries running for just a few seconds on an existing warehouse typically have sufficient CPU/RAM resources with the existing machine size, so they will likely gain very little (if anything) from running on a larger machine. However, queries consuming more CPU/RAM resources, especially those spilling to local or remote storage, will greatly benefit from larger machines (with more RAM, SSD, and CPU power). The catch? Increasing the size doubles the cost (let’s ignore SPO warehouses for now).
Is it worth bumping up the size from Small to Medium for this particular warehouse? Let’s use the Impact Analyzer to see what happens:
The histogram on the right shows how much speed is gained from bumping the size from Small to Medium across three different runtime buckets. Overall, the average speed gain is only 2.4%, which is insignificant, while the cost impact is estimated to be $3.1K (essentially doubling the annual compute cost of this warehouse). While it does help with latency as well, the impact is just on a handful of queries.
So the question becomes: is increasing the size of the warehouse worth the $3.1K annual cost for a very modest performance gain? Like virtually every DataOps trade-off, this is a business decision, but the Impact Analyzer clearly shows us that the alternative options are a lot more cost effective.
While the 8knots Impact Analyzer offers detailed insights into optimizing your Snowflake environment, not every recommended change may align with your business's specific needs or strategy. 8knots is deliberately designed to respect the nuances of your business context; it provides recommendations without taking control, empowering you to decide which optimizations to implement based on your organizational goals and constraints.
In conclusion, optimizing Snowflake effectively requires more than technical adjustments; it demands a thorough understanding of both technical possibilities and their business implications. 8knots equips data leaders with the tools to make informed decisions that optimize performance and cost while respecting the unique strategic needs of their organizations. This balanced approach ensures that optimizations contribute positively to both the technical and business aspects of your operations.
Comentários