Jump to Content
Management Tools

Anomaly detection using dynamic thresholds and two-year-long alerts in Cloud Monitoring

June 30, 2026
Lee Yanco

Senior Product Manager

Daniel Koss

Staff Software Engineer

Try Gemini Enterprise Business Edition today

The front door to AI in the workplace

Try now

Choosing the threshold of an alert policy can be a headache. You have to analyze historical data, aggregate it into semantically meaningful time series, and choose a threshold that matters. If the workload grows, your previously set static threshold might become too low, and your alert might fire too frequently. New workloads might require setting new thresholds, and setting separate thresholds for separate workloads requires creating separate policies, resulting in the annoyance of managing a fleet of mostly similar policies.

Not to mention, some metrics can’t even be alerted on using static thresholds. If your metric varies by time of day, like many e-commerce metrics do, then no single threshold will work. For example, what do you do if your metric looks like this:

https://storage.googleapis.com/gweb-cloudblog-publish/images/qtfse9nqWC88b92.max-2200x2200.png

Clearly something went wrong in the middle of that chart… but because the anomalous value is within the normal range of the daily data, no static value threshold can ever catch it.

Introducing long lookbacks and dynamic thresholding

We are pleased to announce that this problem is now solvable for users of Cloud Monitoring alerts with the launch of long-lookback alert policies for PromQL, currently in preview. This highly requested feature update now lets you configure PromQL alert policies to run over two years of metric data stored in Cloud Monitoring, supporting year-over-year and quarter-over-quarter analysis. 

One major use case unlocked by two-year lookback horizons in PromQL is dynamic thresholding, that is, policies where the threshold refers to the metric’s history. A simple example is an alert policy that says “alert me if the average over the last 5 minutes is 2x more than the average over the last week.” Instead of setting a static number as your threshold, you set how anomalous each time series must be from its historical data before generating an alert. This allows flexibility in policies, supports naturally changing baselines caused by growth in workloads, and provides a single threshold that works for all workloads. You don’t have to analyze every time series to set alerts properly – just set a factor that signals “anomalous” to you.

Take the above example: To catch that anomaly, you might create a policy that says “alert me if the value over the last 5 minutes is lower than 70% of the value from the same 5-minute span one week ago.” Such a policy would create a threshold that varies by the time of day, and you would catch the anomalous drop:

https://storage.googleapis.com/gweb-cloudblog-publish/images/8JX8WREHZPq68Fc.max-2100x2100.png

Dynamic threshold algorithms

Choosing the right dynamic threshold algorithm in PromQL depends on the shape of your source data. Metrics that vary by time of day need a different algorithm than metrics that have little variation. 

You can rewrite the below examples to have the historical data query as your threshold (putting a metric after the < or >), but if you do so you can’t easily visualize the threshold.

Because these use historical data, granular alert policies that trigger on individual workloads instead of aggregates might be flaky when spinning up new workloads. This issue will resolve itself as you accrue historical data. You can also avoid this by only running dynamic threshold alerts on aggregates.

Moving averages
In the simplest of the algorithms, alerts trigger when the recent trend of the data deviates from a moving average of data over a long period of time. This is good for catching anomalies in relatively stable data. 

Here’s some example PromQL, comparing the last 5 minutes to a one-week baseline and alerting if it’s 30% higher or lower than average:

Loading...
https://storage.googleapis.com/gweb-cloudblog-publish/images/4v9HQ8snDJbP2oR.max-1400x1400.png

You can also write this as a direct comparison, which might be more understandable. The following says “alert me if the most recent 5 minutes average of data is >1.3x the weekly average.”:

Loading...

Z-score (standard deviation)
Use this algorithm to identify anomalies based on the average and standard deviation of your data. A z-score measures the statistical distance between your recent data and historical data, with a common threshold being that a z-score above three or below negative three is considered anomalous. This measures the volatility of your data compared to its usual noisiness, and it works best with data that has a stable average and decent volatility:

Example PromQL, comparing the last 5 minutes to the one-week average and standard deviation:

Loading...

Example z-score signal and the resulting anomaly detection threshold:

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_VFrHPBv.max-1700x1700.png

Seasonal decomposition (time offset comparison)
This is a simple time-offset algorithm that compares time-series data in a period of time to the same period from the previous day or week. This is ideal for metrics that have timely patterns associated with them, such as visitors to a website that vary by time of day and day of week. Holidays and other factors that might cause a given day to be lower than expected can be smoothed away by averaging more than one historical period (e.g., average one week ago, two weeks ago, and three weeks ago, then compare that average to today).

Example PromQL, comparing the last 5 minutes to the same time period yesterday, alerting if the recent data is more than 50% lower than the one-day offset data:

Loading...

Which can be algebraically rewritten to:

Loading...
https://storage.googleapis.com/gweb-cloudblog-publish/images/Bkby4f9LySHuz75.max-1500x1500.png

In production, you might want to compare to the same period one week ago, or compare to an average of the same period one and seven days ago, to avoid triggering on naturally lower days such as weekends and holidays:

Loading...

When using time offsets, you can only reliably trigger on either drops or spikes, as triggering on both sudden drops and sudden spikes in a single policy may cause your alerts to fire twice.

Think of it this way: If traffic drops steeply today, your alert will trigger immediately. However, exactly 24 hours later, today's anomalous drop becomes tomorrow's historical baseline. If your policy triggers on any anomalous difference (higher or lower), the sudden "return to normal" tomorrow will look like a massive spike relative to yesterday's dip, and you will get a false alert for a phantom anomaly. You can see this in the above chart — the dip in the signal (blue line) reappears as its reciprocal exactly 24 hours later.

To prevent this, you should only track either drops or spikes when monitoring any given metric.

Control runaway costs using dynamic thresholds

Once you can trigger an alert based on deviations from a historical baseline, many interesting use cases open up. For example, you can use dynamic thresholding to prevent overspend for any Google Cloud service that offers a metric that roughly tracks spend.

Say you are concerned about runaway AI token costs. You could do the following:

    • Configure a dynamic threshold alert that triggers if the most recent 10 minutes of accumulated input/output token usage is more than 25x the one-week historical average, which should only catch extreme anomalous scenarios (such as leaked API keys) that will definitely result in overspend:

      • sum(rate({"__name__"="aiplatform.googleapis.com/publisher/online_serving/
        token_count"}[10m])) >
        25 * sum(rate({"__name__"="aiplatform.googleapis.com/publisher/online_serving/
        token_count"}[1w]))

  • Trigger your alert to fire to a Pub/Sub notification channel that pushes notifications to a Cloud Run function.

  • That Cloud Run function then runs a workflow that uses the Cloud Quotas API to lower your Token Usage quota to 0, which immediately stops the overspend. Note that legitimate use of tokens will be paused until you can fix the problem… but at least you’ll stop the bleeding.

Sign up to be a design partner

We are working on productizing anomaly detection using dynamic thresholds so they’re easier to write. We’re also working on more complex anomaly detection algorithms in Cloud Monitoring alerting that uses AI models specifically trained on time-series data.

If you’re interested in sharing your thoughts and being an early adopter of what we’re building in this space, sign up to be a preview partner. We’d love to have you!

Posted in