How often should you check uptime: the honest answer

How often should you check uptime: the honest answer

5 min read

There is no universal answer. Match the check interval to the failure mode you care about, not to the highest tier your monitoring vendor sells. For most public websites, every 1 to 5 minutes is plenty. For revenue-critical paths, 30 to 60 seconds buys real value. For certificates, DNS records, and domain expiry, once a day is more than enough.

Why "what interval" is the wrong question

Ask any monitoring vendor how often you should check uptime and you get a version of the same answer: as fast as your current plan allows, and faster on a higher plan. That is a sales pitch, not an engineering recommendation.

The engineering recommendation starts somewhere else. What are you actually monitoring, and how fast does someone need to know when it breaks? A marketing page returning 404 for ten minutes is annoying. A checkout flow returning 500s for ten minutes is a Monday-morning incident review. A DNS record drifting silently can take days to surface, and you would not catch it with a faster HTTP check anyway, because nothing about HTTP would fail.

Each failure mode has its own honest cadence. "Every 30 seconds" is not the right answer to most of them by default. It is the right answer to a specific subset, which is the section after next.

Failure modes and their honest cadences

A useful way to think about this: write down the failure mode you actually fear, estimate the time between that failure and meaningful customer impact, then pick a cadence well inside that window. Some examples:

Failure mode Time to customer impact Reasonable cadence Why
Marketing or blog page errors Minutes to hours 5 to 15 min Visitor leaves, no transactional damage
API serving a paying customer SLA Seconds 30 to 60 sec You owe an SLA; latency-to-alert is the budget
Checkout or payment flow Seconds to minutes 30 to 60 sec Every dropped session is lost revenue
Public status endpoint 1 to 2 min 1 min You made a public promise to be honest about it
Cron / job heartbeat Up to the job interval 5 to 15 min A nightly job will not finish faster than once a day
SSL certificate expiry Weeks of lead time Once a day Certificates do not expire in 30 seconds
DNS records Up to record TTL Once a day The interesting failures are deliberate changes
Domain registration 30+ days of lead time Once a day or less Registrars give you weeks of warning

Two things are worth saying out loud here. First, "uptime monitoring" in most vendors means HTTP reachability, not certificate health, not DNS drift, not domain expiry; those are separate checks with separate cadences, and conflating them is how teams end up paying for 30-second cert checks that will never matter. Second, heartbeat-style monitoring (did my cron job actually run?) is a different problem from reachability, and the right check interval is bounded by how often the job runs, not by what your monitoring tool can do.

The alert latency math vendors skip

Marketing materials quote the check interval as if it were the time-to-alert. It is not. A monitoring tool that paged you on the very first failed check would page you every time a packet got dropped, a CDN routed weirdly, or a database paused for a brief vacuum. So every serious uptime tool layers confirmation on top of the raw interval. Two confirmation steps are standard:

  • Retries with exponential backoff before treating a failure as confirmed.
  • Verification from a second monitoring location, so a local network blip in one region does not open an incident.

Both of those are good engineering. They are also why the marketed interval is not the alert latency. Worked example for a 60-second check with four retries on exponential backoff and one cross-location verification:

Check interval:                60s
First failed check at:          0s
Retry 1 (+10s backoff):        10s
Retry 2 (+20s):                30s
Retry 3 (+40s):                70s
Retry 4 (+80s):               150s
Second-location verification: ~30s
-------------------------------------
Earliest possible alert:     ~180s (3 min)

Three minutes, not sixty seconds. Bumping the check interval from 60s to 30s saves you the first 30s, not half the total. The retry tail dominates, and that tail exists for a reason: it is what stops your pager from going off at 3am for a routing flicker that resolved itself in 90 seconds.

Any vendor whose marketing says "alert in seconds" is either disabling retry confirmation (bad) or counting from the start of the last failed retry instead of the start of the incident (misleading). Ask which one. The answer matters for any contract that has a detection-time clause attached.

When 30 seconds earns its money

There are real cases where the fastest intervals are the right call. The pattern: customer impact accumulates faster than the alert latency math can keep up with at slower cadences.

  • Revenue-critical checkout paths, where every minute of outage is a measurable count of abandoned orders.
  • API endpoints with a paying-customer SLA that bakes in a detection budget (for example, "99.95% with sub-five-minute detection").
  • Services whose downstream consumers retry aggressively, so an outage burns through quota or rate limits in seconds.
  • Customer-facing status endpoints feeding a public status page, where slow updates damage trust on their own.

Outside of cases like these, paying for 30-second intervals is a tax on resilience theater. The alert still arrives 2 to 3 minutes after the incident starts, and your team is no better positioned to act on it than they would be at a 60-second cadence.

What faster intervals do not fix

Detection is the first minute of an incident. The rest is human and process. A faster check interval has no effect on any of the following:

  • Broken alert delivery. If the Slack channel is muted, the email is filtered, or the on-call pager is pointed at someone on PTO, your 30-second detection lands nowhere.
  • Mean time to recovery. Detection cuts the first sliver off MTTR; the rest is runbook, paged engineer, deployment rollback, post-incident write-up.
  • Alert fatigue. Faster cadences with weak confirmation logic page humans more often for nothing, which trains them to ignore the pager when it actually matters.
  • Geographic blind spots. Ten checks per minute from a single region do not help if a CDN outage only hits another region. Multi-location verification beats interval every time.

That last point is also where uptime monitoring intersects with how you set the rest up. For most teams, the right move is fewer checks from more places, with retry and cross-location confirmation tuned tight enough to suppress false positives without smothering real outages. That is the model WebPixie uses for its uptime monitoring, and any serious tool should describe its retry and verification logic explicitly instead of hiding behind an advertised interval.

Picking your interval in four questions

When you sit down to set this for a specific monitor, four questions get you to a defensible answer:

  • What is the failure mode? Reachability, certificate validity, DNS drift, or heartbeat absence are distinct problems with distinct cadences. Pick one per monitor.
  • How fast does the world need to know? Walk forward from the moment of failure to the moment of measurable harm. The cadence has to live well inside that window.
  • How tolerant are you of false positives at this cadence? Faster checks need stronger confirmation logic to avoid pager fatigue. If you cannot describe your retry policy, you are not ready to drop the interval.
  • Is your alert chain ready to act in that timeframe? A 30-second detection time means nothing if the response chain assumes a human will respond on Monday morning. Match cadence to actual on-call coverage.

A monitor that gets all four right at 5 minutes is more useful than one that gets none of them right at 30 seconds. The interval is a derived value, not a starting point.