MySQL monitoring maturity model: from survival to expert

Most MySQL monitoring stacks grow reactively. A team starts with a liveness check and disk alerting. An incident exposes a gap, a new signal gets added, and the cycle repeats. The result is uneven coverage: strong in areas that caused past outages, blind in areas that will cause the next one.

This article maps MySQL monitoring into four capability levels. Use it as a self-assessment, not a checklist to max out. Every level includes signals the playbook identifies as operationally important, along with the failure modes they catch. If you are missing an entire category at your current level, that is your highest-priority gap.

The levels are cumulative. Level 2 assumes Level 1 is in place. Level 3 assumes Level 2. Do not skip ahead to expert signals while missing operational fundamentals.

A broader mental model of how MySQL works in production, including the subsystems referenced below, is covered in how MySQL works in production.

flowchart TD
    L1["Level 1: Survival
Is it alive?"] --> L2["Level 2: Operational
Is it healthy?"] L2 --> L3["Level 3: Mature
Are cliff edges approaching?"] L3 --> L4["Level 4: Expert
Where exactly is the contention?"]

Level 1: survival

The minimum to know MySQL is alive and accepting work. Teams at this level know when MySQL is completely down but have no warning before failures and limited diagnostics during incidents.

What this level catches:

  • Server crashes, OOM kills, and restart loops.
  • Connection slot exhaustion (hard refusal).
  • Replication thread failures (broken I/O or SQL thread).
  • Disk full on the data volume.

What this level misses:

  • Gradual performance degradation (slow queries, buffer pool erosion).
  • Lock contention and metadata lock cascades.
  • Replication lag (threads running does not mean caught up).
  • Write-path cliff edges (checkpoint stalls, purge lag).

Signals at this level

SignalSourceWhat it tells you
Server availabilitySELECT 1 + SHOW GLOBAL STATUS LIKE 'Uptime'MySQL is up and can answer a query. Uptime reset means a crash or restart.
Connection utilizationThreads_connected / max_connectionsHow close you are to refusing new connections.
Replication thread stateSHOW REPLICA STATUS\G (Replica_IO_Running, Replica_SQL_Running)Whether replication threads are alive. Does not tell you if the replica is caught up.
Disk space freeOS-level (df) on data volumeWill writes fail soon?
Error log [ERROR] lines/var/log/mysql/error.logCrash signatures, corruption, replication errors with context.

Survival-level checklist

  • Liveness check responds within timeout. A TCP connect succeeding does not mean MySQL is ready. InnoDB crash recovery can block connections for minutes. The check must execute a query, not just open a socket.
  • Connection ratio below 80% at peak. Express as Threads_connected / max_connections, never as an absolute count. Peak headroom of 20-30% covers spikes and failover storms.
  • Both replication threads report Yes. Either thread stopping means lag grows unboundedly. Note: Seconds_Behind_Source at this level is a secondary signal and is unreliable in several scenarios (see Level 2).
  • Error log is collected and alertable. MySQL writes crash signatures, assertion failures, and corruption warnings here that appear nowhere in status variables.

What survival does not cover

A server passing SELECT 1 can still be in deep trouble. Metadata lock cascades, purge lag, checkpoint stalls, and buffer pool cliff edges all produce a server that answers liveness checks while serving queries at a fraction of normal throughput. If your monitoring stops here, your first indication of these problems will be application error rates.

Level 2: operational

What a competent team running MySQL in production monitors. Missing any of these would be considered a gap in peer review. This level adds throughput, active concurrency, query quality, and contention signals.

What this level adds:

  • Active load and throughput baselines.
  • Buffer pool effectiveness.
  • Query quality indicators (scans, joins without indexes, temp table spills).
  • Lock contention and deadlock detection.
  • Connection failure breakdown.

Signals at this level

SignalSourceWhy it matters
Query throughput (Questions rate)SHOW GLOBAL STATUS LIKE 'Questions'A sudden drop with stable connections means queries are stuck (locks, I/O saturation, metadata locks).
Active threads (Threads_running)SHOW GLOBAL STATUS LIKE 'Threads_running'The definitive load gauge. If consistently higher than CPU cores, the database is oversubscribed. Rising while Questions is flat means queries are queuing.
Slow query rateSHOW GLOBAL STATUS LIKE 'Slow_queries' with long_query_time <= 1 secondThe default 10-second threshold misses most operationally relevant slowdowns. Express as Slow_queries / Questions ratio.
Buffer pool hit ratio1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)Below 99% for OLTP is concerning. Below 95% is a crisis. The degradation curve is non-linear.
Row lock waitsSHOW GLOBAL STATUS LIKE 'Innodb_row_lock%'Innodb_row_lock_current_waits > 0 means transactions are actively blocked right now.
DeadlocksINFORMATION_SCHEMA.INNODB_METRICS metric lock_deadlocksSustained high rate indicates application design issues. Note: not available as a SHOW STATUS variable.
Aborted connectionsSHOW GLOBAL STATUS LIKE 'Aborted_connects', Aborted_clientsAuth failures, network drops, credential rotation problems.
Connection rejectionsSHOW GLOBAL STATUS LIKE 'Connection_errors_max_connections'Actual rejections at the limit, not just high utilization.
Replication lagSeconds_Behind_Source from SHOW REPLICA STATUSData freshness. Unreliable in several cases (see below), but better than nothing.
Temp table disk ratioCreated_tmp_disk_tables / Created_tmp_tablesAbove 25% means queries are spilling to disk.
Read/write mixCom_select, Com_insert, Com_update, Com_deleteA shift in ratio often precedes or explains performance changes.

Operational-level checklist

  • Threads_running relative to CPU cores. Sustained above core count means context-switching overhead. Sustained at 3x cores with dropping throughput means contention or stall.
  • Hit ratio computed over intervals, not cumulative. After a year of uptime, cumulative hit ratio masks current degradation. Compute from deltas over 1-minute windows.
  • long_query_time set to 1 second or lower for OLTP. The 10-second default lets 9.9-second queries pass as “normal.” The counter increments even if the slow query log is disabled.
  • Replication lag validated. Seconds_Behind_Source shows 0 when the I/O thread has not fetched latest events but the SQL thread caught up with a stale relay log. It shows NULL when broken. For critical decisions, add heartbeat-based lag (for example pt-heartbeat) or GTID comparison at Level 4.
  • Deadlock count comes from INNODB_METRICS, not SHOW STATUS. The variable Innodb_deadlocks does not exist in upstream SHOW GLOBAL STATUS. Use SELECT COUNT FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME = 'lock_deadlocks'.

Common operational mistakes

  • Alerting on Threads_connected instead of Threads_running. High connections with low running is a connection leak. High running is a crisis. These require different responses.
  • Treating max_connections as a tuning knob. Increasing it trades “connection refused” for potential OOM. Each connection consumes memory for per-thread buffers. The right response is investigating why so many connections are needed.
  • Relying solely on Seconds_Behind_Source. This metric is timestamp-based, not real-time. A burst after source idle time makes it jump to 3600 with no real lag. Use it as a secondary signal, not the primary one.

Level 3: mature

Full coverage including internals, leading indicators, and composite failure pattern detection. Teams at this level catch problems before they become incidents.

What this level adds:

  • Write-path cliff edge detection (checkpoint age, redo log capacity).
  • MVCC debt tracking (history list length).
  • Metadata lock detection before cascade.
  • Transaction age and open transaction monitoring.
  • Binlog growth and table cache efficiency.

Signals at this level

SignalSourceWhy it matters
Checkpoint age / redo log capacityMySQL 8.0.30+: Innodb_redo_log_current_lsn - Innodb_redo_log_checkpoint_lsn. Pre-8.0.30: INNODB_METRICS metric log_lsn_checkpoint_age (disabled by default). All versions: parse SHOW ENGINE INNODB STATUS LOG section.Above 75% triggers aggressive flushing. Above 90% with throughput collapse means synchronous flush stall. This is the most common “mysterious slowdown” in production MySQL.
History list lengthINFORMATION_SCHEMA.INNODB_METRICS metric trx_rseg_history_len. Also: parse SHOW ENGINE INNODB STATUS.Above 100,000 is concerning. Above 1,000,000 means a long-running transaction is blocking purge. Above 10,000,000 means severe degradation across all MVCC reads. Not available as a SHOW STATUS variable in MySQL 8.0.
Open transaction ageinformation_schema.INNODB_TRX ordered by trx_startedAny OLTP transaction older than 5 minutes is suspicious. trx_query = NULL with modifications means a forgotten transaction holding locks and a read view.
Metadata lock waitsperformance_schema.metadata_locks where LOCK_STATUS = 'PENDING'More than 3 sessions waiting on the same object means a cascade is developing. Metadata locks are a separate system from InnoDB row locks and do not appear in data_lock_waits.
Binary log growthSHOW BINARY LOGS summed; binlog_expire_logs_secondsMySQL 5.7 default expire_logs_days = 0 means binlogs accumulate forever. Lagging replicas block purge on the source.
Table cache pressureOpened_tables rate; Open_tables / table_open_cache ratioHigh miss rate adds file descriptor overhead and latency. Cross-check against OS ulimit -n.
Buffer pool wait freeSHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_wait_free'Any nonzero sustained rate means queries are blocked waiting for dirty page flush. More reliable than hit ratio for detecting acute memory pressure.
Log buffer waitsSHOW GLOBAL STATUS LIKE 'Innodb_log_waits'Any nonzero rate means innodb_log_buffer_size is undersized. Distinct from checkpoint stalls (which are about redo log capacity, not buffer size).
Thread cache efficiencyThreads_created / Connections ratioHigh ratio means connection pooling is ineffective or thread_cache_size is too small.
Query digest latencyperformance_schema.events_statements_summary_by_digestPer-query-pattern latency catches individual regressions invisible in aggregate averages. Track per-digest latency over time, not absolute values.
Dirty page ratioInnodb_buffer_pool_pages_dirty / Innodb_buffer_pool_pages_totalApproaching innodb_max_dirty_pages_pct triggers aggressive flushing, consuming I/O bandwidth.

Mature-level checklist

  • Checkpoint age expressed as ratio, not absolute LSN. Compare to redo log capacity (innodb_redo_log_capacity in 8.0.30+, or innodb_log_file_size * innodb_log_files_in_group pre-8.0.30). Below 50% is comfortable. Above 75% is the aggressive flushing zone.
  • History list from INNODB_METRICS, not SHOW STATUS. trx_rseg_history_len behaves as a gauge despite being labeled status_counter. In upstream MySQL it is not a status variable. Percona Server exposes it as Innodb_history_list_length.
  • Metadata lock instrument enabled. performance_schema.metadata_locks is available by default in MySQL 5.7+, but the mdl instrument must be enabled. Without it, metadata lock cascades are invisible until connection pools fill.
  • Binlog expiry configured. MySQL 8.0 defaults to 30 days (binlog_expire_logs_seconds = 2592000). MySQL 5.7 defaults to never (expire_logs_days = 0). Verify your version.
  • Composite pattern detection in place. Mature monitoring correlates multiple signals: metadata lock cascade (pending MDL + throughput drop), buffer pool cliff (hit ratio drop + wait_free nonzero + disk read saturation), checkpoint stall (checkpoint age > 90% + commit rate collapse).

Common mature-level mistakes

  • Confusing Innodb_log_waits with checkpoint stalls. Innodb_log_waits > 0 means the log buffer was too small. It does not confirm a checkpoint stall. Use throughput or commit rate collapse as the confirmer for checkpoint stalls.
  • Using cumulative hit ratio. After long uptime, cumulative counters mask current state. Always compute over recent intervals.
  • Not finding the blocking transaction. When history list grows, the blocker is found via SELECT * FROM information_schema.INNODB_TRX ORDER BY trx_started ASC LIMIT 1. The transaction may show trx_query = NULL because it is idle between statements, but it is still holding its read view.

Level 4: expert

Deep signals that experienced operators add after repeated incidents. These require Performance Schema instrumentation, structured parsing of SHOW ENGINE INNODB STATUS, and often custom collection logic.

What this level adds:

  • Per-index and per-table access statistics.
  • Lock-wait chain reconstruction.
  • Internal mutex and latch contention analysis.
  • GTID-based replication divergence detection.
  • Page cleaner and flush path efficiency.

Signals at this level

SignalSourceWhy it matters
Per-index access statisticsperformance_schema.table_io_waits_summary_by_index_usageINDEX_NAME = NULL rows indicate table scans. Identifies unused indexes and missing indexes on hot tables.
Lock-wait chainsperformance_schema.data_lock_waits joined with information_schema.innodb_trxShows exactly who is blocking whom and on which rows. In MySQL 8.0, this is the single most useful lock diagnostics tool.
GTID divergenceGTID_SUBTRACT(replica_executed_gtid_set, source_gtid_executed)Errant transactions on a replica (transactions the source does not have) indicate data divergence. The next failover will propagate conflicts. Ordinary lag (GTID_SUBTRACT(source, replica)) is not divergence.
Page cleaner efficiencyInnodb_buffer_pool_pages_flushed rate vs dirty ratio; innodb_page_cleaners countWhether page cleaners are keeping up, independent of dirty page count. Capped by innodb_buffer_pool_instances.
Semaphore waitsSHOW ENGINE INNODB STATUS SEMAPHORES sectionMutex and rw-lock contention that no status variable reveals. btr_search waits indicate adaptive hash index latch contention.
AHI efficiencyINFORMATION_SCHEMA.INNODB_METRICS: adaptive_hash_searches vs adaptive_hash_searches_btreeHit ratio below 50% means AHI consumes buffer pool memory without helping. Many high-performance operators disable AHI entirely.
Mutex spin/wait ratiosINFORMATION_SCHEMA.INNODB_METRICSHigh spin counts with low throughput indicate CPU wasted on lock spinning. Note: Performance Schema mutex instrumentation excludes buffer pool internal mutexes. Use SHOW ENGINE INNODB MUTEX for those.
Sort merge pass rateSort_merge_passes / (Sort_scan + Sort_range)Ratio above 1 means sort_buffer_size is too small for the workload. Increasing the buffer past 256KB-2MB has diminishing returns.
Binlog cache spillBinlog_cache_use vs Binlog_cache_disk_useDisk spill means transactions are too large for the binlog cache (binlog_cache_size).
Per-file I/O statisticsperformance_schema.file_summary_by_instanceReveals when a specific tablespace or log file has become a hotspot.
Prepared statement cacheCom_stmt_prepare vs Com_stmt_reprepareHigh reprepare rate indicates table statistics changes invalidating prepared statements frequently.

Expert-level checklist

  • Errant GTID transactions trigger immediate investigation. GTID_SUBTRACT(replica_set, source_set) returning non-empty is data divergence. It cannot self-resolve. The next failover will propagate conflicting data.
  • Lock-wait chains reconstructed on demand. During contention incidents, joining data_lock_waits with innodb_trx and processlist identifies the blocker in seconds. MySQL 8.0.40 reduced the cost of querying these tables under contention.
  • AHI latch contention checked before disabling. Confirm btr_search waits in the SEMAPHORES section with spin rounds above 100K. Disabling AHI (SET GLOBAL innodb_adaptive_hash_index = OFF) is safe and immediate but clears the entire structure with a brief performance dip.
  • Page cleaner count adequate. innodb_page_cleaners defaults to 4, capped by innodb_buffer_pool_instances. If dirty pages accumulate with adequate I/O capacity, increase page cleaners.
  • Unused indexes identified periodically. Query table_io_waits_summary_by_index_usage for indexes with zero or negligible reads. Removing unused indexes reduces write amplification and storage.

How to use this model

Start by identifying your current level. If you are missing signals at your own level, fill those gaps first. A team at Level 2 with no buffer pool hit ratio monitoring has a more urgent problem than a team at Level 3 considering GTID divergence tracking.

The most common blind spots across all levels:

  1. Checkpoint age (Level 3). The most common “mysterious slowdown” and the least monitored cliff edge. The Innodb_checkpoint_age status variable does not exist. You need INNODB_METRICS or 8.0.30+ redo log status variables.
  2. History list length (Level 3). Silent degradation from idle transactions. Not a SHOW STATUS variable in 8.0.
  3. Metadata locks (Level 3). Separate system from InnoDB row locks. Invisible without performance_schema.metadata_locks.
  4. Threads_running vs Threads_connected (Level 2). The single most important real-time health signal, frequently absent or misunderstood.
  5. GTID errant transactions (Level 4). Invisible divergences that cause failover failures.

For deeper coverage of specific failure patterns, see the related guides below.

How Netdata helps

Netdata collects MySQL status variables, Performance Schema tables, and InnoDB internals through its MySQL collector. The signals across all four levels are available as predefined charts and alarms.

  • Correlates Threads_running with Questions rate and CPU. When active concurrency rises while throughput drops, Netdata shows the divergence in a single view, distinguishing CPU saturation from lock contention.
  • Tracks buffer pool hit ratio as interval-computed, not cumulative. The cliff edge from 99% to 95% is visible as it happens, not masked by historical averages.
  • Monitors connection utilization and rejection rate together. Threads_connected / max_connections and Connection_errors_max_connections are charted alongside each other, making the difference between “approaching” and “refusing” immediately clear.
  • Exposes checkpoint age as a derived metric. Netdata computes the ratio against redo log capacity, eliminating the need to parse SHOW ENGINE INNODB STATUS manually or enable INNODB_METRICS.
  • Correlates replication signals across levels. Thread state, lag, relay log space, and GTID position are collected together, so I/O thread disconnects are not masked by a Seconds_Behind_Source of 0.
  • History list and open transaction age as first-class metrics. Purge lag from idle transactions is charted continuously, not discovered during an incident.