MySQL monitoring maturity model: from survival to expert
Most MySQL monitoring stacks grow reactively. A team starts with a liveness check and disk alerting. An incident exposes a gap, a new signal gets added, and the cycle repeats. The result is uneven coverage: strong in areas that caused past outages, blind in areas that will cause the next one.
This article maps MySQL monitoring into four capability levels. Use it as a self-assessment, not a checklist to max out. Every level includes signals the playbook identifies as operationally important, along with the failure modes they catch. If you are missing an entire category at your current level, that is your highest-priority gap.
The levels are cumulative. Level 2 assumes Level 1 is in place. Level 3 assumes Level 2. Do not skip ahead to expert signals while missing operational fundamentals.
A broader mental model of how MySQL works in production, including the subsystems referenced below, is covered in how MySQL works in production.
flowchart TD
L1["Level 1: Survival
Is it alive?"] --> L2["Level 2: Operational
Is it healthy?"]
L2 --> L3["Level 3: Mature
Are cliff edges approaching?"]
L3 --> L4["Level 4: Expert
Where exactly is the contention?"]Level 1: survival
The minimum to know MySQL is alive and accepting work. Teams at this level know when MySQL is completely down but have no warning before failures and limited diagnostics during incidents.
What this level catches:
- Server crashes, OOM kills, and restart loops.
- Connection slot exhaustion (hard refusal).
- Replication thread failures (broken I/O or SQL thread).
- Disk full on the data volume.
What this level misses:
- Gradual performance degradation (slow queries, buffer pool erosion).
- Lock contention and metadata lock cascades.
- Replication lag (threads running does not mean caught up).
- Write-path cliff edges (checkpoint stalls, purge lag).
Signals at this level
| Signal | Source | What it tells you |
|---|---|---|
| Server availability | SELECT 1 + SHOW GLOBAL STATUS LIKE 'Uptime' | MySQL is up and can answer a query. Uptime reset means a crash or restart. |
| Connection utilization | Threads_connected / max_connections | How close you are to refusing new connections. |
| Replication thread state | SHOW REPLICA STATUS\G (Replica_IO_Running, Replica_SQL_Running) | Whether replication threads are alive. Does not tell you if the replica is caught up. |
| Disk space free | OS-level (df) on data volume | Will writes fail soon? |
Error log [ERROR] lines | /var/log/mysql/error.log | Crash signatures, corruption, replication errors with context. |
Survival-level checklist
- Liveness check responds within timeout. A TCP connect succeeding does not mean MySQL is ready. InnoDB crash recovery can block connections for minutes. The check must execute a query, not just open a socket.
- Connection ratio below 80% at peak. Express as
Threads_connected / max_connections, never as an absolute count. Peak headroom of 20-30% covers spikes and failover storms. - Both replication threads report Yes. Either thread stopping means lag grows unboundedly. Note:
Seconds_Behind_Sourceat this level is a secondary signal and is unreliable in several scenarios (see Level 2). - Error log is collected and alertable. MySQL writes crash signatures, assertion failures, and corruption warnings here that appear nowhere in status variables.
What survival does not cover
A server passing SELECT 1 can still be in deep trouble. Metadata lock cascades, purge lag, checkpoint stalls, and buffer pool cliff edges all produce a server that answers liveness checks while serving queries at a fraction of normal throughput. If your monitoring stops here, your first indication of these problems will be application error rates.
Level 2: operational
What a competent team running MySQL in production monitors. Missing any of these would be considered a gap in peer review. This level adds throughput, active concurrency, query quality, and contention signals.
What this level adds:
- Active load and throughput baselines.
- Buffer pool effectiveness.
- Query quality indicators (scans, joins without indexes, temp table spills).
- Lock contention and deadlock detection.
- Connection failure breakdown.
Signals at this level
| Signal | Source | Why it matters |
|---|---|---|
Query throughput (Questions rate) | SHOW GLOBAL STATUS LIKE 'Questions' | A sudden drop with stable connections means queries are stuck (locks, I/O saturation, metadata locks). |
Active threads (Threads_running) | SHOW GLOBAL STATUS LIKE 'Threads_running' | The definitive load gauge. If consistently higher than CPU cores, the database is oversubscribed. Rising while Questions is flat means queries are queuing. |
| Slow query rate | SHOW GLOBAL STATUS LIKE 'Slow_queries' with long_query_time <= 1 second | The default 10-second threshold misses most operationally relevant slowdowns. Express as Slow_queries / Questions ratio. |
| Buffer pool hit ratio | 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests) | Below 99% for OLTP is concerning. Below 95% is a crisis. The degradation curve is non-linear. |
| Row lock waits | SHOW GLOBAL STATUS LIKE 'Innodb_row_lock%' | Innodb_row_lock_current_waits > 0 means transactions are actively blocked right now. |
| Deadlocks | INFORMATION_SCHEMA.INNODB_METRICS metric lock_deadlocks | Sustained high rate indicates application design issues. Note: not available as a SHOW STATUS variable. |
| Aborted connections | SHOW GLOBAL STATUS LIKE 'Aborted_connects', Aborted_clients | Auth failures, network drops, credential rotation problems. |
| Connection rejections | SHOW GLOBAL STATUS LIKE 'Connection_errors_max_connections' | Actual rejections at the limit, not just high utilization. |
| Replication lag | Seconds_Behind_Source from SHOW REPLICA STATUS | Data freshness. Unreliable in several cases (see below), but better than nothing. |
| Temp table disk ratio | Created_tmp_disk_tables / Created_tmp_tables | Above 25% means queries are spilling to disk. |
| Read/write mix | Com_select, Com_insert, Com_update, Com_delete | A shift in ratio often precedes or explains performance changes. |
Operational-level checklist
- Threads_running relative to CPU cores. Sustained above core count means context-switching overhead. Sustained at 3x cores with dropping throughput means contention or stall.
- Hit ratio computed over intervals, not cumulative. After a year of uptime, cumulative hit ratio masks current degradation. Compute from deltas over 1-minute windows.
- long_query_time set to 1 second or lower for OLTP. The 10-second default lets 9.9-second queries pass as “normal.” The counter increments even if the slow query log is disabled.
- Replication lag validated.
Seconds_Behind_Sourceshows 0 when the I/O thread has not fetched latest events but the SQL thread caught up with a stale relay log. It shows NULL when broken. For critical decisions, add heartbeat-based lag (for examplept-heartbeat) or GTID comparison at Level 4. - Deadlock count comes from INNODB_METRICS, not SHOW STATUS. The variable
Innodb_deadlocksdoes not exist in upstreamSHOW GLOBAL STATUS. UseSELECT COUNT FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME = 'lock_deadlocks'.
Common operational mistakes
- Alerting on Threads_connected instead of Threads_running. High connections with low running is a connection leak. High running is a crisis. These require different responses.
- Treating max_connections as a tuning knob. Increasing it trades “connection refused” for potential OOM. Each connection consumes memory for per-thread buffers. The right response is investigating why so many connections are needed.
- Relying solely on Seconds_Behind_Source. This metric is timestamp-based, not real-time. A burst after source idle time makes it jump to 3600 with no real lag. Use it as a secondary signal, not the primary one.
Level 3: mature
Full coverage including internals, leading indicators, and composite failure pattern detection. Teams at this level catch problems before they become incidents.
What this level adds:
- Write-path cliff edge detection (checkpoint age, redo log capacity).
- MVCC debt tracking (history list length).
- Metadata lock detection before cascade.
- Transaction age and open transaction monitoring.
- Binlog growth and table cache efficiency.
Signals at this level
| Signal | Source | Why it matters |
|---|---|---|
| Checkpoint age / redo log capacity | MySQL 8.0.30+: Innodb_redo_log_current_lsn - Innodb_redo_log_checkpoint_lsn. Pre-8.0.30: INNODB_METRICS metric log_lsn_checkpoint_age (disabled by default). All versions: parse SHOW ENGINE INNODB STATUS LOG section. | Above 75% triggers aggressive flushing. Above 90% with throughput collapse means synchronous flush stall. This is the most common “mysterious slowdown” in production MySQL. |
| History list length | INFORMATION_SCHEMA.INNODB_METRICS metric trx_rseg_history_len. Also: parse SHOW ENGINE INNODB STATUS. | Above 100,000 is concerning. Above 1,000,000 means a long-running transaction is blocking purge. Above 10,000,000 means severe degradation across all MVCC reads. Not available as a SHOW STATUS variable in MySQL 8.0. |
| Open transaction age | information_schema.INNODB_TRX ordered by trx_started | Any OLTP transaction older than 5 minutes is suspicious. trx_query = NULL with modifications means a forgotten transaction holding locks and a read view. |
| Metadata lock waits | performance_schema.metadata_locks where LOCK_STATUS = 'PENDING' | More than 3 sessions waiting on the same object means a cascade is developing. Metadata locks are a separate system from InnoDB row locks and do not appear in data_lock_waits. |
| Binary log growth | SHOW BINARY LOGS summed; binlog_expire_logs_seconds | MySQL 5.7 default expire_logs_days = 0 means binlogs accumulate forever. Lagging replicas block purge on the source. |
| Table cache pressure | Opened_tables rate; Open_tables / table_open_cache ratio | High miss rate adds file descriptor overhead and latency. Cross-check against OS ulimit -n. |
| Buffer pool wait free | SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_wait_free' | Any nonzero sustained rate means queries are blocked waiting for dirty page flush. More reliable than hit ratio for detecting acute memory pressure. |
| Log buffer waits | SHOW GLOBAL STATUS LIKE 'Innodb_log_waits' | Any nonzero rate means innodb_log_buffer_size is undersized. Distinct from checkpoint stalls (which are about redo log capacity, not buffer size). |
| Thread cache efficiency | Threads_created / Connections ratio | High ratio means connection pooling is ineffective or thread_cache_size is too small. |
| Query digest latency | performance_schema.events_statements_summary_by_digest | Per-query-pattern latency catches individual regressions invisible in aggregate averages. Track per-digest latency over time, not absolute values. |
| Dirty page ratio | Innodb_buffer_pool_pages_dirty / Innodb_buffer_pool_pages_total | Approaching innodb_max_dirty_pages_pct triggers aggressive flushing, consuming I/O bandwidth. |
Mature-level checklist
- Checkpoint age expressed as ratio, not absolute LSN. Compare to redo log capacity (
innodb_redo_log_capacityin 8.0.30+, orinnodb_log_file_size * innodb_log_files_in_grouppre-8.0.30). Below 50% is comfortable. Above 75% is the aggressive flushing zone. - History list from INNODB_METRICS, not SHOW STATUS.
trx_rseg_history_lenbehaves as a gauge despite being labeledstatus_counter. In upstream MySQL it is not a status variable. Percona Server exposes it asInnodb_history_list_length. - Metadata lock instrument enabled.
performance_schema.metadata_locksis available by default in MySQL 5.7+, but themdlinstrument must be enabled. Without it, metadata lock cascades are invisible until connection pools fill. - Binlog expiry configured. MySQL 8.0 defaults to 30 days (
binlog_expire_logs_seconds = 2592000). MySQL 5.7 defaults to never (expire_logs_days = 0). Verify your version. - Composite pattern detection in place. Mature monitoring correlates multiple signals: metadata lock cascade (pending MDL + throughput drop), buffer pool cliff (hit ratio drop + wait_free nonzero + disk read saturation), checkpoint stall (checkpoint age > 90% + commit rate collapse).
Common mature-level mistakes
- Confusing Innodb_log_waits with checkpoint stalls.
Innodb_log_waits > 0means the log buffer was too small. It does not confirm a checkpoint stall. Use throughput or commit rate collapse as the confirmer for checkpoint stalls. - Using cumulative hit ratio. After long uptime, cumulative counters mask current state. Always compute over recent intervals.
- Not finding the blocking transaction. When history list grows, the blocker is found via
SELECT * FROM information_schema.INNODB_TRX ORDER BY trx_started ASC LIMIT 1. The transaction may showtrx_query = NULLbecause it is idle between statements, but it is still holding its read view.
Level 4: expert
Deep signals that experienced operators add after repeated incidents. These require Performance Schema instrumentation, structured parsing of SHOW ENGINE INNODB STATUS, and often custom collection logic.
What this level adds:
- Per-index and per-table access statistics.
- Lock-wait chain reconstruction.
- Internal mutex and latch contention analysis.
- GTID-based replication divergence detection.
- Page cleaner and flush path efficiency.
Signals at this level
| Signal | Source | Why it matters |
|---|---|---|
| Per-index access statistics | performance_schema.table_io_waits_summary_by_index_usage | INDEX_NAME = NULL rows indicate table scans. Identifies unused indexes and missing indexes on hot tables. |
| Lock-wait chains | performance_schema.data_lock_waits joined with information_schema.innodb_trx | Shows exactly who is blocking whom and on which rows. In MySQL 8.0, this is the single most useful lock diagnostics tool. |
| GTID divergence | GTID_SUBTRACT(replica_executed_gtid_set, source_gtid_executed) | Errant transactions on a replica (transactions the source does not have) indicate data divergence. The next failover will propagate conflicts. Ordinary lag (GTID_SUBTRACT(source, replica)) is not divergence. |
| Page cleaner efficiency | Innodb_buffer_pool_pages_flushed rate vs dirty ratio; innodb_page_cleaners count | Whether page cleaners are keeping up, independent of dirty page count. Capped by innodb_buffer_pool_instances. |
| Semaphore waits | SHOW ENGINE INNODB STATUS SEMAPHORES section | Mutex and rw-lock contention that no status variable reveals. btr_search waits indicate adaptive hash index latch contention. |
| AHI efficiency | INFORMATION_SCHEMA.INNODB_METRICS: adaptive_hash_searches vs adaptive_hash_searches_btree | Hit ratio below 50% means AHI consumes buffer pool memory without helping. Many high-performance operators disable AHI entirely. |
| Mutex spin/wait ratios | INFORMATION_SCHEMA.INNODB_METRICS | High spin counts with low throughput indicate CPU wasted on lock spinning. Note: Performance Schema mutex instrumentation excludes buffer pool internal mutexes. Use SHOW ENGINE INNODB MUTEX for those. |
| Sort merge pass rate | Sort_merge_passes / (Sort_scan + Sort_range) | Ratio above 1 means sort_buffer_size is too small for the workload. Increasing the buffer past 256KB-2MB has diminishing returns. |
| Binlog cache spill | Binlog_cache_use vs Binlog_cache_disk_use | Disk spill means transactions are too large for the binlog cache (binlog_cache_size). |
| Per-file I/O statistics | performance_schema.file_summary_by_instance | Reveals when a specific tablespace or log file has become a hotspot. |
| Prepared statement cache | Com_stmt_prepare vs Com_stmt_reprepare | High reprepare rate indicates table statistics changes invalidating prepared statements frequently. |
Expert-level checklist
- Errant GTID transactions trigger immediate investigation.
GTID_SUBTRACT(replica_set, source_set)returning non-empty is data divergence. It cannot self-resolve. The next failover will propagate conflicting data. - Lock-wait chains reconstructed on demand. During contention incidents, joining
data_lock_waitswithinnodb_trxandprocesslistidentifies the blocker in seconds. MySQL 8.0.40 reduced the cost of querying these tables under contention. - AHI latch contention checked before disabling. Confirm
btr_searchwaits in the SEMAPHORES section with spin rounds above 100K. Disabling AHI (SET GLOBAL innodb_adaptive_hash_index = OFF) is safe and immediate but clears the entire structure with a brief performance dip. - Page cleaner count adequate.
innodb_page_cleanersdefaults to 4, capped byinnodb_buffer_pool_instances. If dirty pages accumulate with adequate I/O capacity, increase page cleaners. - Unused indexes identified periodically. Query
table_io_waits_summary_by_index_usagefor indexes with zero or negligible reads. Removing unused indexes reduces write amplification and storage.
How to use this model
Start by identifying your current level. If you are missing signals at your own level, fill those gaps first. A team at Level 2 with no buffer pool hit ratio monitoring has a more urgent problem than a team at Level 3 considering GTID divergence tracking.
The most common blind spots across all levels:
- Checkpoint age (Level 3). The most common “mysterious slowdown” and the least monitored cliff edge. The
Innodb_checkpoint_agestatus variable does not exist. You need INNODB_METRICS or 8.0.30+ redo log status variables. - History list length (Level 3). Silent degradation from idle transactions. Not a SHOW STATUS variable in 8.0.
- Metadata locks (Level 3). Separate system from InnoDB row locks. Invisible without
performance_schema.metadata_locks. - Threads_running vs Threads_connected (Level 2). The single most important real-time health signal, frequently absent or misunderstood.
- GTID errant transactions (Level 4). Invisible divergences that cause failover failures.
For deeper coverage of specific failure patterns, see the related guides below.
How Netdata helps
Netdata collects MySQL status variables, Performance Schema tables, and InnoDB internals through its MySQL collector. The signals across all four levels are available as predefined charts and alarms.
- Correlates Threads_running with Questions rate and CPU. When active concurrency rises while throughput drops, Netdata shows the divergence in a single view, distinguishing CPU saturation from lock contention.
- Tracks buffer pool hit ratio as interval-computed, not cumulative. The cliff edge from 99% to 95% is visible as it happens, not masked by historical averages.
- Monitors connection utilization and rejection rate together.
Threads_connected / max_connectionsandConnection_errors_max_connectionsare charted alongside each other, making the difference between “approaching” and “refusing” immediately clear. - Exposes checkpoint age as a derived metric. Netdata computes the ratio against redo log capacity, eliminating the need to parse
SHOW ENGINE INNODB STATUSmanually or enable INNODB_METRICS. - Correlates replication signals across levels. Thread state, lag, relay log space, and GTID position are collected together, so I/O thread disconnects are not masked by a
Seconds_Behind_Sourceof 0. - History list and open transaction age as first-class metrics. Purge lag from idle transactions is charted continuously, not discovered during an incident.
Related guides
- How MySQL actually works in production: a mental model for operators
- MySQL Aborted_connects and Aborted_clients climbing: diagnosis
- MySQL ERROR 1045 (28000): Access denied for user - diagnosis
- MySQL adaptive hash index latch contention: high CPU, low throughput
- MySQL authentication failure spike: brute force vs broken credential rotation
- MySQL binary logs filling the disk: expiry, lagging replicas, and purge
- MySQL InnoDB buffer pool hit ratio collapse: the cliff edge
- MySQL slow after restart: buffer pool warm-up and the cold cache
- MySQL innodb_buffer_pool_size tuning: 60-80% of RAM and when that breaks
- MySQL Innodb_buffer_pool_wait_free > 0: buffer pool memory pressure
- MySQL InnoDB checkpoint age: the redo log capacity signal nobody watches
- MySQL connection exhaustion: detection, diagnosis, and prevention







