$ guides / mysql / mysql-monitoring-maturity-model ▌

Operations Guides

MySQL monitoring maturity model: from survival to expert

Most MySQL monitoring stacks grow reactively. A team starts with a liveness check and disk alerting. An incident exposes a gap, a new signal gets added, and the cycle repeats. The result is uneven coverage: strong in areas that caused past outages, blind in areas that will cause the next one.

This article maps MySQL monitoring into four capability levels. Use it as a self-assessment, not a checklist to max out. Every level includes signals the playbook identifies as operationally important, along with the failure modes they catch. If you are missing an entire category at your current level, that is your highest-priority gap.

The levels are cumulative. Level 2 assumes Level 1 is in place. Level 3 assumes Level 2. Do not skip ahead to expert signals while missing operational fundamentals.

A broader mental model of how MySQL works in production, including the subsystems referenced below, is covered in how MySQL works in production.

flowchart TD
    L1["Level 1: Survival
Is it alive?"] --> L2["Level 2: Operational
Is it healthy?"]
    L2 --> L3["Level 3: Mature
Are cliff edges approaching?"]
    L3 --> L4["Level 4: Expert
Where exactly is the contention?"]

Level 1: survival

The minimum to know MySQL is alive and accepting work. Teams at this level know when MySQL is completely down but have no warning before failures and limited diagnostics during incidents.

What this level catches:

Server crashes, OOM kills, and restart loops.
Connection slot exhaustion (hard refusal).
Replication thread failures (broken I/O or SQL thread).
Disk full on the data volume.

What this level misses:

Gradual performance degradation (slow queries, buffer pool erosion).
Lock contention and metadata lock cascades.
Replication lag (threads running does not mean caught up).
Write-path cliff edges (checkpoint stalls, purge lag).

Signals at this level

Signal	Source	What it tells you
Server availability	`SELECT 1` + `SHOW GLOBAL STATUS LIKE 'Uptime'`	MySQL is up and can answer a query. Uptime reset means a crash or restart.
Connection utilization	`Threads_connected` / `max_connections`	How close you are to refusing new connections.
Replication thread state	`SHOW REPLICA STATUS\G` (`Replica_IO_Running`, `Replica_SQL_Running`)	Whether replication threads are alive. Does not tell you if the replica is caught up.
Disk space free	OS-level (`df`) on data volume	Will writes fail soon?
Error log `[ERROR]` lines	`/var/log/mysql/error.log`	Crash signatures, corruption, replication errors with context.

Survival-level checklist

Liveness check responds within timeout. A TCP connect succeeding does not mean MySQL is ready. InnoDB crash recovery can block connections for minutes. The check must execute a query, not just open a socket.
Connection ratio below 80% at peak. Express as Threads_connected / max_connections, never as an absolute count. Peak headroom of 20-30% covers spikes and failover storms.
Both replication threads report Yes. Either thread stopping means lag grows unboundedly. Note: Seconds_Behind_Source at this level is a secondary signal and is unreliable in several scenarios (see Level 2).
Error log is collected and alertable. MySQL writes crash signatures, assertion failures, and corruption warnings here that appear nowhere in status variables.

What survival does not cover

A server passing SELECT 1 can still be in deep trouble. Metadata lock cascades, purge lag, checkpoint stalls, and buffer pool cliff edges all produce a server that answers liveness checks while serving queries at a fraction of normal throughput. If your monitoring stops here, your first indication of these problems will be application error rates.

Level 2: operational

What a competent team running MySQL in production monitors. Missing any of these would be considered a gap in peer review. This level adds throughput, active concurrency, query quality, and contention signals.

What this level adds:

Active load and throughput baselines.
Buffer pool effectiveness.
Query quality indicators (scans, joins without indexes, temp table spills).
Lock contention and deadlock detection.
Connection failure breakdown.

Signals at this level

Signal	Source	Why it matters
Query throughput (`Questions` rate)	`SHOW GLOBAL STATUS LIKE 'Questions'`	A sudden drop with stable connections means queries are stuck (locks, I/O saturation, metadata locks).
Active threads (`Threads_running`)	`SHOW GLOBAL STATUS LIKE 'Threads_running'`	The definitive load gauge. If consistently higher than CPU cores, the database is oversubscribed. Rising while `Questions` is flat means queries are queuing.
Slow query rate	`SHOW GLOBAL STATUS LIKE 'Slow_queries'` with `long_query_time` <= 1 second	The default 10-second threshold misses most operationally relevant slowdowns. Express as `Slow_queries / Questions` ratio.
Buffer pool hit ratio	`1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)`	Below 99% for OLTP is concerning. Below 95% is a crisis. The degradation curve is non-linear.
Row lock waits	`SHOW GLOBAL STATUS LIKE 'Innodb_row_lock%'`	`Innodb_row_lock_current_waits > 0` means transactions are actively blocked right now.
Deadlocks	`INFORMATION_SCHEMA.INNODB_METRICS` metric `lock_deadlocks`	Sustained high rate indicates application design issues. Note: not available as a `SHOW STATUS` variable.
Aborted connections	`SHOW GLOBAL STATUS LIKE 'Aborted_connects'`, `Aborted_clients`	Auth failures, network drops, credential rotation problems.
Connection rejections	`SHOW GLOBAL STATUS LIKE 'Connection_errors_max_connections'`	Actual rejections at the limit, not just high utilization.
Replication lag	`Seconds_Behind_Source` from `SHOW REPLICA STATUS`	Data freshness. Unreliable in several cases (see below), but better than nothing.
Temp table disk ratio	`Created_tmp_disk_tables / Created_tmp_tables`	Above 25% means queries are spilling to disk.
Read/write mix	`Com_select`, `Com_insert`, `Com_update`, `Com_delete`	A shift in ratio often precedes or explains performance changes.

Operational-level checklist

Threads_running relative to CPU cores. Sustained above core count means context-switching overhead. Sustained at 3x cores with dropping throughput means contention or stall.
Hit ratio computed over intervals, not cumulative. After a year of uptime, cumulative hit ratio masks current degradation. Compute from deltas over 1-minute windows.
long_query_time set to 1 second or lower for OLTP. The 10-second default lets 9.9-second queries pass as “normal.” The counter increments even if the slow query log is disabled.
Replication lag validated. Seconds_Behind_Source shows 0 when the I/O thread has not fetched latest events but the SQL thread caught up with a stale relay log. It shows NULL when broken. For critical decisions, add heartbeat-based lag (for example pt-heartbeat) or GTID comparison at Level 4.
Deadlock count comes from INNODB_METRICS, not SHOW STATUS. The variable Innodb_deadlocks does not exist in upstream SHOW GLOBAL STATUS. Use SELECT COUNT FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME = 'lock_deadlocks'.

Common operational mistakes

Alerting on Threads_connected instead of Threads_running. High connections with low running is a connection leak. High running is a crisis. These require different responses.
Treating max_connections as a tuning knob. Increasing it trades “connection refused” for potential OOM. Each connection consumes memory for per-thread buffers. The right response is investigating why so many connections are needed.
Relying solely on Seconds_Behind_Source. This metric is timestamp-based, not real-time. A burst after source idle time makes it jump to 3600 with no real lag. Use it as a secondary signal, not the primary one.

Level 3: mature

Full coverage including internals, leading indicators, and composite failure pattern detection. Teams at this level catch problems before they become incidents.

What this level adds:

Write-path cliff edge detection (checkpoint age, redo log capacity).
MVCC debt tracking (history list length).
Metadata lock detection before cascade.
Transaction age and open transaction monitoring.
Binlog growth and table cache efficiency.

Signals at this level

Signal	Source	Why it matters
Checkpoint age / redo log capacity	MySQL 8.0.30+: `Innodb_redo_log_current_lsn - Innodb_redo_log_checkpoint_lsn`. Pre-8.0.30: `INNODB_METRICS` metric `log_lsn_checkpoint_age` (disabled by default). All versions: parse `SHOW ENGINE INNODB STATUS` LOG section.	Above 75% triggers aggressive flushing. Above 90% with throughput collapse means synchronous flush stall. This is the most common “mysterious slowdown” in production MySQL.
History list length	`INFORMATION_SCHEMA.INNODB_METRICS` metric `trx_rseg_history_len`. Also: parse `SHOW ENGINE INNODB STATUS`.	Above 100,000 is concerning. Above 1,000,000 means a long-running transaction is blocking purge. Above 10,000,000 means severe degradation across all MVCC reads. Not available as a `SHOW STATUS` variable in MySQL 8.0.
Open transaction age	`information_schema.INNODB_TRX` ordered by `trx_started`	Any OLTP transaction older than 5 minutes is suspicious. `trx_query = NULL` with modifications means a forgotten transaction holding locks and a read view.
Metadata lock waits	`performance_schema.metadata_locks` where `LOCK_STATUS = 'PENDING'`	More than 3 sessions waiting on the same object means a cascade is developing. Metadata locks are a separate system from InnoDB row locks and do not appear in `data_lock_waits`.
Binary log growth	`SHOW BINARY LOGS` summed; `binlog_expire_logs_seconds`	MySQL 5.7 default `expire_logs_days = 0` means binlogs accumulate forever. Lagging replicas block purge on the source.
Table cache pressure	`Opened_tables` rate; `Open_tables / table_open_cache` ratio	High miss rate adds file descriptor overhead and latency. Cross-check against OS `ulimit -n`.
Buffer pool wait free	`SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_wait_free'`	Any nonzero sustained rate means queries are blocked waiting for dirty page flush. More reliable than hit ratio for detecting acute memory pressure.
Log buffer waits	`SHOW GLOBAL STATUS LIKE 'Innodb_log_waits'`	Any nonzero rate means `innodb_log_buffer_size` is undersized. Distinct from checkpoint stalls (which are about redo log capacity, not buffer size).
Thread cache efficiency	`Threads_created / Connections` ratio	High ratio means connection pooling is ineffective or `thread_cache_size` is too small.
Query digest latency	`performance_schema.events_statements_summary_by_digest`	Per-query-pattern latency catches individual regressions invisible in aggregate averages. Track per-digest latency over time, not absolute values.
Dirty page ratio	`Innodb_buffer_pool_pages_dirty / Innodb_buffer_pool_pages_total`	Approaching `innodb_max_dirty_pages_pct` triggers aggressive flushing, consuming I/O bandwidth.

Mature-level checklist

Checkpoint age expressed as ratio, not absolute LSN. Compare to redo log capacity (innodb_redo_log_capacity in 8.0.30+, or innodb_log_file_size * innodb_log_files_in_group pre-8.0.30). Below 50% is comfortable. Above 75% is the aggressive flushing zone.
History list from INNODB_METRICS, not SHOW STATUS. trx_rseg_history_len behaves as a gauge despite being labeled status_counter. In upstream MySQL it is not a status variable. Percona Server exposes it as Innodb_history_list_length.
Metadata lock instrument enabled. performance_schema.metadata_locks is available by default in MySQL 5.7+, but the mdl instrument must be enabled. Without it, metadata lock cascades are invisible until connection pools fill.
Binlog expiry configured. MySQL 8.0 defaults to 30 days (binlog_expire_logs_seconds = 2592000). MySQL 5.7 defaults to never (expire_logs_days = 0). Verify your version.
Composite pattern detection in place. Mature monitoring correlates multiple signals: metadata lock cascade (pending MDL + throughput drop), buffer pool cliff (hit ratio drop + wait_free nonzero + disk read saturation), checkpoint stall (checkpoint age > 90% + commit rate collapse).

Common mature-level mistakes

Confusing Innodb_log_waits with checkpoint stalls. Innodb_log_waits > 0 means the log buffer was too small. It does not confirm a checkpoint stall. Use throughput or commit rate collapse as the confirmer for checkpoint stalls.
Using cumulative hit ratio. After long uptime, cumulative counters mask current state. Always compute over recent intervals.
Not finding the blocking transaction. When history list grows, the blocker is found via SELECT * FROM information_schema.INNODB_TRX ORDER BY trx_started ASC LIMIT 1. The transaction may show trx_query = NULL because it is idle between statements, but it is still holding its read view.

Level 4: expert

Deep signals that experienced operators add after repeated incidents. These require Performance Schema instrumentation, structured parsing of SHOW ENGINE INNODB STATUS, and often custom collection logic.

What this level adds:

Per-index and per-table access statistics.
Lock-wait chain reconstruction.
Internal mutex and latch contention analysis.
GTID-based replication divergence detection.
Page cleaner and flush path efficiency.

Signals at this level

Signal	Source	Why it matters
Per-index access statistics	`performance_schema.table_io_waits_summary_by_index_usage`	`INDEX_NAME = NULL` rows indicate table scans. Identifies unused indexes and missing indexes on hot tables.
Lock-wait chains	`performance_schema.data_lock_waits` joined with `information_schema.innodb_trx`	Shows exactly who is blocking whom and on which rows. In MySQL 8.0, this is the single most useful lock diagnostics tool.
GTID divergence	`GTID_SUBTRACT(replica_executed_gtid_set, source_gtid_executed)`	Errant transactions on a replica (transactions the source does not have) indicate data divergence. The next failover will propagate conflicts. Ordinary lag (`GTID_SUBTRACT(source, replica)`) is not divergence.
Page cleaner efficiency	`Innodb_buffer_pool_pages_flushed` rate vs dirty ratio; `innodb_page_cleaners` count	Whether page cleaners are keeping up, independent of dirty page count. Capped by `innodb_buffer_pool_instances`.
Semaphore waits	`SHOW ENGINE INNODB STATUS` SEMAPHORES section	Mutex and rw-lock contention that no status variable reveals. `btr_search` waits indicate adaptive hash index latch contention.
AHI efficiency	`INFORMATION_SCHEMA.INNODB_METRICS`: `adaptive_hash_searches` vs `adaptive_hash_searches_btree`	Hit ratio below 50% means AHI consumes buffer pool memory without helping. Many high-performance operators disable AHI entirely.
Mutex spin/wait ratios	`INFORMATION_SCHEMA.INNODB_METRICS`	High spin counts with low throughput indicate CPU wasted on lock spinning. Note: Performance Schema mutex instrumentation excludes buffer pool internal mutexes. Use `SHOW ENGINE INNODB MUTEX` for those.
Sort merge pass rate	`Sort_merge_passes / (Sort_scan + Sort_range)`	Ratio above 1 means `sort_buffer_size` is too small for the workload. Increasing the buffer past 256KB-2MB has diminishing returns.
Binlog cache spill	`Binlog_cache_use` vs `Binlog_cache_disk_use`	Disk spill means transactions are too large for the binlog cache (`binlog_cache_size`).
Per-file I/O statistics	`performance_schema.file_summary_by_instance`	Reveals when a specific tablespace or log file has become a hotspot.
Prepared statement cache	`Com_stmt_prepare` vs `Com_stmt_reprepare`	High reprepare rate indicates table statistics changes invalidating prepared statements frequently.

Expert-level checklist

Errant GTID transactions trigger immediate investigation. GTID_SUBTRACT(replica_set, source_set) returning non-empty is data divergence. It cannot self-resolve. The next failover will propagate conflicting data.
Lock-wait chains reconstructed on demand. During contention incidents, joining data_lock_waits with innodb_trx and processlist identifies the blocker in seconds. MySQL 8.0.40 reduced the cost of querying these tables under contention.
AHI latch contention checked before disabling. Confirm btr_search waits in the SEMAPHORES section with spin rounds above 100K. Disabling AHI (SET GLOBAL innodb_adaptive_hash_index = OFF) is safe and immediate but clears the entire structure with a brief performance dip.
Page cleaner count adequate. innodb_page_cleaners defaults to 4, capped by innodb_buffer_pool_instances. If dirty pages accumulate with adequate I/O capacity, increase page cleaners.
Unused indexes identified periodically. Query table_io_waits_summary_by_index_usage for indexes with zero or negligible reads. Removing unused indexes reduces write amplification and storage.

How to use this model

Start by identifying your current level. If you are missing signals at your own level, fill those gaps first. A team at Level 2 with no buffer pool hit ratio monitoring has a more urgent problem than a team at Level 3 considering GTID divergence tracking.

The most common blind spots across all levels:

Checkpoint age (Level 3). The most common “mysterious slowdown” and the least monitored cliff edge. The Innodb_checkpoint_age status variable does not exist. You need INNODB_METRICS or 8.0.30+ redo log status variables.
History list length (Level 3). Silent degradation from idle transactions. Not a SHOW STATUS variable in 8.0.
Metadata locks (Level 3). Separate system from InnoDB row locks. Invisible without performance_schema.metadata_locks.
Threads_running vs Threads_connected (Level 2). The single most important real-time health signal, frequently absent or misunderstood.
GTID errant transactions (Level 4). Invisible divergences that cause failover failures.

For deeper coverage of specific failure patterns, see the related guides below.

How Netdata helps

Netdata collects MySQL status variables, Performance Schema tables, and InnoDB internals through its MySQL collector. The signals across all four levels are available as predefined charts and alarms.

Correlates Threads_running with Questions rate and CPU. When active concurrency rises while throughput drops, Netdata shows the divergence in a single view, distinguishing CPU saturation from lock contention.
Tracks buffer pool hit ratio as interval-computed, not cumulative. The cliff edge from 99% to 95% is visible as it happens, not masked by historical averages.
Monitors connection utilization and rejection rate together. Threads_connected / max_connections and Connection_errors_max_connections are charted alongside each other, making the difference between “approaching” and “refusing” immediately clear.
Exposes checkpoint age as a derived metric. Netdata computes the ratio against redo log capacity, eliminating the need to parse SHOW ENGINE INNODB STATUS manually or enable INNODB_METRICS.
Correlates replication signals across levels. Thread state, lag, relay log space, and GTID position are collected together, so I/O thread disconnects are not masked by a Seconds_Behind_Source of 0.
History list and open transaction age as first-class metrics. Purge lag from idle transactions is charted continuously, not discovered during an incident.

The Netdata solution

MySQL monitoring with Netdata

Netdata monitors MySQL and MariaDB with per-second metrics and ML anomaly detection. Track connection usage, query throughput, slow queries, redo-log pressure, and replication lag alongside the host and storage signals that explain them.

See MySQL monitoring → Start monitoring free

MySQL monitoring maturity model: from survival to expert

MySQL monitoring maturity model: from survival to expert

Level 1: survival

Signals at this level

Survival-level checklist

What survival does not cover

Level 2: operational

Signals at this level

Operational-level checklist

Common operational mistakes

Level 3: mature

Signals at this level

Mature-level checklist

Common mature-level mistakes

Level 4: expert

Signals at this level

Expert-level checklist

How to use this model

How Netdata helps

Related guides

MySQL monitoring with Netdata