MySQL monitoring checklist: the signals every production instance needs

Most production incidents involving MySQL stem from missing observability into specific internal subsystems. Connection saturation, buffer pool thrashing, and replication lag can degrade a healthy cluster into an outage if the right signals are not monitored.

This checklist serves as a tiered operational reference for senior engineers and SREs. It defines the exact metrics needed to keep MySQL instances available, fast, and reliable, scaling from basic survival metrics to expert-level internals. Use it to validate existing dashboards or baseline monitoring for a new deployment.

Survival signals (Level 1)

These are the absolute minimum signals required to know MySQL is alive and accepting work. Without these, you are blind to hard failures.

  • Server availability (SELECT 1): Confirms the server accepts connections and can execute a query, covering the InnoDB crash recovery phase where the port is open but the database is not ready for traffic.
  • Server uptime (Uptime): Detects unplanned restarts, segfaults, or OOM kills by monitoring this state counter for unexpected resets to zero.
  • Connection utilization (Threads_connected vs max_connections): Warns when the instance approaches the hard cliff of rejecting new connections, calculated as a ratio to avoid absolute number dependencies.
  • Replication thread state (Replica_IO_Running, Replica_SQL_Running): Verifies the replica is actively receiving events from the source and applying them locally to prevent silent data divergence.
  • Replication lag (Seconds_Behind_Source): Measures data freshness on replicas, though it requires correlation with thread state because it spuriously reports zero if the I/O thread is stalled.
  • Disk space (df -h): Monitors free space on the data, binary log, and tmpdir volumes to prevent write failures, crash loops, and unclean shutdowns.

Operational signals (Level 2)

Operational signals provide visibility into performance degradation and load. Missing these means you will not know why the database is slow or unresponsive.

  • Active execution queue (Threads_running): Tracks concurrent query execution to identify oversubscription or lock contention, serving as the definitive instantaneous measure of database load.
  • Query throughput (Questions): Establishes baseline traffic and reveals drops that indicate global metadata locks, I/O saturation, or queueing.
  • Slow query rate (Slow_queries): Identifies queries exceeding your long_query_time threshold, which must be set to 1 second or lower for OLTP workloads to catch relevant degradations.
  • Buffer pool hit ratio (Innodb_buffer_pool_read_requests vs Innodb_buffer_pool_reads): Predicts read performance collapse when the working set exceeds available memory, degrading non-linearly below 95%.
  • InnoDB row lock contention (Innodb_row_lock_current_waits): Highlights active transaction conflicts blocking query progress, distinguishing lock waits from genuine CPU or I/O saturation.
  • Aborted connections (Aborted_connects): Spots authentication failures, brute force attacks, or network instability affecting the connection handshake before they trigger host caching blocks.
  • Temporary table disk spills (Created_tmp_disk_tables): Reveals queries materializing large intermediate results to disk because they exceed memory limits or involve BLOB columns.
  • Binary log space (SHOW BINARY LOGS): Tracks accumulation against the binlog_expire_logs_seconds policy to ensure lagging replicas do not block purge operations and fill the disk.

Mature signals (Level 3)

Mature signals track internal subsystems and leading indicators of cliff-edge failures. These are necessary to prevent mysterious slowdowns and complex cascading outages.

  • Redo log checkpoint age (Innodb_redo_log_current_lsn minus Innodb_redo_log_checkpoint_lsn): Watches the write-path cliff edge before InnoDB forces a synchronous flush that stalls all writes completely.
  • History list length / purge lag (trx_rseg_history_len): Detects MVCC debt caused by long-running transactions blocking undo cleanup, which progressively degrades all reads in the database.
  • Metadata lock buildup (performance_schema.metadata_locks): Prevents DDL operations from cascading into full table lock queues that exhaust application connection pools and cause partial outages.
  • Open transaction age (information_schema.INNODB_TRX): Finds forgotten or idle transactions holding locks or read views that block purge threads and schema migrations.
  • Log buffer waits (Innodb_log_waits): Identifies undersized redo log buffers or disk fsync latency bottlenecks on the durability path before they throttle commit rates.
  • Thread creation rate (Threads_created): Shows whether connection pooling is effective or missing, preventing unnecessary CPU overhead from thread initialization.
  • Query quality indicators (Handler_read_rnd_next, Select_full_join): Catches query plan regressions and missing indexes by tracking full table scan and unindexed join rates.

Expert signals (Level 4)

Expert signals dive into structured internal instrumentation. These are required for complex debugging, deep performance tuning, and preventing subtle replication divergence.

  • Per-query latency distribution (performance_schema.events_statements_summary_by_digest): Tracks specific query regressions hidden by aggregate throughput metrics to pinpoint exact deployment or schema-change failures.
  • Semaphore waits (SHOW ENGINE INNODB STATUS): Exposes internal mutex and latch contention, such as adaptive hash index storms, that are invisible to standard status variables.
  • Lock wait chain reconstruction (performance_schema.data_lock_waits): Details exactly which transactions are blocking others and on which rows, simplifying complex deadlock and concurrency analysis.
  • GTID execution consistency (GTID_SUBTRACT): Detects replica data divergence or errant transactions before an automated failover fails due to split-brain conditions or missing transactions.

How Netdata helps

Netdata provides pre-configured dashboards and alerts for MySQL out of the box, mapping directly to this operational checklist.

  • Real-time saturation tracking: Netdata visualizes Threads_connected against max_connections and Threads_running against CPU cores, providing instant visibility into connection and execution queue saturation.
  • Replication health: Netdata monitors both replication thread states and Seconds_Behind_Source, alerting you to broken apply threads before data staleness impacts application logic.
  • InnoDB internals: Netdata automatically collects and charts buffer pool hit ratios, checkpoint age, and log buffer waits, giving you early warning before write-path cliff edges or read thrashing occurs.
  • Performance schema integration: Netdata leverages performance schema digest data to track per-query latency changes over time, isolating specific query regressions instead of just aggregate slowdowns.
  • System-level correlation: By mapping MySQL disk I/O metrics directly to Linux block device latency and utilization, Netdata helps you distinguish between database tuning issues and underlying storage hardware bottlenecks.