MySQL online DDL still blocking: ALGORITHM, LOCK, and the copy phase

Online DDL in MySQL is not lock-free. Even with ALGORITHM=INPLACE or ALGORITHM=INSTANT, and even with LOCK=NONE, every online DDL operation passes through a brief window where it upgrades its metadata lock (MDL) to exclusive. That window is short, but it is real, and it is the single most common reason an “online” schema change still stalls a busy production table.

This article explains the three-phase MDL model that online DDL uses, why ALGORITHM=INPLACE with LOCK=NONE can still block, what the copy phase actually does, and how third-party tools such as pt-online-schema-change and gh-ost change the picture. It assumes you already understand the broader MySQL mental model. See the hub page for the failure-pattern catalogue that frames metadata lock stalls as one of MySQL’s characteristic outage archetypes.

What online DDL actually promises

“Online” in MySQL terminology means concurrent DML (INSERT, UPDATE, DELETE) is permitted during the execution phase of the DDL. It does not mean the DDL acquires no exclusive locks. The promise is narrower than most operators assume:

  • ALGORITHM=INPLACE with LOCK=NONE allows concurrent reads and writes during the bulk of the operation.
  • ALGORITHM=INSTANT modifies only data dictionary metadata and touches no table data.
  • Both still require metadata locks, and both still upgrade to an exclusive MDL at commit time.

If your expectation is “the DDL will never block another session,” that expectation is wrong. The correct expectation is “the DDL blocks for a very short time at the start and at the end, and during that window anything holding a shared MDL on the table will block the DDL, which in turn blocks everything queuing behind the DDL.”

For the full background on how metadata locks cascade into partial outages, see the hub page section on the metadata lock cascade pattern.

The three-phase MDL model

The MySQL 8.4 Reference Manual documents online DDL as three phases. Understanding the phases is the key to understanding every “online DDL still blocking” symptom.

flowchart TD
  A[Phase 1: Initialization] -->|shared upgradeable MDL| B[Phase 2: Execution]
  B -->|concurrent DML permitted for INPLACE/INSTANT| C[Phase 3: Commit table definition]
  C -->|exclusive MDL, brief| D[New table definition visible]
  D -->|all new DML on table resumes| E[Operation complete]
  C -->|any session holding shared MDL blocks commit| F[DDL waits, new DML queues behind DDL]

Phase 1: initialization

The DDL takes a shared, upgradeable metadata lock on the table. This lock is compatible with concurrent DML, which is why the operation can proceed while queries run. The server assesses whether the operation can be done INPLACE or must fall back to COPY.

Phase 2: execution

For ALGORITHM=INPLACE with LOCK=NONE, concurrent DML is permitted. InnoDB may build a new version of the table (or index) in the background while the original table continues to serve reads and writes. For ALGORITHM=INSTANT, this phase does effectively nothing to table data because only data dictionary metadata changes.

However, the manual notes that an exclusive metadata lock may be taken briefly during the execution phase, during statement preparation. Whether the upgrade happens depends on factors assessed in the initialization phase. If it is required, it is brief, but it is still exclusive.

Phase 3: commit table definition

This is where the “still blocking” symptom originates. The manual states that in the commit table definition phase, the metadata lock is upgraded to exclusive. The server must evict the old table definition from every session’s table cache and publish the new one. While the exclusive MDL is held:

  • New DML against the table queues behind the exclusive request.
  • Any session that still holds a shared MDL on the table prevents the DDL from acquiring the exclusive MDL.
  • The DDL therefore waits for that shared MDL to release, and every new DML statement queues behind the DDL.

This is the metadata lock cascade, triggered by a DDL that is otherwise fully “online” during execution.

Why ALGORITHM=INPLACE with LOCK=NONE still blocks

LOCK=NONE governs what level of table access the DDL requires during execution. It does not remove the exclusive MDL requirement at commit. The two clauses operate at different layers:

  • LOCK=NONE controls InnoDB-level table access during the bulk of the operation.
  • Metadata locks are a server-level layer, separate from InnoDB row locks, and they apply regardless of the LOCK clause.

Bug #106480 (filed February 2022, last publicly modified May 2022) traced this to wait_while_table_is_used() in sql/sql_base.cc, which unconditionally upgrades to an exclusive MDL before commit regardless of the locking mode requested. The MySQL 8.4 manual text now acknowledges that an exclusive MDL may be taken briefly during execution and is taken at commit.

The practical consequence is that the presence of any long-running transaction holding a shared MDL on the table will block the DDL’s commit, and every new DML statement on the table will queue behind the DDL. The DDL is not stuck because of row locks or because the copy is slow. It is stuck because the server cannot get exclusive control of the table definition to swap in the new one.

Why ALGORITHM=INSTANT still blocks

ALGORITHM=INSTANT modifies only data dictionary metadata. No table data is copied, no rows are rebuilt, and concurrent DML is permitted during execution. Despite this, INSTANT still requires the exclusive MDL at commit. The same wait_while_table_is_used() code path applies.

INSTANT in MySQL 8.4 supports operations such as adding or dropping a column (at any position since 8.0.29), renaming a column, setting or dropping a column default, adding or dropping a virtual generated column, and changing index type. INSTANT is the default algorithm in MySQL 8.4. The server tries INSTANT first, then INPLACE, then COPY.

Always specify ALGORITHM explicitly. If you request ALGORITHM=INSTANT and the operation cannot be done INSTANT, MySQL returns an error rather than silently degrading to INPLACE or COPY. That error is cheap insurance against an unexpected full table rebuild.

One subtlety: ALGORITHM=INSTANT combined with LOCK=NONE, LOCK=SHARED, or LOCK=EXCLUSIVE produces an error. Only LOCK=DEFAULT is compatible with ALGORITHM=INSTANT.

The copy phase

ALGORITHM=COPY is the legacy path. It creates a temporary table, copies every row from the original, applies the schema change to the copy, and then swaps the new table in for the old. The copy phase blocks concurrent DML and holds an exclusive MDL for the entire operation, plus the brief commit-phase upgrade.

You can confirm a COPY operation by checking the rows-affected count. A nonzero rows-affected count after an ALTER indicates data was copied. INPLACE operations that do not rebuild the table report zero rows affected.

INPLACE rebuild operations

Some INPLACE operations rebuild the table internally even though they are classified as INPLACE rather than COPY. Examples include adding or dropping a primary key, changing a column type, and reordering columns. These operations consume disk space roughly equal to the original table size in the InnoDB temporary directory, controlled by innodb_tmpdir. They still permit concurrent DML when combined with LOCK=NONE, but the rebuild itself is more expensive than a metadata-only change.

The online alter log and the commit window

For ALGORITHM=INPLACE with LOCK=NONE, InnoDB buffers concurrent DML changes during execution in the online alter log, sized by innodb_online_alter_log_max_size (default 128 MB). These buffered changes are replayed at the end of the operation, during the exclusive MDL window at commit. The larger the log, the more replay work happens under the exclusive MDL, and the longer the commit window.

If the online alter log overflows during execution, the DDL fails and rolls back any uncommitted concurrent changes. Monitoring innodb_online_alter_log_max_size relative to write rate during a long INPLACE operation is important on write-heavy tables.

The INSTANT row-version limit

Every INSTANT ADD COLUMN or DROP COLUMN increments a per-table counter, visible as TOTAL_ROW_VERSIONS in INFORMATION_SCHEMA.INNODB_TABLES. In MySQL 8.0 through 8.4, the limit is 64. MySQL 9.1.0 raised the limit to 255.

When the limit is reached, MySQL silently degrades to INPLACE or COPY unless ALGORITHM is explicitly specified. If you always specify ALGORITHM=INSTANT, the server returns an error instead of silently downgrading, which is the safer operational behavior.

Monitor row versions with:

SELECT NAME, TOTAL_ROW_VERSIONS
FROM INFORMATION_SCHEMA.INNODB_TABLES
WHERE TOTAL_ROW_VERSIONS > 0
ORDER BY TOTAL_ROW_VERSIONS DESC;

Reset the counter with OPTIMIZE TABLE or ALTER TABLE ... ENGINE=InnoDB. Both rebuild the table and reset TOTAL_ROW_VERSIONS to zero.

pt-online-schema-change and gh-ost

Third-party schema change tools work around the copy phase by doing the copy outside of MySQL’s online DDL machinery, but they have their own MDL behavior.

pt-online-schema-change

pt-online-schema-change (pt-osc) creates a shadow table, copies rows in chunks, and keeps the shadow in sync with the original using INSERT, UPDATE, and DELETE triggers on the original table. Because of the trigger design, pt-osc cannot run on tables that already have triggers, and it requires exclusive access to the original table during trigger creation.

pt-osc holds metadata locks during the copy phase for operations such as trigger creation, the final table swap, and foreign key updates. Its lock_wait_timeout defaults to 60 seconds for metadata lock operations. If any session holds a long transaction on the table, pt-osc blocks at the same MDL layer as native online DDL.

gh-ost

gh-ost also holds MDL during the copy phase but uses row-based binlog streaming instead of triggers. This allows it to run on tables that already have triggers, which pt-osc cannot do. The trigger-free design changes some of the operational constraints but does not eliminate metadata lock requirements at the swap phase.

Both tools are useful when the operation cannot be done INPLACE or INSTANT, or when you need finer control over chunking, throttling, and pause-resume than native online DDL provides. Neither tool is lock-free.

Where this shows up in production

The most common production symptom is an ALTER TABLE that appears to hang, followed within minutes by a flood of application timeouts against the affected table. The pattern is:

  1. A DDL is issued during a window when a long transaction is open on the same table. The long transaction may be a monitoring query, a reporting query, an ORM that began a transaction without committing, or a mysqldump --single-transaction backup.
  2. The DDL reaches commit and requests the exclusive MDL.
  3. The DDL waits for the long transaction to release its shared MDL.
  4. New DML statements on the table queue behind the DDL’s exclusive request.
  5. The connection pool fills with waiting sessions. Other tables continue to work, producing a confusing partial outage.

The DDL is not doing work during the wait. It is blocked at the MDL layer. CPU and disk I/O may be completely normal, which makes the incident harder to diagnose if you are only watching resource metrics.

Tradeoffs and when to use what

GoalRecommended approachTradeoff
Add or drop a column, rename a column, change index typeALGORITHM=INSTANTStill needs exclusive MDL at commit, but commit is fast
Add secondary index, change column type, rebuild primary keyALGORITHM=INPLACE, LOCK=NONEConcurrent DML allowed, but watch online alter log size and commit replay
Operation not supported by INPLACEALGORITHM=COPY or pt-osc / gh-ostCOPY blocks DML for the whole operation; pt-osc / gh-ost avoid that but add tooling complexity
Force failure instead of silent downgradeAlways specify ALGORITHM explicitlyOperation errors out if the requested algorithm is not supported

Run DDL when no long transactions are open on the target table. This is the single most effective way to avoid the commit-phase MDL stall.

Signals to watch in production

SignalWhy it mattersWarning sign
performance_schema.metadata_locks rows with LOCK_STATUS = 'PENDING'Direct evidence of MDL queue buildingMore than 3 sessions pending on the same object, sustained
SHOW PROCESSLIST state “Waiting for table metadata lock”Sessions blocked behind a DDL or behind each otherCount rising on a single table
INNODB_TRX oldest transaction ageLong transactions hold shared MDLsAny OLTP transaction older than 60 seconds
information_schema.INNODB_TABLES.TOTAL_ROW_VERSIONSApproaching the 64-row-version ceiling before silent INPLACE/COPY fallbackCounter approaching 50 for frequently altered tables
Innodb_buffer_pool_pages_dirty ratio during INPLACE rebuildRebuild dirties many pages, can pressure checkpointSustained dirty ratio above 75%
innodb_online_alter_log_max_size relative to write rateOnline alter log overflow aborts the DDLLog filling during long INPLACE on write-heavy table

How Netdata helps

  • The MySQL collector exposes Threads_running, Threads_connected, and Questions rate, which together reveal the moment a DDL starts to stall query processing.
  • Netdata’s correlation across the metadata lock cascade makes it easier to see the relationship between a rising MDL wait count and a corresponding drop in throughput on the affected table.
  • Monitoring long-lived transactions via INNODB_TRX age gives early warning before a DDL is issued against a table with an open transaction.
  • History list length tracking (trx_rseg_history_len) catches the related failure mode where an idle transaction is also blocking purge.
  • Per-table query digest latency from performance_schema.events_statements_summary_by_digest shows whether a stall is isolated to one table or system-wide.
  • Alerting on performance_schema.metadata_locks pending counts detects the cascade before it reaches connection exhaustion.