Imagine your server hangs even on simple, seemingly unrelated commands such as cat file, ps aux or top. This is the issue I've been dealing with today. Short version: it was MySQL/MariaDB server clogging the VPS I/O.
Important: even though a terminal window with an established SSH connection would freeze, I could always open a new Terminal window with a new SSH connection to try a different command, or to debug the one stalled in the first windown.
Below are steps to debug.
Confirm this actually is I/O – iostat
Run iostat -x 1 and check %b.
root@:~ # iostat -x 1
extended device statistics
device r/s w/s kr/s kw/s ms/r ms/w ms/o ms/t qlen %b
da0 201 98 7201.8 5448.0 5 30 0 13 1 98
cd0 0 0 0.0 0.0 0 0 0 0 0 0
pass0 0 0 0.0 0.0 0 0 0 0 0 0
pass1 0 0 0.0 0.0 0 0 0 0 0 0
extended device statistics
device r/s w/s kr/s kw/s ms/r ms/w ms/o ms/t qlen %b
da0 199 101 7276.7 6159.6 4 11 0 6 0 97
cd0 0 0 0.0 0.0 0 0 0 0 0 0
pass0 0 0 0.0 0.0 0 0 0 0 0 0
pass1 0 0 0.0 0.0 0 0 0 0 0 0
extended device statistics
device r/s w/s kr/s kw/s ms/r ms/w ms/o ms/t qlen %b
da0 195 105 7839.5 6175.6 4 22 0 10 0 97
cd0 0 0 0.0 0.0 0 0 0 0 0 0
pass0 0 0 0.0 0.0 0 0 0 0 0 0
pass1 0 0 0.0 0.0 0 0 0 0 0 0
extended device statistics
device r/s w/s kr/s kw/s ms/r ms/w ms/o ms/t qlen %b
da0 166 125 6351.7 6129.4 5 53 0 25 0 99
cd0 0 0 0.0 0.0 0 0 0 0 0 0
pass0 0 0 0.0 0.0 0 0 0 0 0 0
pass1 0 0 0.0 0.0 0 0 0 0 0 0
^C
(Output truncated, but you get the idea.)
The above does show very heavy disk utilization, and write latency spikes:
%bpushing 100% almost constantly confirms the disk is busy almost all the timems/wjumping to 53 ms shows the writes are slow- if
qlengoes up occasionally, that means queued I/O bursts (backpressure)
Confirm this actually is I/O – vmstat
Run vmstat 1 and check column for da0 from the previous test.
root@:~ # vmstat 1
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr da0 cd0 in sy cs us sy id
0 0 0 1.1G 455M 0 0 0 0 1.9k 60 300 0 274 654 2.1k 1 1 96
0 0 0 1.1G 457M 0 0 0 0 2.6k 60 299 0 288 537 1.8k 0 5 93
1 0 0 1.1G 457M 0 0 0 0 2.9k 60 301 0 298 467 1.9k 0 0 99
0 0 0 1.1G 456M 0 0 0 0 2.0k 60 298 0 294 413 1.6k 0 5 94
^C
(Output truncated, but you get the idea.)
As you can see here, the same da0 is busy at around ~300.
Do you run a database on this server?
On FreeBSD, MariaDB is almost always configured with durability-heavy defaults.
Check the status: service mysql-server onestatus
root@:~ # service mysql-server onestatus
mysql is running as pid 2313.
Run procstat kstack <PID> | egrep 'bwait|bufwait' to spot I/O buffer waits. Grep is used as a fast filter. Use a for loop to check multiple snapshots.
for i in 0 1 2 3 4 5 6 7 8 9; do procstat kstack <PID> | egrep 'bwait|bufwait'; sleep 2; done
If this prints anything, at least one thread is blocking filesystem and/or I/O. Example:
root@:~ # procstat kstack 2313 | egrep 'bwait|bufwait'
2313 100447 mariadbd - mi_switch _sleep bwait bufwait cluster_read ffs_read VOP_READ_APV vn_read vn_io_fault_doio vn_io_fault1 vn_io_fault dofileread kern_preadv sys_pread amd64_syscall fast_syscall_common
So this is MariaDB?
Check if your MariaDB server is tuned appropriately.
-
How much RAM does this server have?
root@:~ # sysctl hw.physmem hw.physmem: 21112750082 GB
-
How large is the database?
root@:~ # du -sh /var/db/mysql 40G /var/db/mysql -
Check
innodb_buffer_pool_size,innodb_flush_log_at_trx_commitandsync_binlog.root@:~ # mysql -e "SELECT @@innodb_buffer_pool_size/1024/1024 AS buffer_pool_mb; SHOW VARIABLES LIKE 'innodb_flush_log_at_trx_commit'; SHOW VARIABLES LIKE 'sync_binlog';" +----------------+ | buffer_pool_mb | +----------------+ | 128.00000000 | +----------------+ +--------------------------------+-------+ | Variable_name | Value | +--------------------------------+-------+ | innodb_flush_log_at_trx_commit | 1 | +--------------------------------+-------+ +---------------+-------+ | Variable_name | Value | +---------------+-------+ | sync_binlog | 0 | +---------------+-------+The above shows:
- InnoDB buffer pool is 128 MB which is tiny
innodb_flush_log_at_trx_commit= 1 means doing fsync on every commit
So we’re running a 40 GB dataset with effectively 128 MB of cache. That guarantees constant cache misses. On top of that, trx_commit=1 forces frequent log flushes, which is a problem on storage with poor sync latency.
Fix
- Increase innodb_buffer_pool_size On a 2 GB box, a sane starting point is 1G if this machine is primarily for MySQL/MariaDB.
- Relax flush policy (optional but very impactful on a VPS) Set innodb_flush_log_at_trx_commit=2 to avoid fsync on every commit. Tradeoff: you can lose up to ~1 second of transactions if the host crashes.
Add this to /usr/local/etc/mysql/conf.d/server.cnf under [mysqld]
# Cache more of the working set in RAM (big read I/O reduction)
innodb_buffer_pool_size = 1G
# Reduce fsync pressure (reduce stalling on VPS storage)
innodb_flush_log_at_trx_commit = 2