Diagnosing and Fixing an I/O Bottleneck in My Kubernetes Node

I’ve been experiencing some issues with my homelab servers. One was that the media server (Mac Mini hardware running Ubuntu) kept freezing and locking up, requiring a manual restart. This caused the Kubernetes containers running on it to fail. Kubernetes, in response, would move the failed apps to another node(HP laptop) and slightly overload it. I suspected the issue was I/O related since this machine used a hard drive with a spinning disk, which is quite slow compared to a Solid State Drive(SSD). The media server has significantly less memory than the HP, but it has a better processor, and its RAM never got maxed out, so CPU and RAM weren’t the issue. Before setting up Kubernetes on it, its performance was acceptable, so I figured the slowness was caused by the impact of Kubernetes components like logs, image pulls, and metrics hammering the disk heavily, causing it to slow down.

I fixed the problem by cloning the system from an HDD to a new SSD drive and swapping the drives. In this post, I’ll walk you through how I did that with minimal downtime.

The Problem

Node froze often, especially during heavy disk usage.
It had a decent CPU but less RAM than my HP635 node.
Disk I/O seemed to be the bottleneck.

The Plan

Clone the HDD to a new SSD.
Swap drives
Boot from the SSD.

Steps I Took

I drained the node of all Kubernetes applications running on it. The apps moved to the HP node.
Next, I connected the new SSD to the server using a USB-SATA connector.
Checked the drive names with lsblk:
- sda: 149.1GB (old HDD)
- sdb: 238.5GB (new SSD)
The new SSD was larger than the HDD, which made cloning much easier since I didn’t need to resize the contents of the source disk. I ensured that the SSD was connected but not mounted.
Next, I estimated how long it would take to copy random data from the hard drive to itself using this command and the formula below:
```
sudo dd if=/dev/sda of=/dev/null bs=1M count=256 status=progress
```
- Running the command above gave me a rough estimate of the disk’s speed. I then divided the disk size by the speed to get how long it’d take to copy the entire disk. (Formula: disk size (MB) ÷ speed (MB/s) = ETA)
- I knew that it would take longer to copy data from the HDD to the SSD through a USB cable, but running the test above allowed me to get a reasonable estimate of the time it’d take. My initial thought was that it’d take about 40 minutes to an hour to clone the drive.
The server wasn’t connected to a screen, and setting up a screen for it would have been inconvenient, so I cloned the drive over SSH.
To prevent losing progress if the SSH connection dropped during the clone process, I started a tmux session to keep the job running in the background:
```
tmux new -s clone_disk
```
Next, I cloned the disk using dd:
```
sudo dd if=/dev/sda of=/dev/sdb bs=1M status=progress conv=noerror,sync
```
- where if is the input device, of is the output device, bs is the block size, status=progress displays a progress bar and conv=noerror,sync tells dd not to stop even if it encounters a read error in the input device. If it encounters any read errors, it should pad input blocks with zeroes to keep the output file properly aligned, block by block
- Cloning the drive took twice as long as I had initially thought — the process completed after 2 hours.
Once the disk was cloned, I ran a check on the cloned disk:
```
sudo fdisk -l /dev/sdb
```
- The check was mostly okay except for this warning:
  "The backup GPT table is not on the end of the device." This meant that the GPT (GUID Partition Table) wasn’t located at the physical end of the disk, which is where it’s supposed to be. Since the HDD was smaller than the SSD, dd copied the main GPT header and partition table at the beginning of the disk and the backup GPT at the same position it was on the smaller disk(at the end). But since the SSD is bigger, the backup GPT wasn’t at the end of the new disk. I took note of the warning and moved on.
After swapping the drives, I booted from the SSD. The server booted up without errors and quicker than it had previously.
Next, I fixed the GPT warning using gdisk:
```
sudo gdisk /dev/sda
```
- Entered v to verify
- And w to write the fix
I tried to resize the new drive’s volume by running:
```
sudo pvresize /dev/sda2
```
Checking the volume sizes using the commands below showed no change in the size of the drive; it was still showing up as a hundred and something GB drive:
```
sudo vgdisplay
sudo lvdisplay
```
To resolve this, I installed cloud-guest-utils, grew the partition to take up the extra space, resized the physical volume and extended the logical volume to take up the extra space:
- Installed the required tool:
```
sudo apt install cloud-guest-utils
```
- Grew the partition:
```
sudo growpart /dev/sda2
sudo pvresize /dev/sda2
sudo lvextend -l +100%FREE /dev/ubuntu-vg/root
sudo resize2fs /dev/ubuntu-vg/root
```
- After that, I confirmed that everything was the way it should have been by running:
```
lsblk /dev/sda
```
  Everything looked good, the drive and its partitions were using up all available space.

Replacing the drives

After verifying that the new drive looked good and that all the data from the old drive was copied successfully, I popped the Mac Mini open, removed the HDD and replaced it with the SSD before closing it back up.

SSD Drive connected to a SATA- USB adapter — Copied data to the SSD first

Unopened Mac mini — Copied data to the SSD first

The Result

The Media Server now runs on a Solid State Drive, it’s faster and more stable now. To test it, I drained all Kubernetes applications from the HP control node to the Mac Mini and ran them for a few days. They ran beautifully, the CPU didn’t take a huge hit, and it didn’t lock up from the heavy I/O from the apps. One slight problem I have now with the server is that its CPU fan is much louder. When replacing the hard drive, I forgot to reinstall a temperature sensor in the Mac Mini that regulates how fast the fan runs. Now it probably thinks the sensor is damaged, so it runs at full throttle all the time, and it’s quite loud. Other than that, the server works perfectly.

Takeaway

I need to be more careful when opening Apple devices, it’s easy to miss replacing an important component. If you’re running a server on an old HDD and it keeps constantly freezing, locking up or running slow, consider swapping it out for an SSD. Clone your data between drives before replacing them with dd, fix any GPT issues with gdisk, and resize your volumes.

From HDD to SSD: How I fixed an I/O Bottleneck in a Kubernetes Node

Diagnosing and Fixing an I/O Bottleneck in My Kubernetes Node

The Problem

The Plan

Steps I Took

Replacing the drives

The Result

Takeaway

Like this:

Related

Leave a ReplyCancel reply

Diagnosing and Fixing an I/O Bottleneck in My Kubernetes Node

The Problem

The Plan

Steps I Took

Replacing the drives

The Result

Takeaway

Share this:

Like this:

Related

Leave a ReplyCancel reply