Skip to content

Linux

๐Ÿ“ฆ Auto-growing disks in Vagrant: because 10 GB is never enough

Have you ever fired up a Vagrant VM, provisioned a project, pulled some Docker images, ran a buildโ€ฆ and ran out of disk space halfway through? Welcome to my world. Apparently, the default disk size in Vagrant is tinyโ€”and while you can specify a bigger virtual disk, Ubuntu won’t magically use the extra space. You need to resize the partition, the physical volume, the logical volume, and the filesystem. Every. Single. Time.

Enough of that nonsense.

๐Ÿ›  The setup

Hereโ€™s the relevant part of my Vagrantfile:

Vagrant.configure(2) do |config|
  config.vm.box = 'boxen/ubuntu-24.04'
  config.vm.disk :disk, size: '20GB', primary: true

  config.vm.provision 'shell', path: 'resize_disk.sh'
end

This makes sure the disk is large enough and automatically resized by the resize_disk.sh script at first boot.

โœจ The script

#!/bin/bash
set -euo pipefail
LOGFILE="/var/log/resize_disk.log"
exec > >(tee -a "$LOGFILE") 2>&1
echo "[$(date)] Starting disk resize process..."

REQUIRED_TOOLS=("parted" "pvresize" "lvresize" "lvdisplay" "grep" "awk")
for tool in "${REQUIRED_TOOLS[@]}"; do
  if ! command -v "$tool" &>/dev/null; then
    echo "[$(date)] ERROR: Required tool '$tool' is missing. Exiting."
    exit 1
  fi
done

# Read current and total partition size (in sectors)
parted_output=$(parted --script /dev/sda unit s print || true)
read -r PARTITION_SIZE TOTAL_SIZE < <(echo "$parted_output" | awk '
  / 3 / {part = $4}
  /^Disk \/dev\/sda:/ {total = $3}
  END {print part, total}
')

# Trim 's' suffix
PARTITION_SIZE_NUM="${PARTITION_SIZE%s}"
TOTAL_SIZE_NUM="${TOTAL_SIZE%s}"

if [[ "$PARTITION_SIZE_NUM" -lt "$TOTAL_SIZE_NUM" ]]; then
  echo "[$(date)] Resizing partition /dev/sda3..."
  parted --fix --script /dev/sda resizepart 3 100%
else
  echo "[$(date)] Partition /dev/sda3 is already at full size. Skipping."
fi

if [[ "$(pvresize --test /dev/sda3 2>&1)" != *"successfully resized"* ]]; then
  echo "[$(date)] Resizing physical volume..."
  pvresize /dev/sda3
else
  echo "[$(date)] Physical volume is already resized. Skipping."
fi

LV_SIZE=$(lvdisplay --units M /dev/ubuntu-vg/ubuntu-lv | grep "LV Size" | awk '{print $3}' | tr -d 'MiB')
PE_SIZE=$(vgdisplay --units M /dev/ubuntu-vg | grep "PE Size" | awk '{print $3}' | tr -d 'MiB')
CURRENT_LE=$(lvdisplay /dev/ubuntu-vg/ubuntu-lv | grep "Current LE" | awk '{print $3}')

USED_SPACE=$(echo "$CURRENT_LE * $PE_SIZE" | bc)
FREE_SPACE=$(echo "$LV_SIZE - $USED_SPACE" | bc)

if (($(echo "$FREE_SPACE > 0" | bc -l))); then
  echo "[$(date)] Resizing logical volume..."
  lvresize -rl +100%FREE /dev/ubuntu-vg/ubuntu-lv
else
  echo "[$(date)] Logical volume is already fully extended. Skipping."
fi

๐Ÿ’ก Highlights

  • โœ… Uses parted with --script to avoid prompts.
  • โœ… Automatically fixes GPT mismatch warnings with --fix.
  • โœ… Calculates exact available space using lvdisplay and vgdisplay, with bc for floating point math.
  • โœ… Extends the partition, PV, and LV only when needed.
  • โœ… Logs everything to /var/log/resize_disk.log.

๐Ÿšจ Gotchas

  • Your disk must already use LVM. This script assumes you’re resizing /dev/ubuntu-vg/ubuntu-lv, the default for Ubuntu server installs.
  • You must use a Vagrant box that supports VirtualBox’s disk resizingโ€”thankfully, boxen/ubuntu-24.04 does.
  • If your LVM setup is different, youโ€™ll need to adapt device paths.

๐Ÿ” Automation FTW

Calling this script as a provisioner means I never have to think about disk space again during development. One less yak to shave.

Feel free to steal this setup, adapt it to your team, or improve it and send me a patch. Or better yetโ€”donโ€™t wait until your filesystem runs out of space at 3 AM.

Safer Commands with argv in Ansible: Pros, Cons, and Real Examples

When using Ansible to automate tasks, the command module is your bread and butter for executing system commands. But did you know that there’s a safer, cleaner, and more predictable way to pass arguments? Meet argvโ€”an alternative to writing commands as strings.

In this post, Iโ€™ll explore the pros and cons of using argv, and Iโ€™ll walk through several real-world examples tailored to web servers and mail servers.


Why Use argv Instead of a Command String?

โœ… Pros

  • Avoids Shell Parsing Issues: Each argument is passed exactly as intended, with no surprises from quoting or spaces.
  • More Secure: No shell = no risk of shell injection.
  • Clearer Syntax: Every argument is explicitly defined, improving readability.
  • Predictable: Behavior is consistent across different platforms and setups.

โŒ Cons

  • No Shell Features: You can’t use pipes (|), redirection (>), or environment variables like $HOME.
  • More Verbose: Every argument must be a separate list item. Itโ€™s explicit, but more to type.
  • Not for Shell Built-ins: Commands like cd, export, or echo with redirection won’t work.

Real-World Examples

Letโ€™s apply this to actual use cases.

๐Ÿ”ง Restarting Nginx with argv

- name: Restart Nginx using argv
  hosts: amedee.be
  become: yes
  tasks:
    - name: Restart Nginx
      ansible.builtin.command:
        argv:
          - systemctl
          - restart
          - nginx

๐Ÿ“ฌ Check Mail Queue on a Mail-in-a-Box Server

- name: Check Postfix mail queue using argv
  hosts: box.vangasse.eu
  become: yes
  tasks:
    - name: Get mail queue status
      ansible.builtin.command:
        argv:
          - mailq
      register: mail_queue

    - name: Show queue
      ansible.builtin.debug:
        msg: "{{ mail_queue.stdout_lines }}"

๐Ÿ—ƒ๏ธ Back Up WordPress Database

- name: Backup WordPress database using argv
  hosts: amedee.be
  become: yes
  vars:
    db_user: wordpress_user
    db_password: wordpress_password
    db_name: wordpress_db
  tasks:
    - name: Dump database
      ansible.builtin.command:
        argv:
          - mysqldump
          - -u
          - "{{ db_user }}"
          - -p{{ db_password }}
          - "{{ db_name }}"
          - --result-file=/root/wordpress_backup.sql

โš ๏ธ Avoid exposing credentials directlyโ€”use Ansible Vault instead.


Using argv with Interpolation

Ansible lets you use Jinja2-style variables ({{ }}) inside argv items.

๐Ÿ”„ Restart a Dynamic Service

- name: Restart a service using argv and variable
  hosts: localhost
  become: yes
  vars:
    service_name: nginx
  tasks:
    - name: Restart
      ansible.builtin.command:
        argv:
          - systemctl
          - restart
          - "{{ service_name }}"

๐Ÿ•’ Timestamped Backups

- name: Timestamped DB backup
  hosts: localhost
  become: yes
  vars:
    db_user: wordpress_user
    db_password: wordpress_password
    db_name: wordpress_db
  tasks:
    - name: Dump with timestamp
      ansible.builtin.command:
        argv:
          - mysqldump
          - -u
          - "{{ db_user }}"
          - -p{{ db_password }}
          - "{{ db_name }}"
          - --result-file=/root/wordpress_backup_{{ ansible_date_time.iso8601 }}.sql

๐Ÿงฉ Dynamic Argument Lists

Avoid join(' '), which collapses the list into a single string.

โŒ Wrong:

argv:
  - ls
  - "{{ args_list | join(' ') }}"  # BAD: becomes one long string

โœ… Correct:

argv: ["ls"] + args_list

Or if the length is known:

argv:
  - ls
  - "{{ args_list[0] }}"
  - "{{ args_list[1] }}"

๐Ÿ“ฃ Interpolation Inside Strings

- name: Greet with hostname
  hosts: localhost
  tasks:
    - name: Print message
      ansible.builtin.command:
        argv:
          - echo
          - "Hello, {{ ansible_facts['hostname'] }}!"


When to Use argv

โœ… Commands with complex quoting or multiple arguments
โœ… Tasks requiring safety and predictability
โœ… Scripts or binaries that take arguments, but not full shell expressions

When to Avoid argv

โŒ When you need pipes, redirection, or shell expansion
โŒ When you’re calling shell built-ins


Final Thoughts

Using argv in Ansible may feel a bit verbose, but it offers precision and security that traditional string commands lack. When you need reliable, cross-platform automation that avoids the quirks of shell parsing, argv is the better choice.

Prefer safety? Choose argv.
Need shell magic? Use the shell module.

Have a favorite argv trick or horror story? Drop it in the comments below.

Creating 10 000 Random Files & Analyzing Their Size Distribution: Because Why Not? ๐Ÿง๐Ÿ’พ

Ever wondered what itโ€™s like to unleash 10 000 tiny little data beasts on your hard drive? No? Well, buckle up anyway โ€” because today, weโ€™re diving into the curious world of random file generation, and then nerding out by calculating their size distribution. Spoiler alert: itโ€™s less fun than it sounds. ๐Ÿ˜

Step 1: Letโ€™s Make Some Files… Lots of Them

Our goal? Generate 10 000 files filled with random data. But not just any random sizes โ€” we want a mean file size of roughly 68 KB and a median of about 2 KB. Sounds like a math puzzle? Thatโ€™s because it kind of is.

If you just pick file sizes uniformly at random, youโ€™ll end up with a median close to the mean โ€” which is boring. We want a skewed distribution, where most files are small, but some are big enough to bring that average up.

The Magic Trick: Log-normal Distribution ๐ŸŽฉโœจ

Enter the log-normal distribution, a nifty way to generate lots of small numbers and a few big ones โ€” just like real life. Using Pythonโ€™s NumPy library, we generate these sizes and feed them to good old /dev/urandom to fill our files with pure randomness.

Hereโ€™s the Bash script that does the heavy lifting:

#!/bin/bash

# Directory to store the random files
output_dir="random_files"
mkdir -p "$output_dir"

# Total number of files to create
file_count=10000

# Log-normal distribution parameters
mean_log=9.0  # Adjusted for ~68KB mean
stddev_log=1.5  # Adjusted for ~2KB median

# Function to generate random numbers based on log-normal distribution
generate_random_size() {
    python3 -c "import numpy as np; print(int(np.random.lognormal($mean_log, $stddev_log)))"
}

# Create files with random data
for i in $(seq 1 $file_count); do
    file_size=$(generate_random_size)
    file_path="$output_dir/file_$i.bin"
    head -c "$file_size" /dev/urandom > "$file_path"
    echo "Generated file $i with size $file_size bytes."
done

echo "Done. Files saved in $output_dir."

Easy enough, right? This creates a directory random_files and fills it with 10 000 files of sizes mostly small but occasionally wildly bigger. Donโ€™t blame me if your disk space takes a little hit! ๐Ÿ’ฅ

Step 2: Crunching Numbers โ€” The File Size Distribution ๐Ÿ“Š

Okay, youโ€™ve got the files. Now, what can we learn from their sizes? Letโ€™s find out the:

  • Mean size: The average size across all files.
  • Median size: The middle value when sizes are sorted โ€” because averages can lie.
  • Distribution breakdown: How many tiny files vs. giant files.

Hereโ€™s a handy Bash script that reads file sizes and spits out these stats with a bit of flair:

#!/bin/bash

# Input directory (default to "random_files" if not provided)
directory="${1:-random_files}"

# Check if directory exists
if [ ! -d "$directory" ]; then
    echo "Directory $directory does not exist."
    exit 1
fi

# Array to store file sizes
file_sizes=($(find "$directory" -type f -exec stat -c%s {} \;))

# Check if there are files in the directory
if [ ${#file_sizes[@]} -eq 0 ]; then
    echo "No files found in the directory $directory."
    exit 1
fi

# Calculate mean
total_size=0
for size in "${file_sizes[@]}"; do
    total_size=$((total_size + size))
done
mean=$((total_size / ${#file_sizes[@]}))

# Calculate median
sorted_sizes=($(printf '%s\n' "${file_sizes[@]}" | sort -n))
mid=$(( ${#sorted_sizes[@]} / 2 ))
if (( ${#sorted_sizes[@]} % 2 == 0 )); then
    median=$(( (sorted_sizes[mid-1] + sorted_sizes[mid]) / 2 ))
else
    median=${sorted_sizes[mid]}
fi

# Display file size distribution
echo "File size distribution in directory $directory:"
echo "---------------------------------------------"
echo "Number of files: ${#file_sizes[@]}"
echo "Mean size: $mean bytes"
echo "Median size: $median bytes"

# Display detailed size distribution (optional)
echo
echo "Detailed distribution (size ranges):"
awk '{
    if ($1 < 1024) bins["< 1 KB"]++;
    else if ($1 < 10240) bins["1 KB - 10 KB"]++;
    else if ($1 < 102400) bins["10 KB - 100 KB"]++;
    else bins[">= 100 KB"]++;
} END {
    for (range in bins) printf "%-15s: %d\n", range, bins[range];
}' <(printf '%s\n' "${file_sizes[@]}")

Run it, and voilร  โ€” instant nerd satisfaction.

Example Output:

File size distribution in directory random_files:
---------------------------------------------
Number of files: 10000
Mean size: 68987 bytes
Median size: 2048 bytes

Detailed distribution (size ranges):
&lt; 1 KB         : 1234
1 KB - 10 KB   : 5678
10 KB - 100 KB : 2890
>= 100 KB      : 198

Why Should You Care? ๐Ÿคทโ€โ™€๏ธ

Besides the obvious geek cred, generating files like this can help:

  • Test backup systems โ€” can they handle weird file size distributions?
  • Stress-test storage or network performance with real-world-like data.
  • Understand your data patterns if youโ€™re building apps that deal with files.

Wrapping Up: Big Files, Small Files, and the Chaos In Between

So there you have it. Ten thousand random files later, and weโ€™ve peeked behind the curtain to understand their size story. Itโ€™s a bit like hosting a party and then figuring out who ate how many snacks. ๐Ÿฟ

Try this yourself! Tweak the distribution parameters, generate files, crunch the numbers โ€” and impress your friends with your mad scripting skills. Or at least have a fun weekend project that makes you sound way smarter than you actually are.

Happy hacking! ๐Ÿ”ฅ

How I Tamed Duplicityโ€™s Buggy Versions โ€” and Made Sure I Always Have a Backup ๐Ÿ›ก๏ธ๐Ÿ’พ

If youโ€™re running Mail-in-a-Box like me, you might rely on Duplicity to handle backups quietly in the background. Itโ€™s a great tool โ€” until it isnโ€™t. Recently, I ran into some frustrating issues caused by buggy Duplicity versions. Hereโ€™s the story, a useful discussion from the Mail-in-a-Box forums, and a neat trick I use to keep fallback versions handy. Spoiler: it involves an APT hook and some smart file copying! ๐Ÿš€


The Problem with Duplicity Versions

Duplicity 3.0.1 and 3.0.5 have been reported to cause backup failures โ€” a real headache when you depend on them to protect your data. The Mail-in-a-Box forum post โ€œSomething is wrong with the backupโ€ dives into these issues with great detail. Users reported mysterious backup failures and eventually traced it back to specific Duplicity releases causing the problem.

Hereโ€™s the catch: those problematic versions sometimes sneak in during automatic updates. By the time you realize somethingโ€™s wrong, you might already have upgraded to a buggy release. ๐Ÿ˜ฉ


Pinning Problematic Versions with APT Preferences

One way to stop apt from installing those broken versions is to use APT pinning. Hereโ€™s an example file I created in /etc/apt/preferences/pin_duplicity.pref:

Explanation: Duplicity version 3.0.1* has a bug and should not be installed
Package: duplicity
Pin: version 3.0.1*
Pin-Priority: -1

Explanation: Duplicity version 3.0.5* has a bug and should not be installed
Package: duplicity
Pin: version 3.0.5*
Pin-Priority: -1

This tells apt to refuse to install these specific buggy versions. Sounds great, right? Except โ€” it often comes too late. You could already have updated to a broken version before adding the pin.

Also, since Duplicity is installed from a PPA, older versions vanish quickly as new releases push them out. This makes rolling back to a known good version a pain. ๐Ÿ˜ค


My Solution: Backing Up Known Good Duplicity .deb Files Automatically

To fix this, I created an APT hook that runs after every package operation involving Duplicity. It automatically copies the .deb package files of Duplicity from aptโ€™s archive cache โ€” and even from my local folder if Iโ€™m installing manually โ€” into a safe backup folder.

Hereโ€™s the script, saved as /usr/local/bin/apt-backup-duplicity.sh:

#!/bin/bash
set -x

mkdir -p /var/backups/debs/duplicity

cp -vn /var/cache/apt/archives/duplicity_*.deb /var/backups/debs/duplicity/ 2>/dev/null || true
cp -vn /root/duplicity_*.deb /var/backups/debs/duplicity/ 2>/dev/null || true

And hereโ€™s the APT hook configuration I put in /etc/apt/apt.conf.d/99backup-duplicity-debs to run this script automatically after DPKG operations:

DPkg::Post-Invoke { "/usr/local/bin/apt-backup-duplicity.sh"; };

Use apt-mark hold to Lock a Working Duplicity Version ๐Ÿ”’

Even with pinning and local .deb backups, there’s one more layer of protection I recommend: freezing a known-good version with apt-mark hold.

Once you’ve confirmed that your current version of Duplicity works reliably, run:

sudo apt-mark hold duplicity

This tells apt not to upgrade Duplicity, even if a newer version becomes available. Itโ€™s a great way to avoid accidentally replacing your working setup with something buggy during routine updates.

๐Ÿง  Pro Tip: I only unhold and upgrade Duplicity manually after checking the Mail-in-a-Box forum for reports that a newer version is safe.

When you’re ready to upgrade, do this:

sudo apt-mark unhold duplicity
sudo apt update
sudo apt install duplicity

If everything still works fine, you can apt-mark hold it again to freeze the new version.


How to Use Your Backup Versions to Roll Back

If a new Duplicity version breaks your backups, you can easily reinstall a known-good .deb file from your backup folder:

sudo apt install --reinstall /var/backups/debs/duplicity/duplicity_<version>.deb

Replace <version> with the actual filename you want to roll back to. Because you saved the .deb files right after each update, you always have access to older stable versions โ€” even if the PPA has moved on.


Final Thoughts

While pinning bad versions helps, having a local stash of known-good packages is a game changer. Add apt-mark hold on top of that, and you have a rock-solid defense against regressions. ๐Ÿชจโœจ

Itโ€™s a small extra step but pays off hugely when things go sideways. Plus, itโ€™s totally automated with the APT hook, so you donโ€™t have to remember to save anything manually. ๐ŸŽ‰

If you run Mail-in-a-Box or rely on Duplicity in any critical backup workflow, I highly recommend setting up this safety net.

Stay safe and backed up! ๐Ÿ›ก๏ธโœจ

๐Ÿงฑ Letโ€™s Get Hard (Links): Deduplicating My Linux Filesystem with Hadori

File deduplication isnโ€™t just for massive storage arrays or backup systemsโ€”it can be a practical tool for personal or server setups too. In this post, Iโ€™ll explain how I use hardlinking to reduce disk usage on my Linux system, which directories are safe (and unsafe) to link, why Iโ€™m OK with the trade-offs, and how I automated it with a simple monthly cron job using a neat tool called hadori.


๐Ÿ”— What Is Hardlinking?

In a traditional filesystem, every file has an inode, which is essentially its real identityโ€”the data on disk. A hard link is a different filename that points to the same inode. That means:

  • The file appears to exist in multiple places.
  • But there’s only one actual copy of the data.
  • Deleting one link doesnโ€™t delete the content, unless itโ€™s the last one.

Compare this to a symlink, which is just a pointer to a path. A hardlink is a pointer to the data.

So if you have 10 identical files scattered across the system, you can replace them with hardlinks, and boomโ€”nine of them stop taking up extra space.


๐Ÿค” Why Use Hardlinking?

My servers run a fairly standard Ubuntu install, and like most Linux machines, the root filesystem accumulates a lot of identical binaries and librariesโ€”especially across /bin, /lib, /usr, and /opt.

Thatโ€™s not a problemโ€ฆ until you’re tight on disk space, or youโ€™re just a curious nerd who enjoys squeezing every last byte.

In my case, I wanted to reduce disk usage safely, without weird side effects.

Hardlinking is a one-time cost with ongoing benefits. Itโ€™s not compression. Itโ€™s not archival. But itโ€™s efficient and non-invasive.


๐Ÿ“ Which Directories Are Safe to Hardlink?

Hardlinking only works within the same filesystem, and not all directories are good candidates.

โœ… Safe directories:

  • /bin, /sbin โ€“ system binaries
  • /lib, /lib64 โ€“ shared libraries
  • /usr, /usr/bin, /usr/lib, /usr/share, /usr/local โ€“ user-space binaries, docs, etc.
  • /opt โ€“ optional manually installed software

These contain mostly static files: compiled binaries, libraries, man pagesโ€ฆ not something that changes often.

โš ๏ธ Unsafe or risky directories:

  • /etc โ€“ configuration files, might change frequently
  • /var, /tmp โ€“ logs, spools, caches, session data
  • /home โ€“ user files, temporary edits, live data
  • /dev, /proc, /sys โ€“ virtual filesystems, do not touch

If a file is modified after being hardlinked, it breaks the deduplication (the OS creates a copy-on-write scenario), and youโ€™re back where you startedโ€”or worse, sharing data you didnโ€™t mean to.

Thatโ€™s why I avoid any folders with volatile, user-specific, or auto-generated files.


๐Ÿงจ Risks and Limitations

Hardlinking is not magic. It comes with sharp edges:

  • One inode, multiple names: All links are equal. Editing one changes the data for all.
  • Backups: Some backup tools donโ€™t preserve hardlinks or treat them inefficiently.
    โžค Duplicity, which I use, does not preserve hardlinks. It backs up each linked file as a full copy, so hardlinking wonโ€™t reduce backup size.
  • Security: Linking files with different permissions or owners can have unexpected results.
  • Limited scope: Only works within the same filesystem (e.g., canโ€™t link / and /mnt if theyโ€™re on separate partitions).

In my setup, I accept those risks because:

  • I’m only linking read-only system files.
  • I never link config or user data.
  • I donโ€™t rely on hardlink preservation in backups.
  • I test changes before deploying.

In short: I know what Iโ€™m linking, and why.


๐Ÿ” What the Critics Say About Hardlinking

Not everyone loves hardlinksโ€”and for good reasons. Two thoughtful critiques are:

The core arguments:

  • Hardlinks violate expectations about file ownership and identity.
  • They can break assumptions in software that tracks files by name or path.
  • They complicate file deletion logicโ€”deleting one name doesn’t delete the content.
  • They confuse file monitoring and logging tools, since itโ€™s hard to tell if a file is “new” or just another name.
  • They increase the risk of data corruption if accidentally modified in-place by a script that assumes it owns the file.

Why Iโ€™m still OK with it:

These concerns are validโ€”but mostly apply to:

  • Mutable files (e.g., logs, configs, user data)
  • Systems with untrusted users or dynamic scripts
  • Software that relies on inode isolation or path integrity

In contrast, my approach is intentionally narrow and safe:

  • I only deduplicate read-only system files in /bin, /sbin, /lib, /lib64, /usr, and /opt.
  • These are owned by root, and only changed during package updates.
  • I donโ€™t hardlink anything under /home, /etc, /var, or /tmp.
  • I know exactly when the cron job runs and what it targets.

So yes, hardlinks can be dangerousโ€”but only if you use them in the wrong places. In this case, I believe Iโ€™m using them correctly and conservatively.


โšก Does Hardlinking Impact System Performance?

Good news: hardlinks have virtually no impact on system performance in everyday use.

Hardlinks are a native feature of Linux filesystems like ext4 or xfs. The OS treats a hardlinked file just like a normal file:

  • Reading and writing hardlinked files is just as fast as normal files.
  • Permissions, ownership, and access behave identically.
  • Common tools (ls, cat, cp) donโ€™t care whether a file is hardlinked or not.
  • Filesystem caches and memory management work exactly the same.

The only difference is that multiple filenames point to the exact same data.

Things to keep in mind:

  • If you edit a hardlinked file, all links see that change because thereโ€™s really just one file.
  • Some tools (backup, disk usage) might treat hardlinked files differently.
  • Debugging or auditing files can be slightly trickier since multiple paths share one inode.

But from a performance standpoint? Your system wonโ€™t even notice the difference.


๐Ÿ› ๏ธ Tools for Hardlinking

There are a few tools out there:

  • fdupes โ€“ finds duplicates and optionally replaces with hardlinks
  • rdfind โ€“ more sophisticated detection
  • hardlink โ€“ simple but limited
  • jdupes โ€“ high-performance fork of fdupes

๐Ÿ“Œ About Hadori

From the Debian package description:

This might look like yet another hardlinking tool, but it is the only one which only memorizes one filename per inode. That results in less memory consumption and faster execution compared to its alternatives. Therefore (and because all the other names are already taken) it’s called “Hardlinking DOne RIght”.

Advantages over other tools:

  • Predictability: arguments are scanned in order, each first version is kept
  • Much lower CPU and memory consumption compared to alternatives

This makes hadori especially suited for system-wide deduplication where efficiency and reliability matter.


โฑ๏ธ How I Use Hadori

I run hadori once per month with a cron job. Hereโ€™s the actual command:

/usr/bin/hadori --verbose /bin /sbin /lib /lib64 /usr /opt

This scans those directories, finds duplicate files, and replaces them with hardlinks when safe.

And hereโ€™s the crontab entry I installed in the file /etc/cron.d/hadori:

@monthly root /usr/bin/hadori --verbose /bin /sbin /lib /lib64 /usr /opt

๐Ÿ“‰ What Are the Results?

After the first run, I saw a noticeable reduction in used disk space, especially in /usr/lib and /usr/share. On my modest VPS, that translated to about 300โ€“500 MB savedโ€”not huge, but non-trivial for a small root partition.

While this doesnโ€™t reduce my backup size (Duplicity doesnโ€™t support hardlinks), it still helps with local disk usage and keeps things a little tidier.

And because the job only runs monthly, itโ€™s not intrusive or performance-heavy.


๐Ÿงผ Final Thoughts

Hardlinking isnโ€™t something most people need to think about. And frankly, most people probably shouldnโ€™t use it.

But if you:

  • Know what youโ€™re linking
  • Limit it to static, read-only system files
  • Automate it safely and sparingly

โ€ฆthen it can be a smart little optimization.

With a tool like hadori, itโ€™s safe, fast, and efficient. Iโ€™ve read the horror storiesโ€”and decided that in my case, they donโ€™t apply.


โœ‰๏ธ This post was brought to you by a monthly cron job and the letters i-n-o-d-e.

๐Ÿค” โ€œWasnโ€™t /dev/null Good Enough?โ€ โ€” Understanding the Difference Between /dev/null and /dev/zero

After my last blog post about the gloriously pointless /dev/scream, a few people asked:

โ€œWasnโ€™t /dev/null good enough?โ€

Fair questionโ€”but it misses a key point.

Let me explain: /dev/null and /dev/zero are not interchangeable. In fact, they are opposites in many ways. And to fully appreciate the joke behind /dev/scream, you need to understand where that scream is coming fromโ€”not where it ends up.


๐ŸŒŒ Black Holes and White Holes

To understand the difference, let us borrow a metaphor from cosmology.

  • /dev/null is like a black hole: it swallows everything. You can write data to it, but nothing ever comes out. Not even light. Not even your logs.
  • /dev/zero is like a white hole: it constantly emits data. In this case, an infinite stream of zero bytes (0x00). It produces, but does not accept.

So when I run:

dd if=/dev/zero of=/dev/null

I am pulling data out of the white hole, and sending it straight into the black hole. A perfectly balanced operation of cosmic futility.


๐Ÿ“ฆ What Are All These /dev/* Devices?

Let us break down the core players:

DeviceCan You Write To It?Can You Read From It?What You ReadCommonly Used ForNickname / Metaphor
/dev/nullYesYesInstantly empty (EOF)Discard console output of a programBlack hole ๐ŸŒ‘
/dev/zeroYesYesEndless zeroes (0x00)Wiping drives, filling files, or allocating memory with known contentsWhite hole ๐ŸŒ•
/dev/randomNoYesRandom bytes from entropy poolSecure wiping drives, generating random dataQuantum noise ๐ŸŽฒ
/dev/urandomNoYesPseudo-random bytes (faster, less secure)Generating random dataPseudo-random fountain ๐Ÿ”€
/dev/oneYesYesEndless 0xFF bytesWiping drives, filling files, or allocating memory with known contentsThe dark mirror of /dev/zero โ˜ ๏ธ
/dev/screamYesYesaHAAhhaHHAAHaAaAAAA…CatharsisEmotional white hole ๐Ÿ˜ฑ

Note: /dev/one is not a standard part of Linuxโ€”it comes from a community kernel module, much like /dev/scream.


๐Ÿ—ฃ๏ธ Back to the Screaming

/dev/scream is a parody of /dev/zeroโ€”not /dev/null.

The point of /dev/scream was not to discard data. That is what /dev/null is for.

The point was to generate data, like /dev/zero or /dev/random, but instead of silent zeroes or cryptographic entropy, it gives you something more cathartic: an endless, chaotic scream.

aHAAhhaHHAAHaAaAAAAhhHhhAAaAAAhAaaAAAaHHAHhAaaaaAaHahAaAHaAAHaaHhAHhHaHaAaHAAHaAhhaHaAaAA

So when I wrote:

dd if=/dev/scream of=/dev/null

I was screaming into the void. The scream came from the custom device, and /dev/null politely absorbed it without complaint. Not a single bit screamed back. Like pulling screams out of a white hole and throwing them into a black hole. The ultimate cosmic catharsis.


๐Ÿงช Try Them Yourself

Want to experience the universe of /dev for yourself? Try these commands (press Ctrl+C to stop each):

# Silent, empty. Nothing comes out.
cat /dev/null

# Zero bytes forever. Very chill.
hexdump -C /dev/zero

# Random bytes from real entropy (may block).
hexdump -C /dev/random

# Random bytes, fast but less secure.
hexdump -C /dev/urandom

# If you have the /dev/one module:
hexdump -C /dev/one

# If you installed /dev/scream:
cat /dev/scream

๐Ÿ’ก TL;DR

  • /dev/null = Black hole: absorbs, never emits.
  • /dev/zero = White hole: emits zeroes, absorbs nothing.
  • /dev/random / /dev/urandom = Entropy sources: useful for cryptography.
  • /dev/one = Evil twin of /dev/zero: gives endless 0xFF bytes.
  • /dev/scream = Chaotic white hole: emits pure emotional entropy.

So no, /dev/null was not โ€œgood enoughโ€โ€”it was not the right tool. The original post was not about where the data goes (of=/dev/null), but where it comes from (if=/dev/scream), just like /dev/zero. And when it comes from /dev/scream, you are tapping into something truly primal.

Because sometimes, in Linux as in life, you just need to scream into the void.

๐Ÿง Falling Down the /dev Rabbit Hole: From Secure Deletion to /dev/scream

It started innocently enough. I was reading a thread about secure file deletion on Linuxโ€”a topic that has popped up in discussions for decades. You know the kind: “Is shred still reliable? Should I overwrite with random data or zeroes? What about SSDs and wear leveling?”

As I followed the thread, I came across a mention of /dev/zero, the classic Unix device that outputs an endless stream of null bytes (0x00). It is often used in scripts and system maintenance tasks like wiping partitions or creating empty files.

That led me to wonder: if there is /dev/zero, is there a /dev/one?

Turns out, not in the standard kernelโ€”but someone did write a kernel module to simulate it. It outputs a continuous stream of 0xFF, which is essentially all bits set to one. It is a fun curiosity with some practical uses in testing or wiping data in a different pattern.

But then came the real gem of the rabbit hole: /dev/scream.

Yes, it is exactly what it sounds like.

What is /dev/scream?

/dev/scream is a Linux kernel module that creates a character device which, when read, outputs a stream of text that mimics a chaotic, high-pitched scream. Think:

aHAAhhaHHAAHaAaAAAAhhHhhAAaAAAhAaaAAAaHHAHhAaaaaAaHahAaAHaAAHaaHhAHhHaHaAaHAAHaAhhaHaAaAA

It is completely uselessโ€ฆ and completely delightful.

Originally written by @matlink, the module is a humorous take on the Unix philosophy: “Everything is a file”โ€”even your existential dread. It turns your terminal into a primal outlet. Just run:

cat /dev/scream

And enjoy the textual equivalent of a scream into the void.

Why?

Why not?

Sometimes the joy of Linux is not about solving problems, but about exploring the weird and wonderful corners of its ecosystem. From /dev/null swallowing your output silently, to /dev/urandom serving up chaos, to /dev/scream venting itโ€”all of these illustrate the creativity of the open source world.

Sure, shred and secure deletion are important. But so is remembering that your system is a playground.

Try it Yourself

If you want to give /dev/scream a go, here is how to install it:

โš ๏ธ Warning

This is a custom kernel module. It is not dangerous, but do not run it on production systems unless you know what you are doing.

Build and Load the Module

git clone https://github.com/matlink/dev_scream.git
cd dev_scream
make build
sudo make install
sudo make load
sudo insmod dev_scream.ko

Now read from the device:

cat /dev/scream

Or, if you are feeling truly poetic, try screaming into the void:

dd if=/dev/scream of=/dev/null

In space, nobody can hear you screamโ€ฆ but on Linux, /dev/scream is loud and clearโ€”even if you pipe it straight into oblivion.

When you are done screaming:

sudo rmmod dev_scream

Final Thoughts

I started with secure deletion, and I ended up installing a kernel module that screams. This is the beauty of curiosity-driven learning in Linux: you never quite know where you will end up. And sometimes, after a long day, maybe all you need is to cat /dev/scream.

Let me know if you tried itโ€”and whether your terminal feels a little lighter afterward.

Automating My Server Management with Ansible and GitHub Actions

Managing multiple servers can be a daunting task, especially when striving for consistency and efficiency. To tackle this challenge, I developed a robust automation system using Ansible, GitHub Actions, and Vagrant. This setup not only streamlines server configuration but also ensures that deployments are repeatable and maintainable.

A Bit of History: How It All Started

This project began out of necessity. I was maintaining a handful of Ubuntu servers โ€” one for email, another for a website, and a few for experiments โ€” and I quickly realized that logging into each one to make manual changes was both tedious and error-prone. My first step toward automation was a collection of shell scripts. They worked, but as the infrastructure grew, they became hard to manage and lacked the modularity I needed.

That is when I discovered Ansible. I created the ansible-servers repository in early 2024 as a way to centralize and standardize my infrastructure automation. Initially, it only contained a basic playbook for setting up users and updating packages. But over time, it evolved to include multiple roles, structured inventories, and eventually CI/CD integration through GitHub Actions.

Every addition was born out of a real-world need. When I got tired of testing changes manually, I added Vagrant to simulate my environments locally. When I wanted to be sure my configurations stayed consistent after every push, I integrated GitHub Actions to automate deployments. When I noticed the repo growing, I introduced linting and security checks to maintain quality.

The repository has grown steadily and organically, each commit reflecting a small lesson learned or a new challenge overcome.

The Foundation: Ansible Playbooks

At the core of my automation strategy are Ansible playbooks, which define the desired state of my servers. These playbooks handle tasks such as installing necessary packages, configuring services, and setting up user accounts. By codifying these configurations, I can apply them consistently across different environments.

To manage these playbooks, I maintain a structured repository that includes:

  • Inventory Files: Located in the inventory directory, these YAML files specify the hosts and groups for deployment targets.
  • Roles: Under the roles directory, I define reusable components that encapsulate specific functionalities, such as setting up a web server or configuring a database.
  • Configuration File: The ansible.cfg file sets important defaults, like enabling fact caching and specifying the inventory path, to optimize Ansible’s behavior.

Seamless Deployments with GitHub Actions

To automate the deployment process, I leverage GitHub Actions. This integration allows me to trigger Ansible playbooks automatically upon code changes, ensuring that my servers are always up-to-date with the latest configurations.

One of the key workflows is Deploy to Production, which executes the main playbook against the production inventory. This workflow is defined in the ansible-deploy.yml file and is triggered on specific events, such as pushes to the main branch.

Additionally, I have set up other workflows to maintain code quality and security:

  • Super-Linter: Automatically checks the codebase for syntax errors and adherence to best practices.
  • Codacy Security Scan: Analyzes the code for potential security vulnerabilities.
  • Dependabot Updates: Keeps dependencies up-to-date by automatically creating pull requests for new versions.

Local Testing with Vagrant

Before deploying changes to production, it is crucial to test them in a controlled environment. For this purpose, I use Vagrant to spin up virtual machines that mirror my production servers.

The deploy_to_staging.sh script automates this process by:

  1. Starting the Vagrant environment and provisioning it.
  2. Installing required Ansible roles specified in requirements.yml.
  3. Running the Ansible playbook against the staging inventory.

This approach allows me to validate changes in a safe environment before applying them to live servers.

Embracing Open Source and Continuous Improvement

Transparency and collaboration are vital in the open-source community. By hosting my automation setup on GitHub, I invite others to review, suggest improvements, and adapt the configurations for their own use cases.

The repository is licensed under the MIT License, encouraging reuse and modification. Moreover, I actively monitor issues and welcome contributions to enhance the system further.


In summary, by combining Ansible, GitHub Actions, and Vagrant, I have created a powerful and flexible automation framework for managing my servers. This setup not only reduces manual effort but also increases reliability and scalability. I encourage others to explore this approach and adapt it to their own infrastructure needs. What began as a few basic scripts has now evolved into a reliable automation pipeline I rely on every day.

If you are managing servers and find yourself repeating the same configuration steps, I invite you to check out the ansible-servers repository on GitHub. Clone it, explore the structure, try it in your own environment โ€” and if you have ideas or improvements, feel free to open a pull request or start a discussion. Automation has made a huge difference for me, and I hope it can do the same for you.


Benchmarking USB Drives with Shell Scripts – Part 2: Evolving the Script with ChatGPT

Introduction

In my previous post, I shared the story of why I needed a new USB stick and how I used ChatGPT to write a benchmark script that could measure read performance across various methods. In this follow-up, I will dive into the technical details of how the script evolvedโ€”from a basic prototype into a robust and feature-rich toolโ€”thanks to incremental refinements and some AI-assisted development.


Starting Simple: The First Version

The initial idea was simple: read a file using dd and measure the speed.

dd if=/media/amedee/Ventoy/ISO/ubuntu-24.10-desktop-amd64.iso \
   of=/dev/null bs=8k

That worked, but I quickly ran into limitations:

  • No progress indicator
  • Hardcoded file paths
  • No USB auto-detection
  • No cache flushing, leading to inflated results when repeating the measurement

With ChatGPTโ€™s help, I started addressing each of these issues one by one.


Tools check

On a default Ubuntu installation, some tools are available by default, while others (especially benchmarking tools) usually need to be installed separately.

Tools used in the script:

ToolInstalled by default?Needs require?
hdparmโŒ Not installedโœ… Yes
ddโœ… YesโŒ No
pvโŒ Not installedโœ… Yes
catโœ… YesโŒ No
iopingโŒ Not installedโœ… Yes
fioโŒ Not installedโœ… Yes
lsblkโœ… Yes (in util-linux)โŒ No
awkโœ… Yes (in gawk)โŒ No
grepโœ… YesโŒ No
basenameโœ… Yes (in coreutils)โŒ No
findโœ… YesโŒ No
sortโœ… YesโŒ No
statโœ… YesโŒ No

This function ensures the system has all tools needed for benchmarking. It exits early if any tool is missing.

This was the initial version:

check_required_tools() {
  local required_tools=(dd pv hdparm fio ioping awk grep sed tr bc stat lsblk find sort)
  for tool in "${required_tools[@]}"; do
    if ! command -v "$tool" &>/dev/null; then
      echo "โŒ Required tool '$tool' is not installed."
      exit 1
    fi
  done
}

That’s already nice, but maybe I just want to run the script anyway if some of the tools are missing.

This is a more advanced version:

ALL_TOOLS=(hdparm dd pv ioping fio lsblk stat grep awk find sort basename column gnuplot)
MISSING_TOOLS=()

require() {
  if ! command -v "$1" >/dev/null; then
    return 1
  fi
  return 0
}

check_required_tools() {
  echo "๐Ÿ” Checking required tools..."
  for tool in "${ALL_TOOLS[@]}"; do
    if ! require "$tool"; then
      MISSING_TOOLS+=("$tool")
    fi
  done

  if [[ ${#MISSING_TOOLS[@]} -gt 0 ]]; then
    echo "โš ๏ธ  The following tools are missing: ${MISSING_TOOLS[*]}"
    echo "You can install them using: sudo apt install ${MISSING_TOOLS[*]}"
    if [[ -z "$FORCE_YES" ]]; then
      read -rp "Do you want to continue and skip tests that require them? (y/N): " yn
      case $yn in
        [Yy]*)
          echo "Continuing with limited tests..."
          ;;
        *)
          echo "Aborting. Please install the required tools."
          exit 1
          ;;
      esac
    else
      echo "Continuing with limited tests (auto-confirmed)..."
    fi
  else
    echo "โœ… All required tools are available."
  fi
}

Device Auto-Detection

One early challenge was identifying which device was the USB stick. I wanted the script to automatically detect a mounted USB device. My first version was clunky and error-prone.

detect_usb() {
  USB_DEVICE=$(lsblk -o NAME,TRAN,MOUNTPOINT -J | jq -r '.blockdevices[] | select(.tran=="usb") | .name' | head -n1)
  if [[ -z "$USB_DEVICE" ]]; then
    echo "โŒ No USB device detected."
    exit 1
  fi
  USB_PATH="/dev/$USB_DEVICE"
  MOUNT_PATH=$(lsblk -no MOUNTPOINT "$USB_PATH" | head -n1)
  if [[ -z "$MOUNT_PATH" ]]; then
    echo "โŒ USB device is not mounted."
    exit 1
  fi
  echo "โœ… Using USB device: $USB_PATH"
  echo "โœ… Mounted at: $MOUNT_PATH"
}

After a few iterations, we (ChatGPT and I) settled on parsing lsblk with filters on tran=usb and hotplug=1, and selecting the first mounted partition.

We also added a fallback prompt in case auto-detection failed.

detect_usb() {
  if [[ -n "$USB_DEVICE" ]]; then
    echo "๐Ÿ“Ž Using provided USB device: $USB_DEVICE"
    MOUNT_PATH=$(lsblk -no MOUNTPOINT "$USB_DEVICE")
    return
  fi

  echo "๐Ÿ” Detecting USB device..."
  USB_DEVICE=""
  while read -r dev tran hotplug type _; do
    if [[ "$tran" == "usb" && "$hotplug" == "1" && "$type" == "disk" ]]; then
      base="/dev/$dev"
      part=$(lsblk -nr -o NAME,MOUNTPOINT "$base" | awk '$2 != "" {print "/dev/"$1; exit}')
      if [[ -n "$part" ]]; then
        USB_DEVICE="$part"
        break
      fi
    fi
  done < <(lsblk -o NAME,TRAN,HOTPLUG,TYPE,MOUNTPOINT -nr)

  if [ -z "$USB_DEVICE" ]; then
    echo "โŒ No mounted USB partition found on any USB disk."
    lsblk -o NAME,TRAN,HOTPLUG,TYPE,SIZE,MOUNTPOINT -nr | grep part
    read -rp "Enter the USB device path manually (e.g., /dev/sdc1): " USB_DEVICE
  fi

  MOUNT_PATH=$(lsblk -no MOUNTPOINT "$USB_DEVICE")
  if [ -z "$MOUNT_PATH" ]; then
    echo "โŒ USB device is not mounted."
    exit 1
  fi

  echo "โœ… Using USB device: $USB_DEVICE"
  echo "โœ… Mounted at: $MOUNT_PATH"
}

Finding the Test File

To avoid hardcoding filenames, we implemented logic to search for the latest Ubuntu ISO on the USB stick.

find_ubuntu_iso() {
  # Function to find an Ubuntu ISO on the USB device
  find "$MOUNT_PATH" -type f -regextype posix-extended \
    -regex ".*/ubuntu-[0-9]{2}\.[0-9]{2}-desktop-amd64\\.iso" | sort -V | tail -n1
}

Later, we enhanced it to accept a user-provided file, and even verify that the file was located on the USB stick. If it was not, the script would gracefully fall back to the Ubuntu ISO search.

find_test_file() {
  if [[ -n "$TEST_FILE" ]]; then
    echo "๐Ÿ“Ž Using provided test file: $(basename "$TEST_FILE")"
    
    # Check if the provided test file is on the USB device
    TEST_FILE_MOUNT_PATH=$(realpath "$TEST_FILE" | grep -oP "^$MOUNT_PATH")
    if [[ -z "$TEST_FILE_MOUNT_PATH" ]]; then
      echo "โŒ The provided test file is not located on the USB device."
      # Look for an Ubuntu ISO if it's not on the USB
      TEST_FILE=$(find_ubuntu_iso)
    fi
  else
    TEST_FILE=$(find_ubuntu_iso)
  fi

  if [ -z "$TEST_FILE" ]; then
    echo "โŒ No valid test file found."
    exit 1
  fi

  if [[ "$TEST_FILE" =~ ubuntu-[0-9]{2}\.[0-9]{2}-desktop-amd64\.iso ]]; then
    UBUNTU_VERSION=$(basename "$TEST_FILE" | grep -oP 'ubuntu-\d{2}\.\d{2}')
    echo "๐Ÿงช Selected Ubuntu version: $UBUNTU_VERSION"
  else
    echo "๐Ÿ“Ž Selected test file: $(basename "$TEST_FILE")"
  fi
}

Read Methods and Speed Extraction

To get a comprehensive view, we added multiple methods:

  • hdparm (direct disk access)
  • dd (simple block read)
  • dd + pv (with progress bar)
  • cat + pv (alternative stream reader)
  • ioping (random access)
  • fio (customizable benchmark tool)
    if require hdparm; then
      drop_caches
      speed=$(sudo hdparm -t --direct "$USB_DEVICE" 2>/dev/null | extract_speed)
      mb=$(speed_to_mb "$speed")
      echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
      TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
      echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    fi
    ((idx++))

    drop_caches
    speed=$(dd if="$TEST_FILE" of=/dev/null bs=8k 2>&1 |& extract_speed)
    mb=$(speed_to_mb "$speed")
    echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
    TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
    echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    ((idx++))

    if require pv; then
      drop_caches
      FILESIZE=$(stat -c%s "$TEST_FILE")
      speed=$(dd if="$TEST_FILE" bs=8k status=none | pv -s "$FILESIZE" -f -X 2>&1 | extract_speed)
      mb=$(speed_to_mb "$speed")
      echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
      TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
      echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    fi
    ((idx++))

    if require pv; then
      drop_caches
      speed=$(cat "$TEST_FILE" | pv -f -X 2>&1 | extract_speed)
      mb=$(speed_to_mb "$speed")
      echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
      TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
      echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    fi
    ((idx++))

    if require ioping; then
      drop_caches
      speed=$(ioping -c 10 -A "$USB_DEVICE" 2>/dev/null | grep 'read' | extract_speed)
      mb=$(speed_to_mb "$speed")
      echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
      TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
      echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    fi
    ((idx++))

    if require fio; then
      drop_caches
      speed=$(fio --name=readtest --filename="$TEST_FILE" --direct=1 --rw=read --bs=8k \
            --size=100M --ioengine=libaio --iodepth=16 --runtime=5s --time_based --readonly \
            --minimal 2>/dev/null | awk -F';' '{print $6" KB/s"}' | extract_speed)
      mb=$(speed_to_mb "$speed")
      echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
      TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
      echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    fi

Parsing their outputs proved tricky. For example, pv outputs speed with or without spaces, and with different units. We created a robust extract_speed function with regex, and a speed_to_mb function that could handle both MB/s and MiB/s, with or without a space between value and unit.

extract_speed() {
  grep -oP '(?i)[\d.,]+\s*[KMG]i?B/s' | tail -1 | sed 's/,/./'
}

speed_to_mb() {
  if [[ "$1" =~ ([0-9.,]+)[[:space:]]*([a-zA-Z/]+) ]]; then
    value="${BASH_REMATCH[1]}"
    unit=$(echo "${BASH_REMATCH[2]}" | tr '[:upper:]' '[:lower:]')
  else
    echo "0"
    return
  fi

  case "$unit" in
    kb/s)   awk -v v="$value" 'BEGIN { printf "%.2f", v / 1000 }' ;;
    mb/s)   awk -v v="$value" 'BEGIN { printf "%.2f", v }' ;;
    gb/s)   awk -v v="$value" 'BEGIN { printf "%.2f", v * 1000 }' ;;
    kib/s)  awk -v v="$value" 'BEGIN { printf "%.2f", v / 1024 }' ;;
    mib/s)  awk -v v="$value" 'BEGIN { printf "%.2f", v }' ;;
    gib/s)  awk -v v="$value" 'BEGIN { printf "%.2f", v * 1024 }' ;;
    *) echo "0" ;;
  esac
}

Dropping Caches for Accurate Results

To prevent cached reads from skewing the results, each test run begins by dropping system caches using:

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

What it does:

CommandPurpose
syncFlushes all dirty (pending write) pages to disk
echo 3 > /proc/sys/vm/drop_cachesClears page cache, dentries, and inodes from RAM

We wrapped this in a helper function and used it consistently.


Multiple Runs and Averaging

We made the script repeat each test N times (default: 3), collect results, compute averages, and display a summary at the end.

  echo "๐Ÿ“Š Read-only USB benchmark started ($RUNS run(s))"
  echo "==================================="

  declare -A TEST_NAMES=(
    [1]="hdparm"
    [2]="dd"
    [3]="dd + pv"
    [4]="cat + pv"
    [5]="ioping"
    [6]="fio"
  )

  declare -A TOTAL_MB
  for i in {1..6}; do TOTAL_MB[$i]=0; done
  CSVFILE="usb-benchmark-$(date +%Y%m%d-%H%M%S).csv"
  echo "Test,Run,Speed (MB/s)" > "$CSVFILE"

  for ((run=1; run<=RUNS; run++)); do
    echo "โ–ถ Run $run"
    idx=1

  ### tests run here

  echo "๐Ÿ“„ Summary of average results for $UBUNTU_VERSION:"
  echo "==================================="
  SUMMARY_TABLE=""
  for i in {1..6}; do
    if [[ ${TOTAL_MB[$i]} != 0 ]]; then
      avg=$(echo "scale=2; ${TOTAL_MB[$i]} / $RUNS" | bc)
      echo "${TEST_NAMES[$i]} average: $avg MB/s"
      RESULTS+=("${TEST_NAMES[$i]} average: $avg MB/s")
      SUMMARY_TABLE+="${TEST_NAMES[$i]},$avg\n"
    fi
  done

Output Formats

To make the results user-friendly, we added:

  • A clean table view
  • CSV export for spreadsheets
  • Log file for later reference
  if [[ "$VISUAL" == "table" || "$VISUAL" == "both" ]]; then
    echo -e "๐Ÿ“‹ Table view:"
    echo -e "Test Method,Average MB/s\n$SUMMARY_TABLE" | column -t -s ','
  fi

  if [[ "$VISUAL" == "bar" || "$VISUAL" == "both" ]]; then
    if require gnuplot; then
      echo -e "$SUMMARY_TABLE" | awk -F',' '{print $1" "$2}' | \
      gnuplot -p -e "
        set terminal dumb;
        set title 'USB Read Benchmark Results ($UBUNTU_VERSION)';
        set xlabel 'Test Method';
        set ylabel 'MB/s';
        plot '-' using 2:xtic(1) with boxes notitle
      "
    fi
  fi

  LOGFILE="usb-benchmark-$(date +%Y%m%d-%H%M%S).log"
  {
    echo "Benchmark for USB device: $USB_DEVICE"
    echo "Mounted at: $MOUNT_PATH"
    echo "Ubuntu version: $UBUNTU_VERSION"
    echo "Test file: $TEST_FILE"
    echo "Timestamp: $(date)"
    echo "Number of runs: $RUNS"
    echo ""
    echo "Read speed averages:"
    for line in "${RESULTS[@]}"; do
      echo "$line"
    done
  } > "$LOGFILE"

  echo "๐Ÿ“ Results saved to: $LOGFILE"
  echo "๐Ÿ“ˆ CSV exported to: $CSVFILE"
  echo "==================================="

The Full Script

Here is the complete version of the script used to benchmark the read performance of a USB drive:

#!/bin/bash

# ==========================
# CONFIGURATION
# ==========================
RESULTS=()
USB_DEVICE=""
TEST_FILE=""
RUNS=1
VISUAL="none"
SUMMARY=0

# (Consider grouping related configuration into a config file or associative array if script expands)

# ==========================
# ARGUMENT PARSING
# ==========================
while [[ $# -gt 0 ]]; do
  case $1 in
    --device)
      USB_DEVICE="$2"
      shift 2
      ;;
    --file)
      TEST_FILE="$2"
      shift 2
      ;;
    --runs)
      RUNS="$2"
      shift 2
      ;;
    --visual)
      VISUAL="$2"
      shift 2
      ;;
    --summary)
      SUMMARY=1
      shift
      ;;
    --yes|--force)
      FORCE_YES=1
      shift
      ;;
    *)
      echo "Unknown option: $1"
      exit 1
      ;;
  esac
done

# ==========================
# TOOL CHECK
# ==========================
ALL_TOOLS=(hdparm dd pv ioping fio lsblk stat grep awk find sort basename column gnuplot)
MISSING_TOOLS=()

require() {
  if ! command -v "$1" >/dev/null; then
    return 1
  fi
  return 0
}

check_required_tools() {
  echo "๐Ÿ” Checking required tools..."
  for tool in "${ALL_TOOLS[@]}"; do
    if ! require "$tool"; then
      MISSING_TOOLS+=("$tool")
    fi
  done

  if [[ ${#MISSING_TOOLS[@]} -gt 0 ]]; then
    echo "โš ๏ธ  The following tools are missing: ${MISSING_TOOLS[*]}"
    echo "You can install them using: sudo apt install ${MISSING_TOOLS[*]}"
    if [[ -z "$FORCE_YES" ]]; then
      read -rp "Do you want to continue and skip tests that require them? (y/N): " yn
      case $yn in
        [Yy]*)
          echo "Continuing with limited tests..."
          ;;
        *)
          echo "Aborting. Please install the required tools."
          exit 1
          ;;
      esac
    else
      echo "Continuing with limited tests (auto-confirmed)..."
    fi
  else
    echo "โœ… All required tools are available."
  fi
}

# ==========================
# AUTO-DETECT USB DEVICE
# ==========================
detect_usb() {
  if [[ -n "$USB_DEVICE" ]]; then
    echo "๐Ÿ“Ž Using provided USB device: $USB_DEVICE"
    MOUNT_PATH=$(lsblk -no MOUNTPOINT "$USB_DEVICE")
    return
  fi

  echo "๐Ÿ” Detecting USB device..."
  USB_DEVICE=""
  while read -r dev tran hotplug type _; do
    if [[ "$tran" == "usb" && "$hotplug" == "1" && "$type" == "disk" ]]; then
      base="/dev/$dev"
      part=$(lsblk -nr -o NAME,MOUNTPOINT "$base" | awk '$2 != "" {print "/dev/"$1; exit}')
      if [[ -n "$part" ]]; then
        USB_DEVICE="$part"
        break
      fi
    fi
  done < <(lsblk -o NAME,TRAN,HOTPLUG,TYPE,MOUNTPOINT -nr)

  if [ -z "$USB_DEVICE" ]; then
    echo "โŒ No mounted USB partition found on any USB disk."
    lsblk -o NAME,TRAN,HOTPLUG,TYPE,SIZE,MOUNTPOINT -nr | grep part
    read -rp "Enter the USB device path manually (e.g., /dev/sdc1): " USB_DEVICE
  fi

  MOUNT_PATH=$(lsblk -no MOUNTPOINT "$USB_DEVICE")
  if [ -z "$MOUNT_PATH" ]; then
    echo "โŒ USB device is not mounted."
    exit 1
  fi

  echo "โœ… Using USB device: $USB_DEVICE"
  echo "โœ… Mounted at: $MOUNT_PATH"
}

# ==========================
# FIND TEST FILE
# ==========================
find_ubuntu_iso() {
  # Function to find an Ubuntu ISO on the USB device
  find "$MOUNT_PATH" -type f -regextype posix-extended \
    -regex ".*/ubuntu-[0-9]{2}\.[0-9]{2}-desktop-amd64\\.iso" | sort -V | tail -n1
}

find_test_file() {
  if [[ -n "$TEST_FILE" ]]; then
    echo "๐Ÿ“Ž Using provided test file: $(basename "$TEST_FILE")"
    
    # Check if the provided test file is on the USB device
    TEST_FILE_MOUNT_PATH=$(realpath "$TEST_FILE" | grep -oP "^$MOUNT_PATH")
    if [[ -z "$TEST_FILE_MOUNT_PATH" ]]; then
      echo "โŒ The provided test file is not located on the USB device."
      # Look for an Ubuntu ISO if it's not on the USB
      TEST_FILE=$(find_ubuntu_iso)
    fi
  else
    TEST_FILE=$(find_ubuntu_iso)
  fi

  if [ -z "$TEST_FILE" ]; then
    echo "โŒ No valid test file found."
    exit 1
  fi

  if [[ "$TEST_FILE" =~ ubuntu-[0-9]{2}\.[0-9]{2}-desktop-amd64\.iso ]]; then
    UBUNTU_VERSION=$(basename "$TEST_FILE" | grep -oP 'ubuntu-\d{2}\.\d{2}')
    echo "๐Ÿงช Selected Ubuntu version: $UBUNTU_VERSION"
  else
    echo "๐Ÿ“Ž Selected test file: $(basename "$TEST_FILE")"
  fi
}



# ==========================
# SPEED EXTRACTION
# ==========================
extract_speed() {
  grep -oP '(?i)[\d.,]+\s*[KMG]i?B/s' | tail -1 | sed 's/,/./'
}

speed_to_mb() {
  if [[ "$1" =~ ([0-9.,]+)[[:space:]]*([a-zA-Z/]+) ]]; then
    value="${BASH_REMATCH[1]}"
    unit=$(echo "${BASH_REMATCH[2]}" | tr '[:upper:]' '[:lower:]')
  else
    echo "0"
    return
  fi

  case "$unit" in
    kb/s)   awk -v v="$value" 'BEGIN { printf "%.2f", v / 1000 }' ;;
    mb/s)   awk -v v="$value" 'BEGIN { printf "%.2f", v }' ;;
    gb/s)   awk -v v="$value" 'BEGIN { printf "%.2f", v * 1000 }' ;;
    kib/s)  awk -v v="$value" 'BEGIN { printf "%.2f", v / 1024 }' ;;
    mib/s)  awk -v v="$value" 'BEGIN { printf "%.2f", v }' ;;
    gib/s)  awk -v v="$value" 'BEGIN { printf "%.2f", v * 1024 }' ;;
    *) echo "0" ;;
  esac
}

drop_caches() {
  echo "๐Ÿงน Dropping system caches..."
  if [[ $EUID -ne 0 ]]; then
    echo "  (requires sudo)"
  fi
  sudo sh -c "sync && echo 3 > /proc/sys/vm/drop_caches"
}

# ==========================
# RUN BENCHMARKS
# ==========================
run_benchmarks() {
  echo "๐Ÿ“Š Read-only USB benchmark started ($RUNS run(s))"
  echo "==================================="

  declare -A TEST_NAMES=(
    [1]="hdparm"
    [2]="dd"
    [3]="dd + pv"
    [4]="cat + pv"
    [5]="ioping"
    [6]="fio"
  )

  declare -A TOTAL_MB
  for i in {1..6}; do TOTAL_MB[$i]=0; done
  CSVFILE="usb-benchmark-$(date +%Y%m%d-%H%M%S).csv"
  echo "Test,Run,Speed (MB/s)" > "$CSVFILE"

  for ((run=1; run<=RUNS; run++)); do
    echo "โ–ถ Run $run"
    idx=1

    if require hdparm; then
      drop_caches
      speed=$(sudo hdparm -t --direct "$USB_DEVICE" 2>/dev/null | extract_speed)
      mb=$(speed_to_mb "$speed")
      echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
      TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
      echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    fi
    ((idx++))

    drop_caches
    speed=$(dd if="$TEST_FILE" of=/dev/null bs=8k 2>&1 |& extract_speed)
    mb=$(speed_to_mb "$speed")
    echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
    TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
    echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    ((idx++))

    if require pv; then
      drop_caches
      FILESIZE=$(stat -c%s "$TEST_FILE")
      speed=$(dd if="$TEST_FILE" bs=8k status=none | pv -s "$FILESIZE" -f -X 2>&1 | extract_speed)
      mb=$(speed_to_mb "$speed")
      echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
      TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
      echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    fi
    ((idx++))

    if require pv; then
      drop_caches
      speed=$(cat "$TEST_FILE" | pv -f -X 2>&1 | extract_speed)
      mb=$(speed_to_mb "$speed")
      echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
      TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
      echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    fi
    ((idx++))

    if require ioping; then
      drop_caches
      speed=$(ioping -c 10 -A "$USB_DEVICE" 2>/dev/null | grep 'read' | extract_speed)
      mb=$(speed_to_mb "$speed")
      echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
      TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
      echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    fi
    ((idx++))

    if require fio; then
      drop_caches
      speed=$(fio --name=readtest --filename="$TEST_FILE" --direct=1 --rw=read --bs=8k \
            --size=100M --ioengine=libaio --iodepth=16 --runtime=5s --time_based --readonly \
            --minimal 2>/dev/null | awk -F';' '{print $6" KB/s"}' | extract_speed)
      mb=$(speed_to_mb "$speed")
      echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
      TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
      echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
    fi
  done

  echo "๐Ÿ“„ Summary of average results for $UBUNTU_VERSION:"
  echo "==================================="
  SUMMARY_TABLE=""
  for i in {1..6}; do
    if [[ ${TOTAL_MB[$i]} != 0 ]]; then
      avg=$(echo "scale=2; ${TOTAL_MB[$i]} / $RUNS" | bc)
      echo "${TEST_NAMES[$i]} average: $avg MB/s"
      RESULTS+=("${TEST_NAMES[$i]} average: $avg MB/s")
      SUMMARY_TABLE+="${TEST_NAMES[$i]},$avg\n"
    fi
  done

  if [[ "$VISUAL" == "table" || "$VISUAL" == "both" ]]; then
    echo -e "๐Ÿ“‹ Table view:"
    echo -e "Test Method,Average MB/s\n$SUMMARY_TABLE" | column -t -s ','
  fi

  if [[ "$VISUAL" == "bar" || "$VISUAL" == "both" ]]; then
    if require gnuplot; then
      echo -e "$SUMMARY_TABLE" | awk -F',' '{print $1" "$2}' | \
      gnuplot -p -e "
        set terminal dumb;
        set title 'USB Read Benchmark Results ($UBUNTU_VERSION)';
        set xlabel 'Test Method';
        set ylabel 'MB/s';
        plot '-' using 2:xtic(1) with boxes notitle
      "
    fi
  fi

  LOGFILE="usb-benchmark-$(date +%Y%m%d-%H%M%S).log"
  {
    echo "Benchmark for USB device: $USB_DEVICE"
    echo "Mounted at: $MOUNT_PATH"
    echo "Ubuntu version: $UBUNTU_VERSION"
    echo "Test file: $TEST_FILE"
    echo "Timestamp: $(date)"
    echo "Number of runs: $RUNS"
    echo ""
    echo "Read speed averages:"
    for line in "${RESULTS[@]}"; do
      echo "$line"
    done
  } > "$LOGFILE"

  echo "๐Ÿ“ Results saved to: $LOGFILE"
  echo "๐Ÿ“ˆ CSV exported to: $CSVFILE"
  echo "==================================="
}

# ==========================
# MAIN
# ==========================
check_required_tools
detect_usb
find_test_file
run_benchmarks

You van also find the latest revision of this script as a GitHub Gist.


Lessons Learned

This script has grown from a simple one-liner into a reliable tool to test USB read performance. Working with ChatGPT sped up development significantly, especially for bash edge cases and regex. But more importantly, it helped guide the evolution of the script in a structured way, with clean modular functions and consistent formatting.


Conclusion

This has been a fun and educational project. Whether you are benchmarking your own USB drives or just want to learn more about shell scripting, I hope this walkthrough is helpful.

Next up? Maybe a graphical version, or write benchmarking on a RAM disk to avoid damaging flash storage.

Stay tunedโ€”and let me know if you use this script or improve it!

Benchmarking USB Drives with Shell Scripts – Part 1: Why I Built a Benchmark Script

Introduction

When I upgraded from an old 8GB USB stick to a shiny new 256GB one, I expected faster speeds and more convenienceโ€”especially for carrying around multiple bootable ISO files using Ventoy. With modern Linux distributions often exceeding 4GB per ISO, my old drive could barely hold a single image. But I quickly realized that storage space was only half the storyโ€”performance matters too.

Curious about how much of an upgrade I had actually made, I decided to benchmark the read speed of both USB sticks. Instead of hunting down benchmarking tools or manually comparing outputs, I turned to ChatGPT to help me craft a reliable, repeatable shell script that could automate the entire process. In this post, Iโ€™ll share how ChatGPT helped me go from an idea to a functional USB benchmark script, and what I learned along the way.


The Goal

I wanted to answer a few simple but important questions:

  • How much faster is my new USB stick compared to the old one?
  • Do different USB ports affect read speeds?
  • How can I automate these tests and compare the results?

But I also wanted a reusable script that would:

  • Detect the USB device automatically
  • Find or use a test file on the USB stick
  • Run several types of read benchmarks
  • Present the results clearly, with support for summary and CSV export

Getting Help from ChatGPT

I asked ChatGPT to help me write a shell script with these requirements. It guided me through:

  • Choosing benchmarking tools: hdparm, dd, pv, ioping, fio
  • Auto-detecting the mounted USB device
  • Handling different cases for user-provided test files or Ubuntu ISOs
  • Parsing and converting human-readable speed outputs
  • Displaying results in human-friendly tables and optional CSV export

We iterated over the script, addressing edge cases like:

  • USB devices not mounted
  • Multiple USB partitions
  • pv not showing output unless stderr was correctly handled
  • Formatting output consistently across tools

ChatGPT even helped optimize the code for readability, reduce duplication, and handle both space-separated and non-space-separated speed values like โ€œ18.6 MB/sโ€ and โ€œ18.6MB/sโ€.


Benchmark Results

With the script ready, I ran tests on three configurations:

1. Old 8GB USB Stick

hdparm       16.40 MB/s
dd 18.66 MB/s
dd + pv 17.80 MB/s
cat + pv 18.10 MB/s
ioping 4.44 MB/s
fio 93.99 MB/s

2. New 256GB USB Stick (Fast USB Port)

hdparm      372.01 MB/s
dd 327.33 MB/s
dd + pv 310.00 MB/s
cat + pv 347.00 MB/s
ioping 8.58 MB/s
fio 992.78 MB/s

3. New 256GB USB Stick (Slow USB Port)

hdparm       37.60 MB/s
dd 39.86 MB/s
dd + pv 38.13 MB/s
cat + pv 40.30 MB/s
ioping 6.88 MB/s
fio 73.52 MB/s

Observations

  • The old USB stick is not only limited in capacity but also very slow. It barely breaks 20 MB/s in most tests.
  • The new USB stick, when plugged into a fast USB 3.0 port, is significantly fasterโ€”over 10x the speed in most benchmarks.
  • Plugging the same new stick into a slower port dramatically reduces its performanceโ€”a good reminder to check where you plug it in.
  • Tools like hdparm, dd, and cat + pv give relatively consistent results. However, ioping and fio behave differently due to the way they access dataโ€”random access or block size differences can impact results.

Also worth noting: the metal casing of the new USB stick gets warm after a few test runs, unlike the old plastic one.


Conclusion

Using ChatGPT to develop this benchmark script was like pair-programming with an always-available assistant. It accelerated development, helped troubleshoot weird edge cases, and made the script more polished than if I had done it alone.

If you want to test your own USB drivesโ€”or ensure you’re using the best port for speedโ€”this benchmark script is a great tool to have in your kit. And if you’re looking to learn shell scripting, pairing with ChatGPT is an excellent way to level up.


Want the script?
Iโ€™ll share the full version of the script and instructions on how to use it in a follow-up post. Stay tuned!