Ever wondered what it’s like to unleash 10 000 tiny little data beasts on your hard drive? No? Well, buckle up anyway — because today, we’re diving into the curious world of random file generation, and then nerding out by calculating their size distribution. Spoiler alert: it’s less fun than it sounds. 😏
Step 1: Let’s Make Some Files… Lots of Them
Our goal? Generate 10 000 files filled with random data. But not just any random sizes — we want a mean file size of roughly 68 KB and a median of about 2 KB. Sounds like a math puzzle? That’s because it kind of is.
If you just pick file sizes uniformly at random, you’ll end up with a median close to the mean — which is boring. We want a skewed distribution, where most files are small, but some are big enough to bring that average up.
The Magic Trick: Log-normal Distribution 🎩✨
Enter the log-normal distribution, a nifty way to generate lots of small numbers and a few big ones — just like real life. Using Python’s NumPy library, we generate these sizes and feed them to good old /dev/urandom to fill our files with pure randomness.
Here’s the Bash script that does the heavy lifting:
#!/bin/bash
# Directory to store the random files
output_dir="random_files"
mkdir -p "$output_dir"
# Total number of files to create
file_count=10000
# Log-normal distribution parameters
mean_log=9.0 # Adjusted for ~68KB mean
stddev_log=1.5 # Adjusted for ~2KB median
# Function to generate random numbers based on log-normal distribution
generate_random_size() {
python3 -c "import numpy as np; print(int(np.random.lognormal($mean_log, $stddev_log)))"
}
# Create files with random data
for i in $(seq 1 $file_count); do
file_size=$(generate_random_size)
file_path="$output_dir/file_$i.bin"
head -c "$file_size" /dev/urandom > "$file_path"
echo "Generated file $i with size $file_size bytes."
done
echo "Done. Files saved in $output_dir."
Easy enough, right? This creates a directory random_files and fills it with 10 000 files of sizes mostly small but occasionally wildly bigger. Don’t blame me if your disk space takes a little hit! 💥
Step 2: Crunching Numbers — The File Size Distribution 📊
Okay, you’ve got the files. Now, what can we learn from their sizes? Let’s find out the:
Mean size: The average size across all files.
Median size: The middle value when sizes are sorted — because averages can lie.
Distribution breakdown: How many tiny files vs. giant files.
Here’s a handy Bash script that reads file sizes and spits out these stats with a bit of flair:
#!/bin/bash
# Input directory (default to "random_files" if not provided)
directory="${1:-random_files}"
# Check if directory exists
if [ ! -d "$directory" ]; then
echo "Directory $directory does not exist."
exit 1
fi
# Array to store file sizes
file_sizes=($(find "$directory" -type f -exec stat -c%s {} \;))
# Check if there are files in the directory
if [ ${#file_sizes[@]} -eq 0 ]; then
echo "No files found in the directory $directory."
exit 1
fi
# Calculate mean
total_size=0
for size in "${file_sizes[@]}"; do
total_size=$((total_size + size))
done
mean=$((total_size / ${#file_sizes[@]}))
# Calculate median
sorted_sizes=($(printf '%s\n' "${file_sizes[@]}" | sort -n))
mid=$(( ${#sorted_sizes[@]} / 2 ))
if (( ${#sorted_sizes[@]} % 2 == 0 )); then
median=$(( (sorted_sizes[mid-1] + sorted_sizes[mid]) / 2 ))
else
median=${sorted_sizes[mid]}
fi
# Display file size distribution
echo "File size distribution in directory $directory:"
echo "---------------------------------------------"
echo "Number of files: ${#file_sizes[@]}"
echo "Mean size: $mean bytes"
echo "Median size: $median bytes"
# Display detailed size distribution (optional)
echo
echo "Detailed distribution (size ranges):"
awk '{
if ($1 < 1024) bins["< 1 KB"]++;
else if ($1 < 10240) bins["1 KB - 10 KB"]++;
else if ($1 < 102400) bins["10 KB - 100 KB"]++;
else bins[">= 100 KB"]++;
} END {
for (range in bins) printf "%-15s: %d\n", range, bins[range];
}' <(printf '%s\n' "${file_sizes[@]}")
Run it, and voilà — instant nerd satisfaction.
Example Output:
File size distribution in directory random_files:
---------------------------------------------
Number of files: 10000
Mean size: 68987 bytes
Median size: 2048 bytes
Detailed distribution (size ranges):
< 1 KB : 1234
1 KB - 10 KB : 5678
10 KB - 100 KB : 2890
>= 100 KB : 198
Why Should You Care? 🤷♀️
Besides the obvious geek cred, generating files like this can help:
Test backup systems — can they handle weird file size distributions?
Stress-test storage or network performance with real-world-like data.
Understand your data patterns if you’re building apps that deal with files.
Wrapping Up: Big Files, Small Files, and the Chaos In Between
So there you have it. Ten thousand random files later, and we’ve peeked behind the curtain to understand their size story. It’s a bit like hosting a party and then figuring out who ate how many snacks. 🍿
Try this yourself! Tweak the distribution parameters, generate files, crunch the numbers — and impress your friends with your mad scripting skills. Or at least have a fun weekend project that makes you sound way smarter than you actually are.
If you’re running Mail-in-a-Box like me, you might rely on Duplicity to handle backups quietly in the background. It’s a great tool — until it isn’t. Recently, I ran into some frustrating issues caused by buggy Duplicity versions. Here’s the story, a useful discussion from the Mail-in-a-Box forums, and a neat trick I use to keep fallback versions handy. Spoiler: it involves an APT hook and some smart file copying! 🚀
The Problem with Duplicity Versions
Duplicity 3.0.1 and 3.0.5 have been reported to cause backup failures — a real headache when you depend on them to protect your data. The Mail-in-a-Box forum post “Something is wrong with the backup” dives into these issues with great detail. Users reported mysterious backup failures and eventually traced it back to specific Duplicity releases causing the problem.
Here’s the catch: those problematic versions sometimes sneak in during automatic updates. By the time you realize something’s wrong, you might already have upgraded to a buggy release. 😩
Pinning Problematic Versions with APT Preferences
One way to stop apt from installing those broken versions is to use APT pinning. Here’s an example file I created in /etc/apt/preferences/pin_duplicity.pref:
Explanation: Duplicity version 3.0.1* has a bug and should not be installed
Package: duplicity
Pin: version 3.0.1*
Pin-Priority: -1
Explanation: Duplicity version 3.0.5* has a bug and should not be installed
Package: duplicity
Pin: version 3.0.5*
Pin-Priority: -1
This tells apt to refuse to install these specific buggy versions. Sounds great, right? Except — it often comes too late. You could already have updated to a broken version before adding the pin.
Also, since Duplicity is installed from a PPA, older versions vanish quickly as new releases push them out. This makes rolling back to a known good version a pain. 😤
My Solution: Backing Up Known Good Duplicity .deb Files Automatically
To fix this, I created an APT hook that runs after every package operation involving Duplicity. It automatically copies the .deb package files of Duplicity from apt’s archive cache — and even from my local folder if I’m installing manually — into a safe backup folder.
Here’s the script, saved as /usr/local/bin/apt-backup-duplicity.sh:
Use apt-mark hold to Lock a Working Duplicity Version 🔒
Even with pinning and local .deb backups, there’s one more layer of protection I recommend: freezing a known-good version with apt-mark hold.
Once you’ve confirmed that your current version of Duplicity works reliably, run:
sudo apt-mark hold duplicity
This tells apt not to upgrade Duplicity, even if a newer version becomes available. It’s a great way to avoid accidentally replacing your working setup with something buggy during routine updates.
🧠 Pro Tip: I only unhold and upgrade Duplicity manually after checking the Mail-in-a-Box forum for reports that a newer version is safe.
Replace <version> with the actual filename you want to roll back to. Because you saved the .deb files right after each update, you always have access to older stable versions — even if the PPA has moved on.
Final Thoughts
While pinning bad versions helps, having a local stash of known-good packages is a game changer. Add apt-mark hold on top of that, and you have a rock-solid defense against regressions. 🪨✨
It’s a small extra step but pays off hugely when things go sideways. Plus, it’s totally automated with the APT hook, so you don’t have to remember to save anything manually. 🎉
If you run Mail-in-a-Box or rely on Duplicity in any critical backup workflow, I highly recommend setting up this safety net.
In my previous post, I shared the story of why I needed a new USB stick and how I used ChatGPT to write a benchmark script that could measure read performance across various methods. In this follow-up, I will dive into the technical details of how the script evolved—from a basic prototype into a robust and feature-rich tool—thanks to incremental refinements and some AI-assisted development.
Starting Simple: The First Version
The initial idea was simple: read a file using dd and measure the speed.
No cache flushing, leading to inflated results when repeating the measurement
With ChatGPT’s help, I started addressing each of these issues one by one.
Tools check
On a default Ubuntu installation, some tools are available by default, while others (especially benchmarking tools) usually need to be installed separately.
Tools used in the script:
Tool
Installed by default?
Needs require?
hdparm
❌ Not installed
✅ Yes
dd
✅ Yes
❌ No
pv
❌ Not installed
✅ Yes
cat
✅ Yes
❌ No
ioping
❌ Not installed
✅ Yes
fio
❌ Not installed
✅ Yes
lsblk
✅ Yes (in util-linux)
❌ No
awk
✅ Yes (in gawk)
❌ No
grep
✅ Yes
❌ No
basename
✅ Yes (in coreutils)
❌ No
find
✅ Yes
❌ No
sort
✅ Yes
❌ No
stat
✅ Yes
❌ No
This function ensures the system has all tools needed for benchmarking. It exits early if any tool is missing.
This was the initial version:
check_required_tools() {
local required_tools=(dd pv hdparm fio ioping awk grep sed tr bc stat lsblk find sort)
for tool in "${required_tools[@]}"; do
if ! command -v "$tool" &>/dev/null; then
echo "❌ Required tool '$tool' is not installed."
exit 1
fi
done
}
That’s already nice, but maybe I just want to run the script anyway if some of the tools are missing.
This is a more advanced version:
ALL_TOOLS=(hdparm dd pv ioping fio lsblk stat grep awk find sort basename column gnuplot)
MISSING_TOOLS=()
require() {
if ! command -v "$1" >/dev/null; then
return 1
fi
return 0
}
check_required_tools() {
echo "🔍 Checking required tools..."
for tool in "${ALL_TOOLS[@]}"; do
if ! require "$tool"; then
MISSING_TOOLS+=("$tool")
fi
done
if [[ ${#MISSING_TOOLS[@]} -gt 0 ]]; then
echo "⚠️ The following tools are missing: ${MISSING_TOOLS[*]}"
echo "You can install them using: sudo apt install ${MISSING_TOOLS[*]}"
if [[ -z "$FORCE_YES" ]]; then
read -rp "Do you want to continue and skip tests that require them? (y/N): " yn
case $yn in
[Yy]*)
echo "Continuing with limited tests..."
;;
*)
echo "Aborting. Please install the required tools."
exit 1
;;
esac
else
echo "Continuing with limited tests (auto-confirmed)..."
fi
else
echo "✅ All required tools are available."
fi
}
Device Auto-Detection
One early challenge was identifying which device was the USB stick. I wanted the script to automatically detect a mounted USB device. My first version was clunky and error-prone.
detect_usb() {
USB_DEVICE=$(lsblk -o NAME,TRAN,MOUNTPOINT -J | jq -r '.blockdevices[] | select(.tran=="usb") | .name' | head -n1)
if [[ -z "$USB_DEVICE" ]]; then
echo "❌ No USB device detected."
exit 1
fi
USB_PATH="/dev/$USB_DEVICE"
MOUNT_PATH=$(lsblk -no MOUNTPOINT "$USB_PATH" | head -n1)
if [[ -z "$MOUNT_PATH" ]]; then
echo "❌ USB device is not mounted."
exit 1
fi
echo "✅ Using USB device: $USB_PATH"
echo "✅ Mounted at: $MOUNT_PATH"
}
After a few iterations, we (ChatGPT and I) settled on parsing lsblk with filters on tran=usb and hotplug=1, and selecting the first mounted partition.
We also added a fallback prompt in case auto-detection failed.
detect_usb() {
if [[ -n "$USB_DEVICE" ]]; then
echo "📎 Using provided USB device: $USB_DEVICE"
MOUNT_PATH=$(lsblk -no MOUNTPOINT "$USB_DEVICE")
return
fi
echo "🔍 Detecting USB device..."
USB_DEVICE=""
while read -r dev tran hotplug type _; do
if [[ "$tran" == "usb" && "$hotplug" == "1" && "$type" == "disk" ]]; then
base="/dev/$dev"
part=$(lsblk -nr -o NAME,MOUNTPOINT "$base" | awk '$2 != "" {print "/dev/"$1; exit}')
if [[ -n "$part" ]]; then
USB_DEVICE="$part"
break
fi
fi
done < <(lsblk -o NAME,TRAN,HOTPLUG,TYPE,MOUNTPOINT -nr)
if [ -z "$USB_DEVICE" ]; then
echo "❌ No mounted USB partition found on any USB disk."
lsblk -o NAME,TRAN,HOTPLUG,TYPE,SIZE,MOUNTPOINT -nr | grep part
read -rp "Enter the USB device path manually (e.g., /dev/sdc1): " USB_DEVICE
fi
MOUNT_PATH=$(lsblk -no MOUNTPOINT "$USB_DEVICE")
if [ -z "$MOUNT_PATH" ]; then
echo "❌ USB device is not mounted."
exit 1
fi
echo "✅ Using USB device: $USB_DEVICE"
echo "✅ Mounted at: $MOUNT_PATH"
}
Finding the Test File
To avoid hardcoding filenames, we implemented logic to search for the latest Ubuntu ISO on the USB stick.
find_ubuntu_iso() {
# Function to find an Ubuntu ISO on the USB device
find "$MOUNT_PATH" -type f -regextype posix-extended \
-regex ".*/ubuntu-[0-9]{2}\.[0-9]{2}-desktop-amd64\\.iso" | sort -V | tail -n1
}
Later, we enhanced it to accept a user-provided file, and even verify that the file was located on the USB stick. If it was not, the script would gracefully fall back to the Ubuntu ISO search.
find_test_file() {
if [[ -n "$TEST_FILE" ]]; then
echo "📎 Using provided test file: $(basename "$TEST_FILE")"
# Check if the provided test file is on the USB device
TEST_FILE_MOUNT_PATH=$(realpath "$TEST_FILE" | grep -oP "^$MOUNT_PATH")
if [[ -z "$TEST_FILE_MOUNT_PATH" ]]; then
echo "❌ The provided test file is not located on the USB device."
# Look for an Ubuntu ISO if it's not on the USB
TEST_FILE=$(find_ubuntu_iso)
fi
else
TEST_FILE=$(find_ubuntu_iso)
fi
if [ -z "$TEST_FILE" ]; then
echo "❌ No valid test file found."
exit 1
fi
if [[ "$TEST_FILE" =~ ubuntu-[0-9]{2}\.[0-9]{2}-desktop-amd64\.iso ]]; then
UBUNTU_VERSION=$(basename "$TEST_FILE" | grep -oP 'ubuntu-\d{2}\.\d{2}')
echo "🧪 Selected Ubuntu version: $UBUNTU_VERSION"
else
echo "📎 Selected test file: $(basename "$TEST_FILE")"
fi
}
Read Methods and Speed Extraction
To get a comprehensive view, we added multiple methods:
Parsing their outputs proved tricky. For example, pv outputs speed with or without spaces, and with different units. We created a robust extract_speed function with regex, and a speed_to_mb function that could handle both MB/s and MiB/s, with or without a space between value and unit.
extract_speed() {
grep -oP '(?i)[\d.,]+\s*[KMG]i?B/s' | tail -1 | sed 's/,/./'
}
speed_to_mb() {
if [[ "$1" =~ ([0-9.,]+)[[:space:]]*([a-zA-Z/]+) ]]; then
value="${BASH_REMATCH[1]}"
unit=$(echo "${BASH_REMATCH[2]}" | tr '[:upper:]' '[:lower:]')
else
echo "0"
return
fi
case "$unit" in
kb/s) awk -v v="$value" 'BEGIN { printf "%.2f", v / 1000 }' ;;
mb/s) awk -v v="$value" 'BEGIN { printf "%.2f", v }' ;;
gb/s) awk -v v="$value" 'BEGIN { printf "%.2f", v * 1000 }' ;;
kib/s) awk -v v="$value" 'BEGIN { printf "%.2f", v / 1024 }' ;;
mib/s) awk -v v="$value" 'BEGIN { printf "%.2f", v }' ;;
gib/s) awk -v v="$value" 'BEGIN { printf "%.2f", v * 1024 }' ;;
*) echo "0" ;;
esac
}
Dropping Caches for Accurate Results
To prevent cached reads from skewing the results, each test run begins by dropping system caches using:
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
What it does:
Command
Purpose
sync
Flushes all dirty (pending write) pages to disk
echo 3 > /proc/sys/vm/drop_caches
Clears page cache, dentries, and inodes from RAM
We wrapped this in a helper function and used it consistently.
Multiple Runs and Averaging
We made the script repeat each test N times (default: 3), collect results, compute averages, and display a summary at the end.
echo "📊 Read-only USB benchmark started ($RUNS run(s))"
echo "==================================="
declare -A TEST_NAMES=(
[1]="hdparm"
[2]="dd"
[3]="dd + pv"
[4]="cat + pv"
[5]="ioping"
[6]="fio"
)
declare -A TOTAL_MB
for i in {1..6}; do TOTAL_MB[$i]=0; done
CSVFILE="usb-benchmark-$(date +%Y%m%d-%H%M%S).csv"
echo "Test,Run,Speed (MB/s)" > "$CSVFILE"
for ((run=1; run<=RUNS; run++)); do
echo "▶ Run $run"
idx=1
### tests run here
echo "📄 Summary of average results for $UBUNTU_VERSION:"
echo "==================================="
SUMMARY_TABLE=""
for i in {1..6}; do
if [[ ${TOTAL_MB[$i]} != 0 ]]; then
avg=$(echo "scale=2; ${TOTAL_MB[$i]} / $RUNS" | bc)
echo "${TEST_NAMES[$i]} average: $avg MB/s"
RESULTS+=("${TEST_NAMES[$i]} average: $avg MB/s")
SUMMARY_TABLE+="${TEST_NAMES[$i]},$avg\n"
fi
done
Output Formats
To make the results user-friendly, we added:
A clean table view
CSV export for spreadsheets
Log file for later reference
if [[ "$VISUAL" == "table" || "$VISUAL" == "both" ]]; then
echo -e "📋 Table view:"
echo -e "Test Method,Average MB/s\n$SUMMARY_TABLE" | column -t -s ','
fi
if [[ "$VISUAL" == "bar" || "$VISUAL" == "both" ]]; then
if require gnuplot; then
echo -e "$SUMMARY_TABLE" | awk -F',' '{print $1" "$2}' | \
gnuplot -p -e "
set terminal dumb;
set title 'USB Read Benchmark Results ($UBUNTU_VERSION)';
set xlabel 'Test Method';
set ylabel 'MB/s';
plot '-' using 2:xtic(1) with boxes notitle
"
fi
fi
LOGFILE="usb-benchmark-$(date +%Y%m%d-%H%M%S).log"
{
echo "Benchmark for USB device: $USB_DEVICE"
echo "Mounted at: $MOUNT_PATH"
echo "Ubuntu version: $UBUNTU_VERSION"
echo "Test file: $TEST_FILE"
echo "Timestamp: $(date)"
echo "Number of runs: $RUNS"
echo ""
echo "Read speed averages:"
for line in "${RESULTS[@]}"; do
echo "$line"
done
} > "$LOGFILE"
echo "📝 Results saved to: $LOGFILE"
echo "📈 CSV exported to: $CSVFILE"
echo "==================================="
The Full Script
Here is the complete version of the script used to benchmark the read performance of a USB drive:
#!/bin/bash
# ==========================
# CONFIGURATION
# ==========================
RESULTS=()
USB_DEVICE=""
TEST_FILE=""
RUNS=1
VISUAL="none"
SUMMARY=0
# (Consider grouping related configuration into a config file or associative array if script expands)
# ==========================
# ARGUMENT PARSING
# ==========================
while [[ $# -gt 0 ]]; do
case $1 in
--device)
USB_DEVICE="$2"
shift 2
;;
--file)
TEST_FILE="$2"
shift 2
;;
--runs)
RUNS="$2"
shift 2
;;
--visual)
VISUAL="$2"
shift 2
;;
--summary)
SUMMARY=1
shift
;;
--yes|--force)
FORCE_YES=1
shift
;;
*)
echo "Unknown option: $1"
exit 1
;;
esac
done
# ==========================
# TOOL CHECK
# ==========================
ALL_TOOLS=(hdparm dd pv ioping fio lsblk stat grep awk find sort basename column gnuplot)
MISSING_TOOLS=()
require() {
if ! command -v "$1" >/dev/null; then
return 1
fi
return 0
}
check_required_tools() {
echo "🔍 Checking required tools..."
for tool in "${ALL_TOOLS[@]}"; do
if ! require "$tool"; then
MISSING_TOOLS+=("$tool")
fi
done
if [[ ${#MISSING_TOOLS[@]} -gt 0 ]]; then
echo "⚠️ The following tools are missing: ${MISSING_TOOLS[*]}"
echo "You can install them using: sudo apt install ${MISSING_TOOLS[*]}"
if [[ -z "$FORCE_YES" ]]; then
read -rp "Do you want to continue and skip tests that require them? (y/N): " yn
case $yn in
[Yy]*)
echo "Continuing with limited tests..."
;;
*)
echo "Aborting. Please install the required tools."
exit 1
;;
esac
else
echo "Continuing with limited tests (auto-confirmed)..."
fi
else
echo "✅ All required tools are available."
fi
}
# ==========================
# AUTO-DETECT USB DEVICE
# ==========================
detect_usb() {
if [[ -n "$USB_DEVICE" ]]; then
echo "📎 Using provided USB device: $USB_DEVICE"
MOUNT_PATH=$(lsblk -no MOUNTPOINT "$USB_DEVICE")
return
fi
echo "🔍 Detecting USB device..."
USB_DEVICE=""
while read -r dev tran hotplug type _; do
if [[ "$tran" == "usb" && "$hotplug" == "1" && "$type" == "disk" ]]; then
base="/dev/$dev"
part=$(lsblk -nr -o NAME,MOUNTPOINT "$base" | awk '$2 != "" {print "/dev/"$1; exit}')
if [[ -n "$part" ]]; then
USB_DEVICE="$part"
break
fi
fi
done < <(lsblk -o NAME,TRAN,HOTPLUG,TYPE,MOUNTPOINT -nr)
if [ -z "$USB_DEVICE" ]; then
echo "❌ No mounted USB partition found on any USB disk."
lsblk -o NAME,TRAN,HOTPLUG,TYPE,SIZE,MOUNTPOINT -nr | grep part
read -rp "Enter the USB device path manually (e.g., /dev/sdc1): " USB_DEVICE
fi
MOUNT_PATH=$(lsblk -no MOUNTPOINT "$USB_DEVICE")
if [ -z "$MOUNT_PATH" ]; then
echo "❌ USB device is not mounted."
exit 1
fi
echo "✅ Using USB device: $USB_DEVICE"
echo "✅ Mounted at: $MOUNT_PATH"
}
# ==========================
# FIND TEST FILE
# ==========================
find_ubuntu_iso() {
# Function to find an Ubuntu ISO on the USB device
find "$MOUNT_PATH" -type f -regextype posix-extended \
-regex ".*/ubuntu-[0-9]{2}\.[0-9]{2}-desktop-amd64\\.iso" | sort -V | tail -n1
}
find_test_file() {
if [[ -n "$TEST_FILE" ]]; then
echo "📎 Using provided test file: $(basename "$TEST_FILE")"
# Check if the provided test file is on the USB device
TEST_FILE_MOUNT_PATH=$(realpath "$TEST_FILE" | grep -oP "^$MOUNT_PATH")
if [[ -z "$TEST_FILE_MOUNT_PATH" ]]; then
echo "❌ The provided test file is not located on the USB device."
# Look for an Ubuntu ISO if it's not on the USB
TEST_FILE=$(find_ubuntu_iso)
fi
else
TEST_FILE=$(find_ubuntu_iso)
fi
if [ -z "$TEST_FILE" ]; then
echo "❌ No valid test file found."
exit 1
fi
if [[ "$TEST_FILE" =~ ubuntu-[0-9]{2}\.[0-9]{2}-desktop-amd64\.iso ]]; then
UBUNTU_VERSION=$(basename "$TEST_FILE" | grep -oP 'ubuntu-\d{2}\.\d{2}')
echo "🧪 Selected Ubuntu version: $UBUNTU_VERSION"
else
echo "📎 Selected test file: $(basename "$TEST_FILE")"
fi
}
# ==========================
# SPEED EXTRACTION
# ==========================
extract_speed() {
grep -oP '(?i)[\d.,]+\s*[KMG]i?B/s' | tail -1 | sed 's/,/./'
}
speed_to_mb() {
if [[ "$1" =~ ([0-9.,]+)[[:space:]]*([a-zA-Z/]+) ]]; then
value="${BASH_REMATCH[1]}"
unit=$(echo "${BASH_REMATCH[2]}" | tr '[:upper:]' '[:lower:]')
else
echo "0"
return
fi
case "$unit" in
kb/s) awk -v v="$value" 'BEGIN { printf "%.2f", v / 1000 }' ;;
mb/s) awk -v v="$value" 'BEGIN { printf "%.2f", v }' ;;
gb/s) awk -v v="$value" 'BEGIN { printf "%.2f", v * 1000 }' ;;
kib/s) awk -v v="$value" 'BEGIN { printf "%.2f", v / 1024 }' ;;
mib/s) awk -v v="$value" 'BEGIN { printf "%.2f", v }' ;;
gib/s) awk -v v="$value" 'BEGIN { printf "%.2f", v * 1024 }' ;;
*) echo "0" ;;
esac
}
drop_caches() {
echo "🧹 Dropping system caches..."
if [[ $EUID -ne 0 ]]; then
echo " (requires sudo)"
fi
sudo sh -c "sync && echo 3 > /proc/sys/vm/drop_caches"
}
# ==========================
# RUN BENCHMARKS
# ==========================
run_benchmarks() {
echo "📊 Read-only USB benchmark started ($RUNS run(s))"
echo "==================================="
declare -A TEST_NAMES=(
[1]="hdparm"
[2]="dd"
[3]="dd + pv"
[4]="cat + pv"
[5]="ioping"
[6]="fio"
)
declare -A TOTAL_MB
for i in {1..6}; do TOTAL_MB[$i]=0; done
CSVFILE="usb-benchmark-$(date +%Y%m%d-%H%M%S).csv"
echo "Test,Run,Speed (MB/s)" > "$CSVFILE"
for ((run=1; run<=RUNS; run++)); do
echo "▶ Run $run"
idx=1
if require hdparm; then
drop_caches
speed=$(sudo hdparm -t --direct "$USB_DEVICE" 2>/dev/null | extract_speed)
mb=$(speed_to_mb "$speed")
echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
fi
((idx++))
drop_caches
speed=$(dd if="$TEST_FILE" of=/dev/null bs=8k 2>&1 |& extract_speed)
mb=$(speed_to_mb "$speed")
echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
((idx++))
if require pv; then
drop_caches
FILESIZE=$(stat -c%s "$TEST_FILE")
speed=$(dd if="$TEST_FILE" bs=8k status=none | pv -s "$FILESIZE" -f -X 2>&1 | extract_speed)
mb=$(speed_to_mb "$speed")
echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
fi
((idx++))
if require pv; then
drop_caches
speed=$(cat "$TEST_FILE" | pv -f -X 2>&1 | extract_speed)
mb=$(speed_to_mb "$speed")
echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
fi
((idx++))
if require ioping; then
drop_caches
speed=$(ioping -c 10 -A "$USB_DEVICE" 2>/dev/null | grep 'read' | extract_speed)
mb=$(speed_to_mb "$speed")
echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
fi
((idx++))
if require fio; then
drop_caches
speed=$(fio --name=readtest --filename="$TEST_FILE" --direct=1 --rw=read --bs=8k \
--size=100M --ioengine=libaio --iodepth=16 --runtime=5s --time_based --readonly \
--minimal 2>/dev/null | awk -F';' '{print $6" KB/s"}' | extract_speed)
mb=$(speed_to_mb "$speed")
echo "${idx}. ${TEST_NAMES[$idx]}: $speed"
TOTAL_MB[$idx]=$(echo "${TOTAL_MB[$idx]} + $mb" | bc)
echo "${TEST_NAMES[$idx]},$run,$mb" >> "$CSVFILE"
fi
done
echo "📄 Summary of average results for $UBUNTU_VERSION:"
echo "==================================="
SUMMARY_TABLE=""
for i in {1..6}; do
if [[ ${TOTAL_MB[$i]} != 0 ]]; then
avg=$(echo "scale=2; ${TOTAL_MB[$i]} / $RUNS" | bc)
echo "${TEST_NAMES[$i]} average: $avg MB/s"
RESULTS+=("${TEST_NAMES[$i]} average: $avg MB/s")
SUMMARY_TABLE+="${TEST_NAMES[$i]},$avg\n"
fi
done
if [[ "$VISUAL" == "table" || "$VISUAL" == "both" ]]; then
echo -e "📋 Table view:"
echo -e "Test Method,Average MB/s\n$SUMMARY_TABLE" | column -t -s ','
fi
if [[ "$VISUAL" == "bar" || "$VISUAL" == "both" ]]; then
if require gnuplot; then
echo -e "$SUMMARY_TABLE" | awk -F',' '{print $1" "$2}' | \
gnuplot -p -e "
set terminal dumb;
set title 'USB Read Benchmark Results ($UBUNTU_VERSION)';
set xlabel 'Test Method';
set ylabel 'MB/s';
plot '-' using 2:xtic(1) with boxes notitle
"
fi
fi
LOGFILE="usb-benchmark-$(date +%Y%m%d-%H%M%S).log"
{
echo "Benchmark for USB device: $USB_DEVICE"
echo "Mounted at: $MOUNT_PATH"
echo "Ubuntu version: $UBUNTU_VERSION"
echo "Test file: $TEST_FILE"
echo "Timestamp: $(date)"
echo "Number of runs: $RUNS"
echo ""
echo "Read speed averages:"
for line in "${RESULTS[@]}"; do
echo "$line"
done
} > "$LOGFILE"
echo "📝 Results saved to: $LOGFILE"
echo "📈 CSV exported to: $CSVFILE"
echo "==================================="
}
# ==========================
# MAIN
# ==========================
check_required_tools
detect_usb
find_test_file
run_benchmarks
You van also find the latest revision of this script as a GitHub Gist.
Lessons Learned
This script has grown from a simple one-liner into a reliable tool to test USB read performance. Working with ChatGPT sped up development significantly, especially for bash edge cases and regex. But more importantly, it helped guide the evolution of the script in a structured way, with clean modular functions and consistent formatting.
Conclusion
This has been a fun and educational project. Whether you are benchmarking your own USB drives or just want to learn more about shell scripting, I hope this walkthrough is helpful.
Next up? Maybe a graphical version, or write benchmarking on a RAM disk to avoid damaging flash storage.
Stay tuned—and let me know if you use this script or improve it!
When I upgraded from an old 8GB USB stick to a shiny new 256GB one, I expected faster speeds and more convenience—especially for carrying around multiple bootable ISO files using Ventoy. With modern Linux distributions often exceeding 4GB per ISO, my old drive could barely hold a single image. But I quickly realized that storage space was only half the story—performance matters too.
Curious about how much of an upgrade I had actually made, I decided to benchmark the read speed of both USB sticks. Instead of hunting down benchmarking tools or manually comparing outputs, I turned to ChatGPT to help me craft a reliable, repeatable shell script that could automate the entire process. In this post, I’ll share how ChatGPT helped me go from an idea to a functional USB benchmark script, and what I learned along the way.
The Goal
I wanted to answer a few simple but important questions:
How much faster is my new USB stick compared to the old one?
Do different USB ports affect read speeds?
How can I automate these tests and compare the results?
But I also wanted a reusable script that would:
Detect the USB device automatically
Find or use a test file on the USB stick
Run several types of read benchmarks
Present the results clearly, with support for summary and CSV export
Getting Help from ChatGPT
I asked ChatGPT to help me write a shell script with these requirements. It guided me through:
Handling different cases for user-provided test files or Ubuntu ISOs
Parsing and converting human-readable speed outputs
Displaying results in human-friendly tables and optional CSV export
We iterated over the script, addressing edge cases like:
USB devices not mounted
Multiple USB partitions
pv not showing output unless stderr was correctly handled
Formatting output consistently across tools
ChatGPT even helped optimize the code for readability, reduce duplication, and handle both space-separated and non-space-separated speed values like “18.6 MB/s” and “18.6MB/s”.
Benchmark Results
With the script ready, I ran tests on three configurations:
The old USB stick is not only limited in capacity but also very slow. It barely breaks 20 MB/s in most tests.
The new USB stick, when plugged into a fast USB 3.0 port, is significantly faster—over 10x the speed in most benchmarks.
Plugging the same new stick into a slower port dramatically reduces its performance—a good reminder to check where you plug it in.
Tools like hdparm, dd, and cat + pv give relatively consistent results. However, ioping and fio behave differently due to the way they access data—random access or block size differences can impact results.
Also worth noting: the metal casing of the new USB stick gets warm after a few test runs, unlike the old plastic one.
Conclusion
Using ChatGPT to develop this benchmark script was like pair-programming with an always-available assistant. It accelerated development, helped troubleshoot weird edge cases, and made the script more polished than if I had done it alone.
If you want to test your own USB drives—or ensure you’re using the best port for speed—this benchmark script is a great tool to have in your kit. And if you’re looking to learn shell scripting, pairing with ChatGPT is an excellent way to level up.
Want the script? I’ll share the full version of the script and instructions on how to use it in a follow-up post. Stay tuned!
Because curiosity killed the cat, not because it’s useful! 😀
Start with a clean install in a virtual machine
I start with a simple Vagrantfile:
Vagrant.configure("2") do |config|
config.vm.box = "ubuntu/jammy64"
config.vm.provision "ansible" do |ansible|
ansible.playbook = "playbook.yml"
end
end
This Ansible playbook updates all packages to the latest version and removes unused packages.
- name: Update all packages to the latest version
hosts: all
remote_user: ubuntu
become: yes
tasks:
- name: Update apt cache
apt:
update_cache: yes
cache_valid_time: 3600
force_apt_get: yes
- name: Upgrade all apt packages
apt:
force_apt_get: yes
upgrade: dist
- name: Check if a reboot is needed for Ubuntu boxes
register: reboot_required_file
stat: path=/var/run/reboot-required get_md5=no
- name: Reboot the Ubuntu box
reboot:
msg: "Reboot initiated by Ansible due to kernel updates"
connect_timeout: 5
reboot_timeout: 300
pre_reboot_delay: 0
post_reboot_delay: 30
test_command: uptime
when: reboot_required_file.stat.exists
- name: Remove unused packages
apt:
autoremove: yes
purge: yes
force_apt_get: yes
Then bring up the virtual machine with vagrant up --provision.
Get the installation size
I ssh into the box (vagrant ssh) and run a couple of commands to get some numbers.
Soms moet ne mens al eens iets speciaals doen, zoals het nemen van een screenshot op een toestel dat wel Linux draait, maar geen X. Oink? Volgens StackExchange zou ik fbgrab of fbdump moeten gebruiken, maar dat is in dit concrete geval niet mogelijk because reasons.
In dit concrete geval is er een toepassing die rechtstreeks naar de framebuffer beelden stuurt. Bon, alles is een file onder Linux, dus ik ging eens piepen wat er dan eigenlijk in dat framebuffer device zat:
$ cp /dev/fb0 /tmp/framebuffer.data
$ head -c 64 /tmp/framebuffer.data
kkk�kkk�kkk�kkk�kkk�kkk�kkk�kkk�kkk�kkk�kkk�kkk�kkk�kkk�kkk�kkk�
IEKS!!! Alhoewel… Tiens, dat zag er verdacht regelmatig uit, telkens in groepjes van 4 bytes. “k” heeft ASCII waarde 107, of 6B hexadecimaal, en #6B6B6B is een grijstint. Ik had voorlopig nog geen enkel idee wat die “�” betekende, maar ik wist dat ik iets op het spoor was!
Ik heb framebuffer.data dan gekopieerd naar een pc met daarop Gimp. (referentie naar Contact invoegen)