Skip to content

Creating 10 000 Random Files & Analyzing Their Size Distribution: Because Why Not? ๐Ÿง๐Ÿ’พ

Ever wondered what itโ€™s like to unleash 10 000 tiny little data beasts on your hard drive? No? Well, buckle up anyway โ€” because today, weโ€™re diving into the curious world of random file generation, and then nerding out by calculating their size distribution. Spoiler alert: itโ€™s less fun than it sounds. ๐Ÿ˜

Step 1: Letโ€™s Make Some Files… Lots of Them

Our goal? Generate 10 000 files filled with random data. But not just any random sizes โ€” we want a mean file size of roughly 68 KB and a median of about 2 KB. Sounds like a math puzzle? Thatโ€™s because it kind of is.

If you just pick file sizes uniformly at random, youโ€™ll end up with a median close to the mean โ€” which is boring. We want a skewed distribution, where most files are small, but some are big enough to bring that average up.

The Magic Trick: Log-normal Distribution ๐ŸŽฉโœจ

Enter the log-normal distribution, a nifty way to generate lots of small numbers and a few big ones โ€” just like real life. Using Pythonโ€™s NumPy library, we generate these sizes and feed them to good old /dev/urandom to fill our files with pure randomness.

Hereโ€™s the Bash script that does the heavy lifting:

#!/bin/bash

# Directory to store the random files
output_dir="random_files"
mkdir -p "$output_dir"

# Total number of files to create
file_count=10000

# Log-normal distribution parameters
mean_log=9.0  # Adjusted for ~68KB mean
stddev_log=1.5  # Adjusted for ~2KB median

# Function to generate random numbers based on log-normal distribution
generate_random_size() {
    python3 -c "import numpy as np; print(int(np.random.lognormal($mean_log, $stddev_log)))"
}

# Create files with random data
for i in $(seq 1 $file_count); do
    file_size=$(generate_random_size)
    file_path="$output_dir/file_$i.bin"
    head -c "$file_size" /dev/urandom > "$file_path"
    echo "Generated file $i with size $file_size bytes."
done

echo "Done. Files saved in $output_dir."

Easy enough, right? This creates a directory random_files and fills it with 10 000 files of sizes mostly small but occasionally wildly bigger. Donโ€™t blame me if your disk space takes a little hit! ๐Ÿ’ฅ

Step 2: Crunching Numbers โ€” The File Size Distribution ๐Ÿ“Š

Okay, youโ€™ve got the files. Now, what can we learn from their sizes? Letโ€™s find out the:

  • Mean size: The average size across all files.
  • Median size: The middle value when sizes are sorted โ€” because averages can lie.
  • Distribution breakdown: How many tiny files vs. giant files.

Hereโ€™s a handy Bash script that reads file sizes and spits out these stats with a bit of flair:

#!/bin/bash

# Input directory (default to "random_files" if not provided)
directory="${1:-random_files}"

# Check if directory exists
if [ ! -d "$directory" ]; then
    echo "Directory $directory does not exist."
    exit 1
fi

# Array to store file sizes
file_sizes=($(find "$directory" -type f -exec stat -c%s {} \;))

# Check if there are files in the directory
if [ ${#file_sizes[@]} -eq 0 ]; then
    echo "No files found in the directory $directory."
    exit 1
fi

# Calculate mean
total_size=0
for size in "${file_sizes[@]}"; do
    total_size=$((total_size + size))
done
mean=$((total_size / ${#file_sizes[@]}))

# Calculate median
sorted_sizes=($(printf '%s\n' "${file_sizes[@]}" | sort -n))
mid=$(( ${#sorted_sizes[@]} / 2 ))
if (( ${#sorted_sizes[@]} % 2 == 0 )); then
    median=$(( (sorted_sizes[mid-1] + sorted_sizes[mid]) / 2 ))
else
    median=${sorted_sizes[mid]}
fi

# Display file size distribution
echo "File size distribution in directory $directory:"
echo "---------------------------------------------"
echo "Number of files: ${#file_sizes[@]}"
echo "Mean size: $mean bytes"
echo "Median size: $median bytes"

# Display detailed size distribution (optional)
echo
echo "Detailed distribution (size ranges):"
awk '{
    if ($1 < 1024) bins["< 1 KB"]++;
    else if ($1 < 10240) bins["1 KB - 10 KB"]++;
    else if ($1 < 102400) bins["10 KB - 100 KB"]++;
    else bins[">= 100 KB"]++;
} END {
    for (range in bins) printf "%-15s: %d\n", range, bins[range];
}' <(printf '%s\n' "${file_sizes[@]}")

Run it, and voilร  โ€” instant nerd satisfaction.

Example Output:

File size distribution in directory random_files:
---------------------------------------------
Number of files: 10000
Mean size: 68987 bytes
Median size: 2048 bytes

Detailed distribution (size ranges):
&lt; 1 KB         : 1234
1 KB - 10 KB   : 5678
10 KB - 100 KB : 2890
>= 100 KB      : 198

Why Should You Care? ๐Ÿคทโ€โ™€๏ธ

Besides the obvious geek cred, generating files like this can help:

  • Test backup systems โ€” can they handle weird file size distributions?
  • Stress-test storage or network performance with real-world-like data.
  • Understand your data patterns if youโ€™re building apps that deal with files.

Wrapping Up: Big Files, Small Files, and the Chaos In Between

So there you have it. Ten thousand random files later, and weโ€™ve peeked behind the curtain to understand their size story. Itโ€™s a bit like hosting a party and then figuring out who ate how many snacks. ๐Ÿฟ

Try this yourself! Tweak the distribution parameters, generate files, crunch the numbers โ€” and impress your friends with your mad scripting skills. Or at least have a fun weekend project that makes you sound way smarter than you actually are.

Happy hacking! ๐Ÿ”ฅ

Leave a Reply