Split & Cat Large Files

We still live in a 32-bit world.  This mean if you deal with large file transfers, you will undoubtedly bump up against the 4 GB limitation (2^32) posed by legacy file systems and network infrastructure.  One way of skirting around the 4 GB limitation is to use the Unix commands split and cat to split up large files into smaller chunks.

For example, if you’re trying to save a 5 GB file (e.g. 5gb_largefile.iso) to a USB drive formatted for the FAT32 filesystem (max file size of 4 GB), you can split and save the file as 3 files of 2 GB or less with the following command:

split -b 2048m 5gb_largefile.iso 5gb_largefile.iso_

to yield the following files:

  • 5gb_largefile.iso_xaa (2 GB)
  • 5gb_largefile.iso_xab (2 GB)
  • 5gb_largefile.iso_xac (1 GB)

When you’re ready to utilize the file, you can combine the 3 files back into the original file with the following command:

cat 5gb_largefile.iso_xa* > /tmp/5gb_largefile.iso

In this case, you need to be sure to write the output to a directory (e.g. /tmp on a Unix machine) that does not have the 4 GB limit.

cURL Shell Scripting

A shell script that processes a batch of file URLs and determine the availability of each file:

#!/bin/sh 
# expects "|" delimited file containing 2 columns
# column 1: line number
# column 2: URL
#
# outputs "|" delimited file containing 3 columns
# column 1: line number
# column 2: file URL
# column 3: file available (Yes|No) 
printf "Enter the file name containing the file URLs (input)\n"
printf ":"
read URL_LIST

# make sure input file exist
if [ ! -f $URL_LIST ]; then
 echo "Invalid input file specified."
 exit 1
fi

printf "Enter the file name to capture results (output)\n"
printf ":"
read URL_STATUS

# make sure output file doesn't exist
if [ -f $URL_STATUS ]; then
 echo "Invalid output file specified."
 exit 1
fi

printf "line number|file url|file available\n" >> $URL_STATUS

# loop through input file, row by row
for i in `cat $URL_LIST`; do

# parse for 1st & 2nd column, delimited by "|"
line_no=`echo "$i" | awk -F'|' '{print $1}'`
url=`echo "$i" | awk -F'|' '{print $2}'`

# echo progress to screen
printf "Processing line "
printf $line_no
printf "\n"

# check HTTP header if remote file exists; returns "HTTP/1.1 200 OK"
output=`curl --max-time 6 --silent --head $url | grep 'HTTP/1.1 200 OK' | wc -l`

# output line number & URL to file
printf $line_no >> $URL_STATUS
printf "|" >> $URL_STATUS

printf $url >> $URL_STATUS
printf "|" >> $URL_STATUS

# output line number & URL status to file
if [[ $output -eq 1 ]]; then
 printf "Yes\n" >> $URL_STATUS
else
 printf "No\n" >> $URL_STATUS
fi

done