One Liner Breakdown: Downloading an Entire Album from Photobucket


I recently downloaded 1202 infographics from Photobucket (one very large album) using this one-liner:

curl -s http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=all | \
egrep -o 'i966.photobucket.com\\/albums\\/ae143\\/sentenal01\\/Informational%20pictures\\/[a-zA-Z0-9_-\\]+\.\w{3}' | \
uniq -d | xargs -n1 -IIMAGE wget http://IMAGE

We were passing around links to interesting images in our IRC channel and someone linked to this set of 1202 infographics:

http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=all.

We wanted to download all of the images, so I started by looking at the source of the page.

curl

We’ll start by using curl to download the page source (like the first line of the extended one-liner). We’re using the ‘s‘ flag for silent, in order to keep the progress meter out of our output (which would mess up the rest of the one-liner).

egrep

After staring at the source, you’ll notice that there’s a common pattern that’ll let you pull out the full URL for the image. In our version, we’re using egrep with the ‘o‘ flag to return only the result we ask for (rather than the matching line in full). After massaging the regex for a little, our one-liner looks like the following:

curl -Ls 'http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=al' | \
egrep -o 'i966.photobucket.com\\/albums\\/ae143\\/sentenal01\\/Informational%20pictures\\/[a-zA-Z0-9_-\\]+\.\w{3}'

This should bring back thousands of URLs.

uniq

The page tells us that there’s only 1202 images. After examining the source, we notice that the path to the full image is repeated several times. Piping the output to the uniq program with the -d flag (for showing only entries with multiple results), we temporarily pipe the output through wc -l to check the number of lines in the output:

curl -Ls 'http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=al' | \
egrep -o \
'i966.photobucket.com\\/albums\\/ae143\\/sentenal01\\/Informational%20pictures\\/[a-zA-Z0-9_-\\]+\.\w{3}' \
uniq -d | wc -l

There are 1202 images, and wc -l reports 1202  matching lines. We remove the wc -l and are left to download the resulting images.

xargs

xargs allows us to run a command on each line of input. We append xargs to the command with -n1 for 1 line per command, and -IIMAGE to use the word “IMAGE” as the placeholder text for our input:

curl -s 'http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=al' | \
egrep -o \
'i966.photobucket.com\\/albums\\/ae143\\/sentenal01\\/Informational%20pictures\\/[a-zA-Z0-9_-\\]+\.\w{3}' | \
uniq -d | \
xargs -n1 -IIMAGE wget http://IMAGE

The only thing we’ve done is prepended ‘http://’ to every URL for wget. This command matches the line above and will download every image from that page! The final gzipped tarball for those images ended up being 327MB.

Bonus Round:

Wget provides a batch mode that allows the user to provide input via a text file or STDIN for processing files. With some added manipulation with sed, we can feed the URLS to wget directly and save the cost of spawning a process per URL with xargs:

curl -s 'http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=al' | \
egrep -o \
'i966.photobucket.com\\/albums\\/ae143\\/sentenal01\\/Informational%20pictures\\/[a-zA-Z0-9_-\\]+\.\w{3}' | \
uniq -d |\
sed 's/^/http:\/\//g;s/\\//g' | wget -i -

You’ll see a significant improvement in terms of resources used by the system. I used wget’s –spider flag to check the existence of the URLs and compared the two against a file of the URLs (to minimize differences caused by network latency on Photobucket’s side, other processes slowing things down, etc):

xargs: 1.41s user 6.74s system 51% cpu 15.694 total

wget: 0.13s user 0.10s system 5% cpu 4.113 total

As you can see, there’s a significant overhead with using xargs to spawn another wget process for each URL. Try to use the existing functionality of your tools before adding one more command to the pipeline.

Stan Schwertly is the Founder of the Dead Coder Society and maintains his personal blog at schwertly.com


Leave a Reply

Your email address will not be published.