Archive for January 2012

One Liner Breakdown: Downloading an Entire Album from Photobucket

I recently downloaded 1202 infographics from Photobucket (one very large album) using this one-liner:

curl -s http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=all | \
egrep -o 'i966.photobucket.com\\/albums\\/ae143\\/sentenal01\\/Informational%20pictures\\/[a-zA-Z0-9_-\\]+\.\w{3}' | \
uniq -d | xargs -n1 -IIMAGE wget http://IMAGE

We were passing around links to interesting images in our IRC channel and someone linked to this set of 1202 infographics:

http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=all.

We wanted to download all of the images, so I started by looking at the source of the page.

curl

We’ll start by using curl to download the page source (like the first line of the extended one-liner). We’re using the ‘s‘ flag for silent, in order to keep the progress meter out of our output (which would mess up the rest of the one-liner).

egrep

After staring at the source, you’ll notice that there’s a common pattern that’ll let you pull out the full URL for the image. In our version, we’re using egrep with the ‘o‘ flag to return only the result we ask for (rather than the matching line in full). After massaging the regex for a little, our one-liner looks like the following:

curl -Ls 'http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=al' | \
egrep -o 'i966.photobucket.com\\/albums\\/ae143\\/sentenal01\\/Informational%20pictures\\/[a-zA-Z0-9_-\\]+\.\w{3}'

This should bring back thousands of URLs.

uniq

The page tells us that there’s only 1202 images. After examining the source, we notice that the path to the full image is repeated several times. Piping the output to the uniq program with the -d flag (for showing only entries with multiple results), we temporarily pipe the output through wc -l to check the number of lines in the output:

curl -Ls 'http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=al' | \
egrep -o \
'i966.photobucket.com\\/albums\\/ae143\\/sentenal01\\/Informational%20pictures\\/[a-zA-Z0-9_-\\]+\.\w{3}' \
uniq -d | wc -l

There are 1202 images, and wc -l reports 1202  matching lines. We remove the wc -l and are left to download the resulting images.

xargs

xargs allows us to run a command on each line of input. We append xargs to the command with -n1 for 1 line per command, and -IIMAGE to use the word “IMAGE” as the placeholder text for our input:

curl -s 'http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=al' | \
egrep -o \
'i966.photobucket.com\\/albums\\/ae143\\/sentenal01\\/Informational%20pictures\\/[a-zA-Z0-9_-\\]+\.\w{3}' | \
uniq -d | \
xargs -n1 -IIMAGE wget http://IMAGE

The only thing we’ve done is prepended ‘http://’ to every URL for wget. This command matches the line above and will download every image from that page! The final gzipped tarball for those images ended up being 327MB.

Bonus Round:

Wget provides a batch mode that allows the user to provide input via a text file or STDIN for processing files. With some added manipulation with sed, we can feed the URLS to wget directly and save the cost of spawning a process per URL with xargs:

curl -s 'http://s966.photobucket.com/albums/ae143/sentenal01/Informational%20pictures/\?start\=al' | \
egrep -o \
'i966.photobucket.com\\/albums\\/ae143\\/sentenal01\\/Informational%20pictures\\/[a-zA-Z0-9_-\\]+\.\w{3}' | \
uniq -d |\
sed 's/^/http:\/\//g;s/\\//g' | wget -i -

You’ll see a significant improvement in terms of resources used by the system. I used wget’s –spider flag to check the existence of the URLs and compared the two against a file of the URLs (to minimize differences caused by network latency on Photobucket’s side, other processes slowing things down, etc):

xargs: 1.41s user 6.74s system 51% cpu 15.694 total

wget: 0.13s user 0.10s system 5% cpu 4.113 total

As you can see, there’s a significant overhead with using xargs to spawn another wget process for each URL. Try to use the existing functionality of your tools before adding one more command to the pipeline.

Stan Schwertly is the Founder of the Dead Coder Society and maintains his personal blog at schwertly.com

A New Year for the Dead Coder Society

The Dead Coder Society has been undergoing some serious changes. A majority of the original members have graduated from Stockton. We’re left with the question, “What next?”

We’re still meeting and our online discussion is more active than ever before. We’re meeting outside of Stockton. It takes more work to organize a meeting outside of Stockton. We’re not just staying after class anymore; we’re planning travel and figuring out dates that work with work schedules and the real world. We’re changing.

When we do meet, the discussions are as good as they’ve ever been, as most times they’re even better. We’ve seen new members raise the bar and old members step forward with new presentation material. The hunger is still there.

Presentations have taken on a new form. We’ve moved from a central topic to an extended lightning talk cycle. Each member presents one topic in a summarized way. Anyone can interrupt the discussion, and there’s no time limit. Talks tend to last 20 minutes, but nearly every member is able to contribute. This is largely thanks due to the reduced frequency of our meetings. I’ve noticed that other members are taking in the presentations and are really applying the technology in a practical way.

There’s still a lot of work that can be done to advance the group and overall goal of promoting discussion and ideas. Cheers to the Dead Coder Society.