{"id":299,"date":"2012-01-24T21:38:05","date_gmt":"2012-01-25T02:38:05","guid":{"rendered":"http:\/\/www.deadcodersociety.org\/?p=299"},"modified":"2012-01-26T06:47:37","modified_gmt":"2012-01-26T11:47:37","slug":"one-liner-breakdown-downloading-an-entire-album-from-photobucket","status":"publish","type":"post","link":"http:\/\/www.deadcodersociety.org\/blog\/one-liner-breakdown-downloading-an-entire-album-from-photobucket\/","title":{"rendered":"One Liner Breakdown: Downloading an Entire Album from Photobucket"},"content":{"rendered":"<p>I recently downloaded 1202 infographics from Photobucket (one very large album) using this one-liner:<\/p>\n<pre>curl -s http:\/\/s966.photobucket.com\/albums\/ae143\/sentenal01\/Informational%20pictures\/\\?start\\=all | \\\r\negrep -o 'i966.photobucket.com\\\\\/albums\\\\\/ae143\\\\\/sentenal01\\\\\/Informational%20pictures\\\\\/[a-zA-Z0-9_-\\\\]+\\.\\w{3}' | \\\r\nuniq -d | xargs -n1 -IIMAGE wget http:\/\/IMAGE<\/pre>\n<p>We were passing around links to interesting images in <a title=\"IRC\" href=\"http:\/\/www.deadcodersociety.org\/blog\/irc\/\" target=\"_blank\">our IRC channel<\/a>\u00a0and\u00a0someone linked to this set of 1202 infographics:<\/p>\n<p><a title=\"http:\/\/s966.photobucket.com\/albums\/ae143\/sentenal01\/Informational%20pictures\/\\?start\\=all\" href=\"http:\/\/s966.photobucket.com\/albums\/ae143\/sentenal01\/Informational%20pictures\/\\?start\\=all\" target=\"_blank\">http:\/\/s966.photobucket.com\/albums\/ae143\/sentenal01\/Informational%20pictures\/\\?start\\=all<\/a>.<\/p>\n<p>We wanted to download all of the images, so I started by looking at the source of the page.<\/p>\n<h2>curl<\/h2>\n<p>We&#8217;ll start by using\u00a0<strong>curl<\/strong>\u00a0to download the page source (like the first line of the extended one-liner). We&#8217;re using the &#8216;<strong>s<\/strong>&#8216; flag for <strong>silent<\/strong>, in order to keep the progress meter out of our output (which would mess up the rest of the one-liner).<\/p>\n<h2>egrep<\/h2>\n<p>After staring at the source, you&#8217;ll notice that there&#8217;s a common pattern that&#8217;ll let you pull out the full URL for the image. In our version, we&#8217;re using <strong>egrep<\/strong>\u00a0with the &#8216;<strong>o<\/strong>&#8216; flag to return <strong>only<\/strong>\u00a0the result we ask for (rather than the matching line in full).\u00a0After massaging the regex for a little, our one-liner looks like the following:<\/p>\n<pre>curl -Ls 'http:\/\/s966.photobucket.com\/albums\/ae143\/sentenal01\/Informational%20pictures\/\\?start\\=al' | \\\r\negrep -o 'i966.photobucket.com\\\\\/albums\\\\\/ae143\\\\\/sentenal01\\\\\/Informational%20pictures\\\\\/[a-zA-Z0-9_-\\\\]+\\.\\w{3}'<\/pre>\n<p>This should bring back thousands of URLs.<\/p>\n<h2>uniq<\/h2>\n<p>The page tells us that there&#8217;s only 1202 images. After examining the source, we notice that the path to the full image is repeated several times. Piping the output to the <strong>uniq<\/strong>\u00a0program with the <strong>-d<\/strong>\u00a0flag (for showing only entries with multiple results), we temporarily pipe the output through <strong>wc -l<\/strong>\u00a0to check the number of lines in the output:<\/p>\n<pre>curl -Ls 'http:\/\/s966.photobucket.com\/albums\/ae143\/sentenal01\/Informational%20pictures\/\\?start\\=al' | \\\r\negrep -o \\\r\n'i966.photobucket.com\\\\\/albums\\\\\/ae143\\\\\/sentenal01\\\\\/Informational%20pictures\\\\\/[a-zA-Z0-9_-\\\\]+\\.\\w{3}' \\\r\nuniq -d | wc -l<\/pre>\n<p>There are 1202 images, and <strong>wc -l<\/strong>\u00a0reports 1202 \u00a0matching lines. We remove the <strong>wc -l<\/strong>\u00a0and are left to download the resulting images.<\/p>\n<h2>xargs<\/h2>\n<p><strong>xargs<\/strong>\u00a0allows us to run a command on each line of input. We append <strong>xargs<\/strong>\u00a0to the command with <strong>-n1<\/strong>\u00a0for 1 line per command, and -IIMAGE to use the word &#8220;IMAGE&#8221; as the placeholder text for our input:<\/p>\n<pre>curl -s 'http:\/\/s966.photobucket.com\/albums\/ae143\/sentenal01\/Informational%20pictures\/\\?start\\=al' | \\\r\negrep -o \\\r\n'i966.photobucket.com\\\\\/albums\\\\\/ae143\\\\\/sentenal01\\\\\/Informational%20pictures\\\\\/[a-zA-Z0-9_-\\\\]+\\.\\w{3}' | \\\r\nuniq -d | \\\r\nxargs -n1 -IIMAGE wget http:\/\/IMAGE<\/pre>\n<p>The only thing we&#8217;ve done is prepended &#8216;http:\/\/&#8217; to every URL for wget. This command matches the line above and will download every image from that page! The final gzipped tarball for those images ended up being 327MB.<\/p>\n<p><span style=\"text-decoration: underline;\">Bonus Round:<\/span><\/p>\n<p>Wget provides a batch mode that allows the user to provide input via a text file or STDIN for processing files. With some added manipulation with <strong>sed<\/strong>, we can feed the URLS to wget directly and save the cost of spawning a process per URL with xargs:<\/p>\n<pre>curl -s 'http:\/\/s966.photobucket.com\/albums\/ae143\/sentenal01\/Informational%20pictures\/\\?start\\=al' | \\\r\negrep -o \\\r\n'i966.photobucket.com\\\\\/albums\\\\\/ae143\\\\\/sentenal01\\\\\/Informational%20pictures\\\\\/[a-zA-Z0-9_-\\\\]+\\.\\w{3}' | \\\r\nuniq -d |\\\r\nsed 's\/^\/http:\\\/\\\/\/g;s\/\\\\\/\/g' | wget -i -<\/pre>\n<p>You&#8217;ll see a significant improvement in terms of resources used by the system. I used wget&#8217;s <strong>&#8211;spider<\/strong>\u00a0flag to check the existence of the URLs and compared the two against a file of the URLs (to minimize differences caused by network latency on Photobucket&#8217;s side, other processes slowing things down, etc):<\/p>\n<h4>xargs:\u00a0<strong>1.41s user 6.74s system 51% cpu 15.694 total<\/strong><\/h4>\n<h4>wget:\u00a0<strong>0.13s user 0.10s system 5% cpu 4.113 total<\/strong><\/h4>\n<p>As you can see, there&#8217;s a significant overhead with using xargs to spawn another wget process for each URL. Try to use the existing functionality of your tools before adding one more command to the pipeline.<\/p>\n<p><em><a title=\"Stan Schwertly\" href=\"http:\/\/www.deadcodersociety.org\/blog\/about\/members\/stan-schwertly\/\" target=\"_blank\">Stan Schwertly<\/a>\u00a0is the Founder of the Dead Coder Society and maintains his personal blog at\u00a0<a title=\"Stan Schwertly\" href=\"http:\/\/www.schwertly.com\" target=\"_blank\">schwertly.com<\/a><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Downloading 1202 images\/entire photobucket album using one line with Linux<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_links_to":"","_links_to_target":""},"categories":[4],"tags":[7,5],"aioseo_notices":[],"_links":{"self":[{"href":"http:\/\/www.deadcodersociety.org\/blog\/wp-json\/wp\/v2\/posts\/299"}],"collection":[{"href":"http:\/\/www.deadcodersociety.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.deadcodersociety.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.deadcodersociety.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.deadcodersociety.org\/blog\/wp-json\/wp\/v2\/comments?post=299"}],"version-history":[{"count":31,"href":"http:\/\/www.deadcodersociety.org\/blog\/wp-json\/wp\/v2\/posts\/299\/revisions"}],"predecessor-version":[{"id":328,"href":"http:\/\/www.deadcodersociety.org\/blog\/wp-json\/wp\/v2\/posts\/299\/revisions\/328"}],"wp:attachment":[{"href":"http:\/\/www.deadcodersociety.org\/blog\/wp-json\/wp\/v2\/media?parent=299"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.deadcodersociety.org\/blog\/wp-json\/wp\/v2\/categories?post=299"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.deadcodersociety.org\/blog\/wp-json\/wp\/v2\/tags?post=299"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}