Recently I had cause to collect all Requests for Comment (RFCs) and Phrack articles - ever published - for mining purposes. Was a good quick little bash test, and improved my curlin'! The results are as follows:
Very straight forward snippet for grabbing the first 1000 rfcs:
URL_TEMP="http://www.networksorcery.com/enp/rfc/rfc"
for i in {1..1000}; do
#echo $i
FULL="$URL_TEMP$i.txt"
curl $FULL > $(echo "rfc$i.txt")
done
This is a slighty improved version of the RFC snippet, which takes advantage of curl's f and o flags to use curl to only save non 404s:
#!/bin/bash
n=0
for i in {1..69}; do
for x in {1..30}; do
FILE="phrack i$i v$x.txt"
# this curl produces empty 404 files:
#curl -fs http://www.phrack.org/archives/issues/$i/$x.txt > "$FILE"
# this one doesn't...
curl -o "$FILE" -fs http://www.phrack.org/archives/issues/$i/$x.txt
# with improved curl syntax, don't need 2 conds, but leaving for infos
if [ ! -s "$FILE" ]; then
echo "404! Skipping"
#rm "$FILE" #curl -f (fail) shouldn't dl anything that 404d
else
echo "Article Found, Saving as #$n..."
mv "$FILE" "phrack_$n.txt"
let "n++"
fi
done
done
# Reads: 1389