CAPTCHA is perceived as a quick and effective way to stop bots from performing abusive actions on a website. Bots are often deployed to do things like automatically enter spam into email forms or comment forms. They can also be used to submit fraudulent entries in other forms such as registration forms or to voting forms. CAPTCHA works by presenting a challenge to the user (typically in the form of an image containing jumbled-up letters) which must be solved to proceed in the interaction flow.
On the surface, CAPTCHA seems perfect because bots only have access to that which is in the document source. Text within images cannot be seen by an internet bot and therefore the bot cannot submit a response to the challenge. This is also why CAPTCHA is an accessibility problem. Requiring vision to solve the CAPTCHA locks out all persons who are blind. Lest we think only persons who are blind are impacted, they can often also lock out low-vision users and those with dyslexia – particularly when there’s a lot of “noise” in the image.
Some have attempted to create alternate versions of the typical image CAPTCHA, such as the well-known reCAPTCHA which combines audio with the image. In nearly all cases, some problems with accessibility still remain. For instance, reCPATCHA is still inaccessible to the 45000-50000 Deaf-blind persons in the United States.
CAPTCHA is also not as effective as some may believe. Automated means of beating CAPTCHA have around since 2003. As CAPTCHA techniques advance, so do the means of beating them. There are even services which will employ humans to beat CAPTCHAs.
Keep this in mind at all times when considering CAPTCHA or any other security approaches on your site: The level of effort expended at abusing a system is directly proportional to the perceived benefit gained by the abuser. This applies to the recommendations I make below as well. CAPTCHA is, in many cases, very effective. Otherwise websites wouldn’t use it. But it does lock real people out of your site and it can be beaten. For those reasons, I’d like to discuss some approaches of thwarting website abuse without CAPTCHA.
CAPTCHA-less Security Approaches
Because all of the code for all of my sites (except this one, ironically) is home grown, I’ve developed my own code to handle security as well. This has its advantages and disadvantages, primarily because it took a long time of learning (some of which painful, to be honest) for me to get my code where it is today, but I’m proud to say that using the below approaches, I’ve wholly eliminated all spam and fraudulent registrations on my sites that use this code. Keep in mind, the more attractive a site is for abuse, the more that abusive users will try to find exploits. As I said earlier, in certain scenarios even humans can be employed to simply overcome whatever automated methods you have in place to fight abuse. Here’s what I’ve used with success.
Filter, Validate, Escape
Not directly related to CAPTCHA is the need to filter, validate, and escape all input. This is something every developer should be doing at all times when developing systems which utilize forms. This is something that could take up several postings related to security. Instead, I’d like to point you to Chris Shiflett and encourage you to read his articles, blog, and the books he’s written on this topic. I’ll go over some of these topics here and encourage you to check Chris Shiflett’s work out for more details
Filter all input
Input filtering is the method by which you validate all incoming data and prevent any invalid data from being used by your application. It’s very similar in theory to how water filtering works, where impurities in water are not allowed to pass Chris Shifflett
In my approach, I filter all input from superglobals. Any key from a superglobal array that I do not expect is automatically removed. For instance, if I’m only expecting ‘id’ from
$_GET then that is the only key that is kept. Furthermore, I strip out any input I consider out of bounds for the type of content expected. For instance, if I’m expecting a number for the value of ‘id’, then all non-numeric characters are stripped. If I’m expecting alphanumerics, than anything not a letter or a number is stripped, etc.
Validate strongly to ensure that input adheres to very specific constraints. In the PHP forms class I created for use on all my sites, I have 48 different types of validation ranging from simple string length validation to rather involved regular expressions. The type of validation in use in the final implementation depends on the type of expected input, but everything is validated in some way after input filtering. Even if a field isn’t required, it still gets filtered and validated against various rules meant to prevent abuse. For some of the validation rules, the user is permanently blocked from access as soon as a submission fails.
In this process, any input is escaped to prevent SQL injection, XSS, mail header injection, and so on. Upon accepting submission, most of these things are validated against rather strong rules in the first place. Any signs of abuse result in immediate banning of the offender. Still, all submissions are escaped and submissions stored in a database use prepared statements. On the way out, content is escaped as well. This extra step may be seen as redundant, but helps act as an added protection (in this case, protecting the user in case previous steps were inadequate).
A spam honeypot is a field intended to trap spammers by detecting submissions of attempted spam or fraudulent registrations. One of the ways spammers try to exploit a site is the automatic submission of forms. It is relatively trivial to create a bot which will crawl the web, looking for forms and filling them in. Looking for the string “email”, “e-mail” or other variant as the value for the ‘name’ attribute on fields is also trivial and allows spammers a relatively easy way to exploit forms – unless, of course some other method exists to stop them, which is the entire purpose of CAPTCHA. Bots can vary pretty widely, so no method is perfect but honeypots do tend to work well against bots that have been designed to fill in all fields.
To implement a honeypot, create a hidden text field:
<label for="honeypot">Enter something here if you're a spammer</label><input type="text" id="honeypot" name="honeypot">
Then, use CSS to position the item offscreen. Using this method, you now have an accessible means of tripping up bots.
As I mentioned in the section on validation, I aggressively ban – immediately – all abusive requests I detect. But banning actually starts even before the time of request using the following approaches.
External Services for Checking Emails and IPs
I’ve created an automated CRON script which uses cURL to grab spammy IPs and e-mail addresses from various services. One of my favorites is Stop Forum Spam. I take those items and put them into my own database, because I don’t want to burden them with constant lookups. Those items are then used at time of initial request and also during forum submission as part of the validation process.
UPDATE: Since I wrote the above, I’ve since created my own service, called BotSmasher, which aggregates the data from several services and provides an API you can query to check whether the email address submitted or the user’s IP address is found to have been previously discovered submitting spam or system abuse in the past.
Internal Systems for Checking Emails and IPs
Internally all email addresses and IPs that have been banned (from any means, such as those described above) are retained in a database table. All IPs are checked at the time of request and immediately rejected if the user’s IP has ever been banned. All submissions of all forms are logged as well. During this process, any time a submission is found to be abusive (see, Filter, Validate, Escape, above), the IP address associated with the request is immediately logged in this table. If the form in question had an email field then the email address is banned as well.
Registration requires confirmation
For any of my sites that require membership to certain areas, users must register with a working email address to which I send a confirmation email. Users must click that link – which takes them back to the site – in order to confirm their registration and be granted access to the site. This tactic is pretty common on the web and the reason this works is two-fold: first, it stops bots dead because they often enter nonsensical email addresses which go nowhere and second, even in cases where the fraudulent submissions are run by humans who use a good email address, they aren’t going to waste their time clicking confirmation links. One of my sites has been up for 3 years and not once has a spammer confirmed their registration.
One tactic employed by bot developers is to copy the form itself and then use their script to send the fraudulent submissions. Doing so is a good way to get past validation. Once they know what the expected formats are for each field, they can repeatedly submit that information (doing this with cURL is super simple). To prevent this, I use a temporary token that is assigned to all users at the start of their session. The token expires when their session expires. The token is submitted with each form request. This essentially means that the user submitting the form must be on the site and the value in the token must match the value submitted or the submission will fail.
Other CAPTCHA-less techniques
The above are all things that I currently do on my sites. There are a couple of other techniques that I think show promise at thwarting abuse.
A confirmation screen is, in a lot of ways, a challenge-response. Confirmation screens also help you comply with WCAG 3.3.4. Simply asking the user to confirm their form submission is a great way to beat bots. However, they only make sense in certain situations. Using a confirmation on a login screen would be silly.
What I really love about doing my online banking with Bank of America is that they require SMS verification to perform certain actions. For instance, when I add a new payee to online bill payments, they send an SMS to my phone with a special code. This special code must be entered into their site to confirm the new payee. This feature is incredibly useful on systems for which security is absolutely critical.
Success, Current Challenges, and Weaknesses
Using the methods I’ve discussed above, I’ve not had one successful fraudulent form submission on one of my sites in its two-year existence. The caveat, however, is that none of my sites are huge traffic websites. My most popular website ever has about 300,000 pageviews a month. As I said at the beginning of this post, the level of effort expended at abusing a system is directly proportional to the perceived benefit gained by the abuser. The #1 way to beat everything I outlined above and beat CAPTCHA is to employ a human to do the abuse. Furthermore, each of the items above can be beaten by bots in some way. The reason why they have worked so well for me is because each of them together adds another layer of protection. Overall I think the most effective approaches have been the email confirmations, the use of Botsmasher, and the honeypot.
UPDATE 13-November 2013: In addition to the CAPTCHA-less steps above, something must be done to thwart repeated attempts by bad guys. One thing I’ve noticed is that despite the fact that they’re ultimately unsuccessful, bad guys using bots or other automated means of submitting forms will continue doing so as long as they think the form submission is successful. For example, fraudulent registrations on my site A11yBuzz went through the roof recently. Because the bad guys never confirmed the registration, there was no successful submissions of spam, but the nearly non-stop submissions of the registration form was effectively a Denial-of-Service attack. The answer to this so far has been to monitor the server logs to determine the IP address(es) responsible for these continuous attempts and to ban those IPs at the firewall level. Eventually I’ll develop something that does this automatically, which I’m sure probably exists already in toolsets employed by server administrators at large organizations.
The above solutions I’ve discussed aren’t meant to be the ultimate solution to replace CAPTCHA. Instead, I hope I showed that you can and should attempt to apply other sensible security approaches before simply resorting to CAPTCHA. For the vast majority of cases, the methods listed above should suffice. As my final case-in-point, I’d simply like to point to Amazon, which has only one CAPTCHA on the entire site: the form you use to change your password. I would argue that if Amazon has figured out a way to be secure without CAPTCHA, so can you.