Rolling your own CAPTCHA solution

RMAG news

As an AI chatbot provider, we need some serious bot protection to prevent malicious users from creating bots that attacks us. Google reCAPTCHA of course is “the industry standard” here, which I’m sure most readers are aware of.

It’s no big secret that Google’s code sux – Big time! This becomes a problem for your website as you include their junkware JavaScript on your site. Because on the one hand, Google will punish you if your site is slow. On the other hand, they release garbage JavaScript libraries, increasing your page load time by 10x – And some of these JavaScript libraries have no known real alternatives you can use. Google’s reCAPTCHA is one of those junkware projects that seems to download half the internet if you include it on your page.

At the same time, you do need some sort of bot protection, and most of the open source alternatives I’ve seen depends upon users physically proving they’re humans by solving some image test, such as “find all squares with ducks”, etc.

The one beautiful thing Google’s reCAPTCHA has going for it, is that it’s 100% invisible. It doesn’t bother your users, but instead relies upon DOM events, forcing the user to click, type on his or her keyboard, etc, to such prove it’s a human being behind the machine – And not some malicious automated bot.

After pondering the problem domain for a couple of hours, I realised I could create a 100% stateless solution, that doesn’t rely on DOM events, and more importantly, doesn’t download half the freakin’ internet to “initialise”.

Proof of Work

Proof of Work is a big word in crypto. Bitcoin for instance is based upon the idea, and basically it implies that it is so CPU intensive to perform some type of job, that doing the job for malicious reasons becomes difficult. Some mail servers also depends upon the idea, so it’s not like as if BitCoin invented it or anything.

Proof of Work can also be utilised in CAPTCHA solutions, because they are so CPU intensive, that they make it so impractical for malicious users to perform, that most malicious users would probably choose to go somewhere else.

The basic idea is that in order to invoke some HTTP endpoint, you will need to run your CPU at maximum level for 0.3 seconds to solve some matemathical problem.

A couple of years ago, as I created a website, I woke up in the morning by my phone going berserk. Some script kid had created a script to send us emails through our contact form. Before I could turn the thing off, I had 46,000 emails in my inbox.

With the solution I am illustrating in the following video, those 46,000 emails would probably be reduced to 500, and the guy’s laptop would probably catch fire after that.

The solution doesn’t completely eliminate bots, but it makes it so difficult to create a bot, that the idea is that most would probably choose to attack somebody else instead. Besides, Google reCAPTCHA doesn’t completely eliminate bots either, especially not their version 3, which is the nice one that you can use without bothering your users.

Implementation

The first thing I do in my solution, is that I download a dynamically generated JavaScript file. This file contains a timestamp, in addition to a public key, that’s generated based upon the server timestamp, a couple of HTTP headers, the client’s IP address, and a secret.

Then before I create an HTTP invocation towards a protected endpoint, I find the Unix timestamp on the client. I concatenate the client’s Unix timestamp with the server Unix timestamp for when the public key was created, for then to create a BlowFish hash of this concatenated string.

This BlowFish hash is then passed into the endpoint, together with the Unix timestamp for both the client and the server, which can then rapidly verify the hash, having access to both the client timestamp and the server timestamp for when the key was created.

If the verification fails, it just throws an exception, preventing the endpoint’s code from executing, resulting in an HTTP error being returned to the caller.

The rules for a valid token are that it must have been created by the client less than 5 seconds ago (sorry, I said 10 seconds in the video). And the token must have been generated on the server less than 20 minutes ago. Then the JavaScript file includes logic to invoke the server every 10 minutes, to create a new token it’s using for consecutive requests.

The whole point about the solution is that it’s slooooooooooow. Implying not even a super computer can generate thousands of these without spending so much CPU that it starts becoming a real cost to malicious clients.

A malicious client might be able to generate hundreds of these in an hour, but it will not be able to generate thousands per hour – And definitely not hundreds of thousands. Besides, it also implies that the client needs the ability to execute code, in addition to that it’s a novelty, implying any malicious hackers will have to reverse engineer it to simply understand how to create tokens to verify the invocation.

Besides, simply your electricity bill for generating thousands of these becomes non-negligible …

Conclusion

For a high risk website, you should probably have something better. But for a low risk website, the above would probably be good enough, and completely eliminate 99% of all bots capable of actually faking form post invocations.

The solution can be further improved upon, by for instance storing each token used for 20 minutes or something, and then make sure the same token is never used twice, which would eliminate the need for the client’s Unix timestamp. However, for some few hours of coding, I think it’s pretty awesome.

I’ll probably end up improving upon it in the future, because there’s absolutely no way I can continue using Google’s reCAPTCHA – It’s simply too much of a junkware solution to be able to include on your website without basically destroying it from a usability perspective. And the arguments basically proves themselves once you realise the following.

Google reCAPTCHA 1.7MB of download
My DIY solution, 50KB of download

Google’s reCAPTCHA junkware also have blocking code, document.write invocation, and are basically violating every single best practice that exists in regards to web development.

Yeah, I know it sounds a bit weird, but Google basically do not know how to create software – At least not software for the web …

If you’ve got some great ideas about additional improvements for my solution, I would love to hear your comment about it. Maybe if this article goes viral, I’ll consider turning it into Open Sauce too … 😊

Leave a Reply

Your email address will not be published. Required fields are marked *