users@jersey.java.net

[Jersey] Re: Preventing API abuse, throttling connections

From: Tatu Saloranta <tsaloranta_at_gmail.com>
Date: Thu, 3 Mar 2011 08:57:14 -0800

On Thu, Mar 3, 2011 at 2:55 AM, Tauren Mills <tauren_at_groovee.com> wrote:
> I have two related questions.
> First, are there any Jersey features for throttling connections from users?
> Services such as Twitter only allow a certain number or requests per time
> period. How would I do something similar? Is this something that is better
> handled by a proxy or other service?

As mentioned a filter works well; or if you do not have too many end
points, just explicitly call a method to check before
proceeding with processing; and then returning 504 or 503 to caller,
with precomposed message .
This is typically much faster to do than real processing; and clients
already must deal with transient failures. This is a good way to
signal overload.

To help with this, you always must have some kind of client id, to be
able to discriminate. IP address is not a very good identifier, but
that's all you have, you can work with it. Spammers usually use ranges
of IP addresses, however, so this is not often very effective.

But as opposed to counting number of requests per some time unit, I
find it more useful to limit number of concurrent requests, during
overload. Number of requests done itself is pretty much irrelevant,
unless there are clear costs. The main reason to count number of
requests is really for usage fees.

What usually kills Java service is concurrency, having hundreds of
in-progress requests, all waiting for something (like DB query
results). Put another way, whenever there is overload, amount of
concurrency invariably grows, so it is pretty good signal for
detecting overload.
So, better throttling can be achieved load shedding based on amount of
concurrency; and ideally with differing quality-of-service -- clients
with existing open requests should have higher chance of getting
requests denied. A simple model is one where over N1 requests there is
a statistical chance of rejection; and over N2 (higher limit), 100%
failure. This can be applied both on per-client and global basis.
Use of multiple limits is useful in order to try to keep higher
quality of service for more responsible clients.

> The thing is, I'd like to have varying service levels based upon the "plan"
> the user is paying for. So whatever is doing the throttling would need to be
> able to access business logic.

You could have differing levels based on what I mention above. And you
definitely should cache values (but not indefinitely, to make updates
simpler)
For first requests, you could just assume one of levels, before
getting actual information -- most likely assume highest limits, until
information is available. Limits can be fetched asynchronously from
another thread (queue lookup requests)

> Secondly, I have a specific use case that needs even more strict API abuse
> prevention measures. I'm sure doing this probably isn't a best practice, but
> my API allows new user accounts to be created. The web UI is a single-page
> javascript app, and there are mobile, iphone and android apps, all which
> need to be able to create accounts. Third-party apps will be able to provide
> UI for users to create accounts as well. So using a RESTful API call is
> quite convenient.
> But are there any good ways to keep bots from automatically signing up lots
> of accounts? Perhaps any given IP can only sign up one account per minute.

That will only help for about half a day, until abusers figure out it,
and properly start delaying creation requests.
I'm not sure I would bother. This based on experience with
sign-up-then-spam modes for one fairly big hosting company.

> But what about all those AOL users, or corporate users behind firewalls?
> As a pure API solution, there would not be a CAPTCHA involved. But perhaps
> there could be. What if the client was required to make an initial request

If you do have issues, yes, there probably should be some system that
requires proof that request is not by a bot.

...
> But I hate CAPTCHAs, especially in my mobile apps. And it seems like there
> should be a better solution. Any ideas?

Ok, here goes....

<rant>

Existing CAPTCHAs are pretty much a very very naive and silly idea, in
my opinion. They use one of few things for which LOTS of work has gone
to automate it: image recognition, OCR. About the only good thing
about captchas is that question/answer pair generation and
verification are automatable. But my experiences have been such that
they are hard enough for legitimate users to get right, to be
annoying; potentially possible for code to crack; and worst of all, as
easy for any other human being to solve.
Which leads to automated systems where cheap laborers from 3rd world
countries can solve these in batch, via redirection; so one can buy
correct answers in batch, using semi-automatic system.

Better solution would be combination of:

(a) multiple kinds of tasks -- for visual recognition, don't just do
letters; use geometric objects with different relative sizes, colors,
shapes, ask about relationship between objects.
(b) using domain-specific knowledge (on Jersey list for example, one
could have questions about jax-rs api -- more efficient for anyone
with interest to solve). Combined with (a), you can generate questions
and answers about images from within domain (movie or book characters
for discussion groups).

The real benefit is actually just the fact that one must automate
solution on specific domain, so spammers can not reuse existing
solutions. Whereas with captchas, you solve it once, you solve it
everywhere, or at least within very large group of systems.

</rant>

Too bad this might not work too well for your case. But I would
seriously suggest thinking about this: if you can auto-generate q/a
pairs, custom solution is actually likely to have better performance
than off-the-shelf one. Just because abusers might decide trying to
automate solution is too expensive.

For use case I had, this actually works quite well: for discussion
groups, group owner must generate N questions to answer, or manually
approve all members (or both). Turns out this alone eliminates spam
for majority of groups (smallest ones). This even though there isn't
automatic generation of differing questions, just static choices. Even
this is enough additional work (on group I admin, I just ask which
data format does Jackson library handle. :-) ) apparently.

-+ Tatu +-