[Jersey] Re: Preventing API abuse, throttling connections

From: Tauren Mills <tauren_at_groovee.com>
Date: Thu, 3 Mar 2011 16:35:26 -0800

Thanks for the suggestions Tatu and Mark. I really appreciate the very
thorough response you provided Tatu! I've commented inline below, but have
one additional question.

I was expecting someone to say "Don't do user registrations via an API
call!", but nobody did. Is this an acceptable practice, or should it be
avoided?

As mentioned a filter works well; or if you do not have too many end
> points, just explicitly call a method to check before
> proceeding with processing; and then returning 504 or 503 to caller,
> with precomposed message .
> This is typically much faster to do than real processing; and clients
> already must deal with transient failures. This is a good way to
> signal overload.
>

> To help with this, you always must have some kind of client id, to be
> able to discriminate. IP address is not a very good identifier, but
> that's all you have, you can work with it. Spammers usually use ranges
> of IP addresses, however, so this is not often very effective.
>

I do want to make my API stateless at some point, but right now it is
stateful. A session is created, and remember-me cookies are accepted. So I
could use the session ID for now. But eventually, when I transition to a
fully stateless API, an IP number is really the only client ID available,
right? And if so, then how do you deal with AOL users or big corporations
where everyone comes from the same IP?

> So, better throttling can be achieved load shedding based on amount of
> concurrency; and ideally with differing quality-of-service -- clients
> with existing open requests should have higher chance of getting
> requests denied. A simple model is one where over N1 requests there is
> a statistical chance of rejection; and over N2 (higher limit), 100%
> failure. This can be applied both on per-client and global basis.
> Use of multiple limits is useful in order to try to keep higher
> quality of service for more responsible clients.

Good ideas, I'll consider these suggestions. Although, as you mention,
tracking requests is more for billing purposes, which may be something I
will need to do eventually. So I wanted to basically reuse that for
throttling connections in general as well. But you're approach makes more
sense.

> > The thing is, I'd like to have varying service levels based upon the
> "plan"
> > the user is paying for. So whatever is doing the throttling would need to
> be
> > able to access business logic.
>
> You could have differing levels based on what I mention above. And you
> definitely should cache values (but not indefinitely, to make updates
> simpler)
> For first requests, you could just assume one of levels, before
> getting actual information -- most likely assume highest limits, until
> information is available. Limits can be fetched asynchronously from
> another thread (queue lookup requests)
>

Nice ideas again. Thanks!

> Secondly, I have a specific use case that needs even more strict API
> abuse
> > prevention measures. I'm sure doing this probably isn't a best practice,
> but
> > my API allows new user accounts to be created. The web UI is a
> single-page
> > javascript app, and there are mobile, iphone and android apps, all which
> > need to be able to create accounts. Third-party apps will be able to
> provide
> > UI for users to create accounts as well. So using a RESTful API call is
> > quite convenient.
> > But are there any good ways to keep bots from automatically signing up
> lots
> > of accounts? Perhaps any given IP can only sign up one account per
> minute.
>
> That will only help for about half a day, until abusers figure out it,
> and properly start delaying creation requests.
> I'm not sure I would bother. This based on experience with
> sign-up-then-spam modes for one fairly big hosting company.
>

That's exactly what I was thinking. It seems pointless to go that route.

> > But I hate CAPTCHAs, especially in my mobile apps. And it seems like
> there
> > should be a better solution. Any ideas?
>
> Ok, here goes....
>
> <rant>
>
> Existing CAPTCHAs are pretty much a very very naive and silly idea, in
> my opinion. They use one of few things for which LOTS of work has gone
>

...snip...

> The real benefit is actually just the fact that one must automate
> solution on specific domain, so spammers can not reuse existing
> solutions. Whereas with captchas, you solve it once, you solve it
> everywhere, or at least within very large group of systems.
>
> </rant>
>
> Too bad this might not work too well for your case. But I would
> seriously suggest thinking about this: if you can auto-generate q/a
> pairs, custom solution is actually likely to have better performance
> than off-the-shelf one. Just because abusers might decide trying to
> automate solution is too expensive.
>

I feel absolutely the same about CAPTCHAs. And your suggestion describe
exactly my goal -- to make automating a solution too expensive that they
just look elsewhere. If someone is determined enough, they'll break any sort
of measures I come up with. I'd rather them feel it just isn't worth doing.

In fact, I already avoid CAPTCHAs in my typical web interfaces by doing
custom things similar to your suggestions. However, I don't even require
that users answer a question. Basically, I have some extra hidden fields in
the form that must have certain values or lack of values. Of course, it
won't solve the Mechanical Turk approach, but it seems to do pretty well.
This approach is describe fairly well in this SO answer:
http://stackoverflow.com/questions/2603363/good-form-security-no-captcha/2603408#2603408

The problem is I'm trying to figure out how to do something similar with an
API. It is way to easy to automate an API call with values like { mustExist:
"must be this string", mustBeEmpty: "" }, so that approach doesn't work.

Also, maybe I shouldn't be too concerned with this, as there are additional
steps user registrations require than posting to a forum or blog comment.
Users have to check their email and click an activation link. But I still
don't want to fill up my database with a bunch of junk un-activated accounts
just because someone is calling my API incessantly.

Thanks again for the thoughts.

Tauren

For use case I had, this actually works quite well: for discussion
> groups, group owner must generate N questions to answer, or manually
> approve all members (or both). Turns out this alone eliminates spam
> for majority of groups (smallest ones). This even though there isn't
> automatic generation of differing questions, just static choices. Even
> this is enough additional work (on group I admin, I just ask which
> data format does Jackson library handle. :-) ) apparently.
>
> -+ Tatu +-
>