[Jersey] Re: Preventing API abuse, throttling connections

From: Tatu Saloranta <tsaloranta_at_gmail.com>
Date: Fri, 4 Mar 2011 12:07:03 -0800

On Thu, Mar 3, 2011 at 4:35 PM, Tauren Mills <tauren_at_groovee.com> wrote:
> Thanks for the suggestions Tatu and Mark. I really appreciate the very
> thorough response you provided Tatu! I've commented inline below, but have
> one additional question.
> I was expecting someone to say "Don't do user registrations via an API
> call!", but nobody did. Is this an acceptable practice, or should it be
> avoided?

I don't see anything absolutely wrong with it, but maybe others can
point out some serious issues.

I would not expect users to directly call APIs, obviously, but via
app/client of some sort.

...
>> To help with this, you always must have some kind of client id, to be
>> able to discriminate. IP address is not a very good identifier, but
>> that's all you have, you can work with it. Spammers usually use ranges
>> of IP addresses, however, so this is not often very effective.
>
> I do want to make my API stateless at some point, but right now it is
> stateful. A session is created, and remember-me cookies are accepted. So I
> could use the session ID for now. But eventually, when I transition to a
> fully stateless API, an IP number is really the only client ID available,
> right? And if so, then how do you deal with AOL users or big corporations
> where everyone comes from the same IP?

No, I mean real client id, created with some kind of registration. In
which case IP number would be meaningless.
For internal systems these can be chosen by client, but for external
facing it should tie back to some state.
But it would be more difficult for hierarchic case; in which case you
might have sort of group-id (AOL users), and individual id.
This would both allow more fine-grained control, and make things more
complicated if you tried to do it.

>> So, better throttling can be achieved load shedding based on amount of
>> concurrency; and ideally with differing quality-of-service -- clients
>> with existing open requests should have higher chance of getting
>> requests denied. A simple model is one where over N1 requests there is
>> a statistical chance of rejection; and over N2 (higher limit), 100%
>> failure. This can be applied both on per-client and global basis.
>> Use of multiple limits is useful in order to try to keep higher
>> quality of service for more responsible clients.
>
> Good ideas, I'll consider these suggestions. Although, as you mention,
> tracking requests is more for billing purposes, which may be something I
> will need to do eventually. So I wanted to basically reuse that for

Yeah, for billing number of calls is the most obvious intuitive
approach. There are other ways (using sampling), but nothing
necessarily superior.

> throttling connections in general as well. But you're approach makes more
> sense.

Right, I think they often serve bit different goals. So throttling can
both protect service against grave overload (graceful degradation) and
improve QoS (punish greedy clients); whereas charging for usage may
optimize for something else.

...
>> That will only help for about half a day, until abusers figure out it,
>> and properly start delaying creation requests.
>> I'm not sure I would bother. This based on experience with
>> sign-up-then-spam modes for one fairly big hosting company.
>
> That's exactly what I was thinking. It seems pointless to go that route.

Yeah. I was bit sad to see how quickly that defense was by-passed, but
not surprised.

>> > But I hate CAPTCHAs, especially in my mobile apps. And it seems like there
>> > should be a better solution. Any ideas?
>>
>> Ok, here goes....
>>
>> <rant>
>>
>> Existing CAPTCHAs are pretty much a very very naive and silly idea, in
>> my opinion. They use one of few things for which LOTS of work has gone
>
> ...snip...
>
>>
>> The real benefit is actually just the fact that one must automate
>> solution on specific domain, so spammers can not reuse existing
>> solutions. Whereas with captchas, you solve it once, you solve it
>> everywhere, or at least within very large group of systems.
>>
>> </rant>
>>
>> Too bad this might not work too well for your case. But I would
>> seriously suggest thinking about this: if you can auto-generate q/a
>> pairs, custom solution is actually likely to have better performance
>> than off-the-shelf one. Just because abusers might decide trying to
>> automate solution is too expensive.
>
> I feel absolutely the same about CAPTCHAs. And your suggestion describe
> exactly my goal -- to make automating a solution too expensive that they
> just look elsewhere. If someone is determined enough, they'll break any sort
> of measures I come up with. I'd rather them feel it just isn't worth doing.

Cool, 100% agreed.

> In fact, I already avoid CAPTCHAs in my typical web interfaces by doing
> custom things similar to your suggestions. However, I don't even require
> that users answer a question. Basically, I have some extra hidden fields in
> the form that must have certain values or lack of values. Of course, it
> won't solve the Mechanical Turk approach, but it seems to do pretty well.
> This approach is describe fairly well in this SO answer:
> http://stackoverflow.com/questions/2603363/good-form-security-no-captcha/2603408#2603408

Thanks, I'll read that.

> The problem is I'm trying to figure out how to do something similar with an
> API. It is way to easy to automate an API call with values like { mustExist:
> "must be this string", mustBeEmpty: "" }, so that approach doesn't work.
> Also, maybe I shouldn't be too concerned with this, as there are additional
> steps user registrations require than posting to a forum or blog comment.
> Users have to check their email and click an activation link. But I still
> don't want to fill up my database with a bunch of junk un-activated accounts
> just because someone is calling my API incessantly.

Right. It is good to think about additional measures still, since no
single thing can solve the problems (at least based on what I have
seen).
So death of thousand cuts can work both ways.

I agree in that automation for APIs is easier (for spammers etc)
because those verification fields are so obvious. And I would be
interested in hearing if you do figure out additional tricks,
approaches. If I had more time, I would love to work on more creative
replacements for standard captchas.

-+ Tatu +-