users@jersey.java.net

[Jersey] Re: default regex for path variables

From: Waclaw Kusnierczyk <waclaw.kusnierczyk_at_gmail.com>
Date: Thu, 29 Jan 2015 00:17:19 +0100

Marek,

Thanks for the explanation. Clearly, [^/]+? applied globally (as in Perl
with the modifier g), can with a dose of imprecision be said to effectively
match the whole string -- but yet fact it does not match the string, it
matches each of the string's characters one by one. So it matches as many
times as the string is long, each time just one character.

Consider the example given in the doc you refer to:

>>

Enter your regex: .*?foo // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

The second example [the one quoted above], however, is reluctant, so it
starts by first consuming "nothing". Because "foo" doesn't appear at the
beginning of the string, it's forced to swallow the first letter (an "x"),
which triggers the first match at 0 and 4. Our test harness continues the
process until the input string is exhausted. It finds another match at 4
and 13.

<<

Very clearly, the pattern .*?foo matches two separate substrings. It never
reports matching xfooxxxxxxfoo. Neither does the description claim that.
It _exhausts_ the string, true -- in a loop, it finds multiple subsequent
non-overlapping matches that concatenate to the whole string.

With [^/]+? it so happens that it will match the same characters in a path
fragment as [^/]+, however, the latter matches just one string (the whole
fragment), the former matches all of the fragment's characters individually
but not the whole fragment (except for degenerate cases).

Note, the situation is different if the regex is terminated with a slash.
Then [^/]+?/ will gradually extend the string consumed until it finds a
slash, while [^/]+/ will consume the whole string (as per the doc you cite)
and then backtrack. They effectively will match the same string, but
arrive at it in different ways.

The original regex in the doc I referred to does not have a trailing
slash. I still believe this is not an appropriate explanation. I can see
that the sources do use [^/]+?, but this pattern must then be used in a
loop to match the characters individually. It still will not match the
whole string in the usual sense. Once you use global matching (in a loop),
you can just use [^/] with the same effect---it will successively consume
all characters one by one until the first slash.

Let me know if this seems wrong to you.

Best,
Wacek