[Jersey] Re: default regex for path variables

From: Waclaw Kusnierczyk <waclaw.kusnierczyk_at_gmail.com>
Date: Thu, 29 Jan 2015 01:30:06 +0100

Marek,

I just checked the behaviour of Pattern.matches(String, CharSequence). The
javadoc is rather imprecise, and does not really explain what 'matches'
means. Pattern.matches("foo", "xfoox") returns false, which is unlike in
all other implementations of regular expressions I know.

It seems that matches() requires the pattern to capture the whole string.
(Not 'exhaust'---Pattern.matches(".", "foo") is false, even though the
pattern does exhaust the string in the sense of 'exhaust' as in the doc you
referred to.)

However, Pattern.matches("[^/]+?", "foo") returns true. But, analogously
to the example you referred to before, the pattern should only match the
individual characters, not the whole string, precisely because of the
reluctance. It seems that matches() effectively uses the pattern as if it
was surrounded with the start and end anchors, '^' and '$'. Of course,
^[^/]+?$ matches the whole string. But [^/]+? is not ^[^/]+?$, they are
substantially different patterns.

If that's what Pattern.matches() does, then I'd say it's a bug that needs
to be fixed. As an Oracle developer, you may discuss it with your team,
I'd be happy to be proved wrong---with good, clear argumentation.

Referring to your earlier comment: "Our regex says “find a match of a
largest substring with one or more non-slash chars, by starting with an
empty string and adding one character at a time until you cannot add
another matching character”, i.e. match reluctantly. It does not however
mean “find a smallest matching substring and finish”…" It's precisely the
inverse---the regex says, find the shortest string that is a match. For
^[^/]+?$, "foo" is the shortest match in "foo", but for [^/]+?, it is "f'.

Best,
Wacek

On Thu, Jan 29, 2015 at 12:25 AM, Waclaw Kusnierczyk <
waclaw.kusnierczyk_at_gmail.com> wrote:

> Please consider these examples:
>
> https://regex101.com/r/fE9sF5/1
> https://regex101.com/r/fE9sF5/2
> https://regex101.com/r/fE9sF5/3
> https://regex101.com/r/fE9sF5/4
>
> I don't believe Java is any different in this respect.
>
> Wacek
>
>
> On Thu, Jan 29, 2015 at 12:19 AM, Waclaw Kusnierczyk <
> waclaw.kusnierczyk_at_gmail.com> wrote:
>
>> Just a minor correction:
>>
>> "Clearly, [^/]+? applied globally (as in Perl with the modifier g), can
>> with a dose of imprecision be said to effectively match the whole string --
>> but yet fact it does not match the string, it matches each of the string's
>> characters one by one. So it matches as many times as the string is long,
>> each time just one character."
>>
>> is of course meant to say 'the whole string up to and exclusive of the
>> first slash' and 'as many times as there are characters before the first
>> slash'.
>>
>> Wacek
>>
>> On Thu, Jan 29, 2015 at 12:17 AM, Waclaw Kusnierczyk <
>> waclaw.kusnierczyk_at_gmail.com> wrote:
>>
>>> Marek,
>>>
>>> Thanks for the explanation. Clearly, [^/]+? applied globally (as in
>>> Perl with the modifier g), can with a dose of imprecision be said to
>>> effectively match the whole string -- but yet fact it does not match the
>>> string, it matches each of the string's characters one by one. So it
>>> matches as many times as the string is long, each time just one character.
>>>
>>> Consider the example given in the doc you refer to:
>>>
>>> >>
>>>
>>> Enter your regex: .*?foo // reluctant quantifier
>>> Enter input string to search: xfooxxxxxxfoo
>>> I found the text "xfoo" starting at index 0 and ending at index 4.
>>> I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
>>>
>>> The second example [the one quoted above], however, is reluctant, so it
>>> starts by first consuming "nothing". Because "foo" doesn't appear at the
>>> beginning of the string, it's forced to swallow the first letter (an "x"),
>>> which triggers the first match at 0 and 4. Our test harness continues the
>>> process until the input string is exhausted. It finds another match at 4
>>> and 13.
>>>
>>> <<
>>>
>>> Very clearly, the pattern .*?foo matches two separate substrings. It
>>> never reports matching xfooxxxxxxfoo. Neither does the description claim
>>> that. It _exhausts_ the string, true -- in a loop, it finds multiple
>>> subsequent non-overlapping matches that concatenate to the whole string.
>>>
>>> With [^/]+? it so happens that it will match the same characters in a
>>> path fragment as [^/]+, however, the latter matches just one string (the
>>> whole fragment), the former matches all of the fragment's characters
>>> individually but not the whole fragment (except for degenerate cases).
>>>
>>> Note, the situation is different if the regex is terminated with a
>>> slash. Then [^/]+?/ will gradually extend the string consumed until it
>>> finds a slash, while [^/]+/ will consume the whole string (as per the doc
>>> you cite) and then backtrack. They effectively will match the same string,
>>> but arrive at it in different ways.
>>>
>>> The original regex in the doc I referred to does not have a trailing
>>> slash. I still believe this is not an appropriate explanation. I can see
>>> that the sources do use [^/]+?, but this pattern must then be used in a
>>> loop to match the characters individually. It still will not match the
>>> whole string in the usual sense. Once you use global matching (in a loop),
>>> you can just use [^/] with the same effect---it will successively consume
>>> all characters one by one until the first slash.
>>>
>>> Let me know if this seems wrong to you.
>>>
>>> Best,
>>> Wacek
>>>
>>
>>
>