users@jersey.java.net

[Jersey] Re: default regex for path variables

From: Marek Potociar <marek.potociar_at_oracle.com>
Date: Tue, 3 Feb 2015 17:27:33 +0100

Wacek,

Just did a similar experiment on my own. You’re right. We’ll fix the documentation.

Thanks & Cheers,
Marek

> On 29 Jan 2015, at 01:30, Waclaw Kusnierczyk <waclaw.kusnierczyk_at_gmail.com> wrote:
>
> Marek,
>
> I just checked the behaviour of Pattern.matches(String, CharSequence). The javadoc is rather imprecise, and does not really explain what 'matches' means. Pattern.matches("foo", "xfoox") returns false, which is unlike in all other implementations of regular expressions I know.
>
> It seems that matches() requires the pattern to capture the whole string. (Not 'exhaust'---Pattern.matches(".", "foo") is false, even though the pattern does exhaust the string in the sense of 'exhaust' as in the doc you referred to.)
>
> However, Pattern.matches("[^/]+?", "foo") returns true. But, analogously to the example you referred to before, the pattern should only match the individual characters, not the whole string, precisely because of the reluctance. It seems that matches() effectively uses the pattern as if it was surrounded with the start and end anchors, '^' and '$'. Of course, ^[^/]+?$ matches the whole string. But [^/]+? is not ^[^/]+?$, they are substantially different patterns.
>
> If that's what Pattern.matches() does, then I'd say it's a bug that needs to be fixed. As an Oracle developer, you may discuss it with your team, I'd be happy to be proved wrong---with good, clear argumentation.
>
> Referring to your earlier comment: "Our regex says “find a match of a largest substring with one or more non-slash chars, by starting with an empty string and adding one character at a time until you cannot add another matching character”, i.e. match reluctantly. It does not however mean “find a smallest matching substring and finish”…" It's precisely the inverse---the regex says, find the shortest string that is a match. For ^[^/]+?$, "foo" is the shortest match in "foo", but for [^/]+?, it is "f'.
>
> Best,
> Wacek
>
>
> On Thu, Jan 29, 2015 at 12:25 AM, Waclaw Kusnierczyk <waclaw.kusnierczyk_at_gmail.com <mailto:waclaw.kusnierczyk_at_gmail.com>> wrote:
> Please consider these examples:
>
> https://regex101.com/r/fE9sF5/1 <https://regex101.com/r/fE9sF5/1>
> https://regex101.com/r/fE9sF5/2 <https://regex101.com/r/fE9sF5/2>
> https://regex101.com/r/fE9sF5/3 <https://regex101.com/r/fE9sF5/3>
> https://regex101.com/r/fE9sF5/4 <https://regex101.com/r/fE9sF5/4>
>
> I don't believe Java is any different in this respect.
>
> Wacek
>
>
> On Thu, Jan 29, 2015 at 12:19 AM, Waclaw Kusnierczyk <waclaw.kusnierczyk_at_gmail.com <mailto:waclaw.kusnierczyk_at_gmail.com>> wrote:
> Just a minor correction:
>
> "Clearly, [^/]+? applied globally (as in Perl with the modifier g), can with a dose of imprecision be said to effectively match the whole string -- but yet fact it does not match the string, it matches each of the string's characters one by one. So it matches as many times as the string is long, each time just one character."
>
> is of course meant to say 'the whole string up to and exclusive of the first slash' and 'as many times as there are characters before the first slash'.
>
> Wacek
>
> On Thu, Jan 29, 2015 at 12:17 AM, Waclaw Kusnierczyk <waclaw.kusnierczyk_at_gmail.com <mailto:waclaw.kusnierczyk_at_gmail.com>> wrote:
> Marek,
>
> Thanks for the explanation. Clearly, [^/]+? applied globally (as in Perl with the modifier g), can with a dose of imprecision be said to effectively match the whole string -- but yet fact it does not match the string, it matches each of the string's characters one by one. So it matches as many times as the string is long, each time just one character.
>
> Consider the example given in the doc you refer to:
>
> >>
> Enter your regex: .*?foo // reluctant quantifier
> Enter input string to search: xfooxxxxxxfoo
> I found the text "xfoo" starting at index 0 and ending at index 4.
> I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
> The second example [the one quoted above], however, is reluctant, so it starts by first consuming "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13.
>
> <<
>
> Very clearly, the pattern .*?foo matches two separate substrings. It never reports matching xfooxxxxxxfoo. Neither does the description claim that. It _exhausts_ the string, true -- in a loop, it finds multiple subsequent non-overlapping matches that concatenate to the whole string.
>
> With [^/]+? it so happens that it will match the same characters in a path fragment as [^/]+, however, the latter matches just one string (the whole fragment), the former matches all of the fragment's characters individually but not the whole fragment (except for degenerate cases).
>
> Note, the situation is different if the regex is terminated with a slash. Then [^/]+?/ will gradually extend the string consumed until it finds a slash, while [^/]+/ will consume the whole string (as per the doc you cite) and then backtrack. They effectively will match the same string, but arrive at it in different ways.
>
> The original regex in the doc I referred to does not have a trailing slash. I still believe this is not an appropriate explanation. I can see that the sources do use [^/]+?, but this pattern must then be used in a loop to match the characters individually. It still will not match the whole string in the usual sense. Once you use global matching (in a loop), you can just use [^/] with the same effect---it will successively consume all characters one by one until the first slash.
>
> Let me know if this seems wrong to you.
>
> Best,
> Wacek
>
>
>