[Jersey] Re: default regex for path variables

From: Waclaw Kusnierczyk <waclaw.kusnierczyk_at_gmail.com>
Date: Wed, 4 Feb 2015 01:42:03 +0100

Marek,

Thank you for the update.

I was actually following on the reference examples you pointed to and
looked at some of the java.util.regex.* code. I believe you've got more to
fix than the jersey doc. I believe you should know best where to redirect,
as this now is only indirectly related to the original issue.

Here is java.util.regex.Pattern:1132+:

    public static boolean matches(String regex, CharSequence input) {
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(input);
        return m.matches();
    }

You compile the pattern, create a matcher for the input -- no magic. But
then there is the call m.matches leading to java.util.regex.Matcher:594+:

    /**
     * Attempts to match the entire region against the pattern.
     *
     * <p> If the match succeeds then more information can be obtained via
the
     * <tt>start</tt>, <tt>end</tt>, and <tt>group</tt> methods. </p>
     *
     * @return <tt>true</tt> if, and only if, the entire region sequence
     * matches this matcher's pattern
     */
    public boolean matches() {
        return match(from, ENDANCHOR);
    }

Note the ENDANCHOR. This leads to java.util.regex.Matcher:1255+:

    /**
     * Initiates a search for an anchored match to a Pattern within the
given
     * bounds. The groups are filled with default values and the match of
the
     * root of the state machine is called. The state machine will hold the
     * state of the match as it proceeds in this matcher.
     */
    boolean match(int from, int anchor) {

To make clear what ENDANCHOR is, Matcher:136+:

    /**
     * Matcher state used by the last node. NOANCHOR is used when a
     * match does not have to consume all of the input. ENDANCHOR is
     * the mode used for matching all the input.
     */
    static final int ENDANCHOR = 1;

Effectively, you **force** the pattern to match the whole string. You
transform a pattern p to the pattern p$, i.e., add the end of string
anchor. This is where the relation to the original problem is: a pattern
like ".+?", which captures only one character (but will capture each
character of the target individually if used globally, or in a loop), is
transformed into the *substantially different* pattern ".+?$", which will
capture any target as a whole in one go.

java.util.regex.Pattern#matches(String, CharSequence) seems to be intended
to return true iff the regex captures the whole target:

"foo" matches "foo"
"foo" does not match "xfoo"
"foo" does not match "foox"
"foo" does not match "xfoox"

So proper partial matching is not sufficient. But the implementation is
wrong; instead of testing *whether what the pattern matches is the whole
target*, it *forces* the pattern to match the whole target.

Then, for this method, 'true' means that the pattern can be forced to match
if appended the end-of-string anchor, rather than that the pattern matches
the whole target as is, without the anchor:

".+?" matches "foo" even though it actually doesn't---it matches "f"
(proper partial match). Since the pattern actually used is ".+?$", it
matches "foo": matches(".+?", "foo") returns true. This is wrong, because
the input pattern really is ".+?". If this matches(".+?", "foo") returns
true, so should matches("f", "foo").

I haven't read the whole code bit by bit, so please let me know if I
misinterpret it (I don't believe I do).

Best
Wacek

On Tue, Feb 3, 2015 at 5:27 PM, Marek Potociar <marek.potociar_at_oracle.com>
wrote:

> Wacek,
>
> Just did a similar experiment on my own. You’re right. We’ll fix the
> documentation.
>
> Thanks & Cheers,
> Marek
>
> On 29 Jan 2015, at 01:30, Waclaw Kusnierczyk <waclaw.kusnierczyk_at_gmail.com>
> wrote:
>
> Marek,
>
> I just checked the behaviour of Pattern.matches(String, CharSequence).
> The javadoc is rather imprecise, and does not really explain what 'matches'
> means. Pattern.matches("foo", "xfoox") returns false, which is unlike in
> all other implementations of regular expressions I know.
>
> It seems that matches() requires the pattern to capture the whole string.
> (Not 'exhaust'---Pattern.matches(".", "foo") is false, even though the
> pattern does exhaust the string in the sense of 'exhaust' as in the doc you
> referred to.)
>
> However, Pattern.matches("[^/]+?", "foo") returns true. But, analogously
> to the example you referred to before, the pattern should only match the
> individual characters, not the whole string, precisely because of the
> reluctance. It seems that matches() effectively uses the pattern as if it
> was surrounded with the start and end anchors, '^' and '$'. Of course,
> ^[^/]+?$ matches the whole string. But [^/]+? is not ^[^/]+?$, they are
> substantially different patterns.
>
> If that's what Pattern.matches() does, then I'd say it's a bug that needs
> to be fixed. As an Oracle developer, you may discuss it with your team,
> I'd be happy to be proved wrong---with good, clear argumentation.
>
> Referring to your earlier comment: "Our regex says “find a match of a
> largest substring with one or more non-slash chars, by starting with an
> empty string and adding one character at a time until you cannot add
> another matching character”, i.e. match reluctantly. It does not however
> mean “find a smallest matching substring and finish”…" It's precisely the
> inverse---the regex says, find the shortest string that is a match. For
> ^[^/]+?$, "foo" is the shortest match in "foo", but for [^/]+?, it is "f'.
>
> Best,
> Wacek
>
>
> On Thu, Jan 29, 2015 at 12:25 AM, Waclaw Kusnierczyk <
> waclaw.kusnierczyk_at_gmail.com> wrote:
>
>> Please consider these examples:
>>
>> https://regex101.com/r/fE9sF5/1
>> https://regex101.com/r/fE9sF5/2
>> https://regex101.com/r/fE9sF5/3
>> https://regex101.com/r/fE9sF5/4
>>
>> I don't believe Java is any different in this respect.
>>
>> Wacek
>>
>>
>> On Thu, Jan 29, 2015 at 12:19 AM, Waclaw Kusnierczyk <
>> waclaw.kusnierczyk_at_gmail.com> wrote:
>>
>>> Just a minor correction:
>>>
>>> "Clearly, [^/]+? applied globally (as in Perl with the modifier g), can
>>> with a dose of imprecision be said to effectively match the whole string --
>>> but yet fact it does not match the string, it matches each of the string's
>>> characters one by one. So it matches as many times as the string is long,
>>> each time just one character."
>>>
>>> is of course meant to say 'the whole string up to and exclusive of the
>>> first slash' and 'as many times as there are characters before the first
>>> slash'.
>>>
>>> Wacek
>>>
>>> On Thu, Jan 29, 2015 at 12:17 AM, Waclaw Kusnierczyk <
>>> waclaw.kusnierczyk_at_gmail.com> wrote:
>>>
>>>> Marek,
>>>>
>>>> Thanks for the explanation. Clearly, [^/]+? applied globally (as in
>>>> Perl with the modifier g), can with a dose of imprecision be said to
>>>> effectively match the whole string -- but yet fact it does not match the
>>>> string, it matches each of the string's characters one by one. So it
>>>> matches as many times as the string is long, each time just one character.
>>>>
>>>> Consider the example given in the doc you refer to:
>>>>
>>>> >>
>>>>
>>>> Enter your regex: .*?foo // reluctant quantifier
>>>> Enter input string to search: xfooxxxxxxfoo
>>>> I found the text "xfoo" starting at index 0 and ending at index 4.
>>>> I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
>>>>
>>>> The second example [the one quoted above], however, is reluctant, so it
>>>> starts by first consuming "nothing". Because "foo" doesn't appear at the
>>>> beginning of the string, it's forced to swallow the first letter (an "x"),
>>>> which triggers the first match at 0 and 4. Our test harness continues the
>>>> process until the input string is exhausted. It finds another match at 4
>>>> and 13.
>>>>
>>>> <<
>>>>
>>>> Very clearly, the pattern .*?foo matches two separate substrings. It
>>>> never reports matching xfooxxxxxxfoo. Neither does the description claim
>>>> that. It _exhausts_ the string, true -- in a loop, it finds multiple
>>>> subsequent non-overlapping matches that concatenate to the whole string.
>>>>
>>>> With [^/]+? it so happens that it will match the same characters in a
>>>> path fragment as [^/]+, however, the latter matches just one string (the
>>>> whole fragment), the former matches all of the fragment's characters
>>>> individually but not the whole fragment (except for degenerate cases).
>>>>
>>>> Note, the situation is different if the regex is terminated with a
>>>> slash. Then [^/]+?/ will gradually extend the string consumed until it
>>>> finds a slash, while [^/]+/ will consume the whole string (as per the doc
>>>> you cite) and then backtrack. They effectively will match the same string,
>>>> but arrive at it in different ways.
>>>>
>>>> The original regex in the doc I referred to does not have a trailing
>>>> slash. I still believe this is not an appropriate explanation. I can see
>>>> that the sources do use [^/]+?, but this pattern must then be used in a
>>>> loop to match the characters individually. It still will not match the
>>>> whole string in the usual sense. Once you use global matching (in a loop),
>>>> you can just use [^/] with the same effect---it will successively consume
>>>> all characters one by one until the first slash.
>>>>
>>>> Let me know if this seems wrong to you.
>>>>
>>>> Best,
>>>> Wacek
>>>>
>>>
>>>
>>
>
>