Improving output of syntax errors

  110761
June 28, 2020 18:53 rowan.collins@gmail.com (Rowan Tommins)
Hi all,

During the discussion of "T_PAAMAYIM_NEKUDOTAYIM", there was broad 
agreement that internal token names should not be included in errors 
shown to users. I now have an implementation of this.

Note that we currently only customise the token descriptions placed into 
Bison's hard-coded template, so the format remains "syntax error, 
unexpected %s, expecting %s or %s or %s".


I have distinguished between two types of token:

  * Tokens which always represent the same text (e.g. keywords and
    operators) are represented by their standard form, e.g. 'unexpected
    token "static", expecting "function" ...'
  * Tokens which have variable content are given a user-friendly name,
    shown as well as the actual text when possible, e.g. 'unexpected
    identifier "foo", expecting quoted string ...'

As a special-case, quoted strings show the string's *content* in double 
quotes, e.g. 'unexpected quoted string "foo" ...' rather than 
'unexpected quoted string ""foo"" ...' or 'unexpected quoted string 
"'foo'" ...'.

A "..." is also included if the text is longer than 30 bytes (where 
previously it would have been silently truncated).


For example, given the following:

<<<<<<<<<<<<

The current 8.0 alpha will show this:

Parse error: syntax error, unexpected '<<' (T_SL), expecting identifier 
(T_STRING) or static (T_STATIC) or namespace (T_NAMESPACE) or \\ 
(T_NS_SEPARATOR) in filename.php on line N

The proposed patch will instead show this:

Parse error: syntax error, unexpected token "<<", expecting identifier 
or "static" or "namespace" or "\" in filename.php on line N


For more examples, see: 
https://rwec.co.uk/x/php-parse-errors/comparison.html

The patch can be reviewed at: https://github.com/php/php-src/pull/5722


I am happy to post a small RFC if people think this requires a vote.

Any other feedback is welcome.


(As an aside, the other commonly requested change was to include column 
numbers; this appears to be feasible, but definitely more complex, and 
with potential performance trade-offs. I hope to re-visit this later.)


Regards,

-- 
Rowan Tommins (né Collins)
[IMSoP]
  110762
June 28, 2020 21:03 smalyshev@gmail.com (Stanislav Malyshev)
Hi!

> For more examples, see: > https://rwec.co.uk/x/php-parse-errors/comparison.html > > The patch can be reviewed at: https://github.com/php/php-src/pull/5722
I think this is great, thanks for implementing it!
> I am happy to post a small RFC if people think this requires a vote. > > Any other feedback is welcome.
I personally don't think it requires a vote, as error messages are an implementation detail and nobody should ever rely on it. Let's wait for a bit though if somebody argues that it needs an RFC, but otherwise I think we can merge it.
> (As an aside, the other commonly requested change was to include column > numbers; this appears to be feasible, but definitely more complex, and > with potential performance trade-offs. I hope to re-visit this later.)
Column numbers would be awesome too. -- Stas Malyshev smalyshev@gmail.com
  110763
June 28, 2020 22:08 krakjoe@gmail.com (Joe Watkins)
Nicely done.

No need for an RFC.

Thanks.

On Sunday, 28 June 2020, Stanislav Malyshev <smalyshev@gmail.com> wrote:

> Hi! > > > For more examples, see: > > https://rwec.co.uk/x/php-parse-errors/comparison.html > > > > The patch can be reviewed at: https://github.com/php/php-src/pull/5722 > > I think this is great, thanks for implementing it! > > > I am happy to post a small RFC if people think this requires a vote. > > > > Any other feedback is welcome. > > I personally don't think it requires a vote, as error messages are an > implementation detail and nobody should ever rely on it. Let's wait for > a bit though if somebody argues that it needs an RFC, but otherwise I > think we can merge it. > > > (As an aside, the other commonly requested change was to include column > > numbers; this appears to be feasible, but definitely more complex, and > > with potential performance trade-offs. I hope to re-visit this later.) > > Column numbers would be awesome too. > > -- > Stas Malyshev > smalyshev@gmail.com > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: https://www.php.net/unsub.php > >
  110764
June 29, 2020 06:27 claude.pache@gmail.com (Claude Pache)
> As a special-case, quoted strings show the string's *content* in double quotes, e.g. 'unexpected quoted string "foo" ...' rather than 'unexpected quoted string ""foo"" ...' or 'unexpected quoted string "'foo'" ...'.
For me, in this case, you have dropped the wrong pair of quotes. That is, I prefer to see the quotes used in the source text (“unexpected string 'foo' ...” or “unexpected string "foo" ...”), over the quotes that make error messages consistent with other error messages. This is because I like to have hints that help me to visually identify the offending string in the source text when it is of length 0 or 1, but on other hand I don’t care about consistency across error messages. —Claude
  110766
June 29, 2020 08:29 rowan.collins@gmail.com (Rowan Tommins)
Hi Claude,

On Mon, 29 Jun 2020 at 07:27, Claude Pache pache@gmail.com> wrote:

> > > > As a special-case, quoted strings show the string's *content* in double > quotes, e.g. 'unexpected quoted string "foo" ...' rather than 'unexpected > quoted string ""foo"" ...' or 'unexpected quoted string "'foo'" ...'. > > For me, in this case, you have dropped the wrong pair of quotes. That is, > I prefer to see the quotes used in the source text (“unexpected string > 'foo' ...” or “unexpected string "foo" ...”), over the quotes that make > error messages consistent with other error messages. This is because I like > to have hints that help me to visually identify the offending string in the > source text when it is of length 0 or 1, but on other hand I don’t care > about consistency across error messages. >
I did actually have strings matching the quote marks used in an earlier version of the patch, but dropped it partly to simplify the implementation - because it sometimes truncates strings, it can't just use the real quotes, but has to strip and re-add them. The consistency should aid anyone trying to parse the error message, since the pattern /unexpected ([^"]+) "(.*?)"/ will always give you a token name (even if just 'token') and its content. I can re-visit that part if you feel strongly, though. Regards, -- Rowan Tommins [IMSoP]
  110767
June 29, 2020 09:42 claude.pache@gmail.com (Claude Pache)
> Le 29 juin 2020 à 10:29, Rowan Tommins collins@gmail.com> a écrit : > > > I did actually have strings matching the quote marks used in an earlier > version of the patch, but dropped it partly to simplify the implementation > - because it sometimes truncates strings, it can't just use the real > quotes, but has to strip and re-add them. > > The consistency should aid anyone trying to parse the error message, since > the pattern /unexpected ([^"]+) "(.*?)"/ will always give you a token name > (even if just 'token') and its content. > > I can re-visit that part if you feel strongly, though.
At this point, I don’t feel strongly, because currently, when I see “unexpected '''' (T_CONSTANT_ENCAPSED_STRING)”, my automatism is not looking for a string with specific quote style (which is not evident by looking at four consecutive quotes), but for a missing or misplaced punctuation mark, because, by experience, it is most of the time the root cause of an error message containing that specific token name. Another way to give the programmer the appropriate information, is replacing “quoted string” with “single-quoted string” and “double-quoted string”. And while I am thinking about it: a case that tends to cause divergent squint is: unexpected '"', which you changed into: unexpected token """. In that specific case, it would be nice if it the common name of the offending token was spelt out, something like: unexpected double quotation mark. —Claude
  110773
June 29, 2020 11:15 rowan.collins@gmail.com (Rowan Tommins)
On Mon, 29 Jun 2020 at 10:42, Claude Pache pache@gmail.com> wrote:

> > At this point, I don’t feel strongly, because currently, when I see > “unexpected '''' (T_CONSTANT_ENCAPSED_STRING)”, > my automatism is not looking for a string with specific quote style (which > is not evident by looking at four consecutive quotes), but for a missing or > misplaced punctuation mark, because, by experience, it is most of the time > the root cause of an error message containing that specific token name. >
Yes, there's a bunch of cases where what the parser sees and expects isn't really that helpful, and what's parsed as a zero-length string may not actually look that way in the source. There's probably diminishing returns trying to handle that at the message level, rather than extending the parser itself as has been done for mis-matched brackets.
> Another way to give the programmer the appropriate information, is > replacing “quoted string” with “single-quoted string” and “double-quoted > string”. >
That would actually make a lot of sense. I'll have a look if it can be done without changing the actual parser definitions (which recognise both forms as the same token type).
> And while I am thinking about it: a case that tends to cause divergent > squint is: unexpected '"', which you changed into: unexpected token """. In > that specific case, it would be nice if it the common name of the offending > token was spelt out, something like: unexpected double quotation mark. >
That would be a reasonable special-case; theoretically, the same would apply for a lone single-quote, although I think the parser happens never to see that as a separate token. Thanks for the feedback, -- Rowan Tommins [IMSoP]