Re: [PHP-DEV] [RFC] Permit trailing whitespace in numeric strings

  105167
April 9, 2019 10:47 nikita.ppv@gmail.com (Nikita Popov)
On Thu, Apr 4, 2019 at 1:16 AM Andrea Faulds <ajf@ajf.me> wrote:

> Nikita Popov wrote: > > I'm always a fan of making things stricter, but think that in this > > particular case there are some additional considerations we should keep > in > > mind. > > > > 1. What is more important to me here than strictness is consistency. > Either > > both " 123" and "123 " are numeric, or neither are. Making "123 " > > numeric is a change we can easily do, because it makes the numeric string > > definition more permissive and is thus mostly backwards compatible. Doing > > the reverse change is certainly not compatible and will be a much harder > > sell. > > > > 2. I believe that a large part of the motivation here is that by making > the > > numeric string definition slightly more lax (in a consistent manner), we > > can make *other* things more strict, because this essentially eliminates > > the only "somewhat reasonable" case of trailing characters. The RFC > already > > mentions two of them: > > > > a) We can hard reject "123foo" inputs to "int" arguments (and some other > > places). Currently this is allowed with a notice. I think if we resolve > the > > trailing whitespace question, then there cannot be any reasonable > > opposition to this change. > > b) My own RFC on number to string comparisons would benefit from this. > From > > initial testing it has surprisingly little impact, but one of the few > cases > > that turned up was this comparison with a string that had trailing > > whitespace. > > > > Personally I think both of those changes are a lot more valuable than a > > stricter numeric string definition without leading/trailing whitespace. > > I'm kinda unsure how to go forward because of these points. I would like > to see improved comparisons, and I would like to see the end of the > “non-well-formed” numeric string, and I think this whitespace RFC could > be helpful to both. But I can't see the future, I don't know whether > people will vote for removing leading or permitting traiing whitespace > and whether or not they will be influenced by or this will influence > opinion on the further improvements. ¯\_(ツ)_/¯ > > I'm torn between: > > * Vote on allowing trailing whitespace > * Vote on disallowing leading whitespace > * Vote on which of those two approaches to go for > * Trying to bundle everything together and voting on it as a package. > > I'm probably thinking too strategically. >
Given the response on the mailing list (and also other places like Reddit), it seems like people feel pretty strongly that it's better to drop support for leading whitespace than add support for trailing whitespace. If we do this, I think we should couple this change with the removal of "non well-formed numeric strings", because they are so closely related (one change would forbid leading whitespace and the other trailing characters). One possible course of action would be: a) In PHP 7.4 throw a deprecation warning in is_numeric_string if there is leading whitespace (always). b) In PHP 7.4 throw a deprecation warning in is_numeric_string if there are trailing characters in mode 1 (mode -1 already throws a notice and 0 already treats as non-numeric). b) In PHP 8.0 treat leading whitespace as non-numeric (always). c) In PHP 8.0 treat trailing characters as non-numeric (always), and remove the non well-formed distinction (mode -1). Notably this also affects (int) behavior in that (int) " 42" will be 0 and (int) "42xyz" will be 0. A less aggressive alternative would be: a) In PHP 7.4 throw a deprecation warning in is_numeric_string if there is leading whitespace (unless mode is 1). b) In PHP 8.0 treat leading whitespace as non-numeric (unless mode is 1). c) In PHP 8.0 treat leading characters as non-numeric (unless mode is 1). Remove non well-formed distinction (mode -1). This would keep the behavior of (int) as-is and only affect implement numeric string checks. This discussion how mostly been around the implicit cases, what do people think about the desired behavior of (int)? Regards, Nikita
  105168
April 9, 2019 10:57 derick@php.net (Derick Rethans)
On Tue, 9 Apr 2019, Nikita Popov wrote:

> On Thu, Apr 4, 2019 at 1:16 AM Andrea Faulds <ajf@ajf.me> wrote: >=20 > > I'm kinda unsure how to go forward because of these points. I would lik= e
> > to see improved comparisons, and I would like to see the end of the > > =E2=80=9Cnon-well-formed=E2=80=9D numeric string, and I think this whit= espace RFC could
> > be helpful to both. But I can't see the future, I don't know whether > > people will vote for removing leading or permitting traiing whitespace > > and whether or not they will be influenced by or this will influence > > opinion on the further improvements. =C2=AF\_(=E3=83=84)_/=C2=AF > > > > I'm torn between: > > > > * Vote on allowing trailing whitespace > > * Vote on disallowing leading whitespace > > * Vote on which of those two approaches to go for > > * Trying to bundle everything together and voting on it as a package. > > > > I'm probably thinking too strategically. > > >=20 > Given the response on the mailing list (and also other places like Reddit= ),
> it seems like people feel pretty strongly that it's better to drop suppor= t
> for leading whitespace than add support for trailing whitespace. If we do > this, I think we should couple this change with the removal of "non > well-formed numeric strings", because they are so closely related (one > change would forbid leading whitespace and the other trailing characters)= =2E
>=20 > One possible course of action would be: >=20 > a) In PHP 7.4 throw a deprecation warning in is_numeric_string if there i= s
> leading whitespace (always). > b) In PHP 7.4 throw a deprecation warning in is_numeric_string if there a= re
> trailing characters in mode 1 (mode -1 already throws a notice and 0 > already treats as non-numeric). > b) In PHP 8.0 treat leading whitespace as non-numeric (always). > c) In PHP 8.0 treat trailing characters as non-numeric (always), and > remove the non well-formed distinction (mode -1). >=20 > Notably this also affects (int) behavior in that (int) " 42" will be 0 > and (int) "42xyz" will be 0. >=20 > A less aggressive alternative would be: >=20 > a) In PHP 7.4 throw a deprecation warning in is_numeric_string if there i= s
> leading whitespace (unless mode is 1). > b) In PHP 8.0 treat leading whitespace as non-numeric (unless mode is 1). > c) In PHP 8.0 treat leading characters as non-numeric (unless mode is 1). > Remove non well-formed distinction (mode -1). >=20 > This would keep the behavior of (int) as-is and only affect implement > numeric string checks. >=20 > This discussion how mostly been around the implicit cases, what do people > think about the desired behavior of (int)?
I think there should be no difference in behaviour between implicit and=20 explicit cases. cheers, Derick --=20 https://derickrethans.nl | https://xdebug.org | https://dram.io Like Xdebug? Consider a donation: https://xdebug.org/donate.php, or become my Patron: https://www.patreon.com/derickr twitter: @derickr and @xdebug
  105170
April 9, 2019 11:06 nikita.ppv@gmail.com (Nikita Popov)
On Tue, Apr 9, 2019 at 12:57 PM Derick Rethans <derick@php.net> wrote:

> On Tue, 9 Apr 2019, Nikita Popov wrote: > > > On Thu, Apr 4, 2019 at 1:16 AM Andrea Faulds <ajf@ajf.me> wrote: > > > > > I'm kinda unsure how to go forward because of these points. I would > like > > > to see improved comparisons, and I would like to see the end of the > > > “non-well-formed” numeric string, and I think this whitespace RFC could > > > be helpful to both. But I can't see the future, I don't know whether > > > people will vote for removing leading or permitting traiing whitespace > > > and whether or not they will be influenced by or this will influence > > > opinion on the further improvements. ¯\_(ツ)_/¯ > > > > > > I'm torn between: > > > > > > * Vote on allowing trailing whitespace > > > * Vote on disallowing leading whitespace > > > * Vote on which of those two approaches to go for > > > * Trying to bundle everything together and voting on it as a package. > > > > > > I'm probably thinking too strategically. > > > > > > > Given the response on the mailing list (and also other places like > Reddit), > > it seems like people feel pretty strongly that it's better to drop > support > > for leading whitespace than add support for trailing whitespace. If we do > > this, I think we should couple this change with the removal of "non > > well-formed numeric strings", because they are so closely related (one > > change would forbid leading whitespace and the other trailing > characters). > > > > One possible course of action would be: > > > > a) In PHP 7.4 throw a deprecation warning in is_numeric_string if there > is > > leading whitespace (always). > > b) In PHP 7.4 throw a deprecation warning in is_numeric_string if there > are > > trailing characters in mode 1 (mode -1 already throws a notice and 0 > > already treats as non-numeric). > > b) In PHP 8.0 treat leading whitespace as non-numeric (always). > > c) In PHP 8.0 treat trailing characters as non-numeric (always), and > > remove the non well-formed distinction (mode -1). > > > > Notably this also affects (int) behavior in that (int) " 42" will be > > and (int) "42xyz" will be 0. > > > > A less aggressive alternative would be: > > > > a) In PHP 7.4 throw a deprecation warning in is_numeric_string if there > is > > leading whitespace (unless mode is 1). > > b) In PHP 8.0 treat leading whitespace as non-numeric (unless mode is 1). > > c) In PHP 8.0 treat leading characters as non-numeric (unless mode is 1). > > Remove non well-formed distinction (mode -1). > > > > This would keep the behavior of (int) as-is and only affect implement > > numeric string checks. > > > > This discussion how mostly been around the implicit cases, what do people > > think about the desired behavior of (int)? > > I think there should be no difference in behaviour between implicit and > explicit cases.
I should probably clarify what I mean by explicit and implicit here. By explicit I mean anything using (int) casts or doing so internally (implicitly ^^) -- this *must* produce an integer in some way and does not have the option of rejecting the input. By implicit I mean other places checking for numeric strings, such as "int" parameters. These *do* have the option of rejecting the input. Both cannot work the same way due to the different constraints. So to rephrase my question: While I think there is a consensus that "123xyz" and " 123" should not be accepted by an "int" parameter, it is not clear to me that there is also a consensus that (int) "123xyz" and (int) " 123" should result in 0 rather than 123. Regards, Nikita
  105171
April 9, 2019 11:10 derick@php.net (Derick Rethans)
On Tue, 9 Apr 2019, Nikita Popov wrote:

> On Tue, Apr 9, 2019 at 12:57 PM Derick Rethans <derick@php.net> wrote: > > > > think about the desired behavior of (int)? > > > > I think there should be no difference in behaviour between implicit and > > explicit cases. > > So to rephrase my question: While I think there is a consensus that > "123xyz" and " 123" should not be accepted by an "int" parameter, it > is not clear to me that there is also a consensus that (int) "123xyz" > and (int) " 123" should result in 0 rather than 123.
I meant that (int) should accept the same data as "int" parameters, and hence, the result of (int) " 123" should be 0. cheers, Derick -- https://derickrethans.nl | https://xdebug.org | https://dram.io Like Xdebug? Consider a donation: https://xdebug.org/donate.php, or become my Patron: https://www.patreon.com/derickr twitter: @derickr and @xdebug
  105173
April 9, 2019 11:55 benjamin.morel@gmail.com (Benjamin Morel)
> > I should probably clarify what I mean by explicit and implicit here. By > explicit I mean anything using (int) casts or doing so internally > (implicitly ^^) -- this *must* produce an integer in some way and does not > have the option of rejecting the input. By implicit I mean other places > checking for numeric strings, such as "int" parameters. These *do* have the > option of rejecting the input. Both cannot work the same way due to the > different constraints.
Why? Wouldn't it be nice to align the behaviour of implicit and explicit casting, so that (int) "abc" throws a TypeError? Ben On Tue, 9 Apr 2019 at 13:06, Nikita Popov ppv@gmail.com> wrote:
> On Tue, Apr 9, 2019 at 12:57 PM Derick Rethans <derick@php.net> wrote: > > > On Tue, 9 Apr 2019, Nikita Popov wrote: > > > > > On Thu, Apr 4, 2019 at 1:16 AM Andrea Faulds <ajf@ajf.me> wrote: > > > > > > > I'm kinda unsure how to go forward because of these points. I would > > like > > > > to see improved comparisons, and I would like to see the end of the > > > > “non-well-formed” numeric string, and I think this whitespace RFC > could > > > > be helpful to both. But I can't see the future, I don't know whether > > > > people will vote for removing leading or permitting traiing > whitespace > > > > and whether or not they will be influenced by or this will influence > > > > opinion on the further improvements. ¯\_(ツ)_/¯ > > > > > > > > I'm torn between: > > > > > > > > * Vote on allowing trailing whitespace > > > > * Vote on disallowing leading whitespace > > > > * Vote on which of those two approaches to go for > > > > * Trying to bundle everything together and voting on it as a package. > > > > > > > > I'm probably thinking too strategically. > > > > > > > > > > Given the response on the mailing list (and also other places like > > Reddit), > > > it seems like people feel pretty strongly that it's better to drop > > support > > > for leading whitespace than add support for trailing whitespace. If we > do > > > this, I think we should couple this change with the removal of "non > > > well-formed numeric strings", because they are so closely related (one > > > change would forbid leading whitespace and the other trailing > > characters). > > > > > > One possible course of action would be: > > > > > > a) In PHP 7.4 throw a deprecation warning in is_numeric_string if there > > is > > > leading whitespace (always). > > > b) In PHP 7.4 throw a deprecation warning in is_numeric_string if there > > are > > > trailing characters in mode 1 (mode -1 already throws a notice and 0 > > > already treats as non-numeric). > > > b) In PHP 8.0 treat leading whitespace as non-numeric (always). > > > c) In PHP 8.0 treat trailing characters as non-numeric (always), and > > > remove the non well-formed distinction (mode -1). > > > > > > Notably this also affects (int) behavior in that (int) " 42" will be > 0 > > > and (int) "42xyz" will be 0. > > > > > > A less aggressive alternative would be: > > > > > > a) In PHP 7.4 throw a deprecation warning in is_numeric_string if there > > is > > > leading whitespace (unless mode is 1). > > > b) In PHP 8.0 treat leading whitespace as non-numeric (unless mode is > 1). > > > c) In PHP 8.0 treat leading characters as non-numeric (unless mode is > 1). > > > Remove non well-formed distinction (mode -1). > > > > > > This would keep the behavior of (int) as-is and only affect implement > > > numeric string checks. > > > > > > This discussion how mostly been around the implicit cases, what do > people > > > think about the desired behavior of (int)? > > > > I think there should be no difference in behaviour between implicit and > > explicit cases. > > > I should probably clarify what I mean by explicit and implicit here. By > explicit I mean anything using (int) casts or doing so internally > (implicitly ^^) -- this *must* produce an integer in some way and does not > have the option of rejecting the input. By implicit I mean other places > checking for numeric strings, such as "int" parameters. These *do* have the > option of rejecting the input. Both cannot work the same way due to the > different constraints. > > So to rephrase my question: While I think there is a consensus that > "123xyz" and " 123" should not be accepted by an "int" parameter, it is > not clear to me that there is also a consensus that (int) "123xyz" and > (int) " 123" should result in 0 rather than 123. > > Regards, > Nikita >
  105174
April 9, 2019 13:27 robehickman@gmail.com (Robert Hickman)
> Why? Wouldn't it be nice to align the behaviour of implicit and explicit > casting, so that (int) "abc" throws a TypeError? >
I agree with this. It would be odd if they behaved differently.
  105191
April 10, 2019 08:32 markyr@gmail.com (Mark Randall)
On 09/04/2019 14:27, Robert Hickman wrote:
>> Why? Wouldn't it be nice to align the behaviour of implicit and explicit >> casting, so that (int) "abc" throws a TypeError?
I'm somewhat split on this one. On the one hand, if I make an explicit cast, my intention is clear, make the best attempt to convert this string into a number. Trimming whitespace would be a natural consequence of that. On the other hand, I'm a stickler for consistency. On the balance of things, I would most certainly like to see leading and trailing whitespace render a string ineligible to be implicitly converted, but I'd be satisfied with (int)" 123 " converting without error. On the topic of errorsfor explicit casts I'd be more inclined to also introduce (?int) and in the event the value cannot be converted, return null. As long as the precidence worked out, it would allow: $output = (?int)$input ?? $default; -- Mark Randall
  105268
April 13, 2019 17:34 claude.pache@gmail.com (Claude Pache)
> Le 9 avr. 2019 à 12:47, Nikita Popov ppv@gmail.com> a écrit : > > On Thu, Apr 4, 2019 at 1:16 AM Andrea Faulds <ajf@ajf.me> wrote: > >> Nikita Popov wrote: >>> I'm always a fan of making things stricter, but think that in this >>> particular case there are some additional considerations we should keep >> in >>> mind. >>> >>> 1. What is more important to me here than strictness is consistency. >> Either >>> both " 123" and "123 " are numeric, or neither are. Making "123 " >>> numeric is a change we can easily do, because it makes the numeric string >>> definition more permissive and is thus mostly backwards compatible. Doing >>> the reverse change is certainly not compatible and will be a much harder >>> sell. >>> >>> 2. I believe that a large part of the motivation here is that by making >> the >>> numeric string definition slightly more lax (in a consistent manner), we >>> can make *other* things more strict, because this essentially eliminates >>> the only "somewhat reasonable" case of trailing characters. The RFC >> already >>> mentions two of them: >>> >>> a) We can hard reject "123foo" inputs to "int" arguments (and some other >>> places). Currently this is allowed with a notice. I think if we resolve >> the >>> trailing whitespace question, then there cannot be any reasonable >>> opposition to this change. >>> b) My own RFC on number to string comparisons would benefit from this. >> From >>> initial testing it has surprisingly little impact, but one of the few >> cases >>> that turned up was this comparison with a string that had trailing >>> whitespace. >>> >>> Personally I think both of those changes are a lot more valuable than a >>> stricter numeric string definition without leading/trailing whitespace. >> >> I'm kinda unsure how to go forward because of these points. I would like >> to see improved comparisons, and I would like to see the end of the >> “non-well-formed” numeric string, and I think this whitespace RFC could >> be helpful to both. But I can't see the future, I don't know whether >> people will vote for removing leading or permitting traiing whitespace >> and whether or not they will be influenced by or this will influence >> opinion on the further improvements. ¯\_(ツ)_/¯ >> >> I'm torn between: >> >> * Vote on allowing trailing whitespace >> * Vote on disallowing leading whitespace >> * Vote on which of those two approaches to go for >> * Trying to bundle everything together and voting on it as a package. >> >> I'm probably thinking too strategically. >> > > Given the response on the mailing list (and also other places like Reddit), > it seems like people feel pretty strongly that it's better to drop support > for leading whitespace than add support for trailing whitespace. If we do > this, I think we should couple this change with the removal of "non > well-formed numeric strings", because they are so closely related (one > change would forbid leading whitespace and the other trailing characters). > > One possible course of action would be: > > a) In PHP 7.4 throw a deprecation warning in is_numeric_string if there is > leading whitespace (always). > b) In PHP 7.4 throw a deprecation warning in is_numeric_string if there are > trailing characters in mode 1 (mode -1 already throws a notice and 0 > already treats as non-numeric). > b) In PHP 8.0 treat leading whitespace as non-numeric (always). > c) In PHP 8.0 treat trailing characters as non-numeric (always), and > remove the non well-formed distinction (mode -1). > > Notably this also affects (int) behavior in that (int) " 42" will be > and (int) "42xyz" will be 0. > > A less aggressive alternative would be: > > a) In PHP 7.4 throw a deprecation warning in is_numeric_string if there is > leading whitespace (unless mode is 1). > b) In PHP 8.0 treat leading whitespace as non-numeric (unless mode is 1). > c) In PHP 8.0 treat leading characters as non-numeric (unless mode is 1). > Remove non well-formed distinction (mode -1). > > This would keep the behavior of (int) as-is and only affect implement > numeric string checks. > > This discussion how mostly been around the implicit cases, what do people > think about the desired behavior of (int)? > > Regards, > Nikita
I think that, whatever change we make around ”well-formed numeric string” (those treated as numeric by the comparison operator, etc.), the casting operation itself (implicit as in `$a + $b` or explicit as in `(float) $a`) must keep the same semantics, except maybe with additional notices and/or warnings for the implicit case. People depend on that semantics, and we must not break their code, except maybe we could force them to be explicit (using (float)) if we have sufficiently good reason for that. I’ll give use cases further below. To keep the things simple, I suggest to classify the strings in three categories: 1. Well-formed numeric strings; those, and those only, are considered as valid numeric values for the sake of comparison, of typed parameters, of is_numeric(), etc. Let’s be strict here, and let’s not permit whitespaces. 2. Loose numeric strings. Explicit casting must continue to work as today, since people expect that semantics (see first use case below); ditto for implicit casting but with an additional notice or warning. Those strings are of the form: * optional whitespaces, followed by * well-formed numeric string, followed by * optional garbage. 3. Non-numeric strings. Explicit casting must continue to work as today, as people may depend on that (see second use case below). A notice may be acceptable if we don’t bother people using the evil @ operator in more places than today. Ditto for implicit casting, but with an additional warning, unless we decide to harshly throw a Throwable if we have a good reason for that. ------ Here is a real-world use case, where I want the casting from “loose numeric string” to float to continue to work as today: I have a CSS stylesheet. Because I find it more practical, I prefer to encode the colours in the hwb (hue-whiteness-blackness) format: ```css :root { --red-hue: 350; } ..alert { color: hwb(var(--red-hue), 6% , 12%); } ``` However browsers do not (yet) understand hwb; so I must serve them as hsl (hue-saturation-lightness): ```css ..alert { color: hsl(var(--red-hue), 87.2%, 47%); } ``` The transformation between hwb() and hsl() is quite simple, and I use PHP as a preprocessor (on my dev machine) for that purpose. The algorithm does not need to include a true CSS parser, as I know in advance that the `hwb()` expressions I’ve written myself contain only either literal numbers or literal percentages for the “whiteness” and “blackness” components, and no comma in the “hue” component: * find an occurence of "hwb("; * read characters up to the next ",": this is the “hue” component; * read characters up to the next ",": this is the “whiteness” component; * read next characters up to the next ")": this is the “blackness” component; * if the “whiteness” (respectively the “darkness”) component contains the character "%", it is a percentage: divide it by 100; * calculate “saturation” and “lightness” from “whiteness” and “blackness”; * format and assemble the “hue”, “saturation” and “lightness” components into a "hsl(...)" expression. If you have followed the algorithm, you should have noticed that at one point I have the values: $whiteness === " 6% "; $darkness === " 12%"; which are exactly of the format “loose numeric string” (whitespaces, followed by well-formed numeric string, followed by garbage) described above; and that I straightforwardly treat as numeric (knowing that PHP permits that), with or without explicitly casting depending on how much I am bothered by the “not well-formed numeric value” notice. ---- Here is a use case where I want a string-to-float casting that never fail (and to not change semantics either): I have config parameters stored in a database, with values in the form of string. Some of those parameters are supposed to be numeric; for example, two of them may represent the offset (in millimetres) of the company logo, on printed material, relatively to a predefined position. A missing value (either the empty string or null) is treated as zero. An invalid value (such as "foo") is also happily treated as zero, as I don’t bother to report the error at this point: it is straightforwardly seen whether the logo is correctly placed or not. For that purpose, I’m relying on the fact that the explicit string-to-numeric casting operation is infallible. —Claude