[RFC] [DISCUSSION] Locale-independent float to string cast

  109261
March 24, 2020 09:40 kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=)
Hi Internals,

Together with George, I'd like to propose an RFC for a long-standing
problem PHP has:
casting floats to string depends on the locale settings. As this behaviour
is nonsense, and
because it can cause quite serious problems, we would like to get rid of
locale-dependence
in PHP 8.

Please find our RFC at
https://wiki.php.net/rfc/locale_independent_float_to_string.

Regards,
George and Máté
  109262
March 24, 2020 10:03 kjarli@gmail.com (Lynn)
Hi,

This is a great RFC! Just one minor thing.

> Outputting floats as strings in locales which change the decimal separator will have a slightly different output. In our opinion, the
backward compatibility break won't be serious in practice In my opinion, this will be huge. I can't trace back where the thousands of possible conversions are used in legacy software. There's a variety of custom format functions, sprintf implementations, string casts, and number_format usages. Would it be possible to trigger a warning or deprecation when using a locale that would have a different result from what the new result will be? If I'd use a locale that results in `3.5`, I don't need a warning. If I use a locale that results in `3,5`, I would like to see a warning of sorts so I can fix this before untested legacy code will seriously break data exports. Regards, Lynn On Tue, Mar 24, 2020 at 10:40 AM Máté Kocsis <kocsismate90@gmail.com> wrote:
> Hi Internals, > > Together with George, I'd like to propose an RFC for a long-standing > problem PHP has: > casting floats to string depends on the locale settings. As this behaviour > is nonsense, and > because it can cause quite serious problems, we would like to get rid of > locale-dependence > in PHP 8. > > Please find our RFC at > https://wiki.php.net/rfc/locale_independent_float_to_string. > > Regards, > George and Máté >
  109267
March 24, 2020 10:53 cmbecker69@gmx.de ("Christoph M. Becker")
On 24.03.2020 at 11:03, Lynn wrote:

>> Outputting floats as strings in locales which change the decimal >> separator will have a slightly different output. In our opinion, the >> backward compatibility break won't be serious in practice > > In my opinion, this will be huge. I can't trace back where the thousands of > possible conversions are used in legacy software. There's a variety of > custom format functions, sprintf implementations, string casts, and > number_format usages. Would it be possible to trigger a warning or > deprecation when using a locale that would have a different result from > what the new result will be? If I'd use a locale that results in `3.5`, I > don't need a warning. If I use a locale that results in `3,5`, I would like > to see a warning of sorts so I can fix this before untested legacy code > will seriously break data exports.
It would certainly be *possible* to raise a warning/notice if a (non SCCP optimized) float to string conversion would have different results. This would have a (small) performance impact though, so I'd would rather avoid enforcing it (i.e. maybe temporarily introduce an INI directive for this purpose). -- Christoph M. Becker
  109278
March 24, 2020 18:41 george.banyard@gmail.com ("G. P. B.")
On Tue, 24 Mar 2020 at 11:03, Lynn <kjarli@gmail.com> wrote:

> Hi, > > This is a great RFC! Just one minor thing. > > > Outputting floats as strings in locales which change the decimal > separator will have a slightly different output. In our opinion, the > backward compatibility break won't be serious in practice > > In my opinion, this will be huge. I can't trace back where the thousands of > possible conversions are used in legacy software. There's a variety of > custom format functions, sprintf implementations, string casts, and > number_format usages. Would it be possible to trigger a warning or > deprecation when using a locale that would have a different result from > what the new result will be? If I'd use a locale that results in `3.5`, I > don't need a warning. If I use a locale that results in `3,5`, I would like > to see a warning of sorts so I can fix this before untested legacy code > will seriously break data exports. > > Regards, > Lynn >
We are not saying that this won't be prevalent,especially in legacy code, however, IMHO, this is not a serious issue as it is only a matter how one character is displayed to end users. As said by Christoph we did not consider a deprecation warning due to the performance impact this would lead to as float to string conversions are a common operation. The idea of a temporary INI setting which warn about these conversions is an interesting idea, but I personally would rather not introduce one. Moreover, with a temporary INI setting how long would it last, one minor version, one major version? Something in between the two? On Tue, 24 Mar 2020 at 11:15, Nikita Popov ppv@gmail.com> wrote:
> I'm obviously in favor of this proposal. > > Only really comment I have is on printf(): You're right that we have %f and > %F to toggle locale-sensitivity, but the %e, %E, %g, %G formats are always > locale-sensitive. It might make sense to introduce locale-insensitive > variants of those, especially considering that %G is considered the > "standard" floating point format. Internally we support %H for that, so we > could expose that... (Alternatively, locale-sensitivity might be removed > for e/E/g/G.) > > Regards, > Nikita
*sight* I was not aware of this issue ... Personally, I would prefer that %e and %E are not locale aware as they are meant to represent a standard notation. This would then mean that %g is locale aware as %f is and %G is not because %E and %F are not locale aware. But I'm open to suggestion considering this edge case. Best regards George P. Banyard
  109279
March 24, 2020 18:59 kjarli@gmail.com (Lynn)
On Tue, Mar 24, 2020 at 7:41 PM G. P. B. banyard@gmail.com> wrote:

> On Tue, 24 Mar 2020 at 11:03, Lynn <kjarli@gmail.com> wrote: > >> I would like >> to see a warning of sorts so I can fix this before untested legacy code >> will seriously break data exports. >> >> Regards, >> Lynn >> > > We are not saying that this won't be prevalent,especially in legacy code, > however, IMHO, this is not a serious issue as it is only a matter how one > character is displayed to end users. >
It's so much more than a character that's being displayed. There are automated data exports with prices, that will be automatically consumed on the other side. These kind of changes can seriously break if they were expected to be localized currency formats. We're talking massive amounts of data here. I've seen price formats go wrong often and seen them be fixed even up to a week ago. It's hard to track all usages in a 20 year old code base. As much as I love that this can be fixed, I rather not see it fixed at all if there's no upgrade path available. Regards, Lynn
  109281
March 24, 2020 23:14 kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=)
Hi Lynn,

> There are automated data exports with prices, that will be automatically consumed on the other side.
I agree that these cases can go horribly wrong. However, my reasoning is the following: - if a piece of code currently relies on locale-independence (e.g. automated data exports) then this change wouldn't cause any breakage since a workaround has already been in place there (e.g. the programmers use var_export() instead of casting) - if a piece of code relies on the locale-dependent string representation of floats then there will be a BC break, sure, however I believe that code isn't very sensitive to the change in the vast majority of the cases since that data is for presentation purposes only. Or do you have other locale-dependent use-cases in mind? I am sure there are some but I think the number of the situations where the change is problematic is less than what it first seems. Regards, Máté
  109282
March 25, 2020 08:25 kjarli@gmail.com (Lynn)
I'm very sorry, I pressed the reply instead of reply all button, I hope
this fixes it!

I agree that these cases can go horribly wrong. However, my reasoning is
> the following: > - if a piece of code currently relies on locale-independence (e.g. > automated data exports) then this > change wouldn't cause any breakage since a workaround has already been in > place there (e.g. the > programmers use var_export() instead of casting) > - if a piece of code relies on the locale-dependent string representation > of floats then there will be > a BC break, sure, however I believe that code isn't very sensitive to the > change in the vast majority of the > cases since that data is for presentation purposes only. >
I know it's for presentation purpose only, sadly it's not used for just presentation code. It's being consumed and parsed by automated imports that expect a format different than `3.5`, because that's how it organically evolved. Even when it was meant only for presentation, the consumer expected the `3,5` format and will keep expecting this until we notify them of a change (and then it will probably still take weeks before this is fixed on their side). A lot of code that I've encountered is stuff like: ``` $csv .= $product->getPrice() . ';'; // older code I've seen come by $product = get_product($$var5); $csv .= $product['price'] . ';'; ``` As much as I'd like to see this fixed and always give a `3.5`, sadly that's not the case if you use a different locale. Going through thousands of files to fix this, is not going to be an easy task, especially not as this is often old enough to not be usable by static code analyzers.
> Or do you have other locale-dependent use-cases in mind? I am sure there > are some but I think the number of the situations where the change is > problematic is less than what it first seems. >
No, this is the only issue I personally see with it. However, the impact could be severe enough to not be able to upgrade to a new PHP version in the foreseeable future, thus I would like to see an upgrade path, one that tells you where it would've gone wrong, instead of turning the behavior on or off. If I can gather the logs and thus find usages, I can pro-actively start fixing where this would go wrong. I'll take a performance hit over a breakage. Performance penalty would only have to apply if the locale is set to something we know will change the outcome. Maybe it turns out this is a small issue, maybe it's a big issue... I can't tell because it's either upgrade & break, or upgrade & work in the proposed scenario. Some breakage may take weeks to find out because scripts run periodically. I fully understand that you don't see this as a big issue, and I really wish I could say the same. I can't vote, so I can't change the outcome, please consider my use-case when moving forward, thanks! Regards, Lynn
  109328
March 26, 2020 09:54 kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=)
Hi Lynn,

Thank you very much for the input! Then we could add a secondary vote to
the RFC where an
upgrade path was decided upon.

The first such path is the one proposed by Christoph:
Introduce a temporary ini setting with which a "debug" mode of float to
string casting could be
enabled. That is, when it's turned on and a casting is affected by locale
settings then a warning
is emitted. This setting would be removed in a (near) future version of
PHP. Although
PHP 8.1 would be an unusual candidate to do so, but it *might* be possible
if we found a good
way to warn people from day 0 that it's just a temporary flag? Maybe by an
immediate deprecation?

Is there any other idea for a sensible upgrade path?

Cheers:
Máté
  109668
April 15, 2020 20:19 george.banyard@gmail.com ("G. P. B.")
Hello all,

We've updated the implementation [1] and RFC [2] to provide a
temporary INI setting which will emit a warning every time a locale-aware
float to string conversation would have been done in PHP 7.
It's current name debug_local_sensitive_float_casts is also up to
debate and will be voted on a secondary vote.

How long this INI setting should remain in PHP is also up to discussion.

One remaining question is if we should change the %e and %E formats
for the printf family of functions to be locale independent or not and/or if
we should expose the %H format which is similar to %G but with %E
being locale independent.

Moreover, we would like to open voting next week, so your input on this
is very much appreciated so that we can amend this before going to voting.

Best regards

George P. Banyard

[1] https://github.com/php/php-src/pull/5224
[2] https://wiki.php.net/rfc/locale_independent_float_to_string
  109265
March 24, 2020 10:14 nikita.ppv@gmail.com (Nikita Popov)
On Tue, Mar 24, 2020 at 10:40 AM Máté Kocsis <kocsismate90@gmail.com> wrote:

> Hi Internals, > > Together with George, I'd like to propose an RFC for a long-standing > problem PHP has: > casting floats to string depends on the locale settings. As this behaviour > is nonsense, and > because it can cause quite serious problems, we would like to get rid of > locale-dependence > in PHP 8. > > Please find our RFC at > https://wiki.php.net/rfc/locale_independent_float_to_string. >
I'm obviously in favor of this proposal. Only really comment I have is on printf(): You're right that we have %f and %F to toggle locale-sensitivity, but the %e, %E, %g, %G formats are always locale-sensitive. It might make sense to introduce locale-insensitive variants of those, especially considering that %G is considered the "standard" floating point format. Internally we support %H for that, so we could expose that... (Alternatively, locale-sensitivity might be removed for e/E/g/G.) Regards, Nikita
  109671
April 15, 2020 20:54 ajf@ajf.me (Andrea Faulds)
Hi,

Máté Kocsis wrote:
> Hi Internals, > > Together with George, I'd like to propose an RFC for a long-standing > problem PHP has: > casting floats to string depends on the locale settings. As this behaviour > is nonsense, and > because it can cause quite serious problems, we would like to get rid of > locale-dependence > in PHP 8. > > Please find our RFC at > https://wiki.php.net/rfc/locale_independent_float_to_string. > > Regards, > George and Máté >
Just a little thing, saying “certain locales (notably the German, French, and Spanish ones)” perhaps understates the issue. Most non-English-speaking countries do not use . as the decimal separator! https://en.wikipedia.org/wiki/Decimal_separator#Arabic_numerals Anyway, thank you for this RFC, I think fixing this is long, long overdue. I expect there could be a serious BC impact for certain PHP users, but I don't think there is a good long-term solution for this short of forcing such code to be updated. As a potential improvement, I would suggest to try to be closer to comprehensive in documenting which functions' behaviour changes or does not. For example, this RFC doesn't currently say what happens to settype() and print_r(). Thanks, Andrea