[VOTE] Locale-independent case conversion

  116497
November 25, 2021 05:05 tstarling@wikimedia.org (Tim Starling)
Voting is now open for my RFC on locale-independent case conversion.

https://wiki.php.net/rfc/strtolower-ascii

Voting will close in two weeks, on 2021-12-09.

-- Tim Starling
  116499
November 25, 2021 08:55 nicolas.grekas+php@gmail.com (Nicolas Grekas)
Le jeu. 25 nov. 2021 à 06:05, Tim Starling <tstarling@wikimedia.org> a
écrit :

> Voting is now open for my RFC on locale-independent case conversion. > > https://wiki.php.net/rfc/strtolower-ascii > > Voting will close in two weeks, on 2021-12-09. >
Hi Tim, I voted yes because I want to see this happen but I raised a point in https://externals.io/message/116141#116259 and didn't get an answer: Despite their name, I never used natcase functions for natural language
> processing. I use them eg to sort lists of files in a directory, to account > for numbers mainly. But that's not what I would call natural language > processing. I'm not aware of anyone using them for that actually. I'm > wondering if it's a good idea to postpone migrating them to an hypothetical > future as to me, the whole reasoning of the RFC applies to them. >
I wish the strnatcasecmp() and natcasesort() function, but also the SORT_NATURAL flag were also covered by this RFC. Is that possible? Would it make sense? Nicolas
  116503
November 25, 2021 09:47 tstarling@wikimedia.org (Tim Starling)
On 25/11/21 7:55 pm, Nicolas Grekas wrote:
> > I voted yes because I want to see this happen but I raised a point > in https://externals.io/message/116141#116259 > <https://externals.io/message/116141#116259> and didn't get an answer: > > Despite their name, I never used natcase functions for natural > language > processing. I use them eg to sort lists of files in a directory, > to account > for numbers mainly. But that's not what I would call natural > language > processing. I'm not aware of anyone using them for that > actually. I'm > wondering if it's a good idea to postpone migrating them to an > hypothetical > future as to me, the whole reasoning of the RFC applies to them. > > > I wish the strnatcasecmp() and natcasesort() function, but also the > SORT_NATURAL flag were also covered by this RFC. > Is that possible? Would it make sense?
I'm not going to migrate those functions at this time. It's just a project scope decision. -- Tim Starling
  116504
November 25, 2021 09:58 nicolas.grekas+php@gmail.com (Nicolas Grekas)
Le jeu. 25 nov. 2021 à 10:47, Tim Starling <tstarling@wikimedia.org> a
écrit :

> On 25/11/21 7:55 pm, Nicolas Grekas wrote: > > > I voted yes because I want to see this happen but I raised a point in > https://externals.io/message/116141#116259 and didn't get an answer: > > Despite their name, I never used natcase functions for natural language >> processing. I use them eg to sort lists of files in a directory, to >> account >> for numbers mainly. But that's not what I would call natural language >> processing. I'm not aware of anyone using them for that actually. I'm >> wondering if it's a good idea to postpone migrating them to an >> hypothetical >> future as to me, the whole reasoning of the RFC applies to them. >> > > I wish the strnatcasecmp() and natcasesort() function, but also the > SORT_NATURAL flag were also covered by this RFC. > Is that possible? Would it make sense? > > > I'm not going to migrate those functions at this time. It's just a project > scope decision. >
Why? The RFC says:
> because they also use isdigit() and isspace(),
Does that mean "too much work needed"? I would totally understand that of course but I hope someone could do these last miles.
> and because they are intended for natural language processing
I definitely do not agree with this argument and it should be removed from the RFC to me as it might add confusion in the future. Nicolas
  116505
November 25, 2021 10:34 cmbecker69@gmx.de ("Christoph M. Becker")
On 25.11.2021 at 10:58, Nicolas Grekas wrote:

> Le jeu. 25 nov. 2021 à 10:47, Tim Starling <tstarling@wikimedia.org> a > écrit : > >> and because they are intended for natural language processing > > I definitely do not agree with this argument and it should be removed from > the RFC to me as it might add confusion in the future.
Yeah, the PHP manual says[1]: | This function implements a comparison algorithm that orders | alphanumeric strings in the way a human being would, this is described | as a "natural ordering". [1] <https://www.php.net/manual/en/function.strnatcmp.php> -- Christoph M. Becker
  116506
November 25, 2021 10:39 nicolas.grekas+php@gmail.com (Nicolas Grekas)
Le jeu. 25 nov. 2021 à 11:34, Christoph M. Becker <cmbecker69@gmx.de> a
écrit :

> On 25.11.2021 at 10:58, Nicolas Grekas wrote: > > > Le jeu. 25 nov. 2021 à 10:47, Tim Starling <tstarling@wikimedia.org> a > > écrit : > > > >> and because they are intended for natural language processing > > > > I definitely do not agree with this argument and it should be removed > from > > the RFC to me as it might add confusion in the future. > > Yeah, the PHP manual says[1]: > > | This function implements a comparison algorithm that orders > | alphanumeric strings in the way a human being would, this is described > | as a "natural ordering". > > [1] <https://www.php.net/manual/en/function.strnatcmp.php> >
Yep, yet "natural language processing" means processing sentences we write as humans, e.g. processing this very message. natcase sorting functions are not done for that. They're useful to sort items in a list - typically file names - in a way that makes sense to humans. This is very different from "natural language processing". Having "natsort" vary by locale doesn't make more sense than having "sort()" vary by locale. That's my point. The argument doesn't stand against implementing locale-insensitivity for these functions and I think the RFC shouldn't make it (the argument.) Nicolas
  116508
November 25, 2021 11:23 tstarling@wikimedia.org (Tim Starling)
On 25/11/21 8:58 pm, Nicolas Grekas wrote:
> > The RFC says: > > because they also use isdigit() and isspace(), > > Does that mean "too much work needed"? I would totally understand > that of course but I hope someone could do these last miles. > Yes.
> > and because they are intended for natural language processing > > I definitely do not agree with this argument and it should be > removed from the RFC to me as it might add confusion in the future.
Done. -- Tim Starling
  116509
November 25, 2021 11:31 nicolas.grekas+php@gmail.com (Nicolas Grekas)
Le jeu. 25 nov. 2021 à 12:23, Tim Starling <tstarling@wikimedia.org> a
écrit :

> On 25/11/21 8:58 pm, Nicolas Grekas wrote: > > > The RFC says: > > because they also use isdigit() and isspace(), > > Does that mean "too much work needed"? I would totally understand that of > course but I hope someone could do these last miles. > > Yes. > > > and because they are intended for natural language processing > > I definitely do not agree with this argument and it should be removed from > the RFC to me as it might add confusion in the future. > > Done. >
Great, thanks!
  116500
November 25, 2021 08:57 come@chilliet.eu (=?ISO-8859-1?Q?C=F4me?= Chilliet)
Le jeudi 25 novembre 2021, 06:05:37 CET Tim Starling a écrit :
> Voting is now open for my RFC on locale-independent case conversion. > > https://wiki.php.net/rfc/strtolower-ascii
Hello, The RFC is missing information about alternatives: Do all of these function have an mbstring version? Are those locale dependant or have an option for it? To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use? Côme
  116501
November 25, 2021 09:08 divinity76@gmail.com (Hans Henrik Bergan)
btw why is this code *not* getting dotted capital i on 3v4l?
https://3v4l.org/D1WG1#v7.4.26
it gets ["res_hex"]=> string(2) "49"

$str,
    "str_hex"=>bin2hex($str),
    "res"=>$res,
    "res_hex"=>bin2hex($res),
]);
?>

On Thu, 25 Nov 2021 at 09:57, Côme Chilliet <come@chilliet.eu> wrote:

> Le jeudi 25 novembre 2021, 06:05:37 CET Tim Starling a écrit : > > Voting is now open for my RFC on locale-independent case conversion. > > > > https://wiki.php.net/rfc/strtolower-ascii > > Hello, > > The RFC is missing information about alternatives: > Do all of these function have an mbstring version? > Are those locale dependant or have an option for it? > > To reuse the example from the RFC, if I want to convert a UTF string to > uppercase using Turkish rules and get dotted capital I, what should I use? > > Côme > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: https://www.php.net/unsub.php > >
  116502
November 25, 2021 09:17 dusk@woofle.net (Dusk)
On Nov 25, 2021, at 01:08, Hans Henrik Bergan <divinity76@gmail.com> wrote:
> btw why is this code *not* getting dotted capital i on 3v4l? > https://3v4l.org/D1WG1#v7.4.26 > it gets ["res_hex"]=> string(2) "49" > > setlocale(LC_ALL, "Turkish");
Because "Turkish" isn't a locale. "tr_TR" is. https://3v4l.org/GD91W#v7.4.26 Notice that the output doesn't show up correctly, as it is not UTF-8. (Which is part of the problem addressed by this RFC.)
  116507
November 25, 2021 11:14 tstarling@wikimedia.org (Tim Starling)
On 25/11/21 7:57 pm, Côme Chilliet wrote:
> Hello, > > The RFC is missing information about alternatives: > Do all of these function have an mbstring version?
The following functions have an mbstring version: strtolower, strtoupper, stristr, stripos, strripos. mb_convert_case() provides functionality equivalent to lcfirst, ucfirst and ucwords. There is no mbstring version of str_ireplace, that is https://bugs.php.net/bug.php?id=75225 There is no mbstring equivalent for the array sorting functions with SORT_FLAG_CASE, but there is Collator::asort() in intl.
> Are those locale dependant or have an option for it?
The mbstring functions are locale-independent. Unfortunately there do not seem to be PHP wrappers for the family of case conversion functions in ICU's ustring.h. There is IntlChar::tolower() and IntlChar::toupper(), but they provide locale-independent case conversion, equvialent to mbstring. It's not ideal to change the case of a string character by character, since some languages have multi-character mappings. ICU calls this context-sensitive case conversion. Considering the lack of wide character support or context-sensitive case conversion in the existing strtoupper/strtolower, I would consider this missing functionality rather than functionality which I am removing.
> To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use?
For case-insensitive comparison you can use Collator. But for display you just have to do it yourself. For the Turkish Wikipedia and other Turkic language websites we are currently using str_replace(). -- Tim Starling
  116510
November 25, 2021 12:34 paul.crovella@gmail.com (Paul Crovella)
On Thu, Nov 25, 2021 at 3:14 AM Tim Starling <tstarling@wikimedia.org> wrote:
> > On 25/11/21 7:57 pm, Côme Chilliet wrote: > > > To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use? > > For case-insensitive comparison you can use Collator. But for display > you just have to do it yourself. For the Turkish Wikipedia and other > Turkic language websites we are currently using str_replace().
Any particular reason not to use transliterators? https://3v4l.org/I038T
  116518
November 25, 2021 22:26 tstarling@wikimedia.org (Tim Starling)
On 25/11/21 11:34 pm, Paul Crovella wrote:
> On Thu, Nov 25, 2021 at 3:14 AM Tim Starling <tstarling@wikimedia.org> wrote: >> On 25/11/21 7:57 pm, Côme Chilliet wrote: >> >>> To reuse the example from the RFC, if I want to convert a UTF string to uppercase using Turkish rules and get dotted capital I, what should I use? >> For case-insensitive comparison you can use Collator. But for display >> you just have to do it yourself. For the Turkish Wikipedia and other >> Turkic language websites we are currently using str_replace(). > Any particular reason not to use transliterators? https://3v4l.org/I038T
Thanks, I missed that. You would need to do your own mapping from language code to transliterator name, since it only has converters for az/tr, el, lt and "Any", with no fallbacks. For example if you did Transliterator::create("en-Upper")->transliterate('a') you would get a fatal error. Presumably if I submitted a PR adding wrappers for u_strToUpper() etc., it would not be rejected on the basis that we already have transliterators. -- Tim Starling
  116511
November 25, 2021 14:22 Danack@basereality.com (Dan Ackroyd)
On Thu, 25 Nov 2021 at 05:05, Tim Starling <tstarling@wikimedia.org> wrote:
> > Voting is now open for my RFC on locale-independent case conversion. >
It seems popular, and likely to pass, but I voted no as the "Backward Incompatible Changes" section is missing which makes it hard to evaluate the impact. cheers Dan Ack