Re: [PHP-DEV] Proposal for a new basic function: str_contains

This is only part of a thread. view whole thread
  108819
March 3, 2020 10:04 rowan.collins@gmail.com (Rowan Tommins)
On Tue, 3 Mar 2020 at 08:46, Andreas Heigl <andreas@heigl.org> wrote:

> > While it is mainly aimed at being a mere convenience-function that could > also be easily implemented in userland it misses one main thing IMO when > handling unicode-strings: Normalization. > >
While I would love to see more functionality for handling Unicode which didn't treat it as just another character set, I don't think sprinkling it into the main string functions of the language would be the right approach. Even if we changed all the existing functions to be "Unicode-aware", as was planned for PHP 6, the resulting API would not handle all cases correctly. In this case, a Unicode-based string API ought to provide at least two variants of "contains", as options or separate functions: - a version which matches on code point, for answering queries like "does this string contain right-to-left override characters?" - at least one form of normalization, but probably several If there was serious work on a new string API in progress, a freeze on additions to the current API would make sense; but right now, the byte-based string API is what we have, and I think this function is a sensible addition to it. Regards, -- Rowan Tommins [IMSoP]
  108821
March 3, 2020 13:29 nicolas.grekas+php@gmail.com (Nicolas Grekas)
Le mar. 3 mars 2020 à 11:04, Rowan Tommins collins@gmail.com> a
écrit :

> On Tue, 3 Mar 2020 at 08:46, Andreas Heigl <andreas@heigl.org> wrote: > > > > > While it is mainly aimed at being a mere convenience-function that could > > also be easily implemented in userland it misses one main thing IMO when > > handling unicode-strings: Normalization. > > > > > > While I would love to see more functionality for handling Unicode which > didn't treat it as just another character set, I don't think sprinkling it > into the main string functions of the language would be the right approach. > Even if we changed all the existing functions to be "Unicode-aware", as was > planned for PHP 6, the resulting API would not handle all cases correctly.. > > In this case, a Unicode-based string API ought to provide at least two > variants of "contains", as options or separate functions: > > - a version which matches on code point, for answering queries like "does > this string contain right-to-left override characters?" > - at least one form of normalization, but probably several > > If there was serious work on a new string API in progress, a freeze on > additions to the current API would make sense; but right now, the > byte-based string API is what we have, and I think this function is a > sensible addition to it. >
FYI, I wrote a String handling lib, shipped as Symfony String: - doc: https://symfony.com/doc/current/components/string.html - src: https://github.com/symfony/string TL;DR, it provides 3 classes of value objects, dealing with bytes, code points and grapheme cluster (~= normalized unicode) It makes no sense to have `str_contains()` or any global function able to deal with Unicode normalization *unless* the PHP string values embed their unit system (one of: bytes, codepoints or graphemes). With this rationale, I agree with Rowan: PHP's native string functions deal with bytes. So should str_contains(). Other unit systems can be implemented in userland (until PHP implements something similar to Symfony String in core - but that's another topic.) Nicolas
  108822
March 3, 2020 13:53 andreas@heigl.org (Andreas Heigl)
--X3DZcFQg0hKFkAZpXpbjfOujTUoMEt66M
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable



Am 03.03.20 um 14:29 schrieb Nicolas Grekas:
> Le mar. 3 mars 2020 =C3=A0 11:04, Rowan Tommins collins@gmail.co= m> a
> =C3=A9crit : >=20 >> On Tue, 3 Mar 2020 at 08:46, Andreas Heigl <andreas@heigl.org> wrote: >> >>> >>> While it is mainly aimed at being a mere convenience-function that co= uld
>>> also be easily implemented in userland it misses one main thing IMO w= hen
>>> handling unicode-strings: Normalization. >>> >>> >> >> While I would love to see more functionality for handling Unicode whic= h
>> didn't treat it as just another character set, I don't think sprinklin= g it
>> into the main string functions of the language would be the right appr= oach.
>> Even if we changed all the existing functions to be "Unicode-aware", a= s was
>> planned for PHP 6, the resulting API would not handle all cases correc= tly.
>> >> In this case, a Unicode-based string API ought to provide at least two=
>> variants of "contains", as options or separate functions: >> >> - a version which matches on code point, for answering queries like "d= oes
>> this string contain right-to-left override characters?" >> - at least one form of normalization, but probably several >> >> If there was serious work on a new string API in progress, a freeze on=
>> additions to the current API would make sense; but right now, the >> byte-based string API is what we have, and I think this function is a >> sensible addition to it. >> >=20 >=20 > FYI, I wrote a String handling lib, shipped as Symfony String: > - doc: https://symfony.com/doc/current/components/string.html > - src: https://github.com/symfony/string >=20 > TL;DR, it provides 3 classes of value objects, dealing with bytes, code=
> points and grapheme cluster (~=3D normalized unicode) >=20 > It makes no sense to have `str_contains()` or any global function able = to
> deal with Unicode normalization *unless* the PHP string values embed th= eir
> unit system (one of: bytes, codepoints or graphemes). >=20 > With this rationale, I agree with Rowan: PHP's native string functions = deal
> with bytes. So should str_contains(). Other unit systems can be impleme= nted
> in userland (until PHP implements something similar to Symfony String i= n
> core - but that's another topic.)
str_contains as it currently is implemented can also easily be implemented in userland. That was my reasoning. I would think otherwise would it take unicode into account as that's much harder to implement in userland. And I didn'T want to start a new discussion, I merely wanted to explain the reasoning behind my decission. Cheers Andreas --=20 ,,, (o o) +---------------------------------------------------------ooO-(_)-Ooo-+ | Andreas Heigl | | mailto:andreas@heigl.org N 50=C2=B022'59.5" E 08=C2=B0= 23'58" | | http://andreas.heigl.org http://hei.gl/wiFKy7 | +---------------------------------------------------------------------+ | http://hei.gl/root-ca | +---------------------------------------------------------------------+ --X3DZcFQg0hKFkAZpXpbjfOujTUoMEt66M--