Re: [PHP-DEV] Proposal for a new basic function: str_contains

This is only part of a thread. view whole thread
  108569
February 14, 2020 11:53 nikita.ppv@gmail.com (Nikita Popov)
On Fri, Feb 14, 2020 at 10:18 AM Philipp Tanlak tanlak@gmail.com>
wrote:

> Hello PHP Devs, > > I would like to propose the new basic function: str_contains. > > The goal of this proposal is to standardize on a function, to check weather > or not a string is contained in another string, which has a very common > use-case in almost every PHP project. > PHP Frameworks like Laravel create helper functions for this behavior > because it is so ubiquitous. > > There are currently a couple of approaches to create such a behavior, most > commonly: > strpos($haystack, $needle) !== false; > strstr($haystack, $needle) !== false; > preg_match('/' . $needle . '/', $haystack) != 0; > > All of these functions serve the same purpose but are either not intuitive, > easy to get wrong (especially with the !== comparison) or hard to remember > for new PHP developers. > > The proposed signature for this function follows the conventions of other > signatures of string functions and should look like this: > > str_contains(string $haystack, string $needle): bool > > This function is very easy to implement, has no side effects or backward > compatibility issues. > I've implemented this feature and created a pull request on GitHub ( Link: > https://github.com/php/php-src/pull/5179 ). > > To get this function into the PHP core, I will open up an RFC for this. > But first, I would like to get your opinions and consensus on this > proposal. > > What are your opinions on this proposal? >
Sounds good to me. This operation is needed often enough that it deserves a dedicated function. I'd recommend leaving the proposal at only str_contains(), in particular: * Do not propose a case-insensitive variant. I believe this is really the point on which the last str_starts_with/str_ends_with proposal failed. * Do not propose mb_str_contains(). Especially as no offsets are involved, there is no reason to have this function. (For UTF-8, the behavior would be exactly equivalent to str_contains.) Regards, Nikita
  108577
February 14, 2020 14:14 philipp.tanlak@gmail.com (Philipp Tanlak)
Am Fr., 14. Feb. 2020 um 12:54 Uhr schrieb Nikita Popov <
nikita.ppv@gmail.com>:

> On Fri, Feb 14, 2020 at 10:18 AM Philipp Tanlak tanlak@gmail.com> > wrote: > >> Hello PHP Devs, >> >> I would like to propose the new basic function: str_contains. >> >> The goal of this proposal is to standardize on a function, to check >> weather >> or not a string is contained in another string, which has a very common >> use-case in almost every PHP project. >> PHP Frameworks like Laravel create helper functions for this behavior >> because it is so ubiquitous. >> >> There are currently a couple of approaches to create such a behavior, most >> commonly: >> > strpos($haystack, $needle) !== false; >> strstr($haystack, $needle) !== false; >> preg_match('/' . $needle . '/', $haystack) != 0; >> >> All of these functions serve the same purpose but are either not >> intuitive, >> easy to get wrong (especially with the !== comparison) or hard to remember >> for new PHP developers. >> >> The proposed signature for this function follows the conventions of other >> signatures of string functions and should look like this: >> >> str_contains(string $haystack, string $needle): bool >> >> This function is very easy to implement, has no side effects or backward >> compatibility issues. >> I've implemented this feature and created a pull request on GitHub ( Link: >> https://github.com/php/php-src/pull/5179 ). >> >> To get this function into the PHP core, I will open up an RFC for this. >> But first, I would like to get your opinions and consensus on this >> proposal. >> >> What are your opinions on this proposal? >> > > Sounds good to me. This operation is needed often enough that it deserves > a dedicated function. > > I'd recommend leaving the proposal at only str_contains(), in particular: > > * Do not propose a case-insensitive variant. I believe this is really the > point on which the last str_starts_with/str_ends_with proposal failed. > > * Do not propose mb_str_contains(). Especially as no offsets are > involved, there is no reason to have this function. (For UTF-8, the > behavior would be exactly equivalent to str_contains.) > > Regards, > Nikita >
I like to elaborate on Nikitas response: I don't think a mb_str_contains is necessary, because the proposed function does not behave differently, if the input strings are multibyte strings. When searched for a multibyte string in another multibyte string, the return value would consistently be true/false. The position/offset at which the multibyte string was found is not relevant. The reason for the existence of a strpos/mb_strpos is the fact, that the returned position/offset varies depending on weather or not the string is a multibyte string or not. The only possible valid variants concerning multibyte and incasesensitivity I see are: * str_contains: works as expected with multibyte and non multibyte strings. * mb_str_icontains: is the only valid option to do a incasesensitive search for multibyte strings. Unneeded variants I see are: * mb_str_contains: does not behave differently when compared to str_contains, as mentioned above. * str_icontains: is a possible option but could be error prone for when used with multibyte strings like UTF-8, as it is de facto the standard nowadays. I'm certain there would be confusion among php developers when the newly proposed functions are only str_contains and mb_str_icontains. Patrick ALLAERT: Yes, it does have one: people having already defined a str_contains() function in the global scope will have a PHP Fatal error: Cannot redeclare str_contains() You are absolutely correct with this. Although functions added by frameworks to the global scope are usually guarded by: if (!function_exists('str_contains')) {}
  108814
March 2, 2020 19:09 ajf@ajf.me (Andrea Faulds)
Hi,

Philipp Tanlak wrote:
> I like to elaborate on Nikitas response: I don't think a mb_str_contains is > necessary, because the proposed function does not behave differently, if > the input strings are multibyte strings.
This is not true for all character encodings. For UTF-8 it is correct, but consider for example the Japanese encoding Shift_JIS, where the second byte of a multi-byte character can be a valid first byte of a single-byte character. str_contains() would have incorrect behaviour for this case. Regards, Andrea Faulds
  108648
February 17, 2020 13:37 pierre.php@gmail.com (Pierre Joye)
hello,

On Fri, Feb 14, 2020, 6:54 PM Nikita Popov ppv@gmail.com> wrote:

> On Fri, Feb 14, 2020 at 10:18 AM Philipp Tanlak tanlak@gmail.com> > wrote: > > > Hello PHP Devs, > > > > I would like to propose the new basic function: str_contains. > > > > The goal of this proposal is to standardize on a function, to check > weather > > or not a string is contained in another string, which has a very common > > use-case in almost every PHP project. > > PHP Frameworks like Laravel create helper functions for this behavior > > because it is so ubiquitous. > > > > There are currently a couple of approaches to create such a behavior, > most > > commonly: > > > strpos($haystack, $needle) !== false; > > strstr($haystack, $needle) !== false; > > preg_match('/' . $needle . '/', $haystack) != 0; > > > > All of these functions serve the same purpose but are either not > intuitive, > > easy to get wrong (especially with the !== comparison) or hard to > remember > > for new PHP developers. > > > > The proposed signature for this function follows the conventions of other > > signatures of string functions and should look like this: > > > > str_contains(string $haystack, string $needle): bool > > > > This function is very easy to implement, has no side effects or backward > > compatibility issues. > > I've implemented this feature and created a pull request on GitHub ( > Link: > > https://github.com/php/php-src/pull/5179 ). > > > > To get this function into the PHP core, I will open up an RFC for this. > > But first, I would like to get your opinions and consensus on this > > proposal. > > > > What are your opinions on this proposal? > > > > Sounds good to me. This operation is needed often enough that it deserves a > dedicated function. > > I'd recommend leaving the proposal at only str_contains(), in particular: > > * Do not propose a case-insensitive variant. I believe this is really the > point on which the last str_starts_with/str_ends_with proposal failed. > > * Do not propose mb_str_contains(). Especially as no offsets are involved, > there is no reason to have this function. (For UTF-8, the behavior would be > exactly equivalent to str_contains.) >
Btw, while some mbstring references I I mentioned, I do like the ICU search implementation as well. http://userguide.icu-project.org/collation/icu-string-search-service It handles a lot of cases based on locales.
> Regards, > Nikita >
  108649
February 17, 2020 14:23 rowan.collins@gmail.com (Rowan Tommins)
On Mon, 17 Feb 2020 at 13:38, Pierre Joye php@gmail.com> wrote:

> > Btw, while some mbstring references I I mentioned, I do like the ICU search > implementation as well. > > http://userguide.icu-project.org/collation/icu-string-search-service > > It handles a lot of cases based on locales. >
That's a lovely example of why treating Unicode as a character encoding is the wrong mindset. I would love to see more people using ext/intl rather than ext/mbstring, and more ICU features like this being included. Regards, -- Rowan Tommins [IMSoP]
  108816
March 3, 2020 08:46 andreas@heigl.org (Andreas Heigl)
--UzGXVa5vUM8rdqmgBToOsEzKFwoMks9p1
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable

Hey all.

Just a short note why I voted against the current implementation of the
str_contains functionality.

While it is mainly aimed at being a mere convenience-function that could
also be easily implemented in userland it misses one main thing IMO when
handling unicode-strings: Normalization.

It is correct, that the binary representation of the string "=C3=A4=C3=B6=
=C3=BC=C3=9F"
within the string "T=C3=A4=C3=B6=C3=BC=C3=9Ftstring" seems to be the same=
 and that a simple
`strpos('T=C3=A4=C3=B6=C3=BC=C3=9Ftstring', '=C3=A4=C3=B6=C3=BC=C3=9F')` =
results in a not-false result.

But using unicode it might be that the two strings are using different
normalizations. So for the human eye the two strings look (almost)
identical but internaly they are completely different (and even
mb_strpos might not be able to detect the similarity).

See https://3v4l.org/fasO4 for more information.

As we are creating new functionality here it would have been great to
solve this issue. But as it is IMO merely a convenience add on that can
easily be implemented in userland I vote against it.

Cheers

Andreas

Am 17.02.20 um 15:23 schrieb Rowan Tommins:
> On Mon, 17 Feb 2020 at 13:38, Pierre Joye php@gmail.com> wrote:=
>=20 >> >> Btw, while some mbstring references I I mentioned, I do like the ICU s= earch
>> implementation as well. >> >> http://userguide.icu-project.org/collation/icu-string-search-service >> >> It handles a lot of cases based on locales. >> >=20 >=20 > That's a lovely example of why treating Unicode as a character encoding= is
> the wrong mindset. >=20 > I would love to see more people using ext/intl rather than ext/mbstring= ,
> and more ICU features like this being included. >=20 > Regards, >=20
--=20 ,,, (o o) +---------------------------------------------------------ooO-(_)-Ooo-+ | Andreas Heigl | | mailto:andreas@heigl.org N 50=C2=B022'59.5" E 08=C2=B0= 23'58" | | http://andreas.heigl.org http://hei.gl/wiFKy7 | +---------------------------------------------------------------------+ | http://hei.gl/root-ca | +---------------------------------------------------------------------+ --UzGXVa5vUM8rdqmgBToOsEzKFwoMks9p1--
  108819
March 3, 2020 10:04 rowan.collins@gmail.com (Rowan Tommins)
On Tue, 3 Mar 2020 at 08:46, Andreas Heigl <andreas@heigl.org> wrote:

> > While it is mainly aimed at being a mere convenience-function that could > also be easily implemented in userland it misses one main thing IMO when > handling unicode-strings: Normalization. > >
While I would love to see more functionality for handling Unicode which didn't treat it as just another character set, I don't think sprinkling it into the main string functions of the language would be the right approach. Even if we changed all the existing functions to be "Unicode-aware", as was planned for PHP 6, the resulting API would not handle all cases correctly. In this case, a Unicode-based string API ought to provide at least two variants of "contains", as options or separate functions: - a version which matches on code point, for answering queries like "does this string contain right-to-left override characters?" - at least one form of normalization, but probably several If there was serious work on a new string API in progress, a freeze on additions to the current API would make sense; but right now, the byte-based string API is what we have, and I think this function is a sensible addition to it. Regards, -- Rowan Tommins [IMSoP]
  108821
March 3, 2020 13:29 nicolas.grekas+php@gmail.com (Nicolas Grekas)
Le mar. 3 mars 2020 à 11:04, Rowan Tommins collins@gmail.com> a
écrit :

> On Tue, 3 Mar 2020 at 08:46, Andreas Heigl <andreas@heigl.org> wrote: > > > > > While it is mainly aimed at being a mere convenience-function that could > > also be easily implemented in userland it misses one main thing IMO when > > handling unicode-strings: Normalization. > > > > > > While I would love to see more functionality for handling Unicode which > didn't treat it as just another character set, I don't think sprinkling it > into the main string functions of the language would be the right approach. > Even if we changed all the existing functions to be "Unicode-aware", as was > planned for PHP 6, the resulting API would not handle all cases correctly.. > > In this case, a Unicode-based string API ought to provide at least two > variants of "contains", as options or separate functions: > > - a version which matches on code point, for answering queries like "does > this string contain right-to-left override characters?" > - at least one form of normalization, but probably several > > If there was serious work on a new string API in progress, a freeze on > additions to the current API would make sense; but right now, the > byte-based string API is what we have, and I think this function is a > sensible addition to it. >
FYI, I wrote a String handling lib, shipped as Symfony String: - doc: https://symfony.com/doc/current/components/string.html - src: https://github.com/symfony/string TL;DR, it provides 3 classes of value objects, dealing with bytes, code points and grapheme cluster (~= normalized unicode) It makes no sense to have `str_contains()` or any global function able to deal with Unicode normalization *unless* the PHP string values embed their unit system (one of: bytes, codepoints or graphemes). With this rationale, I agree with Rowan: PHP's native string functions deal with bytes. So should str_contains(). Other unit systems can be implemented in userland (until PHP implements something similar to Symfony String in core - but that's another topic.) Nicolas
  108822
March 3, 2020 13:53 andreas@heigl.org (Andreas Heigl)
--X3DZcFQg0hKFkAZpXpbjfOujTUoMEt66M
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable



Am 03.03.20 um 14:29 schrieb Nicolas Grekas:
> Le mar. 3 mars 2020 =C3=A0 11:04, Rowan Tommins collins@gmail.co= m> a
> =C3=A9crit : >=20 >> On Tue, 3 Mar 2020 at 08:46, Andreas Heigl <andreas@heigl.org> wrote: >> >>> >>> While it is mainly aimed at being a mere convenience-function that co= uld
>>> also be easily implemented in userland it misses one main thing IMO w= hen
>>> handling unicode-strings: Normalization. >>> >>> >> >> While I would love to see more functionality for handling Unicode whic= h
>> didn't treat it as just another character set, I don't think sprinklin= g it
>> into the main string functions of the language would be the right appr= oach.
>> Even if we changed all the existing functions to be "Unicode-aware", a= s was
>> planned for PHP 6, the resulting API would not handle all cases correc= tly.
>> >> In this case, a Unicode-based string API ought to provide at least two=
>> variants of "contains", as options or separate functions: >> >> - a version which matches on code point, for answering queries like "d= oes
>> this string contain right-to-left override characters?" >> - at least one form of normalization, but probably several >> >> If there was serious work on a new string API in progress, a freeze on=
>> additions to the current API would make sense; but right now, the >> byte-based string API is what we have, and I think this function is a >> sensible addition to it. >> >=20 >=20 > FYI, I wrote a String handling lib, shipped as Symfony String: > - doc: https://symfony.com/doc/current/components/string.html > - src: https://github.com/symfony/string >=20 > TL;DR, it provides 3 classes of value objects, dealing with bytes, code=
> points and grapheme cluster (~=3D normalized unicode) >=20 > It makes no sense to have `str_contains()` or any global function able = to
> deal with Unicode normalization *unless* the PHP string values embed th= eir
> unit system (one of: bytes, codepoints or graphemes). >=20 > With this rationale, I agree with Rowan: PHP's native string functions = deal
> with bytes. So should str_contains(). Other unit systems can be impleme= nted
> in userland (until PHP implements something similar to Symfony String i= n
> core - but that's another topic.)
str_contains as it currently is implemented can also easily be implemented in userland. That was my reasoning. I would think otherwise would it take unicode into account as that's much harder to implement in userland. And I didn'T want to start a new discussion, I merely wanted to explain the reasoning behind my decission. Cheers Andreas --=20 ,,, (o o) +---------------------------------------------------------ooO-(_)-Ooo-+ | Andreas Heigl | | mailto:andreas@heigl.org N 50=C2=B022'59.5" E 08=C2=B0= 23'58" | | http://andreas.heigl.org http://hei.gl/wiFKy7 | +---------------------------------------------------------------------+ | http://hei.gl/root-ca | +---------------------------------------------------------------------+ --X3DZcFQg0hKFkAZpXpbjfOujTUoMEt66M--