Re: [PHP-DEV] [Discussion] Scalar Object Strings andMultibyteEncodings

June 22, 2019 13:32 (Rowan Collins)
On 20/06/2019 23:30, Mark Randall wrote:
> There does at least seem to be the starting point in that mb_string is > already widely used, and my suggestion that it "work as expected" is > more that it would work as the equivalent mb_string / iconv function > would.
I think this is a rather short-sighted way of looking at it. If people want the API provided by the mbstring extension, they can just use those functions; the advantage of designing a new set of functions is surely that we don't need to stick to past decisions. If we start to build a new standard library, as Zeev suggested in the deprecation thread, it is a once-in-a-lifetime chance to build something better, not just copy what's gone before.
> mb_strlen returns the number of codepoints for example, I'm not > immediately seeing anything about mb_string supporting Graphemes as > the only reference I could find to their manipulation was The intl > extension.
The mbstring extension was not built for Unicode, but for older Japanese multi-byte encodings, where the definition of "character" is much more straight-forward. Its Unicode support seems to mostly see code points as mappings for characters in some other encoding. (The oldest manual page for it on [1] is from 2001, and includes the quaint remark "As Unicode is getting popular, UTF-8 is used also.") The iconv library is even more explicitly aimed at converting between character sets, rather than understanding them (the extra functions such as iconv_strlen are unique to PHP). Unicode today is much more than a mapping of legacy encodings to a universal character set, and I can think of no useful purpose in declaring the "string length" of the British flag emoji to be 2, just because it is encoded as the sequence U+1F1EC U+1F1E7. [1] Regards, -- Rowan Collins [IMSoP]