Re: [PHP-DEV] [Discussion] Scalar Object Strings and MultibyteEncodings

  106010
June 20, 2019 21:19 rowan.collins@gmail.com (Rowan Collins)
On 20/06/2019 16:36, Mark Randall wrote:
> "Hello".substr(1) // would work as expected regardless of encoding
As I always point out when "multi-byte support" or "Unicode support" is discussed, it's often ambiguous just what should be "expected". A lot of systems go from "each character is one byte" to "each character is one code point", but that leads to what I call "the noël problem": if you reverse the string "noël", the expected behaviour is probably for the diaeresis to stay on the "e". However, if it is encoded as a combining diacritic, a code point based implementation will place it onto the "l" instead. Similarly, taking the first three "characters" should give "noë" not "noe". Enforcing normalisation helps in this case, because there is a composed form of e+diaeresis, but that's not true for all combinations ("graphemes") you can encode, or for all operations. Another example is "length"; what practical purpose does "number of code points" serve, when some of those code points may be combiners or non-printing marks? Often, number of bytes (in some encoding, such as UTF-8) is actually the relevant measure; other times, "width on screen" is what is actually required, and very hard to compute. My point is that any attempt to make the language "do the right thing by default" needs serious thought on what "the right thing" is. Regards, -- Rowan Collins [IMSoP]
  106011
June 20, 2019 22:30 markyr@gmail.com (Mark Randall)
On 20/06/2019 22:19, Rowan Collins wrote:
> On 20/06/2019 16:36, Mark Randall wrote: >> "Hello".substr(1) // would work as expected regardless of encoding > > As I always point out when "multi-byte support" or "Unicode support" is > discussed, it's often ambiguous just what should be "expected". > > My point is that any attempt to make the language "do the right thing by > default" needs serious thought on what "the right thing" is.
Without a doubt, and I expect people will have terrible flashbacks to PHP6 discussions when thinking about it. It will require a consensus of which I have no power to aid or influence. There does at least seem to be the starting point in that mb_string is already widely used, and my suggestion that it "work as expected" is more that it would work as the equivalent mb_string / iconv function would. mb_strlen returns the number of codepoints for example, I'm not immediately seeing anything about mb_string supporting Graphemes as the only reference I could find to their manipulation was The intl extension. -- Mark Randall