default_charset and mb_internal_encoding

  105034
April 2, 2019 09:42 nicolai.scheer@gmail.com (Nicolai Scheer)
Hi list,

I'm currently in the process of migrating an old application from php 5.6
to 7.2.
In the process, I fiddled with the default_charset ini setting.

The documentation states (c.f.
https://www.php.net/manual/en/ini.core.php#ini.default-charset):

"In PHP 5.6 onwards, "UTF-8" is the default value and [...] The value of
default_charset
will also be used to set the default character set for [...] and for
mbstring functions
if the mbstring.http_input mbstring.http_output mbstring.internal_encoding
configuration option is unset."

As such, I'd expect to be able to set default_charset to iso-8859-1 and
mbstring to pick that same setting for its internal encoding (if the
mentioned directives are unset, that is).

This seems not to be the case:


    
  105239
April 11, 2019 13:41 cmbecker69@gmx.de ("Christoph M. Becker")
On 02.04.2019 at 11:42, Nicolai Scheer wrote:

> I'm currently in the process of migrating an old application from php 5.6 > to 7.2. > In the process, I fiddled with the default_charset ini setting. > > The documentation states (c.f. > https://www.php.net/manual/en/ini.core.php#ini.default-charset): > > "In PHP 5.6 onwards, "UTF-8" is the default value and [...] The value of > default_charset > will also be used to set the default character set for [...] and for > mbstring functions > if the mbstring.http_input mbstring.http_output mbstring.internal_encoding > configuration option is unset." > > As such, I'd expect to be able to set default_charset to iso-8859-1 and > mbstring to pick that same setting for its internal encoding (if the > mentioned directives are unset, that is). > > This seems not to be the case: > > ini_set( 'default_charset', 'iso-8859-1' ); > var_dump( ini_get("mbstring.internal_encoding") ); > var_dump( ini_get("mbstring.http_input") ); > var_dump( ini_get("mbstring.http_output") ); > echo mb_internal_encoding() . "\n"; > echo mb_strlen( "\xc3\xb6" ) . "\n"; > echo mb_strlen( "\xc3\xb6", '8bit' ) . "\n"; > > This outputs (7.2.15 on a CentOS box): > string(0) "" > string(0) "" > string(0) "" > UTF-8 > 1 > 2 > > The default_charset is set but mbstring settings are not, so I'd expect to > get 2 as the character/byte count in both cases. > > If I throw a mb_internal_encoding("iso-8859-1") in the mix, both string > lengths are equal. > > Since the mentioned mbstring directives are deprecated as of 5.6.0 - do I > really need to use mb_internal_encoding() instead? > Is the documentation wrong or am I just misinterpreting it? I thought that > default_charset should act as some kind of "master setting" in order not to > have to set all specific settings as well (e.g. iconv, mbstring). > > Usually we use UTF-8, so I did not come across this before... > > Any insight?
<https://3v4l.org/ZvQ67> confirms the reported behavior. A quick look at the code, too. I suggest you file a ticket on <https://bugs.php.net/>. Thanks, Christoph M. Becker
  105577
May 2, 2019 23:18 bjorn.x.larsson@telia.com (=?UTF-8?Q?Bj=c3=b6rn_Larsson?=)
Den 2019-04-11 kl. 15:41, skrev Christoph M. Becker:
> On 02.04.2019 at 11:42, Nicolai Scheer wrote: > >> I'm currently in the process of migrating an old application from php 5.6 >> to 7.2. >> In the process, I fiddled with the default_charset ini setting. >> >> The documentation states (c.f. >> https://www.php.net/manual/en/ini.core.php#ini.default-charset): >> >> "In PHP 5.6 onwards, "UTF-8" is the default value and [...] The value of >> default_charset >> will also be used to set the default character set for [...] and for >> mbstring functions >> if the mbstring.http_input mbstring.http_output mbstring.internal_encoding >> configuration option is unset." >> >> As such, I'd expect to be able to set default_charset to iso-8859-1 and >> mbstring to pick that same setting for its internal encoding (if the >> mentioned directives are unset, that is). >> >> This seems not to be the case: >> >> > ini_set( 'default_charset', 'iso-8859-1' ); >> var_dump( ini_get("mbstring.internal_encoding") ); >> var_dump( ini_get("mbstring.http_input") ); >> var_dump( ini_get("mbstring.http_output") ); >> echo mb_internal_encoding() . "\n"; >> echo mb_strlen( "\xc3\xb6" ) . "\n"; >> echo mb_strlen( "\xc3\xb6", '8bit' ) . "\n"; >> >> This outputs (7.2.15 on a CentOS box): >> string(0) "" >> string(0) "" >> string(0) "" >> UTF-8 >> 1 >> 2 >> >> The default_charset is set but mbstring settings are not, so I'd expect to >> get 2 as the character/byte count in both cases. >> >> If I throw a mb_internal_encoding("iso-8859-1") in the mix, both string >> lengths are equal. >> >> Since the mentioned mbstring directives are deprecated as of 5.6.0 - do I >> really need to use mb_internal_encoding() instead? >> Is the documentation wrong or am I just misinterpreting it? I thought that >> default_charset should act as some kind of "master setting" in order not to >> have to set all specific settings as well (e.g. iconv, mbstring). >> >> Usually we use UTF-8, so I did not come across this before... >> >> Any insight? > <https://3v4l.org/ZvQ67> confirms the reported behavior. A quick look > at the code, too. I suggest you file a ticket on <https://bugs.php.net/>. > > Thanks, > Christoph M. Becker
Hi, Did this lead to a bug report? It lead to a bug in Smarty 3.1.33 for me. I got a warning about "mbregex compile err: invalid code point value" in mb_split(). I have content in ISO-8859-1 and Smarty normal procedure to set encoding and php.ini setting to ISO-8859-1 flunked. However mb_regex_encoding('ISO-8859-1') did the trick! r//Björn L
  105579
May 3, 2019 09:43 cmbecker69@gmx.de ("Christoph M. Becker")
On 03.05.2019 at 01:18, Björn Larsson wrote:

> Den 2019-04-11 kl. 15:41, skrev Christoph M. Becker: > >> On 02.04.2019 at 11:42, Nicolai Scheer wrote: >> >>> I'm currently in the process of migrating an old application from php >>> 5.6 >>> to 7.2. >>> In the process, I fiddled with the default_charset ini setting. >>> >>> The documentation states (c.f. >>> https://www.php.net/manual/en/ini.core.php#ini.default-charset): >>> >>> "In PHP 5.6 onwards, "UTF-8" is the default value and [...] The value of >>> default_charset >>> will also be used to set the default character set for [...] and for >>> mbstring functions >>> if the mbstring.http_input mbstring.http_output >>> mbstring.internal_encoding >>> configuration option is unset." >>> >>> As such, I'd expect to be able to set default_charset to iso-8859-1 and >>> mbstring to pick that same setting for its internal encoding (if the >>> mentioned directives are unset, that is). >>> >>> This seems not to be the case: >>> >>> >> ini_set( 'default_charset', 'iso-8859-1' ); >>> var_dump( ini_get("mbstring.internal_encoding") ); >>> var_dump( ini_get("mbstring.http_input") ); >>> var_dump( ini_get("mbstring.http_output") ); >>> echo mb_internal_encoding() . "\n"; >>> echo mb_strlen( "\xc3\xb6" ) . "\n"; >>> echo mb_strlen( "\xc3\xb6", '8bit' ) . "\n"; >>> >>> This outputs (7.2.15 on a CentOS box): >>> string(0) "" >>> string(0) "" >>> string(0) "" >>> UTF-8 >>> 1 >>> 2 >>> >>> The default_charset is set but mbstring settings are not, so I'd >>> expect to >>> get 2 as the character/byte count in both cases. >>> >>> If I throw a mb_internal_encoding("iso-8859-1") in the mix, both string >>> lengths are equal. >>> >>> Since the mentioned mbstring directives are deprecated as of 5.6.0 - >>> do I >>> really need to use mb_internal_encoding() instead? >>> Is the documentation wrong or am I just misinterpreting it? I thought >>> that >>> default_charset should act as some kind of "master setting" in order >>> not to >>> have to set all specific settings as well (e.g. iconv, mbstring). >>> >>> Usually we use UTF-8, so I did not come across this before... >>> >>> Any insight? >> >> <https://3v4l.org/ZvQ67> confirms the reported behavior.  A quick look >> at the code, too.  I suggest you file a ticket on >> <https://bugs.php.net/>. > > Did this lead to a bug report?
Hmm, apparently not.
> It lead to a bug in Smarty 3.1.33 for me. I got a warning about > "mbregex compile err: invalid code point value" in mb_split(). > I have content in ISO-8859-1 and Smarty normal procedure to > set encoding and php.ini setting to ISO-8859-1 flunked. > > However mb_regex_encoding('ISO-8859-1') did the trick!
While the RFC[1] states | all functions that take encoding option use php.internal_encoding as | default (e.g. htmlentities/mb_strlen/mb_regex/etc) apparently this has not been implemented (yet). [1] <https://wiki.php.net/rfc/default_encoding> -- Christoph M. Becker
  105624
May 7, 2019 14:09 nikita.ppv@gmail.com (Nikita Popov)
On Fri, May 3, 2019 at 11:44 AM Christoph M. Becker <cmbecker69@gmx.de>
wrote:

> On 03.05.2019 at 01:18, Björn Larsson wrote: > > > Den 2019-04-11 kl. 15:41, skrev Christoph M. Becker: > > > >> On 02.04.2019 at 11:42, Nicolai Scheer wrote: > >> > >>> I'm currently in the process of migrating an old application from php > >>> 5.6 > >>> to 7.2. > >>> In the process, I fiddled with the default_charset ini setting. > >>> > >>> The documentation states (c.f. > >>> https://www.php.net/manual/en/ini.core.php#ini.default-charset): > >>> > >>> "In PHP 5.6 onwards, "UTF-8" is the default value and [...] The value > of > >>> default_charset > >>> will also be used to set the default character set for [...] and for > >>> mbstring functions > >>> if the mbstring.http_input mbstring.http_output > >>> mbstring.internal_encoding > >>> configuration option is unset." > >>> > >>> As such, I'd expect to be able to set default_charset to iso-8859-1 and > >>> mbstring to pick that same setting for its internal encoding (if the > >>> mentioned directives are unset, that is). > >>> > >>> This seems not to be the case: > >>> > >>> >>> ini_set( 'default_charset', 'iso-8859-1' ); > >>> var_dump( ini_get("mbstring.internal_encoding") ); > >>> var_dump( ini_get("mbstring.http_input") ); > >>> var_dump( ini_get("mbstring.http_output") ); > >>> echo mb_internal_encoding() . "\n"; > >>> echo mb_strlen( "\xc3\xb6" ) . "\n"; > >>> echo mb_strlen( "\xc3\xb6", '8bit' ) . "\n"; > >>> > >>> This outputs (7.2.15 on a CentOS box): > >>> string(0) "" > >>> string(0) "" > >>> string(0) "" > >>> UTF-8 > >>> 1 > >>> 2 > >>> > >>> The default_charset is set but mbstring settings are not, so I'd > >>> expect to > >>> get 2 as the character/byte count in both cases. > >>> > >>> If I throw a mb_internal_encoding("iso-8859-1") in the mix, both string > >>> lengths are equal. > >>> > >>> Since the mentioned mbstring directives are deprecated as of 5.6.0 - > >>> do I > >>> really need to use mb_internal_encoding() instead? > >>> Is the documentation wrong or am I just misinterpreting it? I thought > >>> that > >>> default_charset should act as some kind of "master setting" in order > >>> not to > >>> have to set all specific settings as well (e.g. iconv, mbstring). > >>> > >>> Usually we use UTF-8, so I did not come across this before... > >>> > >>> Any insight? > >> > >> <https://3v4l.org/ZvQ67> confirms the reported behavior. A quick look > >> at the code, too. I suggest you file a ticket on > >> <https://bugs.php.net/>. > > > > Did this lead to a bug report? > > Hmm, apparently not. >
This was reported as https://bugs.php.net/bug.php?id=77907 and will be fixed in 7.4. Nikita
> > It lead to a bug in Smarty 3.1.33 for me. I got a warning about > > "mbregex compile err: invalid code point value" in mb_split(). > > I have content in ISO-8859-1 and Smarty normal procedure to > > set encoding and php.ini setting to ISO-8859-1 flunked. > > > > However mb_regex_encoding('ISO-8859-1') did the trick! > > While the RFC[1] states > > | all functions that take encoding option use php.internal_encoding as > | default (e.g. htmlentities/mb_strlen/mb_regex/etc) > > apparently this has not been implemented (yet). > > [1] <https://wiki.php.net/rfc/default_encoding> > > -- > Christoph M. Becker > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php > >
  105625
May 7, 2019 14:18 cmbecker69@gmx.de ("Christoph M. Becker")
On 07.05.2019 at 16:09, Nikita Popov wrote:

> On Fri, May 3, 2019 at 11:44 AM Christoph M. Becker <cmbecker69@gmx.de> > wrote: > >> On 03.05.2019 at 01:18, Björn Larsson wrote: >> >>> Did this lead to a bug report? >> >> Hmm, apparently not. > > This was reported as https://bugs.php.net/bug.php?id=77907 and will be > fixed in 7.4.
Thanks! -- Christoph M. Becker