fputcsv() and $escape character

  100729
September 21, 2017 11:43 cmbecker69@gmx.de ("Christoph M. Becker")
Hi everybody!

There are several bug reports regarding "broken" fputcsv() behavior in
our tracker, namely, because the $escape parameter causes unexpected
results.  For instance:

    php://memory', 'w+');
    fputcsv($fp, $row);
    rewind($fp);
    echo stream_get_contents($fp);
    fclose($fp);
    ?>

outputs

    "a\"",bbb

instead of the expected

    "a\""",bbb

I don't think the current behavior is a bug, but rather the escape
character is an extension to the CSV "standard" (RFC 7111).  One can
practically disable the escape character by passing any character that
is not contained in any of the strings in the array; "\0" is usually a
good choice, so changing line 5 in the script above to the following
gives the desired behavior:

    fputcsv($fp, $row, ',', '"', "\0");

Cf. <https://3v4l.org/InlUv> vs. <https://3v4l.org/tVFBo>.

It is, however, not possible to pass an empty string as $escape
argument, because fputcsv() bails out in this case raising

  Warning: fputcsv(): escape must be a character

I suggest to allow an empty string instead, and to consider making this
the default in a future version, probably after some time of deprecating
any other $escape argument.

Thoughts?

-- 
Christoph M. Becker
  100737
September 21, 2017 17:32 andreas@dqxtech.net (Andreas Hennings)
So empty string would enable the standard behavior RFC 7111 with no escape char.
If so, I support this.

I don't know if '' or true / false / null should be this "special value".
It has to be something that was not legal before, and that people
should use intentionally and not by accident.
I guess '' is good enough for this, or not worse than other options.

One question: Does this also require a change in fgetcsv()?
So it can read csv without escape character?
I remember that fgetcsv() does currently tolerate some broken CSV. It
should continue to do so for BC reasons.

For the record, here are some links:
https://stackoverflow.com/questions/44427926/data-gets-garbled-when-writing-to-csv-with-fputcsv-fgetcsv/46342634#46342634
https://bugs.php.net/bug.php?id=74713&edit=2

On Thu, Sep 21, 2017 at 1:43 PM, Christoph M. Becker <cmbecker69@gmx.de> wrote:
> Hi everybody! > > There are several bug reports regarding "broken" fputcsv() behavior in > our tracker, namely, because the $escape parameter causes unexpected > results. For instance: > > $row = ['a\\"', 'bbb']; > > $fp = fopen('php://memory', 'w+'); > fputcsv($fp, $row); > rewind($fp); > echo stream_get_contents($fp); > fclose($fp); > ?> > > outputs > > "a\"",bbb > > instead of the expected > > "a\""",bbb > > I don't think the current behavior is a bug, but rather the escape > character is an extension to the CSV "standard" (RFC 7111). One can > practically disable the escape character by passing any character that > is not contained in any of the strings in the array; "\0" is usually a > good choice, so changing line 5 in the script above to the following > gives the desired behavior: > > fputcsv($fp, $row, ',', '"', "\0"); > > Cf. <https://3v4l.org/InlUv> vs. <https://3v4l.org/tVFBo>. > > It is, however, not possible to pass an empty string as $escape > argument, because fputcsv() bails out in this case raising > > Warning: fputcsv(): escape must be a character > > I suggest to allow an empty string instead, and to consider making this > the default in a future version, probably after some time of deprecating > any other $escape argument. > > Thoughts? > > -- > Christoph M. Becker > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php >
  100738
September 21, 2017 18:01 rowan.collins@gmail.com (Rowan Collins)
On 21 September 2017 18:32:57 BST, Andreas Hennings <andreas@dqxtech.net> wrote:
>So empty string would enable the standard behavior RFC 7111 with no >escape char. >If so, I support this.
Just a note regarding standards: please make sure the behaviour of common applications, most notably Excel, is considered and tested. In my experience it has its own quirks, and it's likely that a large proportion of users require interoperability with it. It may be there's nothing relevant to consider, but I thought I'd mention it, in case people get too caught up with a specification that nobody actually follows. Regards, -- Rowan Collins [IMSoP]
  100750
September 22, 2017 11:20 cmbecker69@gmx.de ("Christoph M. Becker")
On 21.09.2017 at 20:01, Rowan Collins wrote:

> On 21 September 2017 18:32:57 BST, Andreas Hennings <andreas@dqxtech.net> wrote: > >> So empty string would enable the standard behavior RFC 7111 with no >> escape char. >> If so, I support this. > > Just a note regarding standards:
Actually, there is no accepted standard regarding CSV. RFC 7111 is just an update to RFC 4180 (sorry, I've messed that up), but anyway both are just informational.
> please make sure the behaviour of common applications, most notably > Excel, is considered and tested. In my experience it has its own quirks, > and it's likely that a large proportion of users require > interoperability with it. It may be there's nothing relevant to > consider, but I thought I'd mention it, in case people get too caught up > with a specification that nobody actually follows.
That's exactly the point of this proposal. As it is, fputcsv() outputs CSV that won't be understood by Excel (or any other CSV aware implementation I'm aware of), as soon as the escape character is actually part of any string. So being able to avoid any such escaping would be helpful wrt. major CSV implementations, and making that the default even more so. The only other issue that I can see is that currently our fputcsv() hard-codes the line endings to LF (while RFC 4180 demands CRLF), but that may not be a real problem anymore. (Well, there might be issues with non ASCII compatible character encodings as well). -- Christoph M. Becker
  100758
September 22, 2017 18:39 andreas@dqxtech.net (Andreas Hennings)
On Fri, Sep 22, 2017 at 1:20 PM, Christoph M. Becker <cmbecker69@gmx.de> wrote:
>> please make sure the behaviour of common applications, most notably >> Excel, is considered and tested. In my experience it has its own quirks, >> and it's likely that a large proportion of users require >> interoperability with it. It may be there's nothing relevant to >> consider, but I thought I'd mention it, in case people get too caught up >> with a specification that nobody actually follows. > > That's exactly the point of this proposal. As it is, fputcsv() outputs > CSV that won't be understood by Excel (or any other CSV aware > implementation I'm aware of), as soon as the escape character is > actually part of any string. So being able to avoid any such escaping > would be helpful wrt. major CSV implementations, and making that the > default even more so.
The other problem that this proposal wants to fix is that CSV cannot currently be used reliably to store or transport data between different PHP scripts, due to this bug: https://bugs.php.net/bug.php?id=74713&edit=2
> If one cell of the data sent to fputcsv() contains "{$enclosure}{$escape_char}{$escape_char}{$enclosure}{$delimiter}", this cell will be split after a round trip of fputcsv() + fgetcsv().
> > The only other issue that I can see is that currently our fputcsv() > hard-codes the line endings to LF (while RFC 4180 demands CRLF), but > that may not be a real problem anymore. (Well, there might be issues > with non ASCII compatible character encodings as well).
I have opened a lot of php-generated CSV files with LibreOffice Calc, and the line endings have not been a problem. I assume Excel and OpenOffice would also be ok with it. I'd say stick with existing behavior for BC reasons, or discuss it separately.
  100740
September 21, 2017 21:08 danack@basereality.com (Dan Ackroyd)
On 21 September 2017 at 12:43, Christoph M. Becker <cmbecker69@gmx.de> wrote:
> Hi everybody! > > There are several bug reports regarding "broken" fputcsv() behavior in > our tracker, namely, because the $escape parameter causes unexpected > results. For instance:
I looked at fixing some of the CSV related bugs about a year or so ago. My conclusions were: i) There is no way to fix the problems that wouldn't cause horrible BC breaks for code that is only coincidentally working currently. ii) Handling strings in C is much more error prone than handling them in PHP. I'm reasonably certain that trying to fix the current functions is the wrong approach, and one of the following would be much better. Either, find a C library that has already been proven to handle CSV parsing/generating 'correctly' and bring that into PHP core under either new function names or namespace. Or, write the code in PHP (or just use https://github.com/thephpleague/csv) and find a way to make that fast enough for people to use. Touching the existing code is pretty certain to bring a lot of pain, without resulting in a fully compliant csv parser/generator. cheers Dan Ack
  100744
September 22, 2017 00:39 andreas@dqxtech.net (Andreas Hennings)
On Thu, Sep 21, 2017 at 11:08 PM, Dan Ackroyd <danack@basereality.com> wrote:
> On 21 September 2017 at 12:43, Christoph M. Becker <cmbecker69@gmx.de> wrote: >> Hi everybody! >> >> There are several bug reports regarding "broken" fputcsv() behavior in >> our tracker, namely, because the $escape parameter causes unexpected >> results. For instance: > > > I looked at fixing some of the CSV related bugs about a year or so > ago. My conclusions were: > > i) There is no way to fix the problems that wouldn't cause horrible BC > breaks for code that is only coincidentally working currently.
How so? What we can do: If $escape_char !== '', fputcsv() should work exactly as it does now. It can even use the same C code. If $escape_char === '', fputcsv() will have a different behavior than currently, where no character is treated as special. If we want to be 100% sure, these two cases can use completely separate C code, with one if() at the entry point.
> > ii) Handling strings in C is much more error prone than handling them in PHP. > > I'm reasonably certain that trying to fix the current functions is the > wrong approach, and one of the following would be much better. > > Either, find a C library that has already been proven to handle CSV > parsing/generating 'correctly' and bring that into PHP core under > either new function names or namespace.
Yeah why not. But it will require a lot more work and design decisions than just introducing a new allowed value for $enclosure parameter.
> > Or, write the code in PHP (or just use > https://github.com/thephpleague/csv) and find a way to make that fast > enough for people to use.
It can never be as fast as native functions. I vaguely remember something like factor 2 or 3 when I tried it, but for now treat this as hearsay.
> > Touching the existing code is pretty certain to bring a lot of pain, > without resulting in a fully compliant csv parser/generator. > > cheers > Dan > Ack > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php >
  100759
September 22, 2017 22:09 cmbecker69@gmx.de ("Christoph M. Becker")
On 21.09.2017 at 23:08, Dan Ackroyd wrote:

> On 21 September 2017 at 12:43, Christoph M. Becker <cmbecker69@gmx.de> wrote: > >> There are several bug reports regarding "broken" fputcsv() behavior in >> our tracker, namely, because the $escape parameter causes unexpected >> results. For instance: > > I looked at fixing some of the CSV related bugs about a year or so > ago. My conclusions were: > > i) There is no way to fix the problems that wouldn't cause horrible BC > breaks for code that is only coincidentally working currently. > > ii) Handling strings in C is much more error prone than handling them in PHP. > > I'm reasonably certain that trying to fix the current functions is the > wrong approach, and one of the following would be much better. > > Either, find a C library that has already been proven to handle CSV > parsing/generating 'correctly' and bring that into PHP core under > either new function names or namespace. > > Or, write the code in PHP (or just use > https://github.com/thephpleague/csv) and find a way to make that fast > enough for people to use. > > Touching the existing code is pretty certain to bring a lot of pain, > without resulting in a fully compliant csv parser/generator.
I agree that php_fgetcsv() has serious issues, and it might not be possible to fix it without causing severe BC breaks. php_fputcsv(), on the other hand, is less of a problem, though. Overall, the most demanding issue is that both functions try to regard the current locale, but already fail that generally, since several parameters are declared as char, which can't work for (some) multibyte encodings. For instance, it is impossible to generate proper UTF-16 encoded CSV files, or to read them. This issue continues, because several (mostly?) whitespace characters are hard-coded assuming an ASCII compatible character encoding. A minor issue are the hard-coded record terminators, which are currently LF (RFC 4180 specifies CRLF). Apparently, this isn't a real issue nowadays (besides the missing support for non ASCII compatible character encodings). Another issue concerns the escape character. Frankly, I don't even have the slightest clue how that is supposed to work, and why it even has been introduced in the first place. Maybe it has been introduced for compatibility with some application requiring it; maybe it has been introduced to support "DSV style"[1]. If the latter, at least php_fputcsv() doesn't support it (anymore). Unless it's clear what the escape character exactly is supposed to do, we *cannot* even *hope* to fix the implementation. Introducing new functions with a clearly defined behavior would be nice, but appears to me somewhat as pie in the sky (somebody would have to do the actual work!). But even if we do so, at least the actual behavior of the existing functions would have to be documented. And frankly, I don't see why it would be a problem to allow to use no escape character for fputcsv(). That certainly wouldn't be a BC break, since currently the function bails out if `escape_str_len < 1`. Of course, that wouldn't fix all issues, but it appears to make the function work as expected for ASCII compatible character encodings (for other character encodings the function appears to be broken anyway). Ad league/csv: rather impressive! However, including this functionality into ext/standard is totally over the top, in my opinion. I guess that fgetcsv() and fputcsv() are mostly used for importing from and exporting to CSV, respectively, but not as replacement for an SQL database engine. [1] <http://www.catb.org/~esr/writings/taoup/html/ch05s02.html> -- Christoph M. Becker
  100745
September 22, 2017 00:55 andreas@dqxtech.net (Andreas Hennings)
On Thu, Sep 21, 2017 at 1:43 PM, Christoph M. Becker <cmbecker69@gmx.de> wrote:
> I don't think the current behavior is a bug, but rather the escape > character is an extension to the CSV "standard" (RFC 7111).
Are you sure you mean RFC 7111 ? I was just parrotting the number in my previous email, but now I looked it up and only find this: https://tools.ietf.org/html/rfc7111 This talks about uri fragments with CSV, not CSV itself. RFC 4180 seems to be closer to what we are looking for: https://tools.ietf.org/html/rfc4180#section-2 fputcsv() and fgetcsv() already have a number of extensions to this format, which I think are not harmful and that we should keep: - option to choose a different delimiter - option to choose a different enclosure - option to have a different number of cells per row. - fgetcsv() has some tolerance for broken CSV, that we should continue to support. In the stackoverflow discussion, someone argues that line breaks are not part of the standard / not portable: https://stackoverflow.com/questions/44427926/data-gets-garbled-when-writing-to-csv-with-fputcsv-fgetcsv/46342634#comment75882780_44427926 However, the RFC 4180 that I found clearly mentions them:
> 6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
So, all we would change is the escape behavior. In RFC 4180, it says:
> If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.