PhpSpreadsheet

Commit Graph

Author	SHA1	Message	Date
oleibman	c3f53854b6	Php/iconv Should Not Treat FFFE/FFFF as Valid (#2910 ) Fix #2897. We have been relying on iconv/mb_convert_encoding to detect invalid UTF-8, but all techniques designed to validate UTF-8 seem to accept FFFE and FFFF. This PR explicitly converts those characters to FFFD (Unicode substitution character) before validating the rest of the string. It also substitutes one or more FFFD when it detects invalid UTF-8 character sequences. A comment in the code being change stated that it doesn't handle surrogates. It is right not to do so. The only case where we should see surrogates is reading UTF-16. Additional tests are added to an existing test reading a UTF-16 Csv to demonstrate that surrogates are handled correctly, and that FFFE/FFFF are handled reasonably.	2022-07-02 08:53:39 -07:00
oleibman	e768cb0f19	CSV - Guess Encoding, Handle Null-string Escape (#1717 ) * CSV - Guess Encoding, Handle Null-string Escape This is in response to issue #1647 (detect CSV character encoding). First, my tests with mb_detect_encoding indicate that it doesn't work well enough; regardless, users can always do that on their own if they deem it useful. Rolling my own is also troublesome, but I can at least: a. Check for BOM (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE). b. Do some heuristic tests for each of the above encodings. c. Fallback to a user-specified encoding (default CP1252) if a and b don't yield result. I think this is probably useful enough to include, and relatively easy to expand if other potential encodings should be considered. Starting with PHP7.4, fgetcsv allows specification of null string as escape character in fgetcsv. This is a much better choice than the PHP (and PhpSpreadsheet) default of backslash in that it handles the file in the same manner as Excel does. There is one statement in Reader/CSV which would be adversely affected if the caller so specified (building a regular expression under the assumption that escape character is a single character). Fix that statement appropriately and add tests.	2020-12-25 17:47:29 +01:00

Author

SHA1

Message

Date

oleibman

c3f53854b6

Php/iconv Should Not Treat FFFE/FFFF as Valid (#2910 )

Fix #2897. We have been relying on iconv/mb_convert_encoding to detect invalid UTF-8, but all techniques designed to validate UTF-8 seem to accept FFFE and FFFF. This PR explicitly converts those characters to FFFD (Unicode substitution character) before validating the rest of the string. It also substitutes one or more FFFD when it detects invalid UTF-8 character sequences.

A comment in the code being change stated that it doesn't handle surrogates. It is right not to do so. The only case where we should see surrogates is reading UTF-16. Additional tests are added to an existing test reading a UTF-16 Csv to demonstrate that surrogates are handled correctly, and that FFFE/FFFF are handled reasonably.

2022-07-02 08:53:39 -07:00

oleibman

e768cb0f19

CSV - Guess Encoding, Handle Null-string Escape (#1717 )

* CSV - Guess Encoding, Handle Null-string Escape

This is in response to issue #1647 (detect CSV character encoding).
First, my tests with mb_detect_encoding indicate that it doesn't work
well enough; regardless, users can always do that on their own
if they deem it useful.
Rolling my own is also troublesome, but I can at least:
a. Check for BOM (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE).
b. Do some heuristic tests for each of the above encodings.
c. Fallback to a user-specified encoding (default CP1252)
  if a and b don't yield result.
I think this is probably useful enough to include, and relatively
easy to expand if other potential encodings should be considered.

Starting with PHP7.4, fgetcsv allows specification of null string as
escape character in fgetcsv. This is a much better choice than the PHP
(and PhpSpreadsheet) default of backslash in that it handles the file
in the same manner as Excel does. There is one statement in Reader/CSV
which would be adversely affected if the caller so specified (building
a regular expression under the assumption that escape character is
a single character). Fix that statement appropriately and add tests.

2020-12-25 17:47:29 +01:00

2 Commits