See the discussion in PR #2232 which came about 3 months after it was merged. It caused a problem in an unusual situation which did not come to light until the change was part of the new release version. The original PR changed PhpSpreadsheet's behavior to match Excel's for (not case sensitive) strings `TRUE` and `FALSE`. Excel treats the values as boolean, and now so does PhpSpreadsheet.
When StringValueBinder is used, this becomes a tricky situation. The user wants the original strings preserved, including the case of all the letters. This PR changes the behavior of CSV reader as follows:
- If StringValueBinder is not in effect, convert to boolean.
- If StringValueBinder (actually any binder with method getBooleanConversion) is in effect, and the result of getBooleanConversion is true (which is the default in StringValueBinder), leave the value coming out of Csv Reader as the unchanged string.
- Otherwise, convert to boolean.
This should mean that there are no regression problems with StringValueBinder, while allowing PhpSpreadsheet to continue to match Excel in the default situation. No new settings are required.
* CSV - Guess Encoding, Handle Null-string Escape
This is in response to issue #1647 (detect CSV character encoding).
First, my tests with mb_detect_encoding indicate that it doesn't work
well enough; regardless, users can always do that on their own
if they deem it useful.
Rolling my own is also troublesome, but I can at least:
a. Check for BOM (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE).
b. Do some heuristic tests for each of the above encodings.
c. Fallback to a user-specified encoding (default CP1252)
if a and b don't yield result.
I think this is probably useful enough to include, and relatively
easy to expand if other potential encodings should be considered.
Starting with PHP7.4, fgetcsv allows specification of null string as
escape character in fgetcsv. This is a much better choice than the PHP
(and PhpSpreadsheet) default of backslash in that it handles the file
in the same manner as Excel does. There is one statement in Reader/CSV
which would be adversely affected if the caller so specified (building
a regular expression under the assumption that escape character is
a single character). Fix that statement appropriately and add tests.
File author erroneously assumed that backslash was used to escape
quotes in CSV; in fact, doubling the quote is used for escape.
The test still worked, but mainly because the content of the cell
with the escape wasn't tested. The file is now fixed, and
a new test added.
I believe that both CSV Reader and Writer are 100% covered now.
There were some errors uncovered during development.
The reader specifically permits encodings other than UTF-8 to be used.
However, fgetcsv will not properly handle other encodings.
I tried replacing it with fgets/iconv/strgetcsv, but that could not
handle line breaks within a cell, even for UTF-8.
This is, I'm sure, a very rare use case.
I eventually handled it by using php://memory to hold the translated
file contents for non-UTF8. There were no tests for this situation,
and now there are (probably too many).
"Contiguous" read was not handle correctly. There is a file
in samples which uses it. It was designed to read a large sheet,
and split it into three. The first sheet was corrrect, but the
second and third were almost entirely empty. This has been corrected,
and the sample code was adapted into a formal test with assertions
to confirm that it works as designed.
I made a minor documentation change. Unlike HTML, where you never
need a BOM because you can declare the encoding in the file,
a CSV with non-ASCII characters must explicitly include a BOM
for Excel to handle it correctly. This was explained in the Reading CSV
section, but was glossed over in the Writing CSV section, which I
have updated.
We now always trust the file extension to avoid false positive of mime
detection for most simple cases. But we still try to guess the mime type
if the file extension does not match or is missing.
Fixes#564
CSV reader used to accept any file without any kind of check. That made
users incorrectly believe that things were ok, even though there is no
way for CSV reader to read anything else that plain text files.
Fixes#167