Fix#2897. We have been relying on iconv/mb_convert_encoding to detect invalid UTF-8, but all techniques designed to validate UTF-8 seem to accept FFFE and FFFF. This PR explicitly converts those characters to FFFD (Unicode substitution character) before validating the rest of the string. It also substitutes one or more FFFD when it detects invalid UTF-8 character sequences.
A comment in the code being change stated that it doesn't handle surrogates. It is right not to do so. The only case where we should see surrogates is reading UTF-16. Additional tests are added to an existing test reading a UTF-16 Csv to demonstrate that surrogates are handled correctly, and that FFFE/FFFF are handled reasonably.
These changes have already been implemented twice, and been regressed twice. I'll try once more (with a different approach), then give up ...
As configured, Phpstan running under Php7 reports no errors. However, running under Php8, it reports 100 (!) errors. The vast majority of these are due to two reasons:
- renaming parameters in Php builtin functions in preparation for named parameters.
- using the new class GdImage rather than type resource as the argument type for many image-based functions.
Regardless of the cause, this will be a problem sooner or later. This PR is an attempt to get ahead of that problem. For source members, it mostly adds annotations or updates doc-blocks. Only 2 members have changes to executable code, and these are very minor - BitWise and Writer/Xlsx. For test members, all baseline errors are deleted and the code is fixed. Php7 and Php8 both report no errors with this configuration.
This request does not change any source code, only tests.
For a change on which I was working, a test passed when run on its own,
but failed when run as part of the full test suite. It turned out that
an existing test had changed a static value,
thousands separator in this case, and failed to restore it.
The test turned out to be AdvancedBinderTest.
The search for the offending test was more difficult than it should have
been because 26 test scripts which had nothing to do with thousands
separator nevertheless changed that value. They all changed
decimal separator, currency code, and compatibility mode as well,
again for no reason. I changed all of those to eliminate those operations.
I changed the following tests, which actually do change the static
properties identified above for a reason, to restore them as part of teardown.
- CalculationTest sets compatibilityMode and locale
- DayTest sets compatibilityMode, returnDateType, and excelCalendar
- CountTest sets compatibilityMode
- FunctionsTest sets compatibilityMode and returnDateType
- AdvancedValueBinderTest sets currencyCode, decimalSeparator, thousandsSeparator
- StringHelperTest sets currencyCode, decimalSeparator, thousandsSeparator
- NumberFormatTest sets currencyCode, decimalSeparator, thousandsSeparator
- HtmlNumberFormatTest sets currencyCode, decimalSeparator, thousandsSeparator
Because even if it doesn't make a difference in practice, it is
technically more correct to call static methods statically. It
also better advertise that those methods can be used from any context.
Control characters in cell values are automatically escaped without
the need to excplicitly call `StringHelper::buildCharacterSets()` beforehand.
Fixes#212