Scrutinizer had previously suggested annotations for 3 LookupRef tests, but it no longer accepts its own annotation for those cases. This PR cleans them up. ColumnsTest and RowsTest are extremely straightforward.
IndexTest is a bit more complicated, but only because, unlike the other two, it had no test which executed in the context of a spreadsheet. And, when I added those, I discovered a couple of bugs. INDEX always requires at least 2 parameters (row# is always required), but its entry in the function table specified 1-4 parameters, now changed to 2-4. And, omitting col# is not handled the same way as specifying 0 for col#, though the code had treated them identically. (The same would have been true for row# but, because it is now required, ...)
* Xlsx Reader External Data Validations Flag Missing
Fix#2677. This PR supersedes #2679, written by @technghiath, which lacks tests, and probably doesn't solve the problem entirely. The code causing the problem appears to be the last remnant in Xlsx Reader which calls `children` using a namespace prefix rather than a namespace. That is changed, and tests are added where the tag is unexpectedly missing, and also where it uses a non-standard namespace prefix.
* Scrutinizer
Reports 1 "new" error. It isn't, but fix it anyhow.
* Fix One Existing Scrutinizer Problem
Only remaining problem in Reader/Xlsx.
* R1C1 Format and Internationalization, plus Relative Offsets
Fix#1704, albeit imperfectly. Excel's implementation of this feature makes it impossible to fix perfectly. I don't know why it was necessary to internationalize R1C1 in the first place - the benefits are so minimal,and the result is worksheets that break when opened in different locales. Ugh. I can't even find complete documentation about the format in different languages; I am using https://answers.microsoft.com/en-us/officeinsider/forum/all/indirect-function-is-broken-at-least-for-excel-in/1fcbcf20-a103-4172-abf1-2c0dfe848e60 as my definitive reference.
This fix concentrates on the original report, using the INDIRECT function; there may be other areas similarly affected. As with ambiguous date formats, PhpSpreadsheet will do a little better than Excel itself when reading spreadsheets with internationalized R1C1 by trying all possibilities before giving up. When it does give up, it will now return `#REF!`, as Excel does, rather than throwing an exception, which is certainly friendlier. Although read now works better, when writing it will use whatever the user specified, so spreadsheets breaking in the wrong locale will still happen.
There were some bugs that turned up as I added test cases, all of them concerning relative addressing in R1C1 format, e.g. `R[+1]C[-1]`. The regexp for validating the format allowed for minus signs, but not plus signs. Also, the relevant functions did not allow for passing the current cell address, which made relative addressing impossible. The code now allows these, and suitable test cases are added.
* Use Locale for Formats, but Not for XML
Implementing a suggestion from @MarkBaker to use the system locale for determining R1C1 format rather than looping through a set of regexes and accepting any that work. This is closer to how Excel itself operates. The assumption we are making is to use the first character of the translated ROW and COLUMN functions. This will not work for Russian or Bulgarian, where each starts with the same letter, but it appears that Russian, at least, still uses R1C1. So our algorithm will not use non-ASCII characters, nor characters where ROW and COLUMN start with the same letter, falling back to R/C in those cases. Turkish falls into that category. Czech uses an accented character for one of the functions, and I'm guessing to use the unaccented character in that case. Polish COLUMN function is NR.KOLUMNY, and I'm guessing to use K in that case.
The function that converts R1C1 references is also used by the XML reader *where the format is always R1C1*, not locale-based (confirmed by successfully opening in Excel an XML spreadsheet when my language is set to French). The conversion code now handles that distinction through the use of an extra parameter. Xml Reader Load Test is duplicated to confirm that spreadsheet is loaded properly whether the locale is English or French. (No, I did not add an INDIRECT function to the Xml spreadsheet.)
Tests CsvIssue2232Test and TranslationTest both changed locale without resetting it when done. That omission was exposed by the new code, and both are now corrected.
* OpenOffice and Gnumeric
OpenOffice and Gnumeric make it much easier to test with other languages - they can be handled with an environment variable. Sensibly, they require R and C as the characters for R1C1 notation regardless of the language. Change code to recognize this difference from Excel.
* Handle Output of ADDRESS Function
One other function has to deal with R1C1 format as a string. Unlike INDIRECT, which receives the string on input, ADDRESS generates the string on output. Ensure that the ADDRESS output is consistent with the INDIRECT input.
ADDRESS expects its 4th arg to be bool, but it can also accept int, and many examples on the net supply it as an int. This had not been handled properly, but is now corrected.
* More Structured Test
I earlier introduced a new test for relative R1C1 addressing. Rewrite it to be clearer.
* Add Row for This to Locale Spreadsheet
It took a while for me to figure out how it all works. I have added a new row (with English value `*RC`) to Translations.xlsx, in the "Lookup and Reference" section of sheet "Excel Functions". By starting the "function name" with an asterisk, it will not be confused with a "real" function (confirmed by a new test). This approach also gives us the flexibility to do something similar if another surprise case occurs in future; in particular, I think this is more flexible than adding this as another option on the "Excel Localisation" sheet. It also means that any errors or omissions in the list below will be handled as with any other translation problem, by updating the spreadsheet without needing to touch any code.
The spreadsheet has the following entries in the *RC row:
- first letter of ROW/COLUMN functions for da, de, es, fi, fr, hu, nl, nb, pt, pt_br, sv
- no value for locales where ROW/COLUMN functions start with same letter - bg, ru, tr
- no value for locales with a multi-part name for ROW and/or COLUMN - it, pl (I had not previously noted Italian as an exception)
- no value for locales where ROW and/or COLUMN starts with a non-ASCII character - cs (this would also apply to bg and ru which are already included under "same letter")
- it does nothing for locales which are defined on the "Excel Localisation" sheet but have no entries yet on the "Excel Functions" sheet (e.g. eu)
Note that all but the first bullet item will continue to use R/C, which leaves them no worse off than they were before this change.
Fix#1929. This was already substantially fixed, but there was a lingering problem with an unexpected leading space. It turns out there was also a problem with leading zeros, also fixed. There are also problems involving commas; fixing those seems too complicated to delay these changes, but I will add it to my to-do list.
This is a start in addressing issue #2965 (and earlier issue #749). Chart Titles are usually entered as strings or Rich Text strings, and PhpSpreadsheet supports that. They can also be entered as formulas (typically a pointer to a cell with the title text), and, not only did PhpSpreadsheet not support that, it threw an exception when reading a spreadsheet that did so.
This change does:
- eliminate the exception
- set a static chart title when it can determine it from the Xml
This change does not:
- fully support dynamic titles (e.g. if you change the contents of the source cell, or delete or insert cells or rows or columns)
- permit the user to set the title to a formula
- allow the use of formulas when writing a chart title to a spreadsheet
- provide styling for titles when it has read them as a formula
Fix#2934. Null is passed to StringHelper::strtolower which expects string. Same problem appears to be applicable to HLOOKUP.
I noted the following problem in the code, but will document it here as well. Excel's results are not consistent when a non-numeric string is passed as the third parameter. For example, if cell Z1 contains `xyz`, Excel will return a REF error for function `VLOOKUP(whatever,whatever,Z1)`, but it returns a VALUE error for function `VLOOKUP(whatever,whatever,"xyz")`. I don't think PhpSpreadsheet can match both behaviors. For now, it will return VALUE for both, with similar results for other errors.
While studying the returned errors, I realized there is something that needs to be deprecated. `ExcelError::$errorCodes` is a public static array. This means that a user can change its value, which should not be allowed. It is replaced by a constant. Since the original is public, I think it needs to stay, but with a deprecation notice; users can reference and change it, but it will be unused in the rest of the code. I suppose this might be considered a break in functionality (that should not have been allowed in the first place).
* Html Reader Not Handling non-ASCII Data Correctly
Fix#2942. Code was changed by #2894 because PHP8.2 will deprecate how it was being done. See linked issue for more details. Dom loadhtml assumes ISO-8859-1 in the absence of a charset attribute or equivalent, and there is no way to override that assumption. Sigh. The suggested replacements are unsuitable in one way or another. I think this will work with minimal disruption (replace ampersand, less than, and greater than with entities representing illegal characters, then use htmlentities, then restore ampersand, less than, and greater than).
* Better Implementation
Use regexp to escape non-ASCII. Less kludgey, less reliant on the vagaries of the PHP maintainers.
* Additional Tests
Test non-ASCII outside of cell contents: sheet title, image alt attribute.
* Apply Same Change in Second Location
Forgot to change loadFromString.
* Additional Test
Confirm escaped ampersand is handled correctly.
* Keep Calculated String Results Below 32K
This is the result of an investigation into issue #2884 (see also PR #2913). It is, unfortunately, not a fix for the original problem; see the discussion in that PR for why I don't think there is a practical fix for that specific problem at this time.
Excel limits strings to 32,767 characters. We already truncate strings to that length when added to the spreadsheet. However, we have been able to exceed that length as a result of the concatenation operator (Excel truncates); as a result of the CONCATENATE or TEXTJOIN functions (Excel returns #CALC!); or as a result of the REPLACE, REPT, SUBSTITUTE functions (Excel returns #VALUE!). This PR changes PhpSpreadsheet to return the same value as Excel in these cases. Note that Excel2003 truncates in all those cases; I don't think there is a way to differentiate that behavior in PhpSpreadsheet.
However, LibreOffice and Gnumeric do not have that limit; if they have a limit at all, it is much higher. It would be fairly easy to use existing settings to differentiate between Excel and LibreOffice/Gnumeric in this respect. I have not done so in this PR because I am not sure how useful that is, and I can easily see it leading to problems (read in a LibreOffice spreadsheet with a 33K cell and then output to an Excel spreadsheet). Perhaps it should be handled with an additional opt-in setting.
I changed the maximum size from a literal to a constant in the one place where it was already being enforced (Cell/DataType). I am not sure that is the best place for it to be defined; I am open to suggestions.
* Implement Some Suggestions
... from @MarkBaker.
Fix#2897. We have been relying on iconv/mb_convert_encoding to detect invalid UTF-8, but all techniques designed to validate UTF-8 seem to accept FFFE and FFFF. This PR explicitly converts those characters to FFFD (Unicode substitution character) before validating the rest of the string. It also substitutes one or more FFFD when it detects invalid UTF-8 character sequences.
A comment in the code being change stated that it doesn't handle surrogates. It is right not to do so. The only case where we should see surrogates is reading UTF-16. Additional tests are added to an existing test reading a UTF-16 Csv to demonstrate that surrogates are handled correctly, and that FFFE/FFFF are handled reasonably.
Fix#2908. When support for two-cell anchors was added for drawings, we neglected to adjust the second cell address when rows or columns are added or deleted. It also appears that "twoCell" and "oneCell" were introduced as lower-case literals when support for the editAs attribute was subsequently introduced.
* Ignore square-$-brackets prefix in format string
* Test for square-$-brackets prefix in format string issue fixed
* Fix for phpstan compliance
* Additional assert for checking number format of tested source cell
File from https://www.rondebruin.nl/win/s2/win003.htm. I have been in conversation with the author, who has no objection to its use. I have not actually opened the file in Excel (at least not with macros enabled); I am using it merely to demonstrate that the ribbon data is read and written correctly. Test added; no source code changed. This should slightly increase coverage for Reader/Xlsx (moderate), Writer/Xlsx (slight), and Spreadsheet (substantial). Note that this file has no Ribbon Bin objects, so some coverage is still lacking.
Mostly new tests, some code annotations, some minor code changes:
- RichText clone logic is wrong
- TextElement doesn't have object properties, doesn't need clone
Disabled it earlier because its reliance on an external site not under our control was causing problems. URL in spreadsheet is now changed to point to an image in phpspreadsheet.readthedocs.io, which should be more reliable. Test is re-enabled.
Fix#2810. Repairing some Phpstan diagnostics, used `?:` rather than `??` in a few places.
2 different Html modules are affected. Also, Ods Reader, but its problem is with sheet title rather than cell contents. And, as it turns out, Ods Reader was already not handling sheets with a title of `0` correctly - it made a truthy test before setting sheet title. That is now changed to truthy or numeric. Other readers are not susceptible to this problem. Tests are added.
* Real Errors Identified in Calculation by Scrutinizer
Before Scrutinizer broke, I took a look at the remaining 43 errors which it categorized as 'major'. Most of these were false positives, but, in the case of Calculation and Reader/Xlsx/Chart, I was able to determine that its analysis of some of the problems was correct. There is little point addressing the false positives until it starts working again, but we should fix the real errors.
This PR addresses the real errors in Calculation.
- A test for `$pCellParent` should have been a test for `$pCellWorksheet`.
- A test for `$operand1Data['reference']` should have been a test for `$operand1Data['value']`.
- A test for `$operand2Data['reference']` should have been a test for `$operand2Data['value']`.
* Fix Attempt to Erroneously Call trim on Array
Fix#2780. Warning message is being issued when getting calculated value of cell with value `=INDIRECT(ADDRESS(ROW(),COLUMN()-2))/$C$4`. This appears to be the case for all recent (and probably not so recent) releases; it is not a result of changes to the code base. Fix added to this PR because the erring section of code was proximate to code already changed in the PR. Test added.
* Minor Code Changes
Apply some suggestions from @MarkBaker
Fixed that bug, using a slightly faster algorithm for the sort index than the simple fix would have used, and modified the tests that didn't have the correct expected result :-(
Fix#2768. DateFormatter handles only one of six special formats for time intervals `[h] [hh] [m] [mm] [s] [ss]`. This PR extends support to the rest. There should be no more than one of these in any format string. Although it certainly could make sense to treat `[d] [dd]` in the same manner, Excel does not seem to support those.
Interesting observations - hours and minutes are truncated (presumably because they may be followed by minutes and seconds), but seconds are rounded. Also, there are some floating point issues, which fortunately showed up for the example in the original issue. There, the time interval was 1.15, which should evaluate to a minutes value of 1656 (as it does in Excel). However, on my system it evaluated to 1655 because of a rounding error in the 13th decimal place. To overcome this, values are rounded to 10 decimal places before truncating.
* Fix reading of files in the root of a zip
Xlsx.php relies in dirname($filename) for path generation. When path is a bare filename (i.e. files in the root of the zip file), dirname($filename) returns a relative path to the current directory ("."). This is ok for filesystems, but not when accesing contents in a zip file.
Xlsx documents with files in the root of the zip container are not common, but legit. I've found it to happen in files generated by Google Campaign Manager 360.
* Update Xlsx.php
* Update Xlsx.php
* Update CHANGELOG.md
* Add files via upload
* Create XlsxRootZipFilesTest.php
* Update XlsxRootZipFilesTest.php
* Add files via upload
* Delete rootZipFiles.xlsx
* Update XlsxRootZipFilesTest.php
* Update Xlsx.php