Commit Graph

11 Commits

Author SHA1 Message Date
oleibman 5de82981d8
Html Reader Not Handling non-ASCII Data Correctly (#2943)
* Html Reader Not Handling non-ASCII Data Correctly

Fix #2942. Code was changed by #2894 because PHP8.2 will deprecate how it was being done. See linked issue for more details. Dom loadhtml assumes ISO-8859-1 in the absence of a charset attribute or equivalent, and there is no way to override that assumption. Sigh. The suggested replacements are unsuitable in one way or another. I think this will work with minimal disruption (replace ampersand, less than, and greater than with entities representing illegal characters, then use htmlentities, then restore ampersand, less than, and greater than).

* Better Implementation

Use regexp to escape non-ASCII. Less kludgey, less reliant on the vagaries of the PHP maintainers.

* Additional Tests

Test non-ASCII outside of cell contents: sheet title, image alt attribute.

* Apply Same Change in Second Location

Forgot to change loadFromString.

* Additional Test

Confirm escaped ampersand is handled correctly.
2022-07-16 22:08:44 -07:00
oleibman c936f1d9f8
Coverage Improvements (#2859)
Mostly new tests, some code annotations, some minor code changes:
- RichText clone logic is wrong
- TextElement doesn't have object properties, doesn't need clone
2022-06-01 08:29:56 -07:00
oleibman 070bc68514
Html Reader Converting Cell Containing 0 to Null String (#2813)
Fix #2810. Repairing some Phpstan diagnostics, used `?:` rather than `??` in a few places.

2 different Html modules are affected. Also, Ods Reader, but its problem is with sheet title rather than cell contents. And, as it turns out, Ods Reader was already not handling sheets with a title of `0` correctly - it made a truthy test before setting sheet title. That is now changed to truthy or numeric. Other readers are not susceptible to this problem. Tests are added.
2022-05-10 07:33:45 -07:00
Mark Baker 05466e99ce
Html import dimension conversions (#2152)
Allows basic column width conversion when importing from Html that includes UoM... while not overly-sophisticated in converting units to MS Excel's column width units, it should allow import without errors

Also provides a general conversion helper class, and allows column width getters/setters to specify a UoM for easier usage
2021-06-11 17:29:49 +02:00
oleibman cc5c0205d5
Fix for Issue 2029 (Invalid Cell Coordinate A-1) (#2032)
* Fix for Issue 2029 (Invalid Cell Coordinate A-1)

Fix for #2021. When Html Reader encounters an embedded table, it tries to shift it up a row. It obviously should not attempt to shift it above row 1. @danmodini reported the problem, and suggests the correct solution. This PR implements that and adds a test case.

Performing some additional testing, I found that Html Reader cannot handle inline column width or row height set in points rather than pixels (and HTML writer with useInlineCss generates these values in points). It also doesn't handle border style when the border width (which it ignores) is omitted. Fixed and added tests.
2021-04-29 22:59:01 +02:00
Adrien Crivelli 49f87de165
Reduce PHPStan error in tests 2021-04-12 11:10:23 +09:00
oleibman cb23cca3ec
Avoid Duplicate Titles When Reading Multiple HTML Files (#1829)
This issue arose while researching issue #1823. The issue was not a bug;
it just required clarification to the author of how to use the software.
But, while researching, I discovered that loading html into 2
sheets of a spreadsheet has a problem if the html title tag is the same
for the 2 sheets. PhpSpreadsheet would be able to save the resulting file,
but Excel would not be able to read it properly because of the duplicate title.
The worksheet setTitle method allows for disambiguation is such a circumstance.
The html reader passed a parameter indicating "don't disambiguate", but I can't
see any harm in changing that to "disambiguate". An extremely simple fix,
with tests to back it up.
2021-02-27 15:10:04 +01:00
Adrien Crivelli 6a41381c1d
PSR12 code style 2020-07-26 14:13:11 +09:00
Adrien Crivelli 4739f8b2e7
Merge branch 'readhtml' 2020-07-26 13:11:15 +09:00
Owen Leibman 752a0a5a6c Scrutinizer Recommendations
Two unneeded assignments in tests, one unused parameter in source code.
2020-06-25 23:11:30 -07:00
Owen Leibman 6080c4561d Improve Coverage for HTML Reader
Reader/Html is now covered except for 1 statement.
There is some coverage of RichText when you know in advance that the
html will expand into a single cell.
It is a tougher nut, one that I have not yet cracked,
to try to handle rich text while converting unkown html to multiple cells.
The original author left this as a TODO, and so for now must I.

It made sense to restructure some of the code. There are some changes.
- Issue #1532 is fixed (links are now saved when using rowspan).
- Colors can now be specified as html color name. To accomplish this,
  Helper/Html function colourNameLookup was changed from protected
  to public, and changed to static.
- Superfluous empty lines were eliminated in a number of places, e.g.
  <ul><li>A</li><li>B</li><li>C</li></ul>
  had formerly caused a wrapped cell to be created with 2 empty lines
  followed by A, B, and C on separate lines; it will now just have the
  3 A/B/C lines, which seems like a more sensible interpretation.
- Img alt tag, which had been cast to float, is now used as a string.

Private member "encoding" is not used. Functions getEncoding and setEncoding
have therefore been marked deprecated. In fact, I was unable to get
SecurityScanner to pass *any* html which is not UTF-8. There are
possibly ways of getting around this (in Reader/Html - I have no
intention of messing with Security Scanner), as can be seen in my
companion pull request for Excel2003 Xml Reader. Doing this would be
easier for ASCII-compatible character sets (like ISO-8859-1),
than for non-compatible charsets (like UTF-16). I am not
convinced that the effort is worth it, but am willing to investigate
further.

I added a number of tests, creating an Html directory, and moving
HtmlTest to that directory.
2020-06-25 22:42:38 -07:00