PhpSpreadsheet

Commit Graph

Author	SHA1	Message	Date
oleibman	cb23cca3ec	Avoid Duplicate Titles When Reading Multiple HTML Files (#1829 ) This issue arose while researching issue #1823. The issue was not a bug; it just required clarification to the author of how to use the software. But, while researching, I discovered that loading html into 2 sheets of a spreadsheet has a problem if the html title tag is the same for the 2 sheets. PhpSpreadsheet would be able to save the resulting file, but Excel would not be able to read it properly because of the duplicate title. The worksheet setTitle method allows for disambiguation is such a circumstance. The html reader passed a parameter indicating "don't disambiguate", but I can't see any harm in changing that to "disambiguate". An extremely simple fix, with tests to back it up.	2021-02-27 15:10:04 +01:00
Owen Leibman	6080c4561d	Improve Coverage for HTML Reader Reader/Html is now covered except for 1 statement. There is some coverage of RichText when you know in advance that the html will expand into a single cell. It is a tougher nut, one that I have not yet cracked, to try to handle rich text while converting unkown html to multiple cells. The original author left this as a TODO, and so for now must I. It made sense to restructure some of the code. There are some changes. - Issue #1532 is fixed (links are now saved when using rowspan). - Colors can now be specified as html color name. To accomplish this, Helper/Html function colourNameLookup was changed from protected to public, and changed to static. - Superfluous empty lines were eliminated in a number of places, e.g. <ul><li>A</li><li>B</li><li>C</li></ul> had formerly caused a wrapped cell to be created with 2 empty lines followed by A, B, and C on separate lines; it will now just have the 3 A/B/C lines, which seems like a more sensible interpretation. - Img alt tag, which had been cast to float, is now used as a string. Private member "encoding" is not used. Functions getEncoding and setEncoding have therefore been marked deprecated. In fact, I was unable to get SecurityScanner to pass any html which is not UTF-8. There are possibly ways of getting around this (in Reader/Html - I have no intention of messing with Security Scanner), as can be seen in my companion pull request for Excel2003 Xml Reader. Doing this would be easier for ASCII-compatible character sets (like ISO-8859-1), than for non-compatible charsets (like UTF-16). I am not convinced that the effort is worth it, but am willing to investigate further. I added a number of tests, creating an Html directory, and moving HtmlTest to that directory.	2020-06-25 22:42:38 -07:00

Author

SHA1

Message

Date

oleibman

cb23cca3ec

Avoid Duplicate Titles When Reading Multiple HTML Files (#1829 )

This issue arose while researching issue #1823. The issue was not a bug;
it just required clarification to the author of how to use the software.
But, while researching, I discovered that loading html into 2
sheets of a spreadsheet has a problem if the html title tag is the same
for the 2 sheets. PhpSpreadsheet would be able to save the resulting file,
but Excel would not be able to read it properly because of the duplicate title.
The worksheet setTitle method allows for disambiguation is such a circumstance.
The html reader passed a parameter indicating "don't disambiguate", but I can't
see any harm in changing that to "disambiguate". An extremely simple fix,
with tests to back it up.

2021-02-27 15:10:04 +01:00

Owen Leibman

6080c4561d

Improve Coverage for HTML Reader

Reader/Html is now covered except for 1 statement.
There is some coverage of RichText when you know in advance that the
html will expand into a single cell.
It is a tougher nut, one that I have not yet cracked,
to try to handle rich text while converting unkown html to multiple cells.
The original author left this as a TODO, and so for now must I.

It made sense to restructure some of the code. There are some changes.
- Issue #1532 is fixed (links are now saved when using rowspan).
- Colors can now be specified as html color name. To accomplish this,
  Helper/Html function colourNameLookup was changed from protected
  to public, and changed to static.
- Superfluous empty lines were eliminated in a number of places, e.g.
  <ul><li>A</li><li>B</li><li>C</li></ul>
  had formerly caused a wrapped cell to be created with 2 empty lines
  followed by A, B, and C on separate lines; it will now just have the
  3 A/B/C lines, which seems like a more sensible interpretation.
- Img alt tag, which had been cast to float, is now used as a string.

Private member "encoding" is not used. Functions getEncoding and setEncoding
have therefore been marked deprecated. In fact, I was unable to get
SecurityScanner to pass *any* html which is not UTF-8. There are
possibly ways of getting around this (in Reader/Html - I have no
intention of messing with Security Scanner), as can be seen in my
companion pull request for Excel2003 Xml Reader. Doing this would be
easier for ASCII-compatible character sets (like ISO-8859-1),
than for non-compatible charsets (like UTF-16). I am not
convinced that the effort is worth it, but am willing to investigate
further.

I added a number of tests, creating an Html directory, and moving
HtmlTest to that directory.

2020-06-25 22:42:38 -07:00

2 Commits