CSV - Guess Encoding, Handle Null-string Escape (#1717)

* CSV - Guess Encoding, Handle Null-string Escape

This is in response to issue #1647 (detect CSV character encoding).
First, my tests with mb_detect_encoding indicate that it doesn't work
well enough; regardless, users can always do that on their own
if they deem it useful.
Rolling my own is also troublesome, but I can at least:
a. Check for BOM (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE).
b. Do some heuristic tests for each of the above encodings.
c. Fallback to a user-specified encoding (default CP1252)
  if a and b don't yield result.
I think this is probably useful enough to include, and relatively
easy to expand if other potential encodings should be considered.

Starting with PHP7.4, fgetcsv allows specification of null string as
escape character in fgetcsv. This is a much better choice than the PHP
(and PhpSpreadsheet) default of backslash in that it handles the file
in the same manner as Excel does. There is one statement in Reader/CSV
which would be adversely affected if the caller so specified (building
a regular expression under the assumption that escape character is
a single character). Fix that statement appropriately and add tests.
This commit is contained in:
oleibman 2020-12-25 08:47:29 -08:00 committed by GitHub
parent 607d3473e6
commit e768cb0f19
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
15 changed files with 170 additions and 8 deletions

View File

@ -458,6 +458,24 @@ $reader->setSheetIndex(0);
$spreadsheet = $reader->load("sample.csv");
```
You may also let PhpSpreadsheet attempt to guess the input encoding.
It will do so based on a test for BOM (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE,
or UTF-32LE),
or by doing heuristic tests for those encodings, falling back to a
specifiable encoding (default is CP1252) if all of those tests fail.
```php
$reader = new \PhpOffice\PhpSpreadsheet\Reader\Csv();
$encoding = \PhpOffice\PhpSpreadsheet\Reader\Csv::guessEncoding('sample.csv');
// or, e.g. $encoding = \PhpOffice\PhpSpreadsheet\Reader\Csv::guessEncoding(
// 'sample.csv', 'ISO-8859-2');
$reader->setInputEncoding($encoding);
$reader->setDelimiter(';');
$reader->setEnclosure('');
$reader->setSheetIndex(0);
$spreadsheet = $reader->load('sample.csv');
```
#### Read a specific worksheet

View File

@ -9,6 +9,21 @@ use PhpOffice\PhpSpreadsheet\Spreadsheet;
class Csv extends BaseReader
{
const UTF8_BOM = "\xEF\xBB\xBF";
const UTF8_BOM_LEN = 3;
const UTF16BE_BOM = "\xfe\xff";
const UTF16BE_BOM_LEN = 2;
const UTF16BE_LF = "\x00\x0a";
const UTF16LE_BOM = "\xff\xfe";
const UTF16LE_BOM_LEN = 2;
const UTF16LE_LF = "\x0a\x00";
const UTF32BE_BOM = "\x00\x00\xfe\xff";
const UTF32BE_BOM_LEN = 4;
const UTF32BE_LF = "\x00\x00\x00\x0a";
const UTF32LE_BOM = "\xff\xfe\x00\x00";
const UTF32LE_BOM_LEN = 4;
const UTF32LE_LF = "\x0a\x00\x00\x00";
/**
* Input encoding.
*
@ -90,12 +105,8 @@ class Csv extends BaseReader
{
rewind($this->fileHandle);
switch ($this->inputEncoding) {
case 'UTF-8':
fgets($this->fileHandle, 4) == "\xEF\xBB\xBF" ?
fseek($this->fileHandle, 3) : fseek($this->fileHandle, 0);
break;
if (fgets($this->fileHandle, self::UTF8_BOM_LEN + 1) !== self::UTF8_BOM) {
rewind($this->fileHandle);
}
}
@ -213,7 +224,9 @@ class Csv extends BaseReader
private function getNextLine()
{
$line = '';
$enclosure = '(?<!' . preg_quote($this->escapeCharacter, '/') . ')' . preg_quote($this->enclosure, '/');
$enclosure = ($this->escapeCharacter === '' ? ''
: ('(?<!' . preg_quote($this->escapeCharacter, '/') . ')'))
. preg_quote($this->enclosure, '/');
do {
// Get the next line in the file
@ -307,7 +320,7 @@ class Csv extends BaseReader
$this->fileHandle = fopen('php://memory', 'r+b');
$data = StringHelper::convertEncoding($entireFile, 'UTF-8', $this->inputEncoding);
fwrite($this->fileHandle, $data);
rewind($this->fileHandle);
$this->skipBOM();
}
}
@ -531,4 +544,63 @@ class Csv extends BaseReader
return in_array($type, $supportedTypes, true);
}
private static function guessEncodingTestNoBom(string &$encoding, string &$contents, string $compare, string $setEncoding): void
{
if ($encoding === '') {
$pos = strpos($contents, $compare);
if ($pos !== false && $pos % strlen($compare) === 0) {
$encoding = $setEncoding;
}
}
}
private static function guessEncodingNoBom(string $filename): string
{
$encoding = '';
$contents = file_get_contents($filename);
self::guessEncodingTestNoBom($encoding, $contents, self::UTF32BE_LF, 'UTF-32BE');
self::guessEncodingTestNoBom($encoding, $contents, self::UTF32LE_LF, 'UTF-32LE');
self::guessEncodingTestNoBom($encoding, $contents, self::UTF16BE_LF, 'UTF-16BE');
self::guessEncodingTestNoBom($encoding, $contents, self::UTF16LE_LF, 'UTF-16LE');
if ($encoding === '' && preg_match('//u', $contents) === 1) {
$encoding = 'UTF-8';
}
return $encoding;
}
private static function guessEncodingTestBom(string &$encoding, string $first4, string $compare, string $setEncoding): void
{
if ($encoding === '') {
if ($compare === substr($first4, 0, strlen($compare))) {
$encoding = $setEncoding;
}
}
}
private static function guessEncodingBom(string $filename): string
{
$encoding = '';
$first4 = file_get_contents($filename, false, null, 0, 4);
if ($first4 !== false) {
self::guessEncodingTestBom($encoding, $first4, self::UTF8_BOM, 'UTF-8');
self::guessEncodingTestBom($encoding, $first4, self::UTF16BE_BOM, 'UTF-16BE');
self::guessEncodingTestBom($encoding, $first4, self::UTF32BE_BOM, 'UTF-32BE');
self::guessEncodingTestBom($encoding, $first4, self::UTF32LE_BOM, 'UTF-32LE');
self::guessEncodingTestBom($encoding, $first4, self::UTF16LE_BOM, 'UTF-16LE');
}
return $encoding;
}
public static function guessEncoding(string $filename, string $dflt = 'CP1252'): string
{
$encoding = self::guessEncodingBom($filename);
if ($encoding === '') {
$encoding = self::guessEncodingNoBom($filename);
}
return ($encoding === '') ? $dflt : $encoding;
}
}

View File

@ -275,4 +275,66 @@ EOF;
$reader = new Csv();
$reader->load('tests/data/Reader/CSV/encoding.utf8.csvxxx');
}
/**
* @dataProvider providerEscapes
*/
public function testInferSeparator(string $escape, string $delimiter): void
{
$reader = new Csv();
$reader->setEscapeCharacter($escape);
$filename = 'tests/data/Reader/CSV/escape.csv';
$reader->listWorksheetInfo($filename);
self::assertEquals($delimiter, $reader->getDelimiter());
}
public function providerEscapes()
{
return [
['\\', ';'],
["\x0", ','],
[(version_compare(PHP_VERSION, '7.4') < 0) ? "\x0" : '', ','],
];
}
/**
* @dataProvider providerGuessEncoding
*/
public function testGuessEncoding(string $filename): void
{
$reader = new Csv();
$reader->setInputEncoding(Csv::guessEncoding($filename));
$spreadsheet = $reader->load($filename);
$sheet = $spreadsheet->getActiveSheet();
self::assertEquals('première', $sheet->getCell('A1')->getValue());
self::assertEquals('sixième', $sheet->getCell('C2')->getValue());
}
public function providerGuessEncoding()
{
return [
['tests/data/Reader/CSV/premiere.utf8.csv'],
['tests/data/Reader/CSV/premiere.utf8bom.csv'],
['tests/data/Reader/CSV/premiere.utf16be.csv'],
['tests/data/Reader/CSV/premiere.utf16bebom.csv'],
['tests/data/Reader/CSV/premiere.utf16le.csv'],
['tests/data/Reader/CSV/premiere.utf16lebom.csv'],
['tests/data/Reader/CSV/premiere.utf32be.csv'],
['tests/data/Reader/CSV/premiere.utf32bebom.csv'],
['tests/data/Reader/CSV/premiere.utf32le.csv'],
['tests/data/Reader/CSV/premiere.utf32lebom.csv'],
['tests/data/Reader/CSV/premiere.win1252.csv'],
];
}
public function testGuessEncodingDefltIso2(): void
{
$filename = 'tests/data/Reader/CSV/premiere.win1252.csv';
$reader = new Csv();
$reader->setInputEncoding(Csv::guessEncoding($filename, 'ISO-8859-2'));
$spreadsheet = $reader->load($filename);
$sheet = $spreadsheet->getActiveSheet();
self::assertEquals('premičre', $sheet->getCell('A1')->getValue());
self::assertEquals('sixičme', $sheet->getCell('C2')->getValue());
}
}

View File

@ -0,0 +1,4 @@
a\"hello;hello;hello;\",b\"hello;hello;hello;\",c\"\hello;hello;hello;\"
a\"hello;hello;hello;\",b\"hello;hello;hello;\",c\"\hello;hello;hello;\",d
a\"hello;hello;hello;\",b\"hello;hello;hello;\",c\"\hello;hello;hello;\"
a\"hello;hello;hello;\",b\"hello;hello;hello;\",c\"\hello;hello;hello;\"
Can't render this file because it contains an unexpected character in line 1 and column 3.

Binary file not shown.
1 �p�r�e�m�i�è�r�e� �s�e�c�o�n�d� �t�r�o�i�s�i�è�m�e� �
2 �Q�u�a�t�r�i�è�m�e� �c�i�n�q�u�i�è�m�e� �s�i�x�i�è�m�e� �

Binary file not shown.
1 première second troisième
2 Quatrième cinquième sixième

Binary file not shown.
1 p�r�e�m�i�è�r�e�,�s�e�c�o�n�d�,�t�r�o�i�s�i�è�m�e� �
2 �Q�u�a�t�r�i�è�m�e�,�c�i�n�q�u�i�è�m�e�,�s�i�x�i�è�m�e� �
3

Binary file not shown.
1 première second troisième
2 Quatrième cinquième sixième

Binary file not shown.
1 ���p���r���e���m���i������r���e��� ���s���e���c���o���n���d��� ���t���r���o���i���s���i������m���e��� ���
2 ���Q���u���a���t���r���i������m���e��� ���c���i���n���q���u���i������m���e��� ���s���i���x���i������m���e��� ���

Binary file not shown.
1 �����p���r���e���m���i������r���e��� ���s���e���c���o���n���d��� ���t���r���o���i���s���i������m���e��� ���
2 ���Q���u���a���t���r���i������m���e��� ���c���i���n���q���u���i������m���e��� ���s���i���x���i������m���e��� ���

Binary file not shown.
1 p���r���e���m���i������r���e���,���s���e���c���o���n���d���,���t���r���o���i���s���i������m���e��� ���
2 ���Q���u���a���t���r���i������m���e���,���c���i���n���q���u���i������m���e���,���s���i���x���i������m���e��� ���
3 ���

Binary file not shown.
1 ��p���r���e���m���i������r���e���,���s���e���c���o���n���d���,���t���r���o���i���s���i������m���e��� ���
2 ���Q���u���a���t���r���i������m���e���,���c���i���n���q���u���i������m���e���,���s���i���x���i������m���e��� ���
3 ���

View File

@ -0,0 +1,2 @@
première,second,troisième
Quatrième,cinquième,sixième
1 première second troisième
2 Quatrième cinquième sixième

View File

@ -0,0 +1,2 @@
première,second,troisième
Quatrième,cinquième,sixième
1 première second troisième
2 Quatrième cinquième sixième

View File

@ -0,0 +1,2 @@
première,second,troisième
Quatrième,cinquième,sixième
1 première second troisième
2 Quatrième cinquième sixième