Ad

Replacing Or Removing A New Line With Something Else But Only Between Single Or Double Quotes Using PHP On A CSV File

- 1 answer

I have a CSV file that holds about 200,000 - 300,000 records. Most of the records can be separated and inserted into a MySQL database with a simple

$line = explode("\n", $fileData);

and then the values separated with

$lineValues = explode(',', $line);

and then inserted into the database using the proper data type i.e int, float, string, text, etc.

However, some of the records have a text column that includes a \n in the string. Which breaks when using the $line = explode("\n", $fileData); method. Each line of data that needs to be inserted into the database has approximately 216 columns. not every line has a record with a \n in the string. However, each time a \n is found in the line it is enclosed between a pair of single quotes (')

each line is set up in the following format:

id,data,data,data,text,more data

example:

1,0,0,0,'Hello World,0
2,0,0,0,'Hello
    World',0
3,0,0,0,'Hi',0
4,0,0,0,,0

As you can see from the example, most records can be easily split with the methods shown above. Its the second record in the example that causes the problem.

New lines are only \n and the file does not include \r in the file at all.

Ad

Answer

If the csv data is in a file, you can just use fgetcsv() as others have pointed out. fgetcsv handles embedded newlines correctly.

However if your csv data is in a string (like $fileData in your example) the following method may be useful as str_getcsv() only works on a row at a time and cannot split a whole file into records.

You can detect the embedded newlines by counting the quotes in each line. If there are an odd number of quotes, you have an incomplete line, so concatenate this line with the following line. Once you have an even number of quotes, you have a complete record.

Once you have a complete record, split it at the quotes (again using explode()). Odd-numbered fields are quoted (thus embedded commas are not special), even-numbered fields are not.

Example:

# Split file into physical lines (records may span lines)
$lines = explode("\n", $fileData);

# Re-assemble records
$records = array ();
$record = '';
$lineSep = '';
foreach ($lines as $line) {
  # Escape @ symbol so we can use it as a marker (as it does not conflict with
  # any special CSV character.)
  $line = str_replace('@', '@a', $line);

  # Escape commas as we don't yet know which ones are separators
  $line = str_replace(',', '@c', $line);

  # Escape quotes in a form that uses no special characters
  $line = str_replace("\\'", '@q', $line);
  $line = str_replace('\\', '@b', $line);

  $record .= $lineSep . $line;
  $lineSep = "\n";

  # Must have an even number of quotes in a complete record!
  if (substr_count($record, "'") % 2 == 0) {
    $records[] = $record;
    $record = '';
    $lineSep = '';
  }
}
if (strlen($record) > 0) {
  $records[] = $record;
}

$rows = array ();

foreach ($records as $record) {
  $chunks_in = explode("'", $record);
  $chunks_out = array ();

  # Decode escaped quotes/backslashes.
  # Decode field-separating commas (unless quoted)
  foreach ($chunks_in as $i => $chunk) {
    # Unescape quotes & backslashes
    $chunk = str_replace('@q', "'", $chunk);
    $chunk = str_replace('@b', '\\', $chunk);
    if ($i % 2 == 0) {
      # Unescape commas
      $chunk = str_replace('@c', ',', $chunk);
    }
    $chunks_out[] = $chunk;
  }

  # Join back together, discarding unescaped quotes
  $record = join('', $chunks_out);

  $chunks_in = explode(',', $record);
  $row = array ();
  foreach ($chunks_in as $chunk) {
    $chunk = str_replace('@c', ',', $chunk);
    $chunk = str_replace('@a', '@', $chunk);
    $row[] = $chunk;
  }
  $rows[] = $row;
}
Ad
source: stackoverflow.com
Ad