Montag, 13. Juli 2020

Why it is a stupid idea to consider CSV as a valid long-term preservation file format

Take CSV!

It's so nice and quick and easy to say. Take CSV!

For simple cases that may be true. CSV files look so simple, so innocent, so sweet. Yet by their very nature they are insidious, vicious, and resemble a bloody walk into the deepest dungeons of classic role-players.

Let us begin our journey.

Innocent simplicity

You take a separator, e.g. the comma, use it to separate your values. Pour both into readable form. Done.

Okay. We need a second separator to show us the next line. But then, done! It's a CSV.

Hmm. There was something. Line separator. Now, is that line feed, carriage return or carriage return and line feed? It depends. For example, what operating system you're running.

The monster is growing

It is not a bad idea to separate values of a list by commas. Especially for Americans, this feels quite natural.

In other parts of the world, the decimal places of fractional numbers are separated by commas. Good, then we'll give the spreadsheets the opportunity to define the separator freely. Problem solved.

Well, not quite. It could be in other contexts that somehow the separator could appear in the individual values of a list. Good, then we'll introduce quoting. We define a character that allows us to recognize whether a separator is a separator or just a text component of a list value. Apostrophes would fit. That was easy, wasn't it?

Short break

So, to sum up. CSV files are easy. You need a separator, which can be a comma or anything else. We have a second separator that separates the lines. Usually there are three variations. We need quoting to see that a value cannot be confused with a separator.

Yeah, it may have been a little more complex than it looked at first. But what is there to make it worse?

Little toothy pegs!

Hmm, what if I want to store a text like this as a value after the raw value 1:

And he said "Oh, no!"

In the text, we have a comma, which would be protected by quoting, But we also have quotation marks, which we need for our quoting. No problem, then we double the quotation mark at that point to indicate that the text is not finished. So in the CSV it looks like this now:

1, "And he said ""Oh, no!""

I got it.

But, wait, what happens if my text consists of a single quotation mark?

1,""""

You're lucky. It seems to be working.

Wait, so what if I have a lot of quotation marks? As in

""""""
This is translated to
1, """"""""""""""

It works, too.

The problem is in the details

Now, a nasty little devil might get the idea to construct a text as value that contains line breaks, for example this one:

Evil Text
",
",

That would then:

1, "Evil text
"","
"",

Oops! If I now stubbornly read this in line by line, I would have read strange lines.
Good thing there is real software out there that reads and parses CSV files cleanly from the beginning. Not that anyone here still uses 'grep' and co.

The Abyss

Have we actually talked about character encoding yet? ASCII, Latin-1, UTF32? UTF8? With or without byte-order mark? No. Let's turn back. We still have a chance.

Later, at the pub.

I admit it was a terrible trip. Now, over a cold beer, we can laugh about it. But our hearts were already in our mouth. We had no idea what to expect.

If only there had been a sign that said what character encoding, what line end encoding, what separators for lines and columns we could expect, yes, then we would have been able to understand CSV and we would have been spared the horror. But the horror comes from the darkness, from the premonitions of the unknown.

Therefore, be warned!

Don't use CSV, it could get you!

Keine Kommentare:

Kommentar veröffentlichen