Show HN: Parsing CSV files with GPU (opens in new tab)

(github.com)

123 pointsantonmks11y ago60 comments

60 comments

51 comments · 15 top-level

kazinator11y ago· 9 in thread

Though there is no standard definition of CSV, de facto processing it properly requires recognizing quotes, and also escapes of literal quotes using double quoting:

    this, "is, like, CSV", "with three so-called ""fields"""

Note that unquoted leading and trailing whitespace, and whitespace around the commas, is deleted, too.

(See CSV page in the Wikipedia)

A GPU-accelerated string split could be useful but it's not quite "parsing CSV".

haberman11y ago

This is a perfect example of how text parsing is really inherently non-parallelizable. It's very rare that you can do anything useful with a buffer of text without knowing the precise state of the parse at the beginning of that buffer.

The kinds of patterns that would make parsing more parallelizable, like marking the beginning of a delimited region with its length, are human unfriendly so would never be part of an actual text format. Who would ever want to write this?

    # Update "19" whenever string length changes.
    x = {19}"String of length 19"

pcwalton11y ago

> This is a perfect example of how text parsing is really inherently non-parallelizable. It's very rare that you can do anything useful with a buffer of text without knowing the precise state of the parse at the beginning of that buffer.

There are two mechanisms that are usually used to get around this:

(1) Perform a fast, sequential "skeleton parsing" pass before the main parse that scans just enough to find "split points" that are consumed by the parallel parser. This is what some of the parallel XML parsing work [1] did.

(2) Guess the state you're in based on some heuristics, and roll back on failure. This actually works surprisingly well in practice for many grammars, for example HTML [2].

[1]: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=410047...

[2]: http://www.cs.wm.edu/~xshen/Publications/taco14.pdf

1 more reply

wfunction11y ago

I've always wondered if parsing is still unparallelizable if you also allow backtracking. In other words, can you parse eagerly assuming that you're likely to be in a certain state, and then if you're proven wrong maybe you can retry?

1 more reply

thesz11y ago

http://www.cse.chalmers.se/edu/year/2010/course/TDA341/Paper...

Also look at CYK parsing algorithm, it is highly parallelizable (it is based on matrix multiplication).

TheLoneWolfling11y ago

And that's why you let the computer do it for you.

1 more reply

bostonpete11y ago

> unquoted leading and trailing whitespace, and whitespace around the commas, is deleted, too. (See CSV page in the Wikipedia)

I'm not sure why you referenced the Wikipedia page -- it indicates that according to RFC 4180, (a) "spaces outside quotes in a field are not allowed", and (b) "Spaces are considered part of a field and should not be ignored."

Maybe you're referring to the comment that CSV parsers should "be liberal in what you accept from others"...?

http://en.wikipedia.org/wiki/Comma-separated_values#Basic_ru...

kazinator11y ago

> (a) "spaces outside quotes in a field are not allowed", and (b) "Spaces are considered part of a field and should not be ignored."

Note that it's clear that (b) refers only to spaces inside quotes, because spaces outside quotes are not allowed.

> Maybe you're referring to the comment that CSV parsers should "be liberal in what you accept from others"...?

I actually don't believe in that principle. If there is a spec that everyone agrees upon, violations should be accurately and loudly diagnosed and rejected.

Both preparation of the data format and processing of that data format should be conservative. Being liberal in what is accepted has unintended negative consequences.

Be that as it may, there is no universal CSV spec, though. RFC 4180 is just someone's opinion on what CSV should be. CSV is something that has been widely implemented in numerous programs over numerous decades, in different ways.

My main point holds that if you split the string on commas and do nothing else, then one of the aspects you're neglecting to handle is the treatment of unquoted whitespace outside of a field.

1 more reply

motoboi11y ago

You forgot with fields with line-breaks, with non printrable characters and so on.

kazinator11y ago

I don't need to reproduce an entire detailed CSV spec here to make the point.

There is a danger in being too detailed because there is no universal spec anyway. For many users, CSV is whatever the most recent version or two of Microsoft Excel accept, as confirmed by trial-and-error reverse engineering.

castratikron11y ago· 6 in thread

I'm skeptical of the 8x speedup for several reasons, the main one being that this particular problem does not fit the paradigm of problems that work well on the GPU; the GPU cache is not used at all, and there are also many branches. You need to be able to use the cache of the GPU in your application, otherwise your performance is guaranteed to be memory-bound. The reason you want to avoid branches is that there is only one control unit per a number of cores on the GPU, which means that if some threads follow one branch they will have to stall until the other threads complete. Generally the only code that maps well to the GPU is that which contains large for loops and has good spacial locality (e.g. matrix multiplication).

The author is comparing a GPU to a CPU, yet the CPU is only running a single thread (supposedly, the author did not provide the CPU code used in the comparison). For a true comparison the full capability of the CPU should be exposed by means of a multithreaded application (and, as someone else has already mentioned, vector instructions such as SSE). Think performance per socket, not performance per thread.

pcwalton11y ago

> Generally the only code that maps well to the GPU is that which contains large for loops and has good spacial locality (e.g. matrix multiplication).

You also need high arithmetic intensity (the ratio of arithmetic/logical operations to memory loads). Of common CPU-bound tasks, CSV parsing has one of the lowest arithmetic intensities imaginable.

castratikron11y ago

Right, maybe I should have said that instead. In my mind locality implies arithmetic intensity, but intensity may not necessarily imply locality.

zvrba11y ago

> I'm skeptical of the 8x speedup for several reasons [...] yet the CPU is only running a single thread

That's why I'm not skeptical at all. A GPU program can operate over THOUSANDS many more data items in parallel than a single-threaded scalar CPU program can. Yet the speedup is not thousands, not even hundreds, but a mere 8.

Fits perfectly with

> main one being that this particular problem does not fit the paradigm of problems that work well on the GPU

1 more reply

stephen27211y ago

You hit the nail on the head. All of that on top of the fact that he is comparing a top of the line GPU against a mediocre i3 processor. GPU to CPU comparisons are always apples to oranges, but this is pretty bad.

I would like to see an OpenCL kernel that is run on both the GPU and CPU to possibly even the playing field a little.

paulmd11y ago

You can compile Thrust using a device target of OpenMP, so no need to rewrite it in OpenCL.

paulmd11y ago

There's also just the sheer bandwidth factor. A GTX Titan has 288 GB/s of BW, a Haswell with DDR3 puts out about 35 GB/s. That's a factor of 8.22x more bandwidth, so there's your 8x speedup.

It's a highly memory-limited task so I suspect that's where any speedup would come from.

hannibalhorn11y ago· 4 in thread

For file sizes where parsing speed really makes a difference, loading the entire file into memory doesn't seem feasible. Maybe some sort of a hybrid approach (load chunks into memory and parse them via the GPU) would provide some real benefits though.

victorNicollet11y ago

In a REPL environment, for a 1GB file, this can be the difference between "wait three seconds" and "let me get my coffee".

atwebb11y ago

I think there's a utility that does something similar (windows based, assuming it's available elsewhere), I remember it having a "crass" name, you'd think it would make it easier to remember, maybe F.U.C.K.?

bane11y ago

And if you're going to load it, you may as well parse it on the way in.

victorNicollet11y ago

A streaming CSV parser is very difficult to get right, once you get past a certain level of complexity.

Sometimes, you are not lucky enough to have perfect control over the encoding, number format and date format of the input, so you need to look ahead at a value sample to try and find out what those are.

Sometimes, you cannot even assume that the software that produced the file didn't mangle the quotes around fields, and you have to detect that you are 40960 bytes into a quoted field, decide that it's probably an error, and backtrack.

If you have enough memory to load the entire file, you will save a lot of time by giving up on streaming processing.

1 more reply

WhitneyLand11y ago· 4 in thread

I like your creativity in applying the GPU to a less typical task.

Is it the case that the CPU version could be sped up dramatically by using multiple cores and a variation on the line splitting technique?

hvidgaard11y ago

You could use a lockfree queue, fill it, and spawn a suitable number of consumers that will put the parsed data back into a new queue. If the data order matters, then some extra boilerplate is needed, and I do not know how it's done on the GPU, or if it's needed at all.

Seeing that it takes 14.5s with the handrolled code to parse 750GB, I however doubt the optimization is needed - it's more than 50GB every second and you need some really really exotic hardware to generate that much data. You still need to do something with the data and it's likely significantly more computational intensitive - that should be on a different thread.

That said, using the GPU would free up the CPU to do meaningful work with the data. But to be fair, assuming 1GB/s datarate, it's still 12,5m to read it all and you only gain 14.5s or less than 2% speedup if you're CPU bound.

antonmksOP11y ago

It should be 750MB, not 750GB. There is no way a single hard drive can read 750GB in 14 seconds.

2 more replies

antonmksOP11y ago

Yes, CPU version can be accelerated as well, depending on availability of SSE extensions.

Narishma11y ago

Shouldn't that be the default assumption? It's pretty rare to not have SSE.

2 more replies

ape411y ago· 4 in thread

In the real world... read the CSV into a database (doesn't really matter how fast or slow it is). Access the data from the database.

victorNicollet11y ago

There are situations where CSV is the appropriate solution. Our applications read and write CSV data at 1-1.5 million cells per second, and the process can be scaled out at will (there is no central data server, all data is stored on Azure Blob Storage).

I'm not sure we could afford the skills and software to achieve these speeds with SQL.

dilap11y ago

In the real world, sometimes you have an external data source, in csv (which you can't control), which is changing, which you want to reread as quickly as possible. :)

ldng11y ago

You then use PostgreSQL Foreign Data Wrapper (aka SQL/MED)

1 more reply

bitwize11y ago

In the real world you may have to ingest hundreds of megabytes or more of CSV data, daily, from a provider or partner, and CPU time spent ingesting that data is not spent processing transactions. So any speedup you can get is well worth it.

paulmd11y ago· 2 in thread

I kinda suspect he might be measuring the time it takes to launch a kernel rather than the time it takes the kernel to complete.

Thrust device calls, like those of the underlying CUDA library, are asynchronous by default. The only exception is calls that result in a memcpy, which are synchronous. To wait until an async call is completed you need to call one of the synchronize commands, like cudaDeviceSynchronize.

Looking through his test.cu file, he snaps a timestamp using std::clock right after doing the kernel launch with for_each. Ignoring the fact that this is not an accurate way to benchmark a GPU (you need to use events to accurately benchmark the kernel) what you're capturing will just be the processor time it takes to make the async kernel launch. Std::clock measures CPU time, which is (rightly) close to 0 for a program that runs on the GPU.

It's entirely possible that you're not even getting valid results out of the other end - note that you don't show output. I don't know if thrust's magic device memory access function triggers a synchronization or not. I kinda remember having to make an explicit call when I did a GPU simulation.

I don't have access to a CUDA box at the moment, I'd have to add those cudaDeviceSynchronize calls after the for_each invocations to be sure.

antonmksOP11y ago

Thrust CUDA calls are synchronous to each other. You can add an explicit synchronization call cudaDeviceSynchronize() and there won't be a difference in results.

andrewguenther11y ago

I had suspected this as well, but I'm not familiar with how Thrust's CUDA bindings work.

victorNicollet11y ago· 2 in thread

Very interesting. From my experience, the hard part about parsing CSV isn't to identify the individual cells, but rather parsing those cells afterwards (as numbers, dates, etc).

What is the performance of those operations (e.g. parsing YYYY-MM-DD dates to Unix timestamps) when performed on the GPU ?

My company actually picked another optimization strategy, by making the tokenization significantly longer, but it de-duplicates the tokenized cells so that each distinct cell value (a date, a number, a string) can be parsed exactly once. We have seen some fairly good results out of this, compared to the naive approach of stream-token-parse:

https://github.com/Lokad/lokad-flatfiles

TheLoneWolfling11y ago

Makes sense for low-entropy data. Though I can see that approach choking on some datasets. What happens if every entry has a GUID, for example?

May be better to do a best-effort deduplication instead of an exhaustive approach.

victorNicollet11y ago

At some point, we considered tweaking by dropping any strings longer than a certain length from the deduplication (it also helps with memory usage when streaming the data).

Our method makes most sense for many-to-many data (several orders per product, several orders per day), which happens to be the largest data sets we manipulate (by 3 orders of magnitude). I can certainly see situations where this would not be the case (e.g. web crawler logs).

andrewguenther11y ago· 1 in thread

This title is incredibly misleading.

* This isn't parsing a CSV, this is a program written to split this exact dataset. (The code is filled with hard coded values)

* You're comparing a single-threaded run on a low-end CPU to a top-tier GPU.

* Your dataset can fit into GPU memory.

* There is a pull request for a missing semicolon, which means the posted version of the code won't even compile, so couldn't have been the version used to generate the benchmarks.

* The amount of branching in the GPU code makes it hard for me to believe that it actually ran that fast. GPU parallelism does not work well with branching since all cores in a cube must executing in lock-step, if you branch, then you now have to go back and execute all of your branches separately.

antonmksOP11y ago

Sorry if you are misled :-) The program does parses selected fields, there are strings-to-binaries procedures.

I tested this approach on multiterabyte files, take a look at my alenka project, it uses the same method to load large CSV files into databases. It just have to be done in chunks.

The program compiles fine, that pull request was referring to incorrect earlier version of test.cu file.

Test it for yourself, see if you get similar results.

roel_v11y ago· 1 in thread

I don't understand the benchmark. How can a 750GB file be read from disk in 1.5 seconds, let alone be parsed? He mentions it's a 2TB drive, so it's not even SSD presumably?

bbrks11y ago

The README has just been updated to change 750GB to MB.

https://github.com/antonmks/nvParse/commit/fab4c4728096003bc...

mholt11y ago· 1 in thread

I gotta admit, having written a high-speed, multi-threaded streaming CSV parser for the browser[1], I'm quite impressed by your project. I've done CUDA programming before and it's not easy (albeit libraries do help). Good work!

[1] http://papaparse.com

pestaa11y ago

I just used Papa at the office today. Remembered it from a while ago, it really turned out to be effortless. Excellent work.

(While I'm at it, may I suggest to provide CDN URLs on the website? IMO it would make the otherwise awesome page perfect.)

Thank you again.

avarsheny11y ago· 1 in thread

I don't think CPU and GPU is going to make much difference here. You speed up in GPU run could be because of hot cache of file system. Try running GPU run first and non-GPU after that and plz post the results.

antonmksOP11y ago

I used average time for multiple runs, both GPU and CPU benchmarks ran from memory.

maxhou11y ago· 1 in thread

In the real world, the slow part of "parsing" a CSV file is IO: reading the file content from disk to memory, and from memory to CPU cache.

You would avoid reading the file content more than once if you had to parse it.

> The first line counts the number of lines in a buffer (assuming that file is read into memory and copied to gpu buffer d_readbuff).

but this is what is done here, first search to find all \n, then multi-core GPU stuff for each line content.

azurezyq11y ago

Things have been changed in a world of SSDs and machines w/ much memory. Actually, parsing a CSV in a single thread will never reach several hundred megabytes per second.

rdc1211y ago

"However this method of parsing is CPU bound because * it doesn't take advantage of multiple cores of modern CPUs. * memory bandwidth limitations"

So not CPU bound at all then.

snissn11y ago

I think it would be really cool to integrate a GPU into a database to speed up certain operations if an optimizer can decide that certain parts of a query will benefit from it. The thrust docs [0] indicate that the C++ gpu library can be used effectively for sorting, I wonder if sorts on non-indexed fields can be sped up by attaching a GPU to my database!

[0] http://thrust.github.io/

aftbit11y ago

Am I crazy? With a hot disk cache, the cut command he gave takes 1.2 seconds on my machine.

j / k navigate · click thread line to collapse

60 comments

51 comments · 15 top-level

kazinator11y ago· 9 in thread

Though there is no standard definition of CSV, de facto processing it properly requires recognizing quotes, and also escapes of literal quotes using double quoting:

    this, "is, like, CSV", "with three so-called ""fields"""

Note that unquoted leading and trailing whitespace, and whitespace around the commas, is deleted, too.

(See CSV page in the Wikipedia)

A GPU-accelerated string split could be useful but it's not quite "parsing CSV".

haberman11y ago

    # Update "19" whenever string length changes.
    x = {19}"String of length 19"

pcwalton11y ago

There are two mechanisms that are usually used to get around this:

(2) Guess the state you're in based on some heuristics, and roll back on failure. This actually works surprisingly well in practice for many grammars, for example HTML [2].

[1]: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=410047...

[2]: http://www.cs.wm.edu/~xshen/Publications/taco14.pdf

1 more reply

wfunction11y ago

1 more reply

thesz11y ago

http://www.cse.chalmers.se/edu/year/2010/course/TDA341/Paper...

Also look at CYK parsing algorithm, it is highly parallelizable (it is based on matrix multiplication).

TheLoneWolfling11y ago

And that's why you let the computer do it for you.

1 more reply

bostonpete11y ago

> unquoted leading and trailing whitespace, and whitespace around the commas, is deleted, too. (See CSV page in the Wikipedia)

Maybe you're referring to the comment that CSV parsers should "be liberal in what you accept from others"...?

http://en.wikipedia.org/wiki/Comma-separated_values#Basic_ru...

kazinator11y ago

> (a) "spaces outside quotes in a field are not allowed", and (b) "Spaces are considered part of a field and should not be ignored."

Note that it's clear that (b) refers only to spaces inside quotes, because spaces outside quotes are not allowed.

> Maybe you're referring to the comment that CSV parsers should "be liberal in what you accept from others"...?

I actually don't believe in that principle. If there is a spec that everyone agrees upon, violations should be accurately and loudly diagnosed and rejected.

Both preparation of the data format and processing of that data format should be conservative. Being liberal in what is accepted has unintended negative consequences.

My main point holds that if you split the string on commas and do nothing else, then one of the aspects you're neglecting to handle is the treatment of unquoted whitespace outside of a field.

1 more reply

motoboi11y ago

You forgot with fields with line-breaks, with non printrable characters and so on.

kazinator11y ago

I don't need to reproduce an entire detailed CSV spec here to make the point.

castratikron11y ago· 6 in thread

pcwalton11y ago

> Generally the only code that maps well to the GPU is that which contains large for loops and has good spacial locality (e.g. matrix multiplication).

You also need high arithmetic intensity (the ratio of arithmetic/logical operations to memory loads). Of common CPU-bound tasks, CSV parsing has one of the lowest arithmetic intensities imaginable.

castratikron11y ago

Right, maybe I should have said that instead. In my mind locality implies arithmetic intensity, but intensity may not necessarily imply locality.

zvrba11y ago

> I'm skeptical of the 8x speedup for several reasons [...] yet the CPU is only running a single thread

Fits perfectly with

> main one being that this particular problem does not fit the paradigm of problems that work well on the GPU

1 more reply

stephen27211y ago

I would like to see an OpenCL kernel that is run on both the GPU and CPU to possibly even the playing field a little.

paulmd11y ago

You can compile Thrust using a device target of OpenMP, so no need to rewrite it in OpenCL.

paulmd11y ago

There's also just the sheer bandwidth factor. A GTX Titan has 288 GB/s of BW, a Haswell with DDR3 puts out about 35 GB/s. That's a factor of 8.22x more bandwidth, so there's your 8x speedup.

It's a highly memory-limited task so I suspect that's where any speedup would come from.

hannibalhorn11y ago· 4 in thread

victorNicollet11y ago

In a REPL environment, for a 1GB file, this can be the difference between "wait three seconds" and "let me get my coffee".

atwebb11y ago

bane11y ago

And if you're going to load it, you may as well parse it on the way in.

victorNicollet11y ago

A streaming CSV parser is very difficult to get right, once you get past a certain level of complexity.

If you have enough memory to load the entire file, you will save a lot of time by giving up on streaming processing.

1 more reply

WhitneyLand11y ago· 4 in thread

I like your creativity in applying the GPU to a less typical task.

Is it the case that the CPU version could be sped up dramatically by using multiple cores and a variation on the line splitting technique?

hvidgaard11y ago

antonmksOP11y ago

It should be 750MB, not 750GB. There is no way a single hard drive can read 750GB in 14 seconds.

2 more replies

antonmksOP11y ago

Yes, CPU version can be accelerated as well, depending on availability of SSE extensions.

Narishma11y ago

Shouldn't that be the default assumption? It's pretty rare to not have SSE.

2 more replies

ape411y ago· 4 in thread

In the real world... read the CSV into a database (doesn't really matter how fast or slow it is). Access the data from the database.

victorNicollet11y ago

I'm not sure we could afford the skills and software to achieve these speeds with SQL.

dilap11y ago

In the real world, sometimes you have an external data source, in csv (which you can't control), which is changing, which you want to reread as quickly as possible. :)

ldng11y ago

You then use PostgreSQL Foreign Data Wrapper (aka SQL/MED)

1 more reply

bitwize11y ago

paulmd11y ago· 2 in thread

I kinda suspect he might be measuring the time it takes to launch a kernel rather than the time it takes the kernel to complete.

I don't have access to a CUDA box at the moment, I'd have to add those cudaDeviceSynchronize calls after the for_each invocations to be sure.

antonmksOP11y ago

Thrust CUDA calls are synchronous to each other. You can add an explicit synchronization call cudaDeviceSynchronize() and there won't be a difference in results.

andrewguenther11y ago

I had suspected this as well, but I'm not familiar with how Thrust's CUDA bindings work.

victorNicollet11y ago· 2 in thread

Very interesting. From my experience, the hard part about parsing CSV isn't to identify the individual cells, but rather parsing those cells afterwards (as numbers, dates, etc).

What is the performance of those operations (e.g. parsing YYYY-MM-DD dates to Unix timestamps) when performed on the GPU ?

https://github.com/Lokad/lokad-flatfiles

TheLoneWolfling11y ago

Makes sense for low-entropy data. Though I can see that approach choking on some datasets. What happens if every entry has a GUID, for example?

May be better to do a best-effort deduplication instead of an exhaustive approach.

victorNicollet11y ago

At some point, we considered tweaking by dropping any strings longer than a certain length from the deduplication (it also helps with memory usage when streaming the data).

andrewguenther11y ago· 1 in thread

This title is incredibly misleading.

* This isn't parsing a CSV, this is a program written to split this exact dataset. (The code is filled with hard coded values)

* You're comparing a single-threaded run on a low-end CPU to a top-tier GPU.

* Your dataset can fit into GPU memory.

* There is a pull request for a missing semicolon, which means the posted version of the code won't even compile, so couldn't have been the version used to generate the benchmarks.

antonmksOP11y ago

Sorry if you are misled :-) The program does parses selected fields, there are strings-to-binaries procedures.

I tested this approach on multiterabyte files, take a look at my alenka project, it uses the same method to load large CSV files into databases. It just have to be done in chunks.

The program compiles fine, that pull request was referring to incorrect earlier version of test.cu file.

Test it for yourself, see if you get similar results.

roel_v11y ago· 1 in thread

I don't understand the benchmark. How can a 750GB file be read from disk in 1.5 seconds, let alone be parsed? He mentions it's a 2TB drive, so it's not even SSD presumably?

bbrks11y ago

The README has just been updated to change 750GB to MB.

https://github.com/antonmks/nvParse/commit/fab4c4728096003bc...

mholt11y ago· 1 in thread

[1] http://papaparse.com

pestaa11y ago

I just used Papa at the office today. Remembered it from a while ago, it really turned out to be effortless. Excellent work.

(While I'm at it, may I suggest to provide CDN URLs on the website? IMO it would make the otherwise awesome page perfect.)

Thank you again.

avarsheny11y ago· 1 in thread

antonmksOP11y ago

I used average time for multiple runs, both GPU and CPU benchmarks ran from memory.

maxhou11y ago· 1 in thread

In the real world, the slow part of "parsing" a CSV file is IO: reading the file content from disk to memory, and from memory to CPU cache.

You would avoid reading the file content more than once if you had to parse it.

> The first line counts the number of lines in a buffer (assuming that file is read into memory and copied to gpu buffer d_readbuff).

but this is what is done here, first search to find all \n, then multi-core GPU stuff for each line content.

azurezyq11y ago

Things have been changed in a world of SSDs and machines w/ much memory. Actually, parsing a CSV in a single thread will never reach several hundred megabytes per second.

rdc1211y ago

"However this method of parsing is CPU bound because * it doesn't take advantage of multiple cores of modern CPUs. * memory bandwidth limitations"

So not CPU bound at all then.

snissn11y ago

[0] http://thrust.github.io/

aftbit11y ago

Am I crazy? With a hot disk cache, the cut command he gave takes 1.2 seconds on my machine.

j / k navigate · click thread line to collapse