this, "is, like, CSV", "with three so-called ""fields"""
Note that unquoted leading and trailing whitespace, and whitespace around the commas, is deleted, too.(See CSV page in the Wikipedia)
A GPU-accelerated string split could be useful but it's not quite "parsing CSV".
The kinds of patterns that would make parsing more parallelizable, like marking the beginning of a delimited region with its length, are human unfriendly so would never be part of an actual text format. Who would ever want to write this?
# Update "19" whenever string length changes.
x = {19}"String of length 19"There are two mechanisms that are usually used to get around this:
(1) Perform a fast, sequential "skeleton parsing" pass before the main parse that scans just enough to find "split points" that are consumed by the parallel parser. This is what some of the parallel XML parsing work [1] did.
(2) Guess the state you're in based on some heuristics, and roll back on failure. This actually works surprisingly well in practice for many grammars, for example HTML [2].
[1]: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=410047...
Also look at CYK parsing algorithm, it is highly parallelizable (it is based on matrix multiplication).
I'm not sure why you referenced the Wikipedia page -- it indicates that according to RFC 4180, (a) "spaces outside quotes in a field are not allowed", and (b) "Spaces are considered part of a field and should not be ignored."
Maybe you're referring to the comment that CSV parsers should "be liberal in what you accept from others"...?
http://en.wikipedia.org/wiki/Comma-separated_values#Basic_ru...
Note that it's clear that (b) refers only to spaces inside quotes, because spaces outside quotes are not allowed.
> Maybe you're referring to the comment that CSV parsers should "be liberal in what you accept from others"...?
I actually don't believe in that principle. If there is a spec that everyone agrees upon, violations should be accurately and loudly diagnosed and rejected.
Both preparation of the data format and processing of that data format should be conservative. Being liberal in what is accepted has unintended negative consequences.
Be that as it may, there is no universal CSV spec, though. RFC 4180 is just someone's opinion on what CSV should be. CSV is something that has been widely implemented in numerous programs over numerous decades, in different ways.
My main point holds that if you split the string on commas and do nothing else, then one of the aspects you're neglecting to handle is the treatment of unquoted whitespace outside of a field.
There is a danger in being too detailed because there is no universal spec anyway. For many users, CSV is whatever the most recent version or two of Microsoft Excel accept, as confirmed by trial-and-error reverse engineering.
The author is comparing a GPU to a CPU, yet the CPU is only running a single thread (supposedly, the author did not provide the CPU code used in the comparison). For a true comparison the full capability of the CPU should be exposed by means of a multithreaded application (and, as someone else has already mentioned, vector instructions such as SSE). Think performance per socket, not performance per thread.
You also need high arithmetic intensity (the ratio of arithmetic/logical operations to memory loads). Of common CPU-bound tasks, CSV parsing has one of the lowest arithmetic intensities imaginable.
That's why I'm not skeptical at all. A GPU program can operate over THOUSANDS many more data items in parallel than a single-threaded scalar CPU program can. Yet the speedup is not thousands, not even hundreds, but a mere 8.
Fits perfectly with
> main one being that this particular problem does not fit the paradigm of problems that work well on the GPU
I would like to see an OpenCL kernel that is run on both the GPU and CPU to possibly even the playing field a little.
It's a highly memory-limited task so I suspect that's where any speedup would come from.
Sometimes, you are not lucky enough to have perfect control over the encoding, number format and date format of the input, so you need to look ahead at a value sample to try and find out what those are.
Sometimes, you cannot even assume that the software that produced the file didn't mangle the quotes around fields, and you have to detect that you are 40960 bytes into a quoted field, decide that it's probably an error, and backtrack.
If you have enough memory to load the entire file, you will save a lot of time by giving up on streaming processing.
Is it the case that the CPU version could be sped up dramatically by using multiple cores and a variation on the line splitting technique?
Seeing that it takes 14.5s with the handrolled code to parse 750GB, I however doubt the optimization is needed - it's more than 50GB every second and you need some really really exotic hardware to generate that much data. You still need to do something with the data and it's likely significantly more computational intensitive - that should be on a different thread.
That said, using the GPU would free up the CPU to do meaningful work with the data. But to be fair, assuming 1GB/s datarate, it's still 12,5m to read it all and you only gain 14.5s or less than 2% speedup if you're CPU bound.
I'm not sure we could afford the skills and software to achieve these speeds with SQL.
Thrust device calls, like those of the underlying CUDA library, are asynchronous by default. The only exception is calls that result in a memcpy, which are synchronous. To wait until an async call is completed you need to call one of the synchronize commands, like cudaDeviceSynchronize.
Looking through his test.cu file, he snaps a timestamp using std::clock right after doing the kernel launch with for_each. Ignoring the fact that this is not an accurate way to benchmark a GPU (you need to use events to accurately benchmark the kernel) what you're capturing will just be the processor time it takes to make the async kernel launch. Std::clock measures CPU time, which is (rightly) close to 0 for a program that runs on the GPU.
It's entirely possible that you're not even getting valid results out of the other end - note that you don't show output. I don't know if thrust's magic device memory access function triggers a synchronization or not. I kinda remember having to make an explicit call when I did a GPU simulation.
I don't have access to a CUDA box at the moment, I'd have to add those cudaDeviceSynchronize calls after the for_each invocations to be sure.
What is the performance of those operations (e.g. parsing YYYY-MM-DD dates to Unix timestamps) when performed on the GPU ?
My company actually picked another optimization strategy, by making the tokenization significantly longer, but it de-duplicates the tokenized cells so that each distinct cell value (a date, a number, a string) can be parsed exactly once. We have seen some fairly good results out of this, compared to the naive approach of stream-token-parse:
May be better to do a best-effort deduplication instead of an exhaustive approach.
Our method makes most sense for many-to-many data (several orders per product, several orders per day), which happens to be the largest data sets we manipulate (by 3 orders of magnitude). I can certainly see situations where this would not be the case (e.g. web crawler logs).
* This isn't parsing a CSV, this is a program written to split this exact dataset. (The code is filled with hard coded values)
* You're comparing a single-threaded run on a low-end CPU to a top-tier GPU.
* Your dataset can fit into GPU memory.
* There is a pull request for a missing semicolon, which means the posted version of the code won't even compile, so couldn't have been the version used to generate the benchmarks.
* The amount of branching in the GPU code makes it hard for me to believe that it actually ran that fast. GPU parallelism does not work well with branching since all cores in a cube must executing in lock-step, if you branch, then you now have to go back and execute all of your branches separately.
I tested this approach on multiterabyte files, take a look at my alenka project, it uses the same method to load large CSV files into databases. It just have to be done in chunks.
The program compiles fine, that pull request was referring to incorrect earlier version of test.cu file.
Test it for yourself, see if you get similar results.
https://github.com/antonmks/nvParse/commit/fab4c4728096003bc...
(While I'm at it, may I suggest to provide CDN URLs on the website? IMO it would make the otherwise awesome page perfect.)
Thank you again.
You would avoid reading the file content more than once if you had to parse it.
> The first line counts the number of lines in a buffer (assuming that file is read into memory and copied to gpu buffer d_readbuff).
but this is what is done here, first search to find all \n, then multi-core GPU stuff for each line content.
So not CPU bound at all then.