bf.lines().forEach(s -> sb.append(s));
However, this ends up reading all the lines into one giant line, since the String's that lines() produces have the newline character stripped. This leads to the second lines() call to read a 23MB line (the file produced by gen.py). This is less than optimal.The fastest version I managed to write was:
public void readString5(String data) throws IOException {
int lastIdx = 0;
for (int idx = data.indexOf('\n'); idx > -1; idx = data.indexOf('\n', lastIdx))
{
parseLine(data.substring(lastIdx, idx));
lastIdx = idx+1;
}
parseLine(data.substring(lastIdx));
}
Not the prettiest thing, but it went from 0.594 GB/s to 1.047 GB/s. Also, it doesn't quite do the same as the lines() method, but that's easily changed. public void readString(String data) throws IOException {
int lastIdx = 0;
for (int idx = data.indexOf('\n'); idx > -1; idx = data.indexOf('\n', lastIdx)) {
parseLine(subSequenceView(data, lastIdx, idx));
lastIdx = idx + 1;
}
parseLine(subSequenceView(data, lastIdx, data.length()));
}
CharSequence subSequenceView(CharSequence base, int beginIndex, int endIndex) {
return new StringView(base, beginIndex, endIndex - beginIndex);
}
static class StringView implements CharSequence {
final CharSequence base;
final int offset;
final int length;
StringView(CharSequence base, int offset, int length) {
if (length < 0) throw new IllegalArgumentException("length must be >= 0");
this.base = base;
this.offset = offset;
this.length = length;
}
@Override
public char charAt(int n) {
if (n < 0 || n >= length)
throw new IndexOutOfBoundsException(n);
return base.charAt(offset + n);
}
@Override
public int length() {
return length;
}
@Override
public CharSequence subSequence(int beginIndex, int endIndex) {
return new StringView(base, offset + beginIndex, endIndex - beginIndex);
}
}Where is the second lines call?
What does it say about you that you're asking a leading question in this way?
Nothing. Mistakes happen.
Erlang is another language where "naive" IO is kind of slow. https://github.com/bbense/beatwc/ is a project someone did to test various methods of doing IO in Erlang/Elixir, and their performance for a line-counting task, relative to the Unix wc(1) command.
It's interesting to see which approaches are faster. Yes, parallelism gains you a bit, but a much larger win comes from avoiding the stutter-stop effect of cutting the read buffer off whenever you hit a newline. Instead, the read buffer should be the same size as your IO source's optimal read-chunk size (a disk block; a TCP huge packet), and you should grab a whole buffer-ful of lines at a time, do a pattern-matching binary scan to collect all the indices of the newlines, and then use those indices to part the buffer out as slice references.
This achieves quite a dramatic speedup, since most of the time you don't need movable copies of the lines, and can copy the line (or more likely just part of it) yourself when you need to hold onto it.
This approach is probably also already built in to Java's "better" IO libraries, like NIO.
edit: Lemire is showing a lack of what Fowler describes as "mechanical sympathy".
https://martinfowler.com/articles/lmax.html#QueuesAndTheirLa...
This means that, even when your data set compresses so well that you'd theoretically gain a ton of speed by having the data streamed from disk compressed, and then decompressed during parsing—this doesn't apply in practice, since you're introducing an artificial bottleneck in your IO reads.
I'm not actually sure if this is a bug in Erlang, per se, or if it's just the intended behavior and compressed file IO was never intended to be used for performance, only for e.g. embedded devices with tiny ROMs.
(If people here think it's a bug, I'll probably go to the effort at some point to profile the performance impact and submit it as a bug on https://bugs.erlang.org.)
† What it's actually doing, is that all zlib compression/decompression passes get sent to a zlib "port driver" as messages. Port drivers can handle multiple requests in-flight at once (they're the in-process equivalent to sockets—each Erlang process's port against the port-driver is its own "connection") but the zlib port driver shim is coded to expose zlib as a single-threaded, blocking, request-response style of server, rather than one that accepts connections in parallel and instantiates a separate zlib context for each separate connection it receives.
Some of the blame for that probably lies with his headline choice, but he clearly states at the end of this post:
""" This is not the best that Java can do: Java can ingest data much faster. However, my results suggest that on modern systems, Java file parsing might be frequently processor-bound, as opposed to system bound. That is, you can buy much better disks and network cards, and your system won’t go any faster. Unless, of course, you have really good Java engineers.
Many firms probably just throw more hardware at the problem. """
It's not about this piece of code. It's not even about Java In the previous post he mentions at the start of this one, he pointed out:
""" These results suggest that reading a text file in C++ could be CPU bound in that sense that buying an even faster disk would not speed up your single-threaded throughput. """
So, I take his point to be that one shouldn't make assumptions about performance. Rough performance scales -- such as have been posted here many times (e.g. [1]) -- make great rules of thumb for implementation choices or as a guide for where to look first for bottlenecks. To optimize in the real world, though, you're best served using real measurements.
[1] https://www.prowesscorp.com/computer-latency-at-a-human-scal...
It has no baseline and no specs. For all I know, he could have got his 0.5 GB/sec on ab old Pentium II processor.
There is no analysis.
I am perplexed.
edit fine, so instead maybe click on the links in the post to see that this article is just one of a series. He's probably tired of copy-pasting the specs of his reference hardware (Skylake https://arxiv.org/pdf/1902.08318.pdf) since all he's concerned about is the relative performance of different software.
There is a difference between "I'm being dumb because I don't know what I'm doing" and "I'm being lazy because I've done it 1,000 times and the target audience knows what I mean".
I'm actually a stickler about good benchmarks - it riles me when people draw sweeping conclusions from poorly-designed experiments. Lemire is actually one of the good ones. If you want something more fully developed than a blog post, read one of his papers.
I personally really enjoy his blog because of this - he's good at picking interesting exploratory experiments that provide some insight, without trying to over-generalize from the results. If you read his conclusion, the point is that there is a good probability that even relatively simple programs are CPU-bound. His experiment supports that point. My experience also matches that - I've seen a lot of data processing code that could be I/O bound in theory (i.e. a perfect implementation could max out CPU or network) but is CPU bound in practice. Usually because of string manipulation, regexes, or any number of other things.
> This is not the best that Java can do: Java can ingest data much faster. However, my results suggest that on modern systems, Java file parsing might be frequently processor-bound, as opposed to system bound. That is, you can buy much better disks and network cards, and your system won’t go any faster. Unless, of course, you have really good Java engineers.
Java NIO channel should have been used for this. It was demonstrated back in the early 2000s with the "Grand Canyon" demo achieving very good throughput for its time, and it's still the gold standard.
- at least one heap allocation for every line. After it finds the EOL it first uses 'new String' followed by '.toString()
- the C++ version will almost certainly be backing on to memchr() behind the scenes, which will be using SIMD instructions where it makes sense (e.g. large enough scan size, probably true in this case). the Java version is a manual bytewise-coded loop.
- the C++ version is reusing its output buffer, no reallocations assuming the same string length or less
No idea about encodings in Java, maybe that is playing a role too
As a result std::string was disgustingly slow compared to C code.
I made a similar benchmark. The idea is as follows: we have 2 GB byte array (because arrays in Java have 32 bit limit, LoL) filled with 32..126 values, imitating ASCII text and 13 values imitating newlines.
The first test is simply does XOR the whole array. It's the ideal result which should correspond to memory bandwidth.
The second test wraps this array into ByteArrayInputSteram, converts it into Reader using InputStreamReader with UTF-8 encoding, reads lines using BufferedReader and in the end also XORs every char value.
For 2 GB I have 516 ms as an ideal time (3,8 GB/s which is still almost order of magnitude less than theoretical 19.2 GB/s DDR4 speed) and 3566 ms as a BufferedReader, so you can have almost 7x speed improvement with better implementation.
Benchmark: https://pastebin.com/xMD4W8mn
Read a file using something like Vert.X, which is optimized for speed. I'm 100% confident it will be faster than the naive c approach
I do a lot of work with large GZIP that are read line by line using the standard IO (i.e. GzipInputStrem(FileInputStream)) etc) but your comment has really made me second guess my choice of doing that...
You should be able to hack something together that feeds zero copy buffers into Netty compression handler. Maybe using Netty or Vert.X file API, or maybe just raw NIO2.
I'm not sure how fast this would be, but my gut says "very". Netty can easily saturate 40 Gigabit Ethernet lines, and file IO should have less overhead. That's ~5 gigabytes a second.
It's going to be a good bit of coding for sure. Vert.X/Netty/NIO2 are all async and pretty low level. They're generally 1 thread per core, and along with SSD read patterns you're probably best off reading files in parallel, one per core. Might not be worth the effort.
You may want to look into ZStandard as well. It's Superior to Gzip in most ways when you need something fast but decent.
Path path = ...
var count = 0;
try(var channel = FileChannel.open(path)) {
var buffer = ByteBuffer.allocateDirect(8192);
while(channel.read(buffer) != -1) {
while(buffer.hasRemaining()) {
if (buffer.get() == '\n') {
count++;
}
}
buffer.clear();
}
}
System.out.println(count);
but in your particular case, i don't think there is a Gzip decoder that works on ByteBuffer.I'm not saying that's happening here, but it's a basic fact when writing benchmarks that you have to actually test something real and not a transient property of the program after the compiler has had its chance to be really smart.
The correct way to do it is using collect(Collectors.joining(“\n”)) or straight forward imperative style (without streams).
I don’t think the general statement holds (that java or buffered reader is cpu bound in particular).
It is like asserting something about C based on GCC specific behaviour.
Java is not a single language implementation.
So now I have to go out and do benchmarks across all these implementations just to prove they aren't the same?!?
https://en.m.wikipedia.org/wiki/List_of_Java_virtual_machine...
I was once working on an Android app on a cheap custom board with 128 M ram (don't ask why Android on a single function custom board, wasn't my decision).
Among other things, I had to parse a 80000 line csv file. Splitting and the rest of the processing created so many temporary strings the system ran out of ram. We eventually gave up.
Copies and pasted top answer that buffers the whole thing into the heap before parsing
Blames Java/hardware when this doesn't scale
Let us all hope that Project Valhalla, the effort to add value types to Java, comes to a swift completion. It would be very helpful in these sorts of scenarios.
Though at this point I'd wonder (as you did) why I'm using Java in the first place given the resource constraints of the system, I've had some success using byte[] arrays and using `yourInputStreamHere.read(someBuffer, yourOffset, YOUR_PAGE_SIZE)` in a few less constrained but relatively more performance critical areas.
Depending on your exact needs, this can sometimes result in more or less constant memory consumption; it's possible to reuse the array on each read. You can also take it a step further by finding the indexes where you would split the array (e.g. the index of the next comma for a CSV) and using methods that operate on an array plus start and end indexes. That sort of strategy can allow you to avoid additional copies the array, at the cost of somewhat annoying and less maintainable code that is definitely not Java-esque.
There are also less severe solutions that won't buffer the whole file in memory, using StringBuilder or other techniques, etc. depending on what you're doing.
This all might be for naught though; I'm assuming the records have to be parsed into some structure, so tricks like the above might not be good enough given Java's seemingly insatiable hunger for heap space.
While I'm not the language's greatest fan, this is one area where Go has been able to shine. Having value types as a possibility makes solving this sort of problem a typically less expensive proposition.
https://mail.openjdk.java.net/pipermail/valhalla-dev/2019-Ju...
1. 128 M ram total memory, 20-30 M free at best. CPU was slow and we were running off a SD card. Not exactly an enterprise monster.
2. Of course I was parsing it line by line. But running OOM triggered the garbage collector so many times that it took like 3-4 minutes to finish. I didn't mean that the system actually stopped because of running out of ram.
3. Why didn't you do it with the NDK/some other solution? Because the PM just gave up on the feature when it wasn't done in a day. It was just real time search completion whith we didn't really need there, it was just "nice to have".
You just have to use NIO and use direct buffers instead of naively reading and allocating strings.