Making Rust as Fast as Go (opens in new tab)

(christianfscott.com)

304 pointschrfrasco6y ago200 comments

200 comments

87 comments · 21 top-level

codeflo6y ago· 16 in thread

I recently did some experiments with creating small static Rust binaries, custom linking, no_std et cetera. A lot of stuff around that kind of thing is unstable or unfinished, which might be somewhat expected. But I’ve also come to the conclusion that Rust relies on libc way too much. That might be fine on Linux, where GNU’s libc is well-maintained, is a bit questionable on MacOS (as seen in this article) and is a a complete distribution nightmare on Windows (in no small part due to a series of very questionable decisions by Microsoft).

My understanding is that Go doesn’t use the libc at all and makes system calls directly, which IMO is the correct decision in a modern systems programming language that doesn’t want to be limited by 40 years of cruft.

ekidd6y ago

As far as I know, the official system interface on Windows and several Unix systems is via the standard library, not via direct syscalls. I don't know about the MacOS. But in general, you may be required to dynamically link the standard library on many platforms.

Linux guarantees syscalls are stable. And on Linux, you have the option of telling Rust to cross-compile using a statically-linked musl-libc. (If you also need to statically link OpenSSL or a few other common libraries, I maintain https://github.com/emk/rust-musl-builder, and there's at least one similar image out there.)

majewsky6y ago

macOS requires you to make syscalls through libSystem if you want a stable interface. Go binaries used to make direct syscalls until 1.10. Since this caused major breakage on most new macOS releases, they have since switched to using libSystem as well in 1.11:

> On macOS and iOS, the runtime now uses libSystem.dylib instead of calling the kernel directly. This should make Go binaries more compatible with future versions of macOS and iOS. The syscall package still makes direct system calls; fixing this is planned for a future release.

Source: https://golang.org/doc/go1.11

masklinn6y ago

> I don't know about the MacOS.

MacOS literally forbids statically linking to libSystem.

Go finally had to bow down and accept that they just could not perform raw syscalls on MacOS after gettimeofday (IIRC) changed ABI multiple times during the Sierra beta.

pansa26y ago

AFAIK on Windows, the hierarchy is:

C library => kernel32.dll => ntdll.dll => system calls

You don’t have to go via the C library - calling kernel32 directly is fine (I believe this is what Go does). However, it’s very rare to call ntdll or to make system calls directly.

4 more replies

codeflo6y ago

That’s partly true, but the stable interface on Windows is not the libc, but the kernel32/user32 functions like VirtualAlloc, CreateFileW etc. Those are stable since the NT days. The libc functions like malloc and fopen are a compatibility layer above that and unfortunately switch places every few years. Currently they are delivered by a combination of a Windows update and a redistributable package, which makes it a nightmare to ship (on pre-Windows 10 even more so).

1 more reply

fluffything6y ago

> But I’ve also come to the conclusion that Rust relies on libc way too much.

How did you come to this conclusion?

People using Rust rely on libc a lot.

For example, #![no_std] means "no standard-library", but it doesn't mean "no libc, no libunwind, etc.".

So a lot of people like to advertise their crates as "#![no_std]" compatible, because they compile with #![no_std], but then the first thing the crate does is linking against libc, against liballoc, against libm, .... or even the standard library itself...

So... if you are trying to build a binary that does not depend on libc or libstd, then there is no language feature to help you there.

#[no_std] binaries are not only unstable, but also probably not what you want, since that won't prevent you from linking any library that links libc, or the standard library, etc.

If you want to avoid libc, you have to enforce that, e.g., by checking in your build system that your binary doesn't contain any libc, libstd, etc. symbols (I just use nm for this). #![no_std] helps a bit here, but making sure you don't link any library that violates this is up to you.

geofft6y ago

To me, a #![no_std] library is useful for contexts where there isn't a platform libc (embedded systems, kernels, etc.) and you can't assume things like the existence of stdin or processes. On those systems, there may still be a dynamic allocator, because that's a super common feature in all but the most spartan environments. For those use cases, a #![no_std] library that links liballoc (and documents that it needs it) is totally okay, and often better than a more-limited library that only does static allocation - or conversely implementing all of libstd with runtime panics for most of it, just because your library needs to pass a Vec around. It's a balance.

The only case I think I've seen where a #![no_std] library ends up pulling in libc is if you haven't added a custom allocator and your platform's default allocator uses libc (and so you could switch to a libc-free allocator if you want). Are there other cases?

2 more replies

codeflo6y ago

That’s true. But going the no_std route is very hard (the ecosystem isn’t big, and relying cargo and crates.io you’re almost guaranteed to link in the std or liballoc by accident at some point). Even when using the std intentionally, I really wouldn’t have expected that basic std functions like println require the libc.

2 more replies

cesarb6y ago

> [...] relies on libc way too much. That might be fine on Linux [...]

> My understanding is that Go doesn’t use the libc at all and makes system calls directly

Actually, the only system on which it's fine to "not use the libc at all and make system calls directly" is Linux. On MacOS, Windows, and most non-Linux Unix-like systems, you must to go through the libc or its equivalent (which on Windows is kernel32.dll and/or ntdll.dll), since the system call interface is unstable (the libc or its equivalent is distributed together with the kernel, so it's updated whenever the system call interface changes).

AFAIK, Go tried for a while to use system calls directly on MacOS; after being broken several times by operating system updates, they gave up and now go through the shared libraries like everyone else. They still insist on using direct system calls on Linux, where it works mostly fine (except for things like the DNS resolver, in which AFAIK they try to guess whether they can do directly network DNS requests, or have to go through libc; and that one time in which Go's small stacks conflicted with a misoptimized kernel VDSO using too much stack).

steveklabnik6y ago

The problem with this is that not every system makes their system call ABI stable. You have two choices here: use the interface that is stable (which is libc), or track changes and fix things up when they break.

codeflo6y ago

The only stable interface on Windows are the documented functions from kernel32.dll, user32.dll etc. Libc is a compatibility layer above that, that Microsoft invents a new incompatible distribution mechanism for every 3-5 years. It’s pure DLL hell unfortunately.

Edit: Not even the DLL name of Microsoft’s libc is stable (msvcrt140.dll etc.), leading to all kinds of wild goose chases when trying to run old binaries.

1 more reply

rhinoceraptor6y ago

Go’s pathological NIH syndrome does come with downsides. For example, there was an infamous memory corruption bug in the way they called into vDSO.

adev_6y ago

> Go’s pathological NIH syndrome does come with downsides.

Yes. And you can add the inability to use the glibc's nss modules under Linux.

Making it unable to use sssd properly and authenticate a posix user on a machine with LDAP authentication.

Getting completely independent from OS sys lib has consequences

1 more reply

weff_6y ago

I think Go tries to reduce its dependence on libc but, by default, it will still link to it.

For instance, this code:

  package main
  import "net"
  func main(){
    net.Dial("tcp", "golang.org:80")
  }

When compiled with go build main.go does link:

  linux-vdso.so.1 (0x00007ffe3d7f0000) 
  libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc7ac05a000)
  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc7abc69000)
  /lib64/ld-linux-x86-64.so.2 (0x00007fc7ac279000)

There are of course compiler options to truly statically compile.

fortran776y ago

This is simply false. Windows APIs are stable. It's one of the best platforms for back compatibility. The kernel32/user32 API has been stable for 20 years. Rust has some work to do.

codeflo6y ago

That’s exactly my point: The problem is that Rust doesn’t use the stable kernel32/user32 functions (VirtualAlloc etc.), but the libc ones, which don’t have a stable location on Windows. Without looking it up: Which DLL do you have to redistribute this week to get “malloc”?

1 more reply

matklad6y ago· 10 in thread

Note that Rust and Go programs differ in a seemingly insignificant, but actually important detail.

Rust does this

    next_dist = std::cmp::min(
        dist_if_substitute,
        std::cmp::min(dist_if_insert, dist_if_delete),
    );

Go does this

    nextDist = min(
         distIfDelete, 
         min(distIfInsert, distIfSubstitute)
    )

The order of minimums is important for this dynamic programming loop. If I change Rust version to take minimums in the same order (swapping substitute and delete), runtime drops from 1.878696288 to 1.579639363.

I haven't investigated this, but I would guess that this is the same effect I've observed in

* https://matklad.github.io/2017/03/12/min-of-three.html

* https://matklad.github.io/2017/03/18/min-of-three-part-2.htm...

(reposting my comment from reddit, as it's a rather unexpected observation)

ttd6y ago

Those are two really well written articles -- taking a complicated topic and making it very accessible. Thanks! FWIW I think a companion article on how to effectively use perf for tasks like this would be a great addition, since it can be a bit novice-unfriendly.

ishanjain286y ago

I think they made the rust version same as Go because I cloned it just now and they are both the same. Also, Thank you soo much for the blog posts! :)

_____smurf_____6y ago

Given this information, and for general parsing functionalities, which one is faster, Go or Rust?

arcticbull6y ago

As always, it depends on what your goals are. String processing is usually not the long pole when you're building something that consumes the output of a parser. Based on micro and macro benchmarks, Rust is typically substantially faster than Go and pretty much always uses less RAM [1].

But again, depends what you're doing with the output, and if these deltas even matter in your context.

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

fluffything6y ago

You can write a parser as far as the limits of your imagination allow in Rust, or C, or any other baremetal language without a thick run-time.

In Go, beyond the limits of your imagination, you'll also hit other limits, like those of the garbage collector.

drfuchs6y ago

Very few things have ever been measured accurately to ten significant digits. How much did these numbers change run to run? How many measurements did you take? Were the caches warmed up similarly? Still, point taken.

matklad6y ago

The excessive "precision" is because I've just copy-pasted what the original bench printed.

As for "is this a reliable result", I believe I've performed diligence, appropriate for a HN comment, to make sure that this is not a completely random result. As I've said, I did not investigate this particular bit of code thoroughly. You are welcome to read the linked blog posts, which study the issue in depth.

londons_explore6y ago

Since min(a, min(b,c)) == min(b, min(a,c)), perhaps the compiler should be smart enough to swap the comparisons around if it makes it quicker?

dan-robertson6y ago

I suspect that statement is not true for floats. Possibly you don’t get the same float from min(0,-0) as min(-0,0), and similarly with NaNs. Rust specifies that if one input is NaN then the other is returned but doesn’t say what happens if both are NaN.

3 more replies

arcticbull6y ago

Part of the problem may be they re-implemented `std::cmp::min` at the bottom of the file, I wonder if there's a more optimized version in the stdlib.

1 more reply

molf6y ago· 10 in thread

The Rust version uses `target.chars().count()` to initialise the cache, while the Go version counts up to `len(target)`. These are not equivalent: the Rust version counts Unicode code points, the Go version counts bytes.

I am confused by the implementations, although I have not spent any time testing them. Both versions contain a mix of code that counts bytes (`.len()` and `len(...)`) and Unicode code points (`chars()` and `[]rune(...)`). My guess is that the implementation might not work correctly for certain non-ASCII strings, but I have not verified this.

Of course, if only ASCII strings are valid as input for this implementation then both versions will be a lot faster if they exclusively operate on bytes instead.

eis6y ago

Yep.

Here a Go playground example showing that the result is indeed wrong:

https://play.golang.org/p/vmctMFUevPc

It should output 3 but outputs 5 because each ö is two bytes, len("föö") = 5.

I would suggest using "range" to iterate over the unicode characters.

earthboundkid6y ago

If you diff föö and f though, it correctly gives an edit distance of 2.

The code is weird because someone knew enough to convert the strings to slices of runes but not enough to use the rune slices consistently. :-/

1 more reply

savaki6y ago

Looks like a small bug in the go code, corrected here. Original author should have used rune throughout. https://play.golang.org/p/mGZZMFtMgHH

steveklabnik6y ago

Yes, especially because that changes it from constant time (strings know their length in bytes) to linear time (counting chars means a loop through the text.)

I was a bit suspicious of the conclusion, but didn’t dig in myself. I imagine this would be a much larger source of difference.

maxmcd6y ago

huh, yeah if I switch the rust version to target.len() the execution time drops by more than 10%

edit: and if I switch to source.bytes().enumerate() it drops by 20% more

arcticbull6y ago

Go's version and the Rust version differ in yet more subtle ways. It appears that Go's "rune" type is a Code Point, but Rusts's "char" type is a Unicode Scalar Value, a subset of Code Point that excludes surrogate pairs. Both versions will not work with complex Unicode input unless you perform both segmentation by Grapheme Cluster [1] and utilize a consistent Normalization [2] when comparing clusters.

Unicode is hard, fams, and it's rare that anything that looks easy is actually what you want.

[1] https://unicode.org/reports/tr29/

[2] http://unicode.org/reports/tr15/

fluffything6y ago

> Both versions will not work with complex Unicode input unless you perform both segmentation by Grapheme Cluster [1] and utilize a consistent Normalization [2] when comparing clusters

Doing this is quite easy from Rust.

1 more reply

chrfrascoOP6y ago

Nice catch, thanks for pointing this out. I've updated the cache initialization to use `len(targetChars)` rather than `len(target)`:

    cache := make([]int, len(targetChars)+1)
    for i := 0; i < len(targetChars)+1; i++ {
        cache[i] = i
    }

AFAIK this makes them equivalent (fingers crossed). It seems to not have made much of a difference (-0.03s)

molf6y ago

They might be equivalent (perhaps apart from other issues pointed out in other comments), but both implementations are still either

1) incorrect if UTF-8-strings are supposed to be valid input, or

2) very inefficient if only ASCII-strings are supposed to be valid input.

Thaxll6y ago

https://blog.golang.org/strings

chrfrascoOP6y ago· 9 in thread

Hey all, as some keen-eyed commenters have pointed out, it looks like the rust program is not actually equivalent to the go program. The go program parses the string once, while the rust program parses it repeatedly inside every loop. It's quite late in Sydney as I write this so I'm not up for a fix right now, but this post is probably Fake News. The perf gains from jemalloc are real, but it's probably not the allocators fault. I've updated the post with this message as well.

The one-two combo of 1) better performance on linux & 2) jemalloc seeming to fix the issue lured me into believing that the allocator was to blame. I’m not sure what the lesson here is – perhaps more proof of Cunningham’s law? https://en.wikipedia.org/wiki/Ward_Cunningham#Cunningham's_L...

arcticbull6y ago

Thanks for following up. Just as an FYI, there's a few bugs in your implementation, the most obvious one is the use of ".len()" in a number of places interspersed with ".chars().count()". These two return different values. ".len()" returns then number of UTF-8 bytes in the input string, which for ASCII is the same as ".chars().count()" obviously, but if you do attempt any Unicode characters, your function won't work. ".chars()" provides Unicode Scalar Values (USVs) -- which is a subset of code points, excluding surrogate pairs [1]. Note also this is not the same as a Go rune, which is a code point including surrogate pairs.

Secondly, you re-implemented "std::cmp::min" at the bottom of the file, and I'm not sure if the stdlib version is more optimized.

Lastly, well, you caught the issue with repeated passes over the string.

I've fixed the issues if you're curious: https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea...

Unrelated, I hate the term "fake news" as it's an intentional attempt to destroy the world public's faith in news media. It's a cancer on civilized society. Somewhere your civics teacher is crying into some whiskey, even though of course you're joking.

[1] http://www.unicode.org/glossary/#unicode_scalar_value

arcticbull6y ago

This doesn't even begin to get into the question of what Levenshtein Distance even means in a Unicode context. What's the Levenshtein Distance of 3 emoji flags? I suppose we should be segmenting by grapheme clusters and utilizing a consistent normalization form when comparing, but Rust has no native support for processing grapheme clusters -- or for normalizations I believe. The UnicodeSegementation crate might help.

Based on some cursory research, the go version differs in a more subtle way too. A Rune is a Code Point, which is a superset of the Rust "char" type; it includes surrogate pairs.

1 more reply

afiori6y ago

(related to the unrelated part) what if the media is corrupt? I mean, independently on current events (I really don't want to enter that here) we do live in a world where very few amoral corporations own most of the media industry.

If we (correctly) rely on the media to bring to public attentions relevant facts (both criminal and non-criminal) and keep a watchful eye on the nation who then keeps a watchful eye on the media?

is the model entirely based on always being there enough good journalist to spot the bad ones? how is this affected by the very precarious economics of current internet ads-based venture-funded media enterprises?

I just blurted too many questions... what I am trying to say is that similarly with the police there is not as easy answer in shoud-trust should-not-trust (in the US a supreme Court judge advised to "not talk to the police").

in that case I guess part of the problem is that the job of the police can be miscontrued as "arresting people". in the same way the job of a journalist can be miscontrued as "getting clicks"

overall I don't think we can pass an a priori moral judgement on that term, as essentially represent a statement that the default safety measures have failed.

(I want to reiterate that here I try not to intermingle my point with whether I believe or not that the current use is warranted, I am just trying to say that as a concept it needs to be part of an healthy democracy, the same as some distrust in electoral promises)

dcow6y ago

"Fake News" entered the realm (not even recently, might I add) of popular misuse. Actually, is there a term for: language/grammar incorrectly used because society has developed a familiar "meme" use?

Common examples:

* Look at this dank "meme".

meme has come to mean "a picture shared on the internet that has words on it".

* Let's [have a] "cheers".

It's a toast. You say "cheers" when you toast.

* You missed Suzie and I's party last night.

It's Suzie and my party. This one is particularly annoying because it's made it way past editors and into writing, screenplay, etc.

1 more reply

masklinn6y ago

> Hey all, as some keen-eyed commenters have pointed out, it looks like the rust program is not actually equivalent to the go program. The go program parses the string once, while the rust program parses it repeatedly inside every loop. […] The perf gains from jemalloc are real, but it's probably not the allocators fault. I've updated the post with this message as well.

I don't know that it would be a gain: Rust is pretty good at decoding UTF8 quickly given how absolutely fundamental that operation is, and "caching" the decoded data would increase pressure on the allocator.

Unless you also changed the interface of the levenshtein to hand it preallocated cache, source and destination buffers (or the caller did that).

edit: to the downvoter, burntsushi did the legwork of actually looking at this[0] and found caching the decoding to have no effect at best, unless the buffers get lifted out of the function entirely, which matches my comment's expectations.

[0] https://news.ycombinator.com/item?id=23059753

> But yes, I did benchmark this, even after reusing allocations, and I can't tell a difference. The benchmark is fairly noisy.

1 more reply

otterley6y ago

> this post is probably Fake News

It's not Fake News. Fake News is the publication of intentionally false stories. This is just erroneous.

There's a yawning chasm between the two.

arcticbull6y ago

"fake news" is an attempt at permanently damaging the American public's faith in the fifth estate. Nobody should ever use that term in the current political climate, ever.

2 more replies

will42746y ago

Says who?

When news organizations take other news organizations word for it and the story is false, that's fake news. We called it something different back then, but fake news led to the invasion of Iraq. Negligence is sufficient for fake news, malice not required.

1 more reply

Ar-Curunir6y ago

It’s a joke mate

1 more reply

devit6y ago· 6 in thread

The main problem is that allocating a vector for each evaluation is completely wrong: instead, it needs to be allocated once by making the function a method on a struct containing a Vec; which makes the allocator moot.

The second problem is that at least the Rust code is decoding UTF-8 every iteration of the inner loop instead of decoding once and saving the result, or even better interning the characters and having versions of the inner loop for 32-bit chars and 8-bit and 16-bit interned indexes.

Furthermore the code rereads cache[j] instead of storing the previous value, and doesn't do anything to make sure that bound checks are elided in the inner loop (although perhaps the compiler can optimize that).

The code for computing the min seems to have been written mindlessly rather than putting serious thought towards whether to have branches or not and in what order (depending on an analysis of what the branch directions rates would be).

Implausible benchmark results are almost always an indicator of the incompetence of the person performing the benchmark.

burntsushi6y ago

> The second problem is that at least the Rust code is decoding UTF-8 every iteration of the inner loop instead of decoding once and saving the result

Indeed. This is a pretty damning difference. The `target` string is being repeatedly UTF-8 decoded where as the same is not true in the Go version. The Go version even goes out of its way to do UTF-8 decoding exactly once for each of `source` and `target`, but then doesn't do the same for the Rust program.

> Implausible benchmark results are almost always an indicator of the incompetence of the person performing the benchmark.

Come on. We can do better than this. Please don't make it personal. We all have to learn things at some point.

masklinn6y ago

> Indeed. This is a pretty damning difference. The `target` string is being repeatedly UTF-8 decoded where as the same is not true in the Go version. The Go version even goes out of its way to do UTF-8 decoding exactly once for each of `source` and `target`, but then doesn't do the same for the Rust program.

I'm really not sure that's an issue, utf8 decoding is very, very cheap and it's iterating either way.

It would have to be benched, but I wouldn't be surprised if allocating the caches (at least one allocation per line of input) had way more overhead, especially given the inputs are so very short.

I'm not going to claim Rust's utf8 decoder is the fastest around, but it's very fast.

1 more reply

chrfrascoOP6y ago

> The second problem is that at least the Rust code is decoding UTF-8 every iteration of the inner loop instead of decoding once and saving the result, or even better interning the characters and having versions of the inner loop for 32-bit chars and 8-bit and 16-bit interned indexes.

I tried this. Pulling the .chars() call out of the loop & collecting into a Vec made the performance even worse – the following balloons the runtime from ~2.7s to ~5s:

    let target_chars: Vec<char> = target.chars().collect();
    for (i, source_char) in source.chars().enumerate() {
        let mut next_dist = i + 1;

        for (j, target_char) in target_chars.iter().enumerate()  {

> written mindlessly > incompetence of the person

No challenge there :P I am operating under the assumption that I don't need to understand how compilers work to get good performance from rust (where good is "similar enough to an equivalent go program")

burntsushi6y ago

Indeed, it looks like doing the UTF-8 decoding up-front is exacerbating the performance difference in the allocator.

I think this is where the GP's first suggestion comes into play. If one were writing this code _and_ cared about performance, then you'd usually find a way to reuse allocations. I submitted a PR to demonstrate this: https://github.com/christianscott/levenshtein-distance-bench...

1 more reply

molf6y ago

> Implausible benchmark results are almost always an indicator of the incompetence of the person performing the benchmark.

An ad hominem attack surely isn't needed.

masklinn6y ago

And that's really too bad because I think the rest of the comment is spot-on as the fundamental issue is that Rust simply delegates to the system allocator, and MacOS's allocator is pretty slow.

As a result, while Rust allows very explicitly and relatively easily removing allocations (compared to C or C++), getting the most performances out of your program also requires doing so, unless you use a non-system allocator with better support for the workload.

1 more reply

ishanjain286y ago· 3 in thread

I tried to benchmark Go/Rust versions as well.

I made 4 changes in Rust version.

1. Moved up the line that gets a value from cache[j+1] before any calls are made to cache[j]. This removes 1 bound check. (Improvement from 182,747ns down to 176,xyzns +-4800)

2. Moved from .chars().enumerate() to .as_bytes() and manually tracking current position with i/j variables. (Improvement from 176,xyz ns down to 140,xyz ns)

3. Moved to the standard benchmark suite from main + handrolled benchmark system.(File read + load + parse into lines was kept out of benchmark)

4. Replaced hand rolled min with std::cmp::min. (improvement from 140,xyz down to 139,xyz but the std deviation was about the same. So Could just be a fluke. Don't know)

In Go version, I made three changes.

1. Doing the same thing from #1 in Rust actually increased the runtime from 190,xyz to 232,xyz and quite consistently too. I ran it 10+ times to confirm)

2. Replaced []rune(source), []rune(target) to []byte(source), []byte(target). (Improvement from 214817ns to 190152 ns)

3. Replaced hand rolled bench mark system with a proper bench mark system in Go. (Again, File read + load + parse into lines was kept out of benchmark)

So, At the end of it, Rust version was about 50k ns faster than Go version.

Edit #1:

In rust version, I had also replaced the cache initialization to (0..=target.len()).collect() before doing anything els.. This also gave a good perf boost but I forgot to note down the exact value.

blablabla1236y ago

I'd be really surprised to hear that Go is supposed to be faster than Rust. Of course Rust is a bit newer but to me it always sounded like Go is fast because it's static but it doesn't have to be high-speed if that would sacrifice conciseness. Given that this is an artifical example, this here looks more realistic: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

arcticbull6y ago

Rust and Go are contemporaries. Rust started in 2006 at Mozilla, and the first Go public release from Google was in 2009, meaning it probably started at the same time.

Except of course all the Plan 9 garbage (like Go's hand-rolled assembler) brought in to underpin Go from the 80s ;)

2 more replies

burntsushi6y ago

Note that using bytes is a fundamentally different implementation that will produce different results on non-ASCII input. Using codepoints (or "runes") will better approximate edit distance based on visual characters. (And grapheme clusters would be even better. Although one could put the text in composed normal form to get more mileage out of the rune based algorithm.)

tromp6y ago· 3 in thread

Missed chance to shorten title to "Making Rust Go Fast" :-)

katktv6y ago

Making Rust As Fast As It Can Go

codegladiator6y ago

Making the code rust as fast as it can go

kreetx6y ago

Go Rust!

1 more reply

virtualritz6y ago· 2 in thread

I tried this on a spare time project[1]. Runtime in a quick test went down from 14.5 to 12.2 secs on macOS!

So a solid ~15% by changing the allocator to jemalloc.

However, I now have a segfault w/o a stack trace when the data gets written at the end of the process.

Possibly something fishy in some `unsafe{}` code of a dependent crate of mine that the different allocator exposed. :]

Still – no stack trace at all is very strange in Rust when one runs a debug build with RUST_BACKTRACE=full.

[1] https://github.com/virtualritz/rust-diffusion-limited-aggreg...

saagarjha6y ago

I have found that jemallocator is currently broken on macOS Catalina, so that might be the problem. If you can reproduce this issue reliably, I'd love to hear about it because I can't myself unless I use a very specific toolchain that produces -O3 binaries that are a real pain to work with.

virtualritz6y ago

It's 100% reproducible. Just check out the previous to last commit on master on the github repo I linked to and run the tool with any command that invokes the nsi crate.

Eg.:

   > rdla dump foo.nsi

should produce the segfault before exiting the process.

Is there a jemallocator ticked where to attach a report for this?

1 more reply

goranmoomin6y ago· 2 in thread

TLDR for people who didn't read:

The speed difference came from the allocator.

Rust switched from jemalloc to the system allocator per ticket #36963[0] for various reasons (like binary bloat, valgrind incompatibility, etc...).

Go uses a custom allocator[1] instead.

To make 'Rust Go fast' (pun intended), one can use the '#[global_allocator]' to use a custom allocator (in this case, with the jemallocator crate) to make allocations fast again.

[0]: https://github.com/rust-lang/rust/issues/36963

[1]: https://golang.org/src/runtime/malloc.go

k__6y ago

The comments of Rust programmers here also suggest that the Rust implementation is, indeed, different from the Go implementation.

pcr9103036y ago

It was just a summary of the post contents - the post suggests that the biggest difference comes from the allocator.

maoeurk6y ago· 2 in thread

Assuming this was run on a 64bit system, the Rust version seems to be allocating and zeroing twice as much memory as the Go version.

edit: this has been pointed out as incorrect, Go ints are 8 bytes on 64bit systems -- thanks for the correction!

  let mut cache: Vec<usize> = (0..=target.chars().count()).collect();

which can be simplified as

  let mut cache: Vec<usize> = vec![0; target.len()];

  cache := make([]int, len(target)+1)
  for i := 0; i < len(target)+1; i++ {
    cache[i] = i
  }

Rust usize being 8 bytes and Go int being 4 bytes as I understand it.

So between doing more work and worse cache usage, it wouldn't be surprising if the Rust version was slower even with the faster allocator.

rossmohax6y ago

Go int is 8 bytes

eis6y ago

It can be either depending on the system.

https://golang.org/ref/spec#Numeric_types

mqus6y ago· 1 in thread

Is it intentional that the benchmarks include not only running the program itself but also compiling it? e.g. in the linked source code, the bench.sh includes the compilation step which is known to be slow in rust:

    #!/usr/bin/env bash

    set -e
    run() {
      cargo build --release 2> /dev/null
      ./target/release/rust
    }

    run;

Sure, if you run it many times in succession the compiler won't do much but the benchmarking script (run.js) doesn't really indicate that and the blog post also doesn't mention that.

EDIT: I was just being stupid, don't mind me. The times were taken within each language and not externally.

chrfrascoOP6y ago

run.js is not doing the benchmarking. If you look at the source for each of the programs being benchmarked, you'll see that the programs themselves are responsible for benchmarking

hu36y ago· 1 in thread

I'm surprised that a naive implementation in Go can outperform a naive implementation in Rust.

empath756y ago

I’m not. Hell, when I first started learning rust i frequently wrote code that ran slower than _python_.

savaki6y ago· 1 in thread

This discussion seems to me like a microcosm of the differences in philosophies between Rust and Go.

With Rust, you have much more control, but you also need a deep understanding of the language to get the most out of it. With Go, the way you think it should work is usually is Good Enough™.

jeffdavis6y ago

I wouldn't put it that way. Both languages are fast at nice straight-line code.

The main area I'd expect to see performance benefits for rust (though I don't have experience here) is larger rust programs. Rust's zero-cost abstractions have more benefits as the abstractions nest more deeply. For a small program, you don't really have a lot of abstractions, so Go will do just fine.

I think Go has a number of nice performance tricks up it's sleeve, though, so I wouldn't rule out Go on performance grounds too quickly.

alvarelle6y ago

Could also try to use the smallvec crate in this case, which put small allocation on the stack https://docs.rs/smallvec/

arcticbull6y ago

There's a bunch of issues with the Rust implementation, not least that the initial condition where source or target lengths are zero, it returns the number of UTF-8 bytes of the other, but all other computations are performed in terms of Unicode chars -- except at the end: `cache[target.len()]` which will return the wrong value if any non-ASCII characters precede it.

Further, each time you call `.chars().count()` the entire string is re-enumerated at Unicode character boundaries, which is O(n) and hardly cheap, hence wrapping it in an iterator over char view.

Also, re-implementing std::cmp::min at the bottom there may well lead to a missed optimization.

Anyways, I cleaned it up here in case the author is curious: https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea...

submeta6y ago

Impressed to see four posts about Rust on the front page of HN simultaneously.

1 more reply

anderskaseorg6y ago

I’ve found that Microsoft’s mimalloc (available for Rust via https://crates.io/crates/mimalloc) typically provides even better performance than jemalloc. Of course, allocator performance can vary a lot depending on workload, which is why it’s good to have options.

novocaine6y ago

It may be that the system allocator is making an excessive number of syscalls to do work, whereas most custom allocators will allocate in slabs to avoid this. You could try using dtruss or strace to compare the differences.

savaki6y ago

A few folks have commented that there were logic errors in the Go version. Specifically that

  len("föö") = 5

should instead have returned

  len("föö") = 3

I submitted a pull request, https://github.com/christianscott/levenshtein-distance-bench..., that fixes these issues in the Go implementation.

Interestingly enough, when I re-ran the benchmark, the Go version is roughly 19% faster than it was previously:

  old: 1.747889s
  new: 1.409262s (-19.3%)

fortran776y ago

loeg6y ago

FreeBSD's system allocator is jemalloc :-).

j / k navigate · click thread line to collapse

200 comments

87 comments · 21 top-level

codeflo6y ago· 16 in thread

ekidd6y ago

majewsky6y ago

Source: https://golang.org/doc/go1.11

masklinn6y ago

> I don't know about the MacOS.

MacOS literally forbids statically linking to libSystem.

Go finally had to bow down and accept that they just could not perform raw syscalls on MacOS after gettimeofday (IIRC) changed ABI multiple times during the Sierra beta.

pansa26y ago

AFAIK on Windows, the hierarchy is:

C library => kernel32.dll => ntdll.dll => system calls

You don’t have to go via the C library - calling kernel32 directly is fine (I believe this is what Go does). However, it’s very rare to call ntdll or to make system calls directly.

4 more replies

codeflo6y ago

1 more reply

fluffything6y ago

> But I’ve also come to the conclusion that Rust relies on libc way too much.

How did you come to this conclusion?

People using Rust rely on libc a lot.

For example, #![no_std] means "no standard-library", but it doesn't mean "no libc, no libunwind, etc.".

So... if you are trying to build a binary that does not depend on libc or libstd, then there is no language feature to help you there.

#[no_std] binaries are not only unstable, but also probably not what you want, since that won't prevent you from linking any library that links libc, or the standard library, etc.

geofft6y ago

2 more replies

codeflo6y ago

2 more replies

cesarb6y ago

> [...] relies on libc way too much. That might be fine on Linux [...]

> My understanding is that Go doesn’t use the libc at all and makes system calls directly

steveklabnik6y ago

codeflo6y ago

Edit: Not even the DLL name of Microsoft’s libc is stable (msvcrt140.dll etc.), leading to all kinds of wild goose chases when trying to run old binaries.

1 more reply

rhinoceraptor6y ago

Go’s pathological NIH syndrome does come with downsides. For example, there was an infamous memory corruption bug in the way they called into vDSO.

adev_6y ago

> Go’s pathological NIH syndrome does come with downsides.

Yes. And you can add the inability to use the glibc's nss modules under Linux.

Making it unable to use sssd properly and authenticate a posix user on a machine with LDAP authentication.

Getting completely independent from OS sys lib has consequences

1 more reply

weff_6y ago

I think Go tries to reduce its dependence on libc but, by default, it will still link to it.

For instance, this code:

  package main
  import "net"
  func main(){
    net.Dial("tcp", "golang.org:80")
  }

When compiled with go build main.go does link:

  linux-vdso.so.1 (0x00007ffe3d7f0000) 
  libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc7ac05a000)
  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc7abc69000)
  /lib64/ld-linux-x86-64.so.2 (0x00007fc7ac279000)

There are of course compiler options to truly statically compile.

fortran776y ago

This is simply false. Windows APIs are stable. It's one of the best platforms for back compatibility. The kernel32/user32 API has been stable for 20 years. Rust has some work to do.

codeflo6y ago

1 more reply

matklad6y ago· 10 in thread

Note that Rust and Go programs differ in a seemingly insignificant, but actually important detail.

Rust does this

    next_dist = std::cmp::min(
        dist_if_substitute,
        std::cmp::min(dist_if_insert, dist_if_delete),
    );

Go does this

    nextDist = min(
         distIfDelete, 
         min(distIfInsert, distIfSubstitute)
    )

I haven't investigated this, but I would guess that this is the same effect I've observed in

* https://matklad.github.io/2017/03/12/min-of-three.html

* https://matklad.github.io/2017/03/18/min-of-three-part-2.htm...

(reposting my comment from reddit, as it's a rather unexpected observation)

ttd6y ago

ishanjain286y ago

I think they made the rust version same as Go because I cloned it just now and they are both the same. Also, Thank you soo much for the blog posts! :)

_____smurf_____6y ago

Given this information, and for general parsing functionalities, which one is faster, Go or Rust?

arcticbull6y ago

But again, depends what you're doing with the output, and if these deltas even matter in your context.

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

fluffything6y ago

You can write a parser as far as the limits of your imagination allow in Rust, or C, or any other baremetal language without a thick run-time.

In Go, beyond the limits of your imagination, you'll also hit other limits, like those of the garbage collector.

drfuchs6y ago

matklad6y ago

The excessive "precision" is because I've just copy-pasted what the original bench printed.

londons_explore6y ago

Since min(a, min(b,c)) == min(b, min(a,c)), perhaps the compiler should be smart enough to swap the comparisons around if it makes it quicker?

dan-robertson6y ago

3 more replies

arcticbull6y ago

Part of the problem may be they re-implemented `std::cmp::min` at the bottom of the file, I wonder if there's a more optimized version in the stdlib.

1 more reply

molf6y ago· 10 in thread

Of course, if only ASCII strings are valid as input for this implementation then both versions will be a lot faster if they exclusively operate on bytes instead.

eis6y ago

Yep.

Here a Go playground example showing that the result is indeed wrong:

https://play.golang.org/p/vmctMFUevPc

It should output 3 but outputs 5 because each ö is two bytes, len("föö") = 5.

I would suggest using "range" to iterate over the unicode characters.

earthboundkid6y ago

If you diff föö and f though, it correctly gives an edit distance of 2.

The code is weird because someone knew enough to convert the strings to slices of runes but not enough to use the rune slices consistently. :-/

1 more reply

savaki6y ago

Looks like a small bug in the go code, corrected here. Original author should have used rune throughout. https://play.golang.org/p/mGZZMFtMgHH

steveklabnik6y ago

Yes, especially because that changes it from constant time (strings know their length in bytes) to linear time (counting chars means a loop through the text.)

I was a bit suspicious of the conclusion, but didn’t dig in myself. I imagine this would be a much larger source of difference.

maxmcd6y ago

huh, yeah if I switch the rust version to target.len() the execution time drops by more than 10%

edit: and if I switch to source.bytes().enumerate() it drops by 20% more

arcticbull6y ago

Unicode is hard, fams, and it's rare that anything that looks easy is actually what you want.

[1] https://unicode.org/reports/tr29/

[2] http://unicode.org/reports/tr15/

fluffything6y ago

> Both versions will not work with complex Unicode input unless you perform both segmentation by Grapheme Cluster [1] and utilize a consistent Normalization [2] when comparing clusters

Doing this is quite easy from Rust.

1 more reply

chrfrascoOP6y ago

Nice catch, thanks for pointing this out. I've updated the cache initialization to use `len(targetChars)` rather than `len(target)`:

    cache := make([]int, len(targetChars)+1)
    for i := 0; i < len(targetChars)+1; i++ {
        cache[i] = i
    }

AFAIK this makes them equivalent (fingers crossed). It seems to not have made much of a difference (-0.03s)

molf6y ago

They might be equivalent (perhaps apart from other issues pointed out in other comments), but both implementations are still either

1) incorrect if UTF-8-strings are supposed to be valid input, or

2) very inefficient if only ASCII-strings are supposed to be valid input.

Thaxll6y ago

https://blog.golang.org/strings

chrfrascoOP6y ago· 9 in thread

arcticbull6y ago

Secondly, you re-implemented "std::cmp::min" at the bottom of the file, and I'm not sure if the stdlib version is more optimized.

Lastly, well, you caught the issue with repeated passes over the string.

I've fixed the issues if you're curious: https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea...

[1] http://www.unicode.org/glossary/#unicode_scalar_value

arcticbull6y ago

Based on some cursory research, the go version differs in a more subtle way too. A Rune is a Code Point, which is a superset of the Rust "char" type; it includes surrogate pairs.

1 more reply

afiori6y ago

If we (correctly) rely on the media to bring to public attentions relevant facts (both criminal and non-criminal) and keep a watchful eye on the nation who then keeps a watchful eye on the media?

in that case I guess part of the problem is that the job of the police can be miscontrued as "arresting people". in the same way the job of a journalist can be miscontrued as "getting clicks"

overall I don't think we can pass an a priori moral judgement on that term, as essentially represent a statement that the default safety measures have failed.

dcow6y ago

"Fake News" entered the realm (not even recently, might I add) of popular misuse. Actually, is there a term for: language/grammar incorrectly used because society has developed a familiar "meme" use?

Common examples:

* Look at this dank "meme".

meme has come to mean "a picture shared on the internet that has words on it".

* Let's [have a] "cheers".

It's a toast. You say "cheers" when you toast.

* You missed Suzie and I's party last night.

It's Suzie and my party. This one is particularly annoying because it's made it way past editors and into writing, screenplay, etc.

1 more reply

masklinn6y ago

Unless you also changed the interface of the levenshtein to hand it preallocated cache, source and destination buffers (or the caller did that).

[0] https://news.ycombinator.com/item?id=23059753

> But yes, I did benchmark this, even after reusing allocations, and I can't tell a difference. The benchmark is fairly noisy.

1 more reply

otterley6y ago

> this post is probably Fake News

It's not Fake News. Fake News is the publication of intentionally false stories. This is just erroneous.

There's a yawning chasm between the two.

arcticbull6y ago

"fake news" is an attempt at permanently damaging the American public's faith in the fifth estate. Nobody should ever use that term in the current political climate, ever.

2 more replies

will42746y ago

Says who?

1 more reply

Ar-Curunir6y ago

It’s a joke mate

1 more reply

devit6y ago· 6 in thread

Implausible benchmark results are almost always an indicator of the incompetence of the person performing the benchmark.

burntsushi6y ago

> The second problem is that at least the Rust code is decoding UTF-8 every iteration of the inner loop instead of decoding once and saving the result

> Implausible benchmark results are almost always an indicator of the incompetence of the person performing the benchmark.

Come on. We can do better than this. Please don't make it personal. We all have to learn things at some point.

masklinn6y ago

I'm really not sure that's an issue, utf8 decoding is very, very cheap and it's iterating either way.

It would have to be benched, but I wouldn't be surprised if allocating the caches (at least one allocation per line of input) had way more overhead, especially given the inputs are so very short.

I'm not going to claim Rust's utf8 decoder is the fastest around, but it's very fast.

1 more reply

chrfrascoOP6y ago

I tried this. Pulling the .chars() call out of the loop & collecting into a Vec made the performance even worse – the following balloons the runtime from ~2.7s to ~5s:

    let target_chars: Vec<char> = target.chars().collect();
    for (i, source_char) in source.chars().enumerate() {
        let mut next_dist = i + 1;

        for (j, target_char) in target_chars.iter().enumerate()  {

> written mindlessly > incompetence of the person

burntsushi6y ago

Indeed, it looks like doing the UTF-8 decoding up-front is exacerbating the performance difference in the allocator.

1 more reply

molf6y ago

> Implausible benchmark results are almost always an indicator of the incompetence of the person performing the benchmark.

An ad hominem attack surely isn't needed.

masklinn6y ago

And that's really too bad because I think the rest of the comment is spot-on as the fundamental issue is that Rust simply delegates to the system allocator, and MacOS's allocator is pretty slow.

1 more reply

ishanjain286y ago· 3 in thread

I tried to benchmark Go/Rust versions as well.

I made 4 changes in Rust version.

1. Moved up the line that gets a value from cache[j+1] before any calls are made to cache[j]. This removes 1 bound check. (Improvement from 182,747ns down to 176,xyzns +-4800)

2. Moved from .chars().enumerate() to .as_bytes() and manually tracking current position with i/j variables. (Improvement from 176,xyz ns down to 140,xyz ns)

3. Moved to the standard benchmark suite from main + handrolled benchmark system.(File read + load + parse into lines was kept out of benchmark)

4. Replaced hand rolled min with std::cmp::min. (improvement from 140,xyz down to 139,xyz but the std deviation was about the same. So Could just be a fluke. Don't know)

In Go version, I made three changes.

1. Doing the same thing from #1 in Rust actually increased the runtime from 190,xyz to 232,xyz and quite consistently too. I ran it 10+ times to confirm)

2. Replaced []rune(source), []rune(target) to []byte(source), []byte(target). (Improvement from 214817ns to 190152 ns)

3. Replaced hand rolled bench mark system with a proper bench mark system in Go. (Again, File read + load + parse into lines was kept out of benchmark)

So, At the end of it, Rust version was about 50k ns faster than Go version.

Edit #1:

In rust version, I had also replaced the cache initialization to (0..=target.len()).collect() before doing anything els.. This also gave a good perf boost but I forgot to note down the exact value.

blablabla1236y ago

arcticbull6y ago

Rust and Go are contemporaries. Rust started in 2006 at Mozilla, and the first Go public release from Google was in 2009, meaning it probably started at the same time.

Except of course all the Plan 9 garbage (like Go's hand-rolled assembler) brought in to underpin Go from the 80s ;)

2 more replies

burntsushi6y ago

tromp6y ago· 3 in thread

Missed chance to shorten title to "Making Rust Go Fast" :-)

katktv6y ago

Making Rust As Fast As It Can Go

codegladiator6y ago

Making the code rust as fast as it can go

kreetx6y ago

Go Rust!

1 more reply

virtualritz6y ago· 2 in thread

I tried this on a spare time project[1]. Runtime in a quick test went down from 14.5 to 12.2 secs on macOS!

So a solid ~15% by changing the allocator to jemalloc.

However, I now have a segfault w/o a stack trace when the data gets written at the end of the process.

Possibly something fishy in some `unsafe{}` code of a dependent crate of mine that the different allocator exposed. :]

Still – no stack trace at all is very strange in Rust when one runs a debug build with RUST_BACKTRACE=full.

[1] https://github.com/virtualritz/rust-diffusion-limited-aggreg...

saagarjha6y ago

virtualritz6y ago

It's 100% reproducible. Just check out the previous to last commit on master on the github repo I linked to and run the tool with any command that invokes the nsi crate.

Eg.:

   > rdla dump foo.nsi

should produce the segfault before exiting the process.

Is there a jemallocator ticked where to attach a report for this?

1 more reply

goranmoomin6y ago· 2 in thread

TLDR for people who didn't read:

The speed difference came from the allocator.

Rust switched from jemalloc to the system allocator per ticket #36963[0] for various reasons (like binary bloat, valgrind incompatibility, etc...).

Go uses a custom allocator[1] instead.

To make 'Rust Go fast' (pun intended), one can use the '#[global_allocator]' to use a custom allocator (in this case, with the jemallocator crate) to make allocations fast again.

[0]: https://github.com/rust-lang/rust/issues/36963

[1]: https://golang.org/src/runtime/malloc.go

k__6y ago

The comments of Rust programmers here also suggest that the Rust implementation is, indeed, different from the Go implementation.

pcr9103036y ago

It was just a summary of the post contents - the post suggests that the biggest difference comes from the allocator.

maoeurk6y ago· 2 in thread

Assuming this was run on a 64bit system, the Rust version seems to be allocating and zeroing twice as much memory as the Go version.

edit: this has been pointed out as incorrect, Go ints are 8 bytes on 64bit systems -- thanks for the correction!

  let mut cache: Vec<usize> = (0..=target.chars().count()).collect();

which can be simplified as

  let mut cache: Vec<usize> = vec![0; target.len()];

  cache := make([]int, len(target)+1)
  for i := 0; i < len(target)+1; i++ {
    cache[i] = i
  }

Rust usize being 8 bytes and Go int being 4 bytes as I understand it.

So between doing more work and worse cache usage, it wouldn't be surprising if the Rust version was slower even with the faster allocator.

rossmohax6y ago

Go int is 8 bytes

eis6y ago

It can be either depending on the system.

https://golang.org/ref/spec#Numeric_types

mqus6y ago· 1 in thread

    #!/usr/bin/env bash

    set -e
    run() {
      cargo build --release 2> /dev/null
      ./target/release/rust
    }

    run;

Sure, if you run it many times in succession the compiler won't do much but the benchmarking script (run.js) doesn't really indicate that and the blog post also doesn't mention that.

EDIT: I was just being stupid, don't mind me. The times were taken within each language and not externally.

chrfrascoOP6y ago

run.js is not doing the benchmarking. If you look at the source for each of the programs being benchmarked, you'll see that the programs themselves are responsible for benchmarking

hu36y ago· 1 in thread

I'm surprised that a naive implementation in Go can outperform a naive implementation in Rust.

empath756y ago

I’m not. Hell, when I first started learning rust i frequently wrote code that ran slower than _python_.

savaki6y ago· 1 in thread

This discussion seems to me like a microcosm of the differences in philosophies between Rust and Go.

With Rust, you have much more control, but you also need a deep understanding of the language to get the most out of it. With Go, the way you think it should work is usually is Good Enough™.

jeffdavis6y ago

I wouldn't put it that way. Both languages are fast at nice straight-line code.

I think Go has a number of nice performance tricks up it's sleeve, though, so I wouldn't rule out Go on performance grounds too quickly.

alvarelle6y ago

Could also try to use the smallvec crate in this case, which put small allocation on the stack https://docs.rs/smallvec/

arcticbull6y ago

Further, each time you call `.chars().count()` the entire string is re-enumerated at Unicode character boundaries, which is O(n) and hardly cheap, hence wrapping it in an iterator over char view.

Also, re-implementing std::cmp::min at the bottom there may well lead to a missed optimization.

Anyways, I cleaned it up here in case the author is curious: https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea...

submeta6y ago

Impressed to see four posts about Rust on the front page of HN simultaneously.

1 more reply

anderskaseorg6y ago

novocaine6y ago

savaki6y ago

A few folks have commented that there were logic errors in the Go version. Specifically that

  len("föö") = 5

should instead have returned

  len("föö") = 3

I submitted a pull request, https://github.com/christianscott/levenshtein-distance-bench..., that fixes these issues in the Go implementation.

Interestingly enough, when I re-ran the benchmark, the Go version is roughly 19% faster than it was previously:

  old: 1.747889s
  new: 1.409262s (-19.3%)

fortran776y ago

loeg6y ago

FreeBSD's system allocator is jemalloc :-).

j / k navigate · click thread line to collapse