That's an assumption that is repeated very often recently, and measured very rarely. Truth is that they amount of applications for which they don't work is surprisingly low. I'm working at a well known cloud provider, and lots of people would really be suprised which applications at largest scale are working fine with a thread-per-request model. 50k OS threads are not really an issue on modern server hardware. While it might not be the most efficient [1], it will not perform so bad that it causes an availaiblity impact either.
There's obviously some exceptions to that [2] - but I encourage people to measure instead of making assumptions. Unless one finds themselves in a weekly meeting about server efficiency or scaling cliffs both models probably work.
[1] it really depends on the workload, but people might find an efficiency degradation (e.g. measured as BYTES_TRANSFERRED/CPU_CORES_USED) of 20% at a concurrency level of 1000, or maybe only at a concurrency level of 10k. Coarse-grained work items (e.g. send a large file to a socket) will show a lower degradation.
[2] Load balancers, CDN services, and e.g. chat applications which maintain a massive amount of mostly idle client connections can be such environments. They have a high amount of concurrency that needs to be managed, but less so of "active concurrency". If all clients would be active at the same time, those environments would run out of disk IO or network bandwidth far before CPU or memory become an issue.
Performance is important, but the biggest performance gain happens when a program goes from not working to working correctly.
Debugging is another corner case which async makes it intolerably hard to get backtrace and make sense out of what is going on.
It's not like debugging threads is easy, but in a low contention environment which is entirely "1 thread holds state of one request" and there are few interlocking threads in it, threading is a fair bit better than async execution. Plus the logs which indicate thread-names make it possible to draw out something like a post-processed Catapult timing diagram (open chrome://tracing and look at an example, it is a great UI for dropping in your own multi-threaded event log as JSON).
I'm a big fan of executor thread-groups and work queues, but damn does it make hard to mentally walk through a bug when the stack traces are scattered across multiple places.
> That's an assumption that is repeated very often recently, and measured very rarely.
I would go further--there is a whole infrastructure that needs to appear when massive concurrency is involved and very few times is that taken into account.
For those people interested in genuine massive concurrency, I encourage people to investigate Erlang. In my opinion, the language itself is just "meh", but OTP, the infrastructure around managing, upgrading, restarting, etc. processes/threads, is extremely on point.
Side note: Erlang still has the absolute best handling of binary parsing of any language ever. https://www.erlang.org/doc/programming_examples/bit_syntax.h...
I really wish the Rust people would pick something like the Erlang Bit Sytax up and integrate it with their pattern matching (probably necessitating some pattern matching language fixes) rather than the amount of effort they continue to piddle on async/await.
Re concurrency. I learned Erlang before Akka. It took me a bit but I find Akka more ergonomic. Akka will easily handle millions of actors on a single machine, too. But I always miss matching on binaries.
Another good one is protoactor for golang. That will also do a million actors no problem. Comes really close to Erlang in terms of how concise the syntax is. But again, no binary matching.
If you're writing a CRUD app, sure, do it in PHP and spin up a thread per request.
In the real world here are the kinds of problems that people at Google etc. care about when it comes to performance or scalability issues with hugely concurrent programs:
- Noisy neighbor problems from other threads messing with your TLB and L1 cache
- High cost of context switches
- Unpredictable scheduling/priority inversion in the scheduler
The first problem isn't actually made any better by using async coroutines or green threads/fibers, if you switch to another coroutine or fiber and it does something naughty (e.g. munmaps memory, which will cause a TLB shootdown) it's going to degrade performance for your unrelated coroutine/fiber.The second and third problems can be solved in some cases by things like fibers and userspace scheduling, but this is a fairly advanced topic and "just use async" is definitely not the solution. If you're interested in learning more about how these problems are actually solved at Google for example I recommend [2] and [3].
[1] https://abseil.io/docs/cpp/guides/synchronization#thread-ann... [2] https://www.youtube.com/watch?v=KXuZi9aeGTw [3] https://storage.googleapis.com/pub-tools-public-publication-...
Switching between threads within the same process doesn't require a TLB or L1 cache flush. Not sure if you were implying this, just wanted to point that out.
> - High cost of context switches
Userspace schedulers (like rust's tokio) do make context switching cheaper, however, most of the context switching in the case of a web server is due to blocking I/O and the most expensive part of the switch, entering the kernel, is already accounted for by the I/O request. Kernel context switching is unlikely to be your bottleneck.
> Unpredictable scheduling/priority inversion in the scheduler
This can definitely be an issue at scale, but a general purpose async scheduler like most use is unlikely to be any better.
$ ps -eLf | grep firefox | wc -l
569
$The perception that async rust is where you should start for concurrent rust because it's built in and everyone uses it perhaps should be revisited. I would argue that the other options are worth consideration first and dropping down to low level async code might be warranted when you need the performance it gives and that justifies the increase in development costs.
Rust isn't meant to be a language for CRUD apps (despite making inroads in this space). It's meant to be a C/C++ alternative that can work every difficult niche where these two can, including processes that already have their own runtimes, kernel space, microcontrollers, and other situations where any overhead or bringing custom threads with magic I/O and special stack handling is unacceptable.
Rust's async is designed to be separate from the core language, and work on top of arbitrary runtimes. Most people use tokio, but it can also work with your custom loop on microcontrollers, or on top of another runtime, e.g. WASM + browser's event loop, or gtk-rs that can work on top of GTK's event loop.
I just think that the cultural decision in the wider ecosystem to make, practically speaking, everything io related, async is possibly a mistake.
It's hard to wind down that existing momentum.
Nodejs devs seem to be doing fine? and I would say their development is faster than most devs working on other stacks. Nodejs is also a top 3 server stack and growing.
This puts limits on what can be accomplished. Starting with a more restricted set of code allowed, and then expanding it over time can be more successful in many cases, without locking you into a perhaps more ergonomic looking interface that needs to be coddled with no tooling support to avoid the "slow path". For examples in Rust: `impl Trait` used not to exist, which meant you had to use `Box<dyn Trait>` instead, which can be slower and certainly ads some verbosity. Then `impl Trait` was added and a bunch of code was now representable, and soon `type Alias = impl Trait;` will be stabilized which will allow even more code to be representable, in a way that is both performant and easier to use. A language that instead says "just use `-> Trait` and the compiler will figure out what to do" would have increased the user's perf without intervention, but for anyone that really cares about FFI stability or wants to keep on top of heap allocations would be out in the cold.
It is the same reason that you can complain about the complexity of the String/&str distinction in Rust[1], but avoiding lingering references to big strings in JS (effectively a memory leak) becomes much harder.
[1]: https://fasterthanli.me/articles/working-with-strings-in-rus...
Requesting urls n-at-a-time took me a while (https://play.rust-lang.org/?version=stable&mode=debug&editio...). In particular rust-analyzer itself cannot figure out `buffer`'s type here.
You can consider me very intrigued by Lunatic.
At first, I noticed that the go version was actually faster than the Rust one, and then I saw that in `reqwest`, they recommend you if you're doing multiple GET request, to create a `Client` and then use that to get better performance[1]. After changing my code, the Rust version was effectively a bit faster (not by much, to be honest, which was a bit disappointing considering go's version was way easier to write, and I say this as a generally rust shill).
Hopefully this comment is somewhat helpful :)
[1] https://docs.rs/reqwest/latest/reqwest/#making-a-get-request
Always create a client explicitly. And also always add a timeout.
The Go http.Get() function uses a shared global client, so making a request doesn't have high initialization costs, and requests can make sure of a shared connection pool.
Right, then it doesn't have to reopen the connection for each request. That's not an async thing, it's a caching thing.
My problem is more that even if I don't need massive concurrency (say in a client that only talks to a single server, in a serial manner), I'm still more or less forced into async code because that's what the ecosystem switched to. No matter if you benefit from async or not, not using it is going against the grain and generally makes your life harder, despite threads being much better from a language-ergonomics point of view
Async rust lets you implement different combinators on async tasks and cancel them effortlessly.
As for performance, tokio is not exactly a zero cost abstraction. Just run perf on a tokio program to see how big of overhead it introduces. It has claimed to be zero cost from the start, and since then it has done at least two major performance overhauls, to prove the point. That being said I love tokio and its ecosystem, but it is ergonomics, not speed that I love. That being said async-std was much slower for the networking use case that I had, so overall tokio is as good as it gets.
Is this really true? All the problems that are solvable in Go should be solvable in Rust too right (but not vice versa because Go is GCed)? They might not compete on every front but there definitely should be overlap in the use cases.
Rust is difficult to learn, unless you already have a lot of experience with existing low level languages. Getting complex programs up and running with Rust is cumbersome. But the performance is excellent, you can have a high degree that your program is rock solid, and there are entire classes of security issues don't happen in Rust. For the types of applications where Rust does well, it does very well indeed. The time investment to become a decent Rust programmer is high, but this higher barrier to entry can make your programming skills even more valuable since there's less supply to meet the demand.
Async is hard again, taking more months to feel proficient. I've again a suspicion that much of the resistance to async is due to people who have done the first effort to feel comfortable in rust and expect async to fit right in, but it doesn't, because it's hard too.
Threads are also hard, but under rust they better map to existing thread models, so pre-existing skills are useful and so someone skilled in threads and rust will be skilled in threads with rust.
For sure, there are missing pieces of the async world like async traits, but they will come.
I know it sounds crazy. I recently dove into the area, and was pretty surprised at how many interesting building blocks there are out there. It feels like if we just combine them in the right way, we'll discover something that works a lot better.
Off the top of my head:
Google discovered a way to switch between OS threads without the syscall overhead. All it needs is to solve the memory overhead. [0]
Zig discovered a way to use monomorphization to enable colorless async/await. If someone could figure out how to make it work through polymorphism / virtual dispatch, that would be amazing. [1]
Vale discovered a possible way to make structured concurrency in a memory safe way that's easier than existing methods. [2]
Go [3] and Loom [4] show us that we can move stacks around. Loom is particularly interesting as it shows we can move the stack to its original location, a unique mechanism that could solve some other approaches' problems with pointer invalidation.
Cone is designing a unique blend of actors and async await, to enable simpler architectures. [5]
We're close to solving the problem, I can feel it.
[0] No public docs on it, but TL;DR: we tell the OS the thread is blocked, and manually switch over to it by saving/manipulating registers.
[1] https://kristoff.it/blog/zig-colorblind-async-await/
[2] https://verdagon.dev/blog/seamless-fearless-structured-concu...
[3] https://blog.cloudflare.com/how-stacks-are-handled-in-go/
[4] https://youtu.be/NV46KFV1m-4
[5] Can't find the link, but was a discussion on their server.
A lot of this stuff is intriguing from the implementation side, but where we're really lacking is in the syntax and semantic side to make concurrency "make sense" to programmers. I don't think we're close to solving that problem (for example, call/cc isn't the answer, it's the problem).
imho the issue isn't function coloring, threads, whatever. It's a compiler that defaults to async code in the calling convention and then optimization passes to de-async-ify (remove unnecessary yield points) the code at compile time. The result would be code that looks synchronous but is async where it matters (i/o).
A lot of the symptoms of the sync/async problem are caused by the explicit decoupling of sync/async APIs in source code. If you remove that and force it to be implicit internal to the language implementation, the issue goes away. It would take a lot of work to determine if that was worth it.
Basically as we've now accepted garbage collection to be an acceptable part of language implementation, one day I think we'll accept async executors to be a part of that too. We're halfway there on the impl side (Go, Java through Loom, NodeJS, etc). The other half is removing the explicit syntax for it.
Safepoints for garbage collection are somewhat similar, but for preemption one wants to interrupt threads on a timer, rather than before the collector takes over. Despite occurring very frequently (at around 100 _million_ checks per second), the time overhead is only about 2.5% or so, according to a study by Blackburn et al [0]. It appears, I think, that as long as the fast not-interrupting path is fast enough, eliminating safepoints isn't too important.
[0] Stop and Go: Understanding Yieldpoint Behaviour <https://users.cecs.anu.edu.au/~steveb/pubs/papers/yieldpoint...>
Sounds like Erlang and single assignment languages.
Jokes aside, part of the problem seems to be the computer model and cpu architectures themselves.
We need something that is designed from scratch to run things concurrently.
Yes. There is current research into Algebraic Effects (see for instance https://www.microsoft.com/en-us/research/wp-content/uploads/...).
Algebraic Effects promise a return to non-colored functions, as AE can abstract over exceptions, continuations, async and other control-flow mechanisms.
Now, we want to remove threading from the concurrency story, in the hopes of getting another performance boost. This itself is the problem, because threads were giving us automatic preemption, akin to how GCs were giving us automatic memory safety. Now we have to statically determine a "good time" for the program to yield. I/O yielding is the easy part, and the reason why people are flocking to async; but we also need to support yielding for fairness reasons. Kernels can do this because they have interrupt timers; but there's no lower-overhead equivalent for userspace code that I'm aware of.
The other problems mentioned with async Rust are particular to Rust itself. The language has a policy that heap allocations only ever happen in `std`, because they want to support embedding Rust into applications where heaps don't exist. This means that futures need to be structs. Rust does support structs of indeterminate size, but barely; and there's no support for structs that can grow. Such a thing is likely unsound without a way for the compiler to check growth limits, and the memory is pinned, so we can't grow beyond a preset limit set at the start of the future[0].
Async infects everything it touches because it's a total pain to write networking library code that's preemption-agnostic. Monad<T> would fix that, but higher-kinded traits aren't a thing in Rust yet and we would need lots of language tooling (akin to `?`) to make this ergonomic to use.
There's also just the possibility that we've been engineering the wrong fix, and we should be trying to get OS threads to be as lightweight as possible rather than trying to move the entire threading system into userspace. There's no particular reason why we need 8MB stacks, other than the fact that compilers don't check stack growth themselves. (Which, BTW, is also a soundness hole in Rust as far as I know.)
[0] Go gets around this with a linked list of stacks, which adds its own overhead.
There may still be some fracturing here, ie in the first example (but not the others, inexplicably?) `lunatic::net` vice `std::net`.
The reason why we provide `lunatic::net` and you can't just use `std::net` is that WASI (system interface for WebAssembly) still doesn't have support for sockets[0]. `lunatic::net::TcpStream` is for now just a drop in replacement for `std::net::TcpStream` and once sockets get standardised you will be able to use the standard library types instead.
I believe all of these are handled. I just cannot find sufficient documentation to understand the details of how this works.
let mut offset = 0;
while offset != number_as_bytes.length() {
let written = stream.write(&number_as_bytes[offset..]).await.unwrap();
offset += written;
}
The synchronous version would be the same without the .await, and offers stronger guarantees that either all bytes are written to the socket or the socket errored and is dead. The async version could be cancelled in the middle of the invocation after some segments have already been written.Because it sound interesting, but the hard part is that you need a combo of request/webserver to have a chance.
and then the DB side....
I'm unable to get debugger breakpoints in Async functions in Rust to actually break.
Is this a known bug with Async Rust? Or is this simply unsupported (yet)? Seems like a really broken experience currently.
Stop trying to stir shit.
No, you will benefit from parallelism/multithreading. Why only use 1 core? Multitasking as it was once called, or "async" as it is now, is fundamentally _synchronous_ because everything still happens on one core. Just that the order of execution may be a bit wonky, which technically all code already suffers from at the microscopic level with instruction reordering and out of order execution. You almost certainly don't need multitasking unless you are writing an OS for embedded.