Rust intentionally provides the simplest possible growable string buffer String, which is literally (under the hood, you can't poke this legitimately) Vec<u8> plus the promise that this is UTF-8 text.
But you might find your needs better served by one (or several) of:
Box<str> -- you don't need capacity, so, don't store it => length == capacity
CompactString -- use the entire 24 bytes for SSO, up to 24 bytes of UTF-8 inline, obviously doesn't make sense if all or the vast majority of your strings are 25 bytes or longer
ColdString -- same idea but for 8 bytes, and also not storing capacity, this only makes sense over Box<str> if you have plenty of <= 8 byte strings
Atoms: Each string can be referenced with a single u32 or even u16, and they're inherently deduplicated.
Bump allocator: your strings are &str, allocation is super fast with limited fragmentation.
Single pointer strings (this has a name, I can't think of it right now): you store the length inside the allocation instead of in each reference, so your strings are a single pointer.
First, on the heap we have a self-indicating length prefix, basically we use the bottom 7 bits of each byte to indicate 7 bits of length and the top bit indicates there are more bits in the next byte. So "ben-schaaf" would be 0x0A then the ASCII for "ben-schaaf"
But, we avoid even having a heap allocation if we have 8 or fewer UTF-8 bytes to encode the text, that's our Small String Optimisation.
To pull this off we specify that our heap allocations will have 4 byte alignment even though they don't need it. This shouldn't be a problem, in fact many allocators never actually deliver smaller alignments anyway.
This means our pointer now has two spare bits, the least significant bits are now always zero for a valid heap pointer. We rotate these bits to the top of the first byte (this varies depending on whether the target is big-endian or little-endian) and we mask them so that for these valid pointers they are 0b10xxxxxx
So, now we can look at the "single pointer" and figure out
If it begins 0b10xxxxxx it really will be a valid pointer, rotate it, mask out that flag bit and dereference the pointer to find the length-prefixed text.
If it begins 0b11111AAA there's a short string here but it didn't need all 8 bytes, the next AAA bytes of the "pointer" are just UTF-8 and conveniently AAA is enough binary for 0 through 7 to be signalled, the exact length we have
If it has any other value the entire 8 bytes of "pointer" is a UTF-8 encoded string
But it also means the CPU has to follow the pointer (and potentially get a cache miss or pipeline stall) to find the length. Having a fat pointer of ptr+length makes a lot of sense for string views, and for owned string buffers with capacity it can mean avoiding a cache miss when appending to the buffer.
It's complicated in other words.
You can build an ad-hoc bump allocator by using a String and indexing into it. You can't use &str references though, as a growing String may reallocate elsewhere and invalidate your references (Rust won't even let you try this), so you have to use your own indices. This is the same thing that bump allocator libraries usually do, too. It can be tricky but have great performance gains.
I recently 100x-d the speed of an XML/HTML builder I use internally by rewriting it to only have one thing on the heap, a single String. Every push happens right at the call site linearly, and by passing data through closures the formatting (indentation, etc.) is controllable. My first iteration was written in the least efficient way possible and had thousands of tiny allocations in nested heap objects, it was painfully slow.
Yes. It is exactly how they are described.
https://docs.rs/string_cache/latest/string_cache/struct.Atom...
> Represents a string that has been interned.
These aren't really optimizations. They are specialized implementations that introduce design and architectural tradeoffs.
For example, Rust's Atom represents a string that has been interned, and it's actually an implementation of a design pattern popular in the likes of Erlang/Elixir. This is essentially a specialized implementations of the old Flyweight design pattern, where managing N independent instances of an expensive read-only object is replaced with a singleton instance that's referenced through a key handle.
I would hardly call this an optimization. It actually represents a significant change to a system's architecture. You have to introduce a set of significant architectural constraints into your system to leverage a specific tradeoff. This isn't just a tweak that makes everything run magically leaner and faster.
In my opinion, there's no magic in the software engineering. Everything (or almost everything) is a system that can be described, explained, modified and so on. Applications, libraries, operating systems, kernels, CPUs/RAM/GPU/NPU/xPU/whatever silicon there is, ALUs/etc, transistors, electricity, physics... That's nowhere near "magic". There's always some trade-offs, it's just that you may not be aware of them initially.
`String::as_vec_mut` kinda implies that, since it gives you access to that underlying `Vec` which must then exist somewhere.
In case anyone else was wondering it, yes, it's "unsafe".
Commonly as... conversions are actually no-ops at runtime (the type changes but the data does not, no CPU instructions are emitted) whereas to... conversions might do quite a lot, especially if they bring into existence an actual thing at runtime -- maybe Goose::to_donkey actually needs to go allocate memory for a Donkey and destroy the Goose.
Yes it's unsafe because the Vec doesn't enforce the promise we made about this being UTF-8 text whereas String did, so now that promise is ours to keep and `unsafe` is how we signify that you the programmer took on the responsibility for safety here.
Option<Box<SmithyTraits> -> SmithyTrait^?I think I agree though, especially with Option. Swift’s option syntax (and kotlin’s which is similar) is so much better, a simple question mark in the type. Options are important enough that dedicated syntax makes so much sense. Rust blew their chance here with ? meaning “maybe early return”, it would have been a lot more useful as an Option indicator.
Note that both Option and Result implement that same trait.
Perhaps if try blocks ever become a thing... we can finally use it for our own types ;)