hyperfine -N --warmup 5 test/test_fastq_record
'needletail_test/target/release/rust_parser data/fastq_test.fastq'
Benchmark 1: test/test_fastq_record
Time (mean ± σ): 1.936 s ± 0.086 s [User: 0.171 s, System: 1.386 s]
Range (min … max): 1.836 s … 2.139 s 10 runs
Benchmark 2: needletail_test/target/release/rust_parser data/fastq_test.fastq
Time (mean ± σ): 838.8 ms ± 4.4 ms [User: 578.2 ms, System: 254.3 ms]
Range (min … max): 833.7 ms … 848.2 ms 10 runs
Summary
needletail_test/target/release/rust_parser data/fastq_test.fastq ran
2.31 ± 0.10 times faster than test/test_fastq_record
(Edit: I built the Rust version with `cargo build --release` on Rust 1.74, and Mojo with `mojo build` on Mojo 0.7.0.)When running on the commit & code you point to here, here are my new results:
$ hyperfine -N --warmup 5 './benchmark/fast_parser data/fastq_test.fastq' './benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq '
Benchmark 1: ./benchmark/fast_parser data/fastq_test.fastq
Time (mean ± σ): 675.0 ms ± 2.4 ms [User: 399.3 ms, System: 269.4 ms]
Range (min … max): 670.5 ms … 677.5 ms 10 runs
Benchmark 2: ./benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq
Time (mean ± σ): 840.8 ms ± 3.0 ms [User: 578.0 ms, System: 257.0 ms]
Range (min … max): 837.0 ms … 847.7 ms 10 runs
Summary
./benchmark/fast_parser data/fastq_test.fastq ran
1.25 ± 0.01 times faster than ./benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq
Which indeed shows your parser running about 25% faster than the needletail version.But, for my work, C++/Fortran reign supreme. I really wish Julia had easy AOT compilation and no GC, that would be perfect, but beggars can't be choosers. I am just glad that there are alternatives to C++/Fortran now.
Rust has been great, but I have noticed something: there isn't much of a community of numerical/scientific/ML library writers in Rust. That's not a big problem, BUT, the new libraries being written by the communities in Julia/C++ have made me question the free time I have spent, writing Rust code for my domain. When it comes time to get serious about heterogeneous compute, you have to drop Rust and go back to C++/CUDA, when you try to replicate some of the C++/CUDA infrastructure for your own needs in Rust: you really feel alone! I don't like that feeling ... of constantly being "one of the few" interested in scientific/numerical code in Rust community discussions ...
Mojo seems to be betting heavy on a world where deep heterogeneous compute abilities are table stakes, it seems the language is really a frontend for MLIR, that is very exciting to me, as someone who works at the intersection of systems programming and numerical programming.
I don't feel like Mojo will cause any issues for Julia, I think that Mojo provides an alternative that complements Julia. After toiling away for years with C/C++/Fortran, I feel great about a future where I have the option of using Julia, Mojo, or Rust for my projects.
I pretty strongly disagree with the no gc part of this. A well written GC has the same throughout (or higher) than reference counting for most applications, and the Rust approach is very cool, but a significant usability cliff for users that are domain first, CS second. A GC is a pretty good compromise for 99% of users since it is a minor performance cost for a fairly large usability gain.
I don't find ownership models that difficult. It's things one should be thinking of anyway. I think this provides a good example of where stricter checking/an ownership model like Rust has makes it easier than languages that do not have it (in this case, C++): https://blog.dureuill.net/articles/too-dangerous-cpp/
Reference counting has its own problems. The true comparison should be with code that (mostly) doesn’t do reference counting.
Then, the claim still holds, IF you give your process enough memory. https://cse.buffalo.edu/~mhertz/gcmalloc-oopsla-2005.pdf:
“with five times as much memory, an Appel-style generational collector with a non- copying mature space matches the performance of reachability- based explicit memory management. With only three times as much memory, the collector runs on average 17% slower than explicit memory management. However, with only twice as much memory, garbage collection degrades performance by nearly 70%. When physical memory is scarce, paging causes garbage collection to run an order of magnitude slower than explicit memory management.”
That paper is old and garbage collectors have improved, but I think there typically still is a factor of 2 to 3.
Would love to see a comparison between modern recounting and modern GC, though. Static code analysis can avoid a lot of recount updates and creation of garbage.
No Rust or Julia in their radar.
"Mojo may be the biggest programming language advance in decades"
It is a bunch of incremental improvements to the Python like language environment.
That's no big programming language advance to me. A biggie would be to Haskel or even Rust.
That's not to say it won't be wildly more successful as it gives a lot of what people want in a number of areas all in one go.
I'd jump on board except for the vibe around the current licensing. Maybe that will change and I'll be one of those Rust people who comment 'but Rust' on every C and C++ article, except I'll be saying "but Mojo" :)
EDIT: perhaps I'm being too harsh—this was literally just announced. I'm just taken aback by the blatant marketing as everyone else is.
As for this specific claim, it was coupled with a blog post that actually demonstrated the speedup on a specific problem. Getting several orders of magnitude speedup over plain python is often quite easy. That's why we have numpy and pandas after all!
* Mojo provides first-class support for AoT compilation of standalone binaries [1]. Julia provides second-class support at best.
* Mojo aims to provide first-class support for traits and a modern Rust-like memory ownership model. Julia has second-class support for traits ("Tim Holy trait trick") and uses a garbage collector.
To be clear, I really like Julia and have been gravitating back to it over time. Julia has a very talented community and a massive head start on its package ecosystem. There are plenty of other strengths I could list as well.
But I'm still keeping my eye on Mojo. There's nothing wrong with having two powerful languages learning from each other's innovations.
[1]: https://docs.modular.com/mojo/manual/get-started/hello-world...
> Bioinformatics is like 0.1% dealing with FASTQ files and the rest is using the ecosystem of libraries for statistics and plotting. Many of them in R
Considering that, do you need AOT, memory ownership for doing plotting and statistics? I'd argue not, and that's why R and Python are so popular in Bio.
So it was comparing something that a) didn't do meaningful parsing against b) the full parsing rust implementation in a non-optimized debug build.
> needletail_benchmark folder was compiled using the command cargo build --release and ran using the following command ./target/release/<binary> <path/to/file.fq>.
Or are you talking about something else here?
I don’t think the article mentions it explicitly, but I suppose the timing is from Julia 1.10: as far as I can remember, this kind of execution time would have been impossible in Julia 1.8 even to run a simple script.
Bravo, Julia devs. Bravo.
I do most DS/ML work in Python but move to R for stats, and publication-ready plots and tables (gt is really great). I often switch between them frequently, which is a hassle in the EDA and prototyping stages, especially when using notebooks. I enjoy Quarto in RStudio, but the VS Code version is not that great.
How do you make it work?
Also, after so many years using Python and R, I would love to learn a new language, even if only for just a couple of use cases. I considered Elixir for parallel processing and because it has a nice syntax, but ultimately decided against it because it can be a little slow and isn't used much in my area (sadly!). Rust seems to require too much time to get decent at it. Any recommendations? (Prolog?)
Tbf python's stats implementations can be garbage; the last time I checked you can't do multiple levels for hierarchical regression.
I use go extensively for data preprocessing. Sounds weird but it works well for highly repetitive conversion tasks like DICOM parsing, converting EKGs to numpy, etc.
Golang is a popular answer, as you can start building stuff with it fairly quickly (especially compared to Rust). Java can also be useful if you haven't learned it and find a use case (although you will hear it bemoaned as the "New COBOL", there is still a lot of work done using it).
I find Go is a great middle-ground though! And now there starts to be a few more bio-related tools and toolkits out there, including:
- https://github.com/vertgenlab/gonomics
- https://github.com/biogo/biogo
- https://github.com/pbenner/gonetics
- https://github.com/shenwei356/bio
... except from there being some really popular bio tools written in Go, like:
- https://github.com/shenwei356/seqkit
I think Go lost a bit of steam in bio after Rust started to take off, but it seems the field is growing to such an extent, and people are also starting to realize Rust isn't the answer to everything. I.e. it is fantastic for fast tools, but for replacing Python for all of the various ad hoc coding in biology ... nah, not so much. That's where I think Go shines.
I'm a microbiologist though, for stuff like human RNA-Seq I understand that it's often plug and play to get a gene counts table at this point.
I'm a microbiologist too, but the kind that uses mostly off-the-shelf tools to do the taxonomic/functional assignment on metagenomes, and then stats/data science on the features. I kinda didn't know what you mean by "99% of the scientifically important stuff happens before the stats and the plotting".
I mean, give me a 500x2.6x10^6 sparse matrix of gene function abundances and tell me that you've done anything scientifically meaningful. Or on the other side, let me hand you a fastq file from sequencing a poorly extracted DNA sample, and you give me the best algorithm in the world, and there's nothing scientifically meaningful that's going to come out of that.
I got my start at a NGS facility, so handling FASTQ was closer to 80% of my time, so any speedups would have been greatly appreciated.
Agreed. I know people in my department who just ran Galaxy pipelines and R scripts to make pretty plots. I was on the other side of the spectrum and needed fast parsers, so the SAM and VCF specifications were my bible.
If this is not the way to remove workflow friction, what is?
For me, the pain points are often the same as in business. Biologists with no data analysis experience want something done without understanding constraints. Requirements are often not understood and there isn’t a good plan.
Some people do indeed suffer from code being slow and this can be solved with better tools. I works with large datasets in single-cell genomics (over a million cells) and the model takes ~12 hrs to train on an entry-level GPU. So, most o my time is spent at trying to understand the results.
I’m wary of software engineers coming over the bioinformatics because they never have the domain expertise required to make meaningful contributions, and yet many think they know everything.
Python is a juggernaut with total control of the ML space and is a huge part (even if less dominant) in modern scientific computing.
A VC has way better chances of success building solutions compatible with Python rather than replacing it.
No one will use a language that isn't free and open source.
If mojo was free and open source (wasn't a company), and didn't just give out binaries with a 'trust me bro' stamp if approval, then I would have worked with it. But it's not, so I will never use it.
- Never actually runs it. Seriously.
- Wants us to know it's definitely not a real parser as compared to Needlepoint...then 1000 words later, "real parser" means "handles \r\n...and validates 1st & 3rd lines begin with @ and +...seq and qual lines have the same length".
- At the end, "Julia is faster!!!!" off a one-off run on their own machine, comparing it to benchmark times on the Mojo website
It reads as an elaborate way to indicate they don't like that the Mojo website says it's faster, coupled to a entry-level explanation of why it is faster, coupled to disturbingly poor attempts to benchmark without running Mojo code
The point is that the original blogs claims of "Mojo is faster" isn't right - it's comparing different programs. That implementation in Mojo is faster than Needletail - but that doesn't say very much and I prove it by also beating Needletail in Julia by using the same algorithm as Mojo does. So it's the algorithm. Not Mojo. Not Julia.
Also, did you even read my discussion on how much a parser ought to validate? Your resume is completely missing the point.
It's just, the content length : content ratio is high - all I got out of it was you don't like the Mojo speed claim & genomics parsing is text parsing*
Don't take that the wrong way, I feel bad. It's just bad for me - I'm a mobile developer, so I was way out of my domain, I've barely written Python, Julia is a complete abstraction to me outside of HN. An alternative way to think about it is, I shouldn't have expected an in-depth analysis of Mojo.
* i mean, everything is bytes parsing, but it always tickles me when I find out other domains aren't castles in the sky, speaking an alien language
For the bioinformatics part, I think something like the "Genomics data science" specialization on Coursera should be a pretty good start.
Crystal itself is a gem, but comparing it to Mojo and its relation to Python is fair but gives the wrong message. Python is by far more popular becuse of all the packages, so the market is way larger there.
Besides people opting for closer to C speed had Rust, Go, Java, Swift, and other options to go to, all with more momentum and support, before going for a yet unproven Ruby clone.
> Mojo is still early and not yet a Python superset, so only simple programs can be brought over as-is with no code changes. We will continue investing in this and build migration tools as the language matures.
https://docs.modular.com/mojo/faq.html#how-do-i-convert-pyth...
1. Back to the rust vs mojo article that kicked this off... this isnt someone who is going to use rust.
2. Availably, portability, ease of use... These are the reasons python is winning.
3. I am baffled that this person has to write code as part of their job, and does not know what a VM is! Note: This isnt a slight against the author, I doubt they are an isolated case. I think this is my own cognitive dissonance showing.
On the other hand, the fact that Mojo doesn't run on Windows and most Linux distros is a point in itself. And also, would the blog post really be substantially improved if I had gotten the number of milliseconds right for the Mojo implementation on my computer? Of course not. It should be clear that the implementations are incomparable, and that a similar Julia implementation is very fast which implies that the reason the original Mojo implementation allegedly beat Rust is not because Mojo is faster. It's just a different program.
Yes.
Would you talk about a book you didn't read? Or a movie you didn't see? Not on any meaningful level.
It's odd to read something that's pretty well-versed with some relatively complex CS concepts, i.e. it's not just a PhD with a blank text editor. But simultaneously, makes egregiously obvious mistakes that I wouldn't expect any college graduate to roll with.
There's a certain type, and I don't know what name to give it, especially because I certainly don't want to give it a condescending name. I call it "data scientist types" when I'm in person with someone who I trust to give me some verbal rope.
Software really feels like it ate everything and everyone. So you end up with insanely bright people who do software engineering as part of their job, but miss some pieces you expect from trad software engineering.
He benchmarks against the rust implementation, which, unless benchmarks have zero meaning, should be sufficient to get a general sense of the scale of the difference. The post is obviously not meant as the last word on this benchmark, it's meant to show that the benchmark is kinda meaningless.
>Then you conclude with "the language I use is faster!!!"
If this is your take-home from the post, it's pretty clear you didn't read it, or your reading comprehension needs some work. That sentence was obviously facetious, poking a little fun at the author of the original piece.