A look at the Mojo language for bioinformatics (opens in new tab)

(viralinstruction.com)

148 pointsblindseer2y ago120 comments

120 comments

74 comments · 15 top-level

f6v2y ago· 9 in thread

As someone who practices bioinformatics, it doesn’t seem appealing. Bioinformatics is like 0.1% dealing with FASTQ files and the rest is using the ecosystem of libraries for statistics and plotting. Many of them in R, by the way.

tstactplsignore2y ago

To disagree, I'm a computational biologist and it's my firm belief 99% of the scientifically important stuff happens before the stats and plotting. That's not to say I dismiss those things and haven't done my fair share of stats, but just that the difference between real results and incorrect results most often happens before that step.

I'm a microbiologist though, for stuff like human RNA-Seq I understand that it's often plug and play to get a gene counts table at this point.

bfrankline2y ago

Sure, but I think, for example, representation learning, doesn’t involve manipulating an array of strings.

kescobo2y ago

>To disagree, I'm a computational biologist and it's my firm belief 99% of the scientifically important stuff happens before the stats and plotting.

I'm a microbiologist too, but the kind that uses mostly off-the-shelf tools to do the taxonomic/functional assignment on metagenomes, and then stats/data science on the features. I kinda didn't know what you mean by "99% of the scientifically important stuff happens before the stats and the plotting".

I mean, give me a 500x2.6x10^6 sparse matrix of gene function abundances and tell me that you've done anything scientifically meaningful. Or on the other side, let me hand you a fastq file from sequencing a poorly extracted DNA sample, and you give me the best algorithm in the world, and there's nothing scientifically meaningful that's going to come out of that.

folli2y ago

I guess that depends on your exact ecological niche within bioinformatics.

I got my start at a NGS facility, so handling FASTQ was closer to 80% of my time, so any speedups would have been greatly appreciated.

MillironX2y ago

> I guess that depends on your exact ecological niche within bioinformatics.

Agreed. I know people in my department who just ran Galaxy pipelines and R scripts to make pretty plots. I was on the other side of the spectrum and needed fast parsers, so the SAM and VCF specifications were my bible.

__MatrixMan__2y ago

As someone who is considering a switch from generic software engineering towards bioinformatics, what would you say the pain points are?

If this is not the way to remove workflow friction, what is?

f6v2y ago

I had an ok career in software engineering (Android/iOS -> backend -> engineering management) before getting MS in Bioinformatics and starting a PhD in Medicine.

For me, the pain points are often the same as in business. Biologists with no data analysis experience want something done without understanding constraints. Requirements are often not understood and there isn’t a good plan.

Some people do indeed suffer from code being slow and this can be solved with better tools. I works with large datasets in single-cell genomics (over a million cells) and the model takes ~12 hrs to train on an entry-level GPU. So, most o my time is spent at trying to understand the results.

getoffmycase2y ago

Honestly the major pain point is that the grad student that wrote the package you need is no longer maintaining it because they’ve graduated. Also the code they wrote sucks, but whatever.

I’m wary of software engineers coming over the bioinformatics because they never have the domain expertise required to make meaningful contributions, and yet many think they know everything.

1 more reply

life-and-quiet2y ago

Would like to second this question. I'm very interested in getting into this world, but it feels like there isn't a clear path (especially for someone self-taught like me). Bioinformatics feels pretty inaccessible without a computer science or biology degree, even with substantial R and Python experience.

3 more replies

jimbob452y ago· 9 in thread

Crystal was never able to find traction as a Ruby clone that could compete with C speeds. Why would a Python clone have any better luck? I don’t think anyone would accuse Python of being dramatically more usable than Ruby.

Alifatisk2y ago

I think the appeal with Crystal is for users who already know Ruby, so the marked was already limited there.

Crystal itself is a gem, but comparing it to Mojo and its relation to Python is fair but gives the wrong message. Python is by far more popular becuse of all the packages, so the market is way larger there.

coldtea2y ago

Well, for the domains Mojo targets, Python is king. So a faster-Python-like language would have more potential audiences. A fast Ruby-like language, not so much, as Ruby was never that special in those domains, or in most places outside web development, and even for that it kind of lost steam in the past 10 years.

Besides people opting for closer to C speed had Rust, Go, Java, Swift, and other options to go to, all with more momentum and support, before going for a yet unproven Ruby clone.

pjmlp2y ago

I used to be quite sceptical given how Swift for Tensorflow went, however since NVidia decided to partner with Modular, alongside their ongoing CUDA JIT bindings for Python, I think Mojo might actually work out.

2 more replies

akkad332y ago

Crystal is an entirely different language with a similar syntax. Valid Python is valid Mojo

frou_dh2y ago

Apparently that is the goal, but not the reality:

> Mojo is still early and not yet a Python superset, so only simple programs can be brought over as-is with no code changes. We will continue investing in this and build migration tools as the language matures.

https://docs.modular.com/mojo/faq.html#how-do-i-convert-pyth...

breather2y ago

Crystal didn't have much use in ruby's sweet spot—being a DSL for some immensely complicated-to-configure framework (eg rails, chef).

samuell2y ago

From someone who would love for Crystal to be the answer here, because of its fantastic concurrency features: It is a bit of a non-starter because of excessive compile times for larger projects. Also, they hadn't solved the cross-compilation issue last time I checked.

jdiaz972y ago

I think it's less about the language and it's more about Modular's product, their MAX supercomputer thingy.

pjmlp2y ago

Because of the people and companies behind the project.

mcqueenjordan2y ago· 6 in thread

Another point of clarification that is of great importance to the results, and is a common Rust newcomer error: The benchmarks for the Rust implementation (in the original post that got all the traction) were run with a /debug/ build of rust, i.e. not an optimized binary compiled with --release.

So it was comparing something that a) didn't do meaningful parsing against b) the full parsing rust implementation in a non-optimized debug build.

SushiHippie2y ago

Am I missing something? In the git repository [0] it says:

> needletail_benchmark folder was compiled using the command cargo build --release and ran using the following command ./target/release/<binary> <path/to/file.fq>.

Or are you talking about something else here?

[0] https://github.com/MoSafi2/MojoFastTrim

mcqueenjordan2y ago

It was later edited, after it had basically made the rounds.

1 more reply

tehsauce2y ago

How much does this particular result change when running in release mode?

alpaca1282y ago

Depending on the code I've seen performance increases above 100x in some cases. While that's not exactly the norm, benchmarking Rust in debug mode is absolutely pointless even as a rough estimate.

1 more reply

fwip2y ago

On my machine, running the debug executable on the medium-size dataset takes ~14.5 seconds, and release mode takes ~0.8 seconds.

1 more reply

bhansconnect2y ago

This is not accurate. The blog post used `--release` for it's Rust numbers. The confusion comes from the 50% performance win being specific to running on an M2 mac. On an x86_64 Linux machine, the results are more or less equivalent.

WeatherBrier2y ago· 5 in thread

The language is far from stable, but I have had a LOT of fun writing Mojo code. I was surprised by that! The only promising new languages for low-level numerical coding that can dislodge C/C++/Fortran somewhat, in my opinion, have been Julia/Rust. I feel like I can update that last list to be Julia/Rust/Mojo now.

But, for my work, C++/Fortran reign supreme. I really wish Julia had easy AOT compilation and no GC, that would be perfect, but beggars can't be choosers. I am just glad that there are alternatives to C++/Fortran now.

Rust has been great, but I have noticed something: there isn't much of a community of numerical/scientific/ML library writers in Rust. That's not a big problem, BUT, the new libraries being written by the communities in Julia/C++ have made me question the free time I have spent, writing Rust code for my domain. When it comes time to get serious about heterogeneous compute, you have to drop Rust and go back to C++/CUDA, when you try to replicate some of the C++/CUDA infrastructure for your own needs in Rust: you really feel alone! I don't like that feeling ... of constantly being "one of the few" interested in scientific/numerical code in Rust community discussions ...

Mojo seems to be betting heavy on a world where deep heterogeneous compute abilities are table stakes, it seems the language is really a frontend for MLIR, that is very exciting to me, as someone who works at the intersection of systems programming and numerical programming.

I don't feel like Mojo will cause any issues for Julia, I think that Mojo provides an alternative that complements Julia. After toiling away for years with C/C++/Fortran, I feel great about a future where I have the option of using Julia, Mojo, or Rust for my projects.

adgjlsfhk12y ago

> I really wish Julia had easy AOT compilation and no GC, that would be perfect

I pretty strongly disagree with the no gc part of this. A well written GC has the same throughout (or higher) than reference counting for most applications, and the Rust approach is very cool, but a significant usability cliff for users that are domain first, CS second. A GC is a pretty good compromise for 99% of users since it is a minor performance cost for a fairly large usability gain.

celrod2y ago

Too bad Julia doesn't have this theoretical "well written GC". I do not like GCs, so I agree with OP's sentiment. Why solve such a hard problem when you don't have to?

I don't find ownership models that difficult. It's things one should be thinking of anyway. I think this provides a good example of where stricter checking/an ownership model like Rust has makes it easier than languages that do not have it (in this case, C++): https://blog.dureuill.net/articles/too-dangerous-cpp/

4 more replies

Someone2y ago

> A well written GC has the same throughout (or higher) than reference counting for most applications

Reference counting has its own problems. The true comparison should be with code that (mostly) doesn’t do reference counting.

Then, the claim still holds, IF you give your process enough memory. https://cse.buffalo.edu/~mhertz/gcmalloc-oopsla-2005.pdf:

“with five times as much memory, an Appel-style generational collector with a non- copying mature space matches the performance of reachability- based explicit memory management. With only three times as much memory, the collector runs on average 17% slower than explicit memory management. However, with only twice as much memory, garbage collection degrades performance by nearly 70%. When physical memory is scarce, paging causes garbage collection to run an order of magnitude slower than explicit memory management.”

That paper is old and garbage collectors have improved, but I think there typically still is a factor of 2 to 3.

Would love to see a comparison between modern recounting and modern GC, though. Static code analysis can avoid a lot of recount updates and creation of garbage.

aldanor2y ago

Well, there's some big DS projects written in rust that are now very widely used in Python world - e.g., polars.

pjmlp2y ago

I just came from a CERN event, HEP seems to still be all about C++, Fortran, Python, Java, and some Go due to Kubernetes.

No Rust or Julia in their radar.

jdiaz972y ago· 5 in thread

Great post. I think Mojo's claims like the speedup over Rust are a problem, like the 65000x speedup over Python. How can we differentiate between good new tech and Silicon Valley shenanigans when they use claims like that? They do nice titles and slogans but are shady in substance

latenightcoding2y ago

I can't take this language or company serious after reading stuff like:

"Mojo may be the biggest programming language advance in decades"

https://www.fast.ai/posts/2023-05-03-mojo-launch.html

mianos2y ago

My answer to the deleted comment and to this.

It is a bunch of incremental improvements to the Python like language environment.

That's no big programming language advance to me. A biggie would be to Haskel or even Rust.

That's not to say it won't be wildly more successful as it gives a lot of what people want in a number of areas all in one go.

I'd jump on board except for the vibe around the current licensing. Maybe that will change and I'll be one of those Rust people who comment 'but Rust' on every C and C++ article, except I'll be saying "but Mojo" :)

breather2y ago

Hard to remember the last language that felt so obviously sold by something other than an actual community. Even Swift tried its best to exist outside of xcode and mac/i os

EDIT: perhaps I'm being too harsh—this was literally just announced. I'm just taken aback by the blatant marketing as everyone else is.

2 more replies

boxed2y ago

A little bit of clickbait is what you need to get interest at all. That's just a fact of life.

As for this specific claim, it was coupled with a blog post that actually demonstrated the speedup on a specific problem. Getting several orders of magnitude speedup over plain python is often quite easy. That's why we have numpy and pandas after all!

whoami173572y ago

Probably reasonable to label as a shenanigan if they try to differentiate with a emoji file extension.

fwip2y ago· 4 in thread

For what it's worth, I couldn't reproduce the benchmarks cited in the post, which claimed a 50% speedup over Rust on M1. The rust implementation was consistently about two to three times as fast as Mojo with the provided test scripts and datasets. It's possible I was compiling the Mojo program suboptimally, though.

  hyperfine -N --warmup 5 test/test_fastq_record 
  'needletail_test/target/release/rust_parser data/fastq_test.fastq'
  Benchmark 1: test/test_fastq_record
    Time (mean ± σ):      1.936 s ±  0.086 s    [User: 0.171 s, System: 1.386 s]
    Range (min … max):    1.836 s …  2.139 s    10 runs
  
  Benchmark 2: needletail_test/target/release/rust_parser data/fastq_test.fastq
    Time (mean ± σ):     838.8 ms ±   4.4 ms    [User: 578.2 ms, System: 254.3 ms]
    Range (min … max):   833.7 ms … 848.2 ms    10 runs
  
  Summary
    needletail_test/target/release/rust_parser data/fastq_test.fastq ran
      2.31 ± 0.10 times faster than test/test_fastq_record

(Edit: I built the Rust version with `cargo build --release` on Rust 1.74, and Mojo with `mojo build` on Mojo 0.7.0.)

chromatin2y ago

It was later noted on Twitter/X by someone that the rust version was not compiled with `--release`

robinsonrc2y ago

That’s a fairly big omission for such an attention grabbing performance comparison

MohamedMabrouk2y ago

Hey, the Mojo parser author here. the test folder is just for the unit tests. All the benchmarking code is located in the /benchmark folder. It would be great if you can give it another go on your machine. https://github.com/MoSafi2/MojoFastTrim/tree/restructed/benc...

fwip2y ago

Thanks for the pointer, I had only checked the 'main' branch, which doesn't have the benchmarking code present.

When running on the commit & code you point to here, here are my new results:

  $ hyperfine -N --warmup 5 './benchmark/fast_parser data/fastq_test.fastq'  './benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq '
  Benchmark 1: ./benchmark/fast_parser data/fastq_test.fastq
    Time (mean ± σ):     675.0 ms ±   2.4 ms    [User: 399.3 ms, System: 269.4 ms]
    Range (min … max):   670.5 ms … 677.5 ms    10 runs
  
  Benchmark 2: ./benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq
    Time (mean ± σ):     840.8 ms ±   3.0 ms    [User: 578.0 ms, System: 257.0 ms]
    Range (min … max):   837.0 ms … 847.7 ms    10 runs
  
  Summary
    ./benchmark/fast_parser data/fastq_test.fastq ran
      1.25 ± 0.01 times faster than ./benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq

Which indeed shows your parser running about 25% faster than the needletail version.

2 more replies

dr_kiszonka2y ago· 4 in thread

Folks using multiple languages, what is your workflow?

I do most DS/ML work in Python but move to R for stats, and publication-ready plots and tables (gt is really great). I often switch between them frequently, which is a hassle in the EDA and prototyping stages, especially when using notebooks. I enjoy Quarto in RStudio, but the VS Code version is not that great.

How do you make it work?

Also, after so many years using Python and R, I would love to learn a new language, even if only for just a couple of use cases. I considered Elixir for parallel processing and because it has a nice syntax, but ultimately decided against it because it can be a little slow and isn't used much in my area (sadly!). Rust seems to require too much time to get decent at it. Any recommendations? (Prolog?)

skwb2y ago

Use python and write my results in a CSV that I quickly import into R and do my fancy stats.

Tbf python's stats implementations can be garbage; the last time I checked you can't do multiple levels for hierarchical regression.

carbocation2y ago

My workflow is similar to yours: python for deep learning and surface reconstruction. R for stats and plots.

I use go extensively for data preprocessing. Sounds weird but it works well for highly repetitive conversion tasks like DICOM parsing, converting EKGs to numpy, etc.

_huayra_2y ago

It's hard to learn a language for fun, so I'd pick something that fits your needs to build something (or even just your curiosity). Elixir and Prolog, although both cool, might not fit the bill because they really excel at one particular thing.

Golang is a popular answer, as you can start building stuff with it fairly quickly (especially compared to Rust). Java can also be useful if you haven't learned it and find a use case (although you will hear it bemoaned as the "New COBOL", there is still a lot of work done using it).

samuell2y ago

I've been thinking to learn Rust for these use cases, but always get frustrated with the complexity.

I find Go is a great middle-ground though! And now there starts to be a few more bio-related tools and toolkits out there, including:

- https://github.com/vertgenlab/gonomics

- https://github.com/biogo/biogo

- https://github.com/pbenner/gonetics

- https://github.com/shenwei356/bio

... except from there being some really popular bio tools written in Go, like:

- https://github.com/shenwei356/seqkit

I think Go lost a bit of steam in bio after Rust started to take off, but it seems the field is growing to such an extent, and people are also starting to realize Rust isn't the answer to everything. I.e. it is fantastic for fast tools, but for replacing Python for all of the various ad hoc coding in biology ... nah, not so much. That's where I think Go shines.

refulgentis2y ago· 4 in thread

I felt like I learned more about the author than Mojo.

- Never actually runs it. Seriously.

- Wants us to know it's definitely not a real parser as compared to Needlepoint...then 1000 words later, "real parser" means "handles \r\n...and validates 1st & 3rd lines begin with @ and +...seq and qual lines have the same length".

- At the end, "Julia is faster!!!!" off a one-off run on their own machine, comparing it to benchmark times on the Mojo website

It reads as an elaborate way to indicate they don't like that the Mojo website says it's faster, coupled to a entry-level explanation of why it is faster, coupled to disturbingly poor attempts to benchmark without running Mojo code

jakobnissen2y ago

I feel like if you believe my conclusion was that "Julia is faster" then you are missing the point.

The point is that the original blogs claims of "Mojo is faster" isn't right - it's comparing different programs. That implementation in Mojo is faster than Needletail - but that doesn't say very much and I prove it by also beating Needletail in Julia by using the same algorithm as Mojo does. So it's the algorithm. Not Mojo. Not Julia.

Also, did you even read my discussion on how much a parser ought to validate? Your resume is completely missing the point.

refulgentis2y ago

Yeah, I got the joke, and understood the parser.

It's just, the content length : content ratio is high - all I got out of it was you don't like the Mojo speed claim & genomics parsing is text parsing*

Don't take that the wrong way, I feel bad. It's just bad for me - I'm a mobile developer, so I was way out of my domain, I've barely written Python, Julia is a complete abstraction to me outside of HN. An alternative way to think about it is, I shouldn't have expected an in-depth analysis of Mojo.

* i mean, everything is bytes parsing, but it always tickles me when I find out other domains aren't castles in the sky, speaking an alien language

1 more reply

cbkeller2y ago

It looks like you very dramatically missed the point

refulgentis2y ago

Please, explain

zer00eyz2y ago· 4 in thread

>>> As a bioinformatician who is obsessed with high-performance, high-level programming, that's right in my wheelhouse!... Mojo currently only runs on Ubuntu and MacOS, and I run neither. So, I can't run any Mojo code

1. Back to the rust vs mojo article that kicked this off... this isnt someone who is going to use rust.

2. Availably, portability, ease of use... These are the reasons python is winning.

3. I am baffled that this person has to write code as part of their job, and does not know what a VM is! Note: This isnt a slight against the author, I doubt they are an isolated case. I think this is my own cognitive dissonance showing.

jakobnissen2y ago

Author here. I do know about VMs. Is it too lazy for me to write that article and not bother to install a VM with Mojo (and Rust and Julia, to benchmark in the same environment)? Maybe. If this was for my work I certainly would have felt compelled to.

On the other hand, the fact that Mojo doesn't run on Windows and most Linux distros is a point in itself. And also, would the blog post really be substantially improved if I had gotten the number of milliseconds right for the Mojo implementation on my computer? Of course not. It should be clear that the implementations are incomparable, and that a similar Julia implementation is very fast which implies that the reason the original Mojo implementation allegedly beat Rust is not because Mojo is faster. It's just a different program.

zer00eyz2y ago

>> Is it too lazy for me to write that article and not bother to install a VM with Mojo

Yes.

Would you talk about a book you didn't read? Or a movie you didn't see? Not on any meaningful level.

3 more replies

refulgentis2y ago

Got the same general impression, TL;DR: wrote a benchmark article without...running it? Then you conclude with "the language I use is faster!!!" based on a one-off run on your machine, which surely isn't the same machine Mojo used to run bechmarks for their website copy?

It's odd to read something that's pretty well-versed with some relatively complex CS concepts, i.e. it's not just a PhD with a blank text editor. But simultaneously, makes egregiously obvious mistakes that I wouldn't expect any college graduate to roll with.

There's a certain type, and I don't know what name to give it, especially because I certainly don't want to give it a condescending name. I call it "data scientist types" when I'm in person with someone who I trust to give me some verbal rope.

Software really feels like it ate everything and everyone. So you end up with insanely bright people who do software engineering as part of their job, but miss some pieces you expect from trad software engineering.

kescobo2y ago

>TL;DR: wrote a benchmark article without...running it?

He benchmarks against the rust implementation, which, unless benchmarks have zero meaning, should be sufficient to get a general sense of the scale of the difference. The post is obviously not meant as the last word on this benchmark, it's meant to show that the benchmark is kinda meaningless.

>Then you conclude with "the language I use is faster!!!"

If this is your take-home from the post, it's pretty clear you didn't read it, or your reading comprehension needs some work. That sentence was obviously facetious, poking a little fun at the author of the original piece.

1 more reply

ubj2y ago· 3 in thread

Great post, but I think the author missed a few advantages of Mojo:

* Mojo provides first-class support for AoT compilation of standalone binaries [1]. Julia provides second-class support at best.

* Mojo aims to provide first-class support for traits and a modern Rust-like memory ownership model. Julia has second-class support for traits ("Tim Holy trait trick") and uses a garbage collector.

To be clear, I really like Julia and have been gravitating back to it over time. Julia has a very talented community and a massive head start on its package ecosystem. There are plenty of other strengths I could list as well.

But I'm still keeping my eye on Mojo. There's nothing wrong with having two powerful languages learning from each other's innovations.

[1]: https://docs.modular.com/mojo/manual/get-started/hello-world...

jdiaz972y ago

True, but the title of the blog is about Bioinformatics, and like another comment said:

> Bioinformatics is like 0.1% dealing with FASTQ files and the rest is using the ecosystem of libraries for statistics and plotting. Many of them in R

Considering that, do you need AOT, memory ownership for doing plotting and statistics? I'd argue not, and that's why R and Python are so popular in Bio.

beanjuiceII2y ago

doesn't this make more sense to have a python like language then for speed? and python for all that other stuff. so learn one'ish language and get it all?

3 more replies

WeatherBrier2y ago

I feel the same way, I love using Julia, but the features that Mojo provides are exciting. It's great that we have both of them.

zaptheimpaler2y ago· 3 in thread

How does a software engineer transition into bioinformatics or computational biology? I've taken some online courses on bioinformatics and have some experience in large distributed jobs but these jobs seem few and far in between and generally want M.S/PhDs in bioinformatics. Is it really a field that's not viable to enter without an MS?

jltsiren2y ago

Doing a Master's and/or PhD in bioinformatics is probably the easiest way. It's a pretty specialized field, and the first couple of years are usually spent learning the basics. You are unlikely to find anyone willing to hire you to a real job to do that.

samuell2y ago

I think the challenge is learning enough of the biology outside of academia. I think it is fully possible, e.g. from books and videos ... but will take a lot of determination.

For the bioinformatics part, I think something like the "Genomics data science" specialization on Coursera should be a pretty good start.

jakobnissen2y ago

I'm not sure what's the best strategy to get hired, but professionally, you need to learn as much biology as you can. Cell biology, molecular biology, genetics, physiology. My experience has been that there are a bunch of software engineers in bioinfo already who fall short on the biology side. Differentiate yourself from those.

gandalfgeek2y ago· 2 in thread

> It does grate me then, when someone else manages to raise 100M dollars on the premise of reinventing the wheel to solve the exact same problem, but from a worse starting point because they start from zero and they want to retain Python compatibility. Think of what money like that could do to Julia!

Python is a juggernaut with total control of the ML space and is a huge part (even if less dominant) in modern scientific computing.

A VC has way better chances of success building solutions compatible with Python rather than replacing it.

chaxor2y ago

I was interested in trying our mojo. Then I looked at it booked out quick.

No one will use a language that isn't free and open source.

If mojo was free and open source (wasn't a company), and didn't just give out binaries with a 'trust me bro' stamp if approval, then I would have worked with it. But it's not, so I will never use it.

andyferris2y ago

I get your viewpoint. However, in terms of numbers, I suspect >90% of the populace (even research populace) will care that it is free-as-in-beer and that's all. So from the VC's point of view...

stellalo2y ago· 1 in thread

> If I include the time for Julia to start up and compile the script, my implementation takes 354 ms total, on the same level as Mojo's.

I don’t think the article mentions it explicitly, but I suppose the timing is from Julia 1.10: as far as I can remember, this kind of execution time would have been impossible in Julia 1.8 even to run a simple script.

Bravo, Julia devs. Bravo.

adgjlsfhk12y ago

for a script like this that doesn't have any dependencies, Julia 1.10 doesn't make a significant difference. that said, for real world usability, Julia 1.10 is dramatically better than previous versions.

hkmaxpro2y ago

https://news.ycombinator.com/item?id=39290958

https://news.ycombinator.com/item?id=39296559

math_dandy2y ago

I’m really excited about Mojo’s potential. But I don’t think it’s ready for real use outside it’s AI niche yet. Being able to call Mojo functions from Python is the sentinel capability I’m waiting for before considering its use for general purpose code.

j / k navigate · click thread line to collapse

120 comments

74 comments · 15 top-level

f6v2y ago· 9 in thread

tstactplsignore2y ago

I'm a microbiologist though, for stuff like human RNA-Seq I understand that it's often plug and play to get a gene counts table at this point.

bfrankline2y ago

Sure, but I think, for example, representation learning, doesn’t involve manipulating an array of strings.

kescobo2y ago

>To disagree, I'm a computational biologist and it's my firm belief 99% of the scientifically important stuff happens before the stats and plotting.

folli2y ago

I guess that depends on your exact ecological niche within bioinformatics.

I got my start at a NGS facility, so handling FASTQ was closer to 80% of my time, so any speedups would have been greatly appreciated.

MillironX2y ago

> I guess that depends on your exact ecological niche within bioinformatics.

__MatrixMan__2y ago

As someone who is considering a switch from generic software engineering towards bioinformatics, what would you say the pain points are?

If this is not the way to remove workflow friction, what is?

f6v2y ago

I had an ok career in software engineering (Android/iOS -> backend -> engineering management) before getting MS in Bioinformatics and starting a PhD in Medicine.

getoffmycase2y ago

Honestly the major pain point is that the grad student that wrote the package you need is no longer maintaining it because they’ve graduated. Also the code they wrote sucks, but whatever.

I’m wary of software engineers coming over the bioinformatics because they never have the domain expertise required to make meaningful contributions, and yet many think they know everything.

1 more reply

life-and-quiet2y ago

3 more replies

jimbob452y ago· 9 in thread

Alifatisk2y ago

I think the appeal with Crystal is for users who already know Ruby, so the marked was already limited there.

coldtea2y ago

Besides people opting for closer to C speed had Rust, Go, Java, Swift, and other options to go to, all with more momentum and support, before going for a yet unproven Ruby clone.

pjmlp2y ago

2 more replies

akkad332y ago

Crystal is an entirely different language with a similar syntax. Valid Python is valid Mojo

frou_dh2y ago

Apparently that is the goal, but not the reality:

https://docs.modular.com/mojo/faq.html#how-do-i-convert-pyth...

breather2y ago

Crystal didn't have much use in ruby's sweet spot—being a DSL for some immensely complicated-to-configure framework (eg rails, chef).

samuell2y ago

jdiaz972y ago

I think it's less about the language and it's more about Modular's product, their MAX supercomputer thingy.

pjmlp2y ago

Because of the people and companies behind the project.

mcqueenjordan2y ago· 6 in thread

So it was comparing something that a) didn't do meaningful parsing against b) the full parsing rust implementation in a non-optimized debug build.

SushiHippie2y ago

Am I missing something? In the git repository [0] it says:

> needletail_benchmark folder was compiled using the command cargo build --release and ran using the following command ./target/release/<binary> <path/to/file.fq>.

Or are you talking about something else here?

[0] https://github.com/MoSafi2/MojoFastTrim

mcqueenjordan2y ago

It was later edited, after it had basically made the rounds.

1 more reply

tehsauce2y ago

How much does this particular result change when running in release mode?

alpaca1282y ago

Depending on the code I've seen performance increases above 100x in some cases. While that's not exactly the norm, benchmarking Rust in debug mode is absolutely pointless even as a rough estimate.

1 more reply

fwip2y ago

On my machine, running the debug executable on the medium-size dataset takes ~14.5 seconds, and release mode takes ~0.8 seconds.

1 more reply

bhansconnect2y ago

WeatherBrier2y ago· 5 in thread

adgjlsfhk12y ago

> I really wish Julia had easy AOT compilation and no GC, that would be perfect

celrod2y ago

Too bad Julia doesn't have this theoretical "well written GC". I do not like GCs, so I agree with OP's sentiment. Why solve such a hard problem when you don't have to?

4 more replies

Someone2y ago

> A well written GC has the same throughout (or higher) than reference counting for most applications

Reference counting has its own problems. The true comparison should be with code that (mostly) doesn’t do reference counting.

Then, the claim still holds, IF you give your process enough memory. https://cse.buffalo.edu/~mhertz/gcmalloc-oopsla-2005.pdf:

That paper is old and garbage collectors have improved, but I think there typically still is a factor of 2 to 3.

Would love to see a comparison between modern recounting and modern GC, though. Static code analysis can avoid a lot of recount updates and creation of garbage.

aldanor2y ago

Well, there's some big DS projects written in rust that are now very widely used in Python world - e.g., polars.

pjmlp2y ago

I just came from a CERN event, HEP seems to still be all about C++, Fortran, Python, Java, and some Go due to Kubernetes.

No Rust or Julia in their radar.

jdiaz972y ago· 5 in thread

latenightcoding2y ago

I can't take this language or company serious after reading stuff like:

"Mojo may be the biggest programming language advance in decades"

https://www.fast.ai/posts/2023-05-03-mojo-launch.html

mianos2y ago

My answer to the deleted comment and to this.

It is a bunch of incremental improvements to the Python like language environment.

That's no big programming language advance to me. A biggie would be to Haskel or even Rust.

That's not to say it won't be wildly more successful as it gives a lot of what people want in a number of areas all in one go.

breather2y ago

Hard to remember the last language that felt so obviously sold by something other than an actual community. Even Swift tried its best to exist outside of xcode and mac/i os

EDIT: perhaps I'm being too harsh—this was literally just announced. I'm just taken aback by the blatant marketing as everyone else is.

2 more replies

boxed2y ago

A little bit of clickbait is what you need to get interest at all. That's just a fact of life.

whoami173572y ago

Probably reasonable to label as a shenanigan if they try to differentiate with a emoji file extension.

fwip2y ago· 4 in thread

  hyperfine -N --warmup 5 test/test_fastq_record 
  'needletail_test/target/release/rust_parser data/fastq_test.fastq'
  Benchmark 1: test/test_fastq_record
    Time (mean ± σ):      1.936 s ±  0.086 s    [User: 0.171 s, System: 1.386 s]
    Range (min … max):    1.836 s …  2.139 s    10 runs
  
  Benchmark 2: needletail_test/target/release/rust_parser data/fastq_test.fastq
    Time (mean ± σ):     838.8 ms ±   4.4 ms    [User: 578.2 ms, System: 254.3 ms]
    Range (min … max):   833.7 ms … 848.2 ms    10 runs
  
  Summary
    needletail_test/target/release/rust_parser data/fastq_test.fastq ran
      2.31 ± 0.10 times faster than test/test_fastq_record

(Edit: I built the Rust version with `cargo build --release` on Rust 1.74, and Mojo with `mojo build` on Mojo 0.7.0.)

chromatin2y ago

It was later noted on Twitter/X by someone that the rust version was not compiled with `--release`

robinsonrc2y ago

That’s a fairly big omission for such an attention grabbing performance comparison

MohamedMabrouk2y ago

fwip2y ago

Thanks for the pointer, I had only checked the 'main' branch, which doesn't have the benchmarking code present.

When running on the commit & code you point to here, here are my new results:

  $ hyperfine -N --warmup 5 './benchmark/fast_parser data/fastq_test.fastq'  './benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq '
  Benchmark 1: ./benchmark/fast_parser data/fastq_test.fastq
    Time (mean ± σ):     675.0 ms ±   2.4 ms    [User: 399.3 ms, System: 269.4 ms]
    Range (min … max):   670.5 ms … 677.5 ms    10 runs
  
  Benchmark 2: ./benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq
    Time (mean ± σ):     840.8 ms ±   3.0 ms    [User: 578.0 ms, System: 257.0 ms]
    Range (min … max):   837.0 ms … 847.7 ms    10 runs
  
  Summary
    ./benchmark/fast_parser data/fastq_test.fastq ran
      1.25 ± 0.01 times faster than ./benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq

Which indeed shows your parser running about 25% faster than the needletail version.

2 more replies

dr_kiszonka2y ago· 4 in thread

Folks using multiple languages, what is your workflow?

How do you make it work?

skwb2y ago

Use python and write my results in a CSV that I quickly import into R and do my fancy stats.

Tbf python's stats implementations can be garbage; the last time I checked you can't do multiple levels for hierarchical regression.

carbocation2y ago

My workflow is similar to yours: python for deep learning and surface reconstruction. R for stats and plots.

I use go extensively for data preprocessing. Sounds weird but it works well for highly repetitive conversion tasks like DICOM parsing, converting EKGs to numpy, etc.

_huayra_2y ago

samuell2y ago

I've been thinking to learn Rust for these use cases, but always get frustrated with the complexity.

I find Go is a great middle-ground though! And now there starts to be a few more bio-related tools and toolkits out there, including:

- https://github.com/vertgenlab/gonomics

- https://github.com/biogo/biogo

- https://github.com/pbenner/gonetics

- https://github.com/shenwei356/bio

... except from there being some really popular bio tools written in Go, like:

- https://github.com/shenwei356/seqkit

refulgentis2y ago· 4 in thread

I felt like I learned more about the author than Mojo.

- Never actually runs it. Seriously.

- At the end, "Julia is faster!!!!" off a one-off run on their own machine, comparing it to benchmark times on the Mojo website

jakobnissen2y ago

I feel like if you believe my conclusion was that "Julia is faster" then you are missing the point.

Also, did you even read my discussion on how much a parser ought to validate? Your resume is completely missing the point.

refulgentis2y ago

Yeah, I got the joke, and understood the parser.

It's just, the content length : content ratio is high - all I got out of it was you don't like the Mojo speed claim & genomics parsing is text parsing*

* i mean, everything is bytes parsing, but it always tickles me when I find out other domains aren't castles in the sky, speaking an alien language

1 more reply

cbkeller2y ago

It looks like you very dramatically missed the point

refulgentis2y ago

Please, explain

zer00eyz2y ago· 4 in thread

1. Back to the rust vs mojo article that kicked this off... this isnt someone who is going to use rust.

2. Availably, portability, ease of use... These are the reasons python is winning.

jakobnissen2y ago

zer00eyz2y ago

>> Is it too lazy for me to write that article and not bother to install a VM with Mojo

Yes.

Would you talk about a book you didn't read? Or a movie you didn't see? Not on any meaningful level.

3 more replies

refulgentis2y ago

kescobo2y ago

>TL;DR: wrote a benchmark article without...running it?

>Then you conclude with "the language I use is faster!!!"

1 more reply

ubj2y ago· 3 in thread

Great post, but I think the author missed a few advantages of Mojo:

* Mojo provides first-class support for AoT compilation of standalone binaries [1]. Julia provides second-class support at best.

* Mojo aims to provide first-class support for traits and a modern Rust-like memory ownership model. Julia has second-class support for traits ("Tim Holy trait trick") and uses a garbage collector.

But I'm still keeping my eye on Mojo. There's nothing wrong with having two powerful languages learning from each other's innovations.

[1]: https://docs.modular.com/mojo/manual/get-started/hello-world...

jdiaz972y ago

True, but the title of the blog is about Bioinformatics, and like another comment said:

> Bioinformatics is like 0.1% dealing with FASTQ files and the rest is using the ecosystem of libraries for statistics and plotting. Many of them in R

Considering that, do you need AOT, memory ownership for doing plotting and statistics? I'd argue not, and that's why R and Python are so popular in Bio.

beanjuiceII2y ago

doesn't this make more sense to have a python like language then for speed? and python for all that other stuff. so learn one'ish language and get it all?

3 more replies

WeatherBrier2y ago

I feel the same way, I love using Julia, but the features that Mojo provides are exciting. It's great that we have both of them.

zaptheimpaler2y ago· 3 in thread

jltsiren2y ago

samuell2y ago

I think the challenge is learning enough of the biology outside of academia. I think it is fully possible, e.g. from books and videos ... but will take a lot of determination.

For the bioinformatics part, I think something like the "Genomics data science" specialization on Coursera should be a pretty good start.

jakobnissen2y ago

gandalfgeek2y ago· 2 in thread

Python is a juggernaut with total control of the ML space and is a huge part (even if less dominant) in modern scientific computing.

A VC has way better chances of success building solutions compatible with Python rather than replacing it.

chaxor2y ago

I was interested in trying our mojo. Then I looked at it booked out quick.

No one will use a language that isn't free and open source.

If mojo was free and open source (wasn't a company), and didn't just give out binaries with a 'trust me bro' stamp if approval, then I would have worked with it. But it's not, so I will never use it.

andyferris2y ago

I get your viewpoint. However, in terms of numbers, I suspect >90% of the populace (even research populace) will care that it is free-as-in-beer and that's all. So from the VC's point of view...

stellalo2y ago· 1 in thread

> If I include the time for Julia to start up and compile the script, my implementation takes 354 ms total, on the same level as Mojo's.

Bravo, Julia devs. Bravo.

adgjlsfhk12y ago

hkmaxpro2y ago

https://news.ycombinator.com/item?id=39290958

https://news.ycombinator.com/item?id=39296559

math_dandy2y ago

j / k navigate · click thread line to collapse