I really like that Seq seems to have built-in some parallelization ability. I spend no small amount of time in my day job doing that manually in R with RcppParallel for loops that are totally independent across each iteration.
Bioinformaticians are often educated to use a specific programming language and environment. They aren't usually looking to try other languages. For example, I support our bioinformatics group and they are basically 100% R and RStudio users. We have a single user of Python and that user is doing "typical" tensorflow stuff with images.
I've noticed this same bias towards a single language for some other academic niches. Like SAS or Stata camps in public health or psychology - I think of these languages as basically the same, but for non-CS folks the perception seems to be more like English vs Russian.
Even more complicated, researchers may be extremely committed to a specific library in a language and suspicious of languages that don't have their favorite library available.
Any shift to new tooling for these highly-committed users will almost certainly require large and obvious benefits to gain traction.
It's really saying something when scientists think writing Python code is a pain, because Python's a pretty forgiving language, too.
My son works in polysci analytics and I see the same thing you describe. A group will pick a tool and flog all problems with it. Change rarely occurs. He was in the Stata camp at one university, the TidyVerse at MIT.
It’s very weird for me, I develop and maintain a piece of software that that has 3 OSes, and 5 languages to wrestle with as well as multiple “tool” technologies like Ansible/MQTT, etc. so I’m very much in a polyglot-best-tool-for-the-job environment. Observationally from a casual POV, I see pros/cons both ways.
I am aware of only one case where a community migrated to other software. Many economists I know switched from Stata to R. Some of them later moved on to Python.
> So it appears the primary reason BioJulia code is slower than Seq code in these three benchmarks is that BioSequences.jl is doing important work for you that Seq is not doing. As scientists, we hope you value tools that spend the time and effort to validate inputs given to it rather than fail silently.
Reminds me of the myriads of Excel catastrophes.
> Seq is able to outperform Python code by up to 160x.
So ... a reimplementation of Python that can outperform cpython by over 100 times? I know literally nothing about this project, but I have to say that rings pretty false for me. Hell, even PyPy has trouble with many applications. (Plus they're claiming to outperform "equivalent" C code by 2x.)
Even if the performance claims are overblown, it's always nice to see new work on compiled languages with easy-to-read syntax. It's hard to beat Python for an education / prototyping language, so I will definitely be giving this a look.
That's why I quoted their claim that the "vast majority" of Python programs run unmodified. Even PyPy barely achieves that. To really get 100x performance over Python (and even supposedly beat C) with a compiler that works on most unmodified Python code would be an extraordinary achievement.
It seems it's better to think of this particular claim as "we made a C++ algorithm that is 2x faster than the previous SotA C++ algorithm" (with the help of a heavily optimized DSL).
Python’s main advantage was that it was easier than some of its competitors (C++/Java). But that is no longer the case with modern languages (Nim/Crystal/Julia/JavaScript) being both faster and comparably as easy (or easier).
It is now coasting off its momentum, mostly do to the vast amount of (usually poorly designed) open source libraries. That and Jupyter.
Not claiming it as any sort of reference, but you can see how it [2] may be used to solve some basic genome sequencing.
My biggest concern is that Seq sucks users into a sort of local maximum. While piping syntax is nice, and the built-in routines are handy, it's a lot less flexible than a "mainstream" programming language, simply because of the smaller community and relative paucity of libraries. BioPython[1] has been around a long long time, and I think a lot of potential users of Seq would be better suited by using a regular bioinformatics library in the language they know best.
e.g: The example of reading Fasta files in Seq:
# iterate over everything
for r in FASTA('genome.fa'):
print r.name
print r.seq
versus BioPython: from Bio import SeqIO
for r in SeqIO.parse("genome.fa", "fasta"):
print(r.id)
print(r.seq)
It might be pretty useful as a teaching tool, but I'm skeptical of its long-term benefit to professionals. I'm not sure the ecosystem of Seq users will be large enough, y'know? Again, it's pretty impressive work, and it's come a long way. I wish the devs all the best. :)Big enough for what? Instead of a gratuitous critique of its "benefit to professionals", maybe you could comment on the project's design choices and implementation. That would be more useful to us amateurs.
Nim can be sold as a "A strongly-typed and statically-compiled high-performance Pythonic language" as Seq (although it is more than that and does not actually have as a goal to be Pythonic, see https://nim-lang.org/ or https://github.com/Araq/nimconf2021/blob/main/zennim.rst).
Still, given the small size of Nim community and even smaller size of the genomics nim subcommunity, I would say it is not that odd that is not included in the benchmark. The existing nim genomics library might not even cover the functionalities required by the benchmark.
I don't expect the community will adopt other languages at a large scale. My hope, though, is that more of these algorithms move to real distributed processing systems like Spark, to take advantage of all the great ideas in systems like that. But genomics will continue to trail the leading edge by about 20 years for the foreseeable future.
The workflows I deal with generally involve moving hundreds of terabytes of storage into memory, processing it, and writing it out. Single machines (even beefy ones) tend to hit their limits (networking, max RAM, cache size, TLB, etc).
Maybe there's another tool better than spark, i don't know, the important thing is that spark is the most ubiquitous.
[1]https://databricks.com/wp-content/uploads/2018/08/SSE15-40-D...
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
A pitch most people doing applied bioinformatics won’t understand/appreciate.
Will this be available via conda? And how would seq integreate with Snakemake, since that is also based on Python?
It seems to me there is a huge demand for making Python faster, whether it be via making a more optimisation friendly subset, or ideally throwing engineering talent into improving the interpreter.
V8 shows this can be done with highly dynamic Javascript. I guess we need a big corporate sponsor or the community to fund some positions.
It's kind of crazy how few developers are working on optimising cPython, it may even be a worth it for environmental reasons.