Seq – A programming language for computational genomics and bioinformatics (opens in new tab)

(github.com)

116 pointstdido4y ago59 comments

59 comments

47 comments · 17 top-level

bscphil4y ago· 8 in thread

> Seq is a Python-compatible language, and the vast majority of Python programs should work without any modifications

> Seq is able to outperform Python code by up to 160x.

So ... a reimplementation of Python that can outperform cpython by over 100 times? I know literally nothing about this project, but I have to say that rings pretty false for me. Hell, even PyPy has trouble with many applications. (Plus they're claiming to outperform "equivalent" C code by 2x.)

Even if the performance claims are overblown, it's always nice to see new work on compiled languages with easy-to-read syntax. It's hard to beat Python for an education / prototyping language, so I will definitely be giving this a look.

amelius4y ago

It's probably in the same sense that Numpy is much faster than doing matrix operations with pure Python arrays and Python for-loops.

aldanor4y ago

I also know literally nothing about this particular project, but why not? If you support a small restricted subset of Python it's completely doable under certain conditions for specific types of programs. E.g., Numba can easily outperform Python 100-1000x in numerical applications (done it myself multiple times), simply because it jit-compiles the code by first translating it to LLVM IR.

mhenders4y ago

(minor contributor) I’ve been following the project for a while and pleasantly surprised by the ability to manually convert Python programs to Seq without needing to make too many changes. Note, most of my experimentation has been with smallish programs I’ve written. I like that I can still think “Pythonically” and compose mostly correct Seq code using familiar idioms, e.g. list/set/dict comprehensions. The standard library is very readable and a source for “from import” type functionality. Some of the other features I’ve come to appreciate: pipeline operator |>, JIT compile or create an executable (seqc run, seqc build), match statements, and strong typing.

bscphil4y ago

> If you support a small restricted subset of Python

That's why I quoted their claim that the "vast majority" of Python programs run unmodified. Even PyPy barely achieves that. To really get 100x performance over Python (and even supposedly beat C) with a compiler that works on most unmodified Python code would be an extraordinary achievement.

1 more reply

drocer884y ago

Look at the link: https://github.com/seq-lang/seq It says 96% of the code is C++ in the "Languages" box on the right. C ( and C++ and Rust) outperforms Python in benchmarks and certain optimized C code can do 160x over very naive Python. So this is very possible, though the routines tested are probably cherry picked for bragging rights.

arc-in-space4y ago

> We show that many important and widely-used NGS algorithms can be made up to 160× faster than their Python counterparts as well as 2× faster than the existing hand-optimized C++ implementations

It seems it's better to think of this particular claim as "we made a C++ algorithm that is 2x faster than the previous SotA C++ algorithm" (with the help of a heavily optimized DSL).

snicker74y ago

Most newer languages will give you multiple orders of magnitude better performance than python.

Python’s main advantage was that it was easier than some of its competitors (C++/Java). But that is no longer the case with modern languages (Nim/Crystal/Julia/JavaScript) being both faster and comparably as easy (or easier).

It is now coasting off its momentum, mostly do to the vast amount of (usually poorly designed) open source libraries. That and Jupyter.

hoseja4y ago

Probably, it can outperform generic python specifically for genomics payloads, versus python code/C code.

dekhn4y ago· 5 in thread

Typically, any high performance (low latency or high throughput) genomics/bioinformatics applicaiton is not going to be written in plain Python, except possibly for prototyping. Instead, nearly all codes today are written in C++ or Java, with some sort of command and control in Python or a DAG-based workflow scheduler.

I don't expect the community will adopt other languages at a large scale. My hope, though, is that more of these algorithms move to real distributed processing systems like Spark, to take advantage of all the great ideas in systems like that. But genomics will continue to trail the leading edge by about 20 years for the foreseeable future.

adgjlsfhk14y ago

IMO, spark isn't the way forward. The typical pattern with it is it lets you scale up to 100 cores really easily which is almost enough to compete with a good single threaded implementation in a fast language.

dekhn4y ago

100 cores? I forgot how to count that low.

The workflows I deal with generally involve moving hundreds of terabytes of storage into memory, processing it, and writing it out. Single machines (even beefy ones) tend to hit their limits (networking, max RAM, cache size, TLB, etc).

Maybe there's another tool better than spark, i don't know, the important thing is that spark is the most ubiquitous.

east2west4y ago

I recall that the group that created Spark had a bioinformatics project on Spark but I don't know what happened to it. All I could find now is a paper[1] hosted by databricks.

[1]https://databricks.com/wp-content/uploads/2018/08/SSE15-40-D...

heuermh4y ago

We're here, still plugging along.

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

https://github.com/bigdatagenomics/adam

dekhn4y ago

Yep, that's the one I was thinking of (along with GNOMAD, which IIRC uses ADAM or some similar tech). My main complaint with ADAM was that they came up with their own file format (which had some flaws). But the general idea is the right one.

1 more reply

clusterhacks4y ago· 4 in thread

I am a CS person who works with bioinformaticians every day as part of my job.

I really like that Seq seems to have built-in some parallelization ability. I spend no small amount of time in my day job doing that manually in R with RcppParallel for loops that are totally independent across each iteration.

Bioinformaticians are often educated to use a specific programming language and environment. They aren't usually looking to try other languages. For example, I support our bioinformatics group and they are basically 100% R and RStudio users. We have a single user of Python and that user is doing "typical" tensorflow stuff with images.

I've noticed this same bias towards a single language for some other academic niches. Like SAS or Stata camps in public health or psychology - I think of these languages as basically the same, but for non-CS folks the perception seems to be more like English vs Russian.

Even more complicated, researchers may be extremely committed to a specific library in a language and suspicious of languages that don't have their favorite library available.

Any shift to new tooling for these highly-committed users will almost certainly require large and obvious benefits to gain traction.

psychometry4y ago

Scientists like using R instead of because the language lets them get set up and coding quickly with RStudio. More importantly, the language, tooling, and ecosystem is very forgiving when it comes to code quality and style. There is good R code out there, but the R community generally lacks the wide acceptance of good coding practices you see with Python users: unit tests, sane dependency management, type hints, documentation, safe namespacing, etc.

It's really saying something when scientists think writing Python code is a pain, because Python's a pretty forgiving language, too.

travisgriggs4y ago

So basically, the same thing that kept(keeps?) Visual Basic in use for so long.

My son works in polysci analytics and I see the same thing you describe. A group will pick a tool and flog all problems with it. Change rarely occurs. He was in the Stata camp at one university, the TidyVerse at MIT.

It’s very weird for me, I develop and maintain a piece of software that that has 3 OSes, and 5 languages to wrestle with as well as multiple “tool” technologies like Ansible/MQTT, etc. so I’m very much in a polyglot-best-tool-for-the-job environment. Observationally from a casual POV, I see pros/cons both ways.

ativzzz4y ago

I assume you are a software engineer? If so, part of our job is to use a variety of software tools, since that's our specialty. The researchers are not software developers. They learn how to use one particular tool to do their jobs, but they are not software specialists, nor do they desire to be.

dr_kiszonka4y ago

Very interesting! I noticed a similar phenomenon in the GIS space. All of my colleagues with formal training in GIS use ArcGIS and its Python API, but those without such background gravitate towards FLOSS solutions.

I am aware of only one case where a community migrated to other software. Many economists I know switched from Stata to R. Some of them later moved on to Python.

encode4y ago· 2 in thread

Also see this comparison between Julia's BioSequences and Seq by Jakob Nissen and Ben Ward: https://biojulia.net/post/seq-lang/

dgb234y ago

An interesting takeaway:

> So it appears the primary reason BioJulia code is slower than Seq code in these three benchmarks is that BioSequences.jl is doing important work for you that Seq is not doing. As scientists, we hope you value tools that spend the time and effort to validate inputs given to it rather than fail silently.

Reminds me of the myriads of Excel catastrophes.

dunefox4y ago

This shows imo that BioJulia is better, precisely because it validates data and is a broader programming language invented for science, not a DSL that optimises for speed over all else. Besides the new version of BuiJulia seems to perform even better than seq.

arshajii4y ago· 2 in thread

Hi everyone, I’m one of the developers on the Seq project — I was delighted to see it posted here! We started this project with a focus on bioinformatics, but since then we’ve added a lot of language features/libraries that have closed the gap with Python by a decent margin, and Seq today can be useful in other areas or even for general Python programs (although there are still limitations of course). We’re in the process of creating an extensible / plugin-able Python compiler based on Seq that allow for other domain-extensions. The upcoming release also has some neat features like OpenMP integration (e.g. “@par(num_threads=10) for i in range(N): …” will run the loop with 10 threads). Happy to answer any questions!

adgjlsfhk14y ago

Have follow-up benchmarks vs BioJulia been done since 2019? If I remember correctly at the time, the result was that BioJulia was faster once you consider that it did validation.

arshajii4y ago

We haven't done too many comparisons with BioJulia since that paper, although we did address the (valid) issues they raised such as data validation (i.e. Seq now validates input data by default, but this can be optionally disabled). We did compare against them in our last paper in a sequence alignment benchmark: https://www.nature.com/articles/s41587-021-00985-6 (check the supplement).

1 more reply

totalperspectiv4y ago· 2 in thread

It’s odd that they didn’t include Nim in the benchmarks in their paper: https://dl.acm.org/doi/pdf/10.1145/3360551

jpxw4y ago

I know nothing about Nim or genomics. Why is it odd that they didn’t include Nim?

pietroppeter4y ago

Nim has had some success in genomics mainly thanks to the work of https://github.com/brentp

Nim can be sold as a "A strongly-typed and statically-compiled high-performance Pythonic language" as Seq (although it is more than that and does not actually have as a goal to be Pythonic, see https://nim-lang.org/ or https://github.com/Araq/nimconf2021/blob/main/zennim.rst).

Still, given the small size of Nim community and even smaller size of the genomics nim subcommunity, I would say it is not that odd that is not included in the benchmark. The existing nim genomics library might not even cover the functionalities required by the benchmark.

2 more replies

chmaynard4y ago· 2 in thread

I'm wondering if Seq can also serve as a general-purpose replacement for Python whenever a fast executable is needed.

arshajii4y ago

(I'm one of the developers on Seq.) We've actually been working mostly on closing the gap with Python for the last year or so. Seq can be useful for plain Python programs as well -- I give a bit more context in my comment above.

dunefox4y ago

It's a domain specific language for bioinformatics. So, most likely not.

jack_riminton4y ago· 2 in thread

How do you pronounce Seq?

adgjlsfhk14y ago

Short for sequence.

jack_riminton4y ago

So is it pronounced sequence or like 'seek'?

1 more reply

fwip4y ago· 1 in thread

It's an impressive project, but I'm not sure the niche is big enough. It's certainly come a long way since the last time I looked at it!

My biggest concern is that Seq sucks users into a sort of local maximum. While piping syntax is nice, and the built-in routines are handy, it's a lot less flexible than a "mainstream" programming language, simply because of the smaller community and relative paucity of libraries. BioPython[1] has been around a long long time, and I think a lot of potential users of Seq would be better suited by using a regular bioinformatics library in the language they know best.

e.g: The example of reading Fasta files in Seq:

    # iterate over everything
    for r in FASTA('genome.fa'):
        print r.name
        print r.seq

versus BioPython:

    from Bio import SeqIO
    for r in SeqIO.parse("genome.fa", "fasta"):
        print(r.id)
        print(r.seq)

It might be pretty useful as a teaching tool, but I'm skeptical of its long-term benefit to professionals. I'm not sure the ecosystem of Seq users will be large enough, y'know? Again, it's pretty impressive work, and it's come a long way. I wish the devs all the best. :)

1. https://biopython.org/

chmaynard4y ago

> It's an impressive project, but I'm not sure the niche is big enough.

Big enough for what? Instead of a gratuitous critique of its "benefit to professionals", maybe you could comment on the project's design choices and implementation. That would be more useful to us amateurs.

car4y ago· 1 in thread

Looks great, will definitely give this a try since it does sequence manipulations that I otherwise have to write myself.

Will this be available via conda? And how would seq integreate with Snakemake, since that is also based on Python?

tdidoOP4y ago

Seems like there's a conda package in the works: https://github.com/bioconda/bioconda-recipes/pull/29660

Bostonian4y ago· 1 in thread

The code examples look like Python 2 rather than Python 3. Print does have not parentheses. Why was this decision made?

haihaibye4y ago

They support both print syntaxes, and will deprecate Python 2 style soon.

https://github.com/seq-lang/seq/issues/223

fuzzythinker4y ago

Used it for coding Coursera/Stepik's Bioinformatics course [1] when it was first announced 2 years ago.

Not claiming it as any sort of reference, but you can see how it [2] may be used to solve some basic genome sequencing.

[1] https://www.coursera.org/specializations/bioinformatics

[2] https://github.com/fuzzthink/seq-genomics

f6v4y ago

> Think of Seq as a strongly-typed and statically-compiled Python: all the bells and whistles of Python, boosted with a strong type system, without any performance overhead.

A pitch most people doing applied bioinformatics won’t understand/appreciate.

haihaibye4y ago

I'm in the target market but can't use this unless it supports all of my Python libraries like Django and Numpy.

It seems to me there is a huge demand for making Python faster, whether it be via making a more optimisation friendly subset, or ideally throwing engineering talent into improving the interpreter.

V8 shows this can be done with highly dynamic Javascript. I guess we need a big corporate sponsor or the community to fund some positions.

It's kind of crazy how few developers are working on optimising cPython, it may even be a worth it for environmental reasons.

kasperset4y ago

I like this idea. However to me it is similar to using à la carte tools/programs along with bash script or DSL such as Nextflow. More often these stand-alone programs are already written in compiled languages. I am sure Seq will allow to build customized programs as compared to scripting or gluing programs.

tdidoOP4y ago

https://www.nature.com/articles/s41587-021-00985-6 (paywalled)

gandalfgeek4y ago

Quick explainer video: https://youtu.be/5bk4Wc5Op2M

j / k navigate · click thread line to collapse

59 comments

47 comments · 17 top-level

bscphil4y ago· 8 in thread

> Seq is a Python-compatible language, and the vast majority of Python programs should work without any modifications

> Seq is able to outperform Python code by up to 160x.

amelius4y ago

It's probably in the same sense that Numpy is much faster than doing matrix operations with pure Python arrays and Python for-loops.

aldanor4y ago

mhenders4y ago

bscphil4y ago

> If you support a small restricted subset of Python

1 more reply

drocer884y ago

arc-in-space4y ago

> We show that many important and widely-used NGS algorithms can be made up to 160× faster than their Python counterparts as well as 2× faster than the existing hand-optimized C++ implementations

It seems it's better to think of this particular claim as "we made a C++ algorithm that is 2x faster than the previous SotA C++ algorithm" (with the help of a heavily optimized DSL).

snicker74y ago

Most newer languages will give you multiple orders of magnitude better performance than python.

It is now coasting off its momentum, mostly do to the vast amount of (usually poorly designed) open source libraries. That and Jupyter.

hoseja4y ago

Probably, it can outperform generic python specifically for genomics payloads, versus python code/C code.

dekhn4y ago· 5 in thread

adgjlsfhk14y ago

dekhn4y ago

100 cores? I forgot how to count that low.

Maybe there's another tool better than spark, i don't know, the important thing is that spark is the most ubiquitous.

east2west4y ago

I recall that the group that created Spark had a bioinformatics project on Spark but I don't know what happened to it. All I could find now is a paper[1] hosted by databricks.

[1]https://databricks.com/wp-content/uploads/2018/08/SSE15-40-D...

heuermh4y ago

We're here, still plugging along.

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

https://github.com/bigdatagenomics/adam

dekhn4y ago

1 more reply

clusterhacks4y ago· 4 in thread

I am a CS person who works with bioinformaticians every day as part of my job.

Even more complicated, researchers may be extremely committed to a specific library in a language and suspicious of languages that don't have their favorite library available.

Any shift to new tooling for these highly-committed users will almost certainly require large and obvious benefits to gain traction.

psychometry4y ago

It's really saying something when scientists think writing Python code is a pain, because Python's a pretty forgiving language, too.

travisgriggs4y ago

So basically, the same thing that kept(keeps?) Visual Basic in use for so long.

ativzzz4y ago

dr_kiszonka4y ago

I am aware of only one case where a community migrated to other software. Many economists I know switched from Stata to R. Some of them later moved on to Python.

encode4y ago· 2 in thread

Also see this comparison between Julia's BioSequences and Seq by Jakob Nissen and Ben Ward: https://biojulia.net/post/seq-lang/

dgb234y ago

An interesting takeaway:

Reminds me of the myriads of Excel catastrophes.

dunefox4y ago

arshajii4y ago· 2 in thread

adgjlsfhk14y ago

Have follow-up benchmarks vs BioJulia been done since 2019? If I remember correctly at the time, the result was that BioJulia was faster once you consider that it did validation.

arshajii4y ago

1 more reply

totalperspectiv4y ago· 2 in thread

It’s odd that they didn’t include Nim in the benchmarks in their paper: https://dl.acm.org/doi/pdf/10.1145/3360551

jpxw4y ago

I know nothing about Nim or genomics. Why is it odd that they didn’t include Nim?

pietroppeter4y ago

Nim has had some success in genomics mainly thanks to the work of https://github.com/brentp

2 more replies

chmaynard4y ago· 2 in thread

I'm wondering if Seq can also serve as a general-purpose replacement for Python whenever a fast executable is needed.

arshajii4y ago

dunefox4y ago

It's a domain specific language for bioinformatics. So, most likely not.

jack_riminton4y ago· 2 in thread

How do you pronounce Seq?

adgjlsfhk14y ago

Short for sequence.

jack_riminton4y ago

So is it pronounced sequence or like 'seek'?

1 more reply

fwip4y ago· 1 in thread

It's an impressive project, but I'm not sure the niche is big enough. It's certainly come a long way since the last time I looked at it!

e.g: The example of reading Fasta files in Seq:

    # iterate over everything
    for r in FASTA('genome.fa'):
        print r.name
        print r.seq

versus BioPython:

    from Bio import SeqIO
    for r in SeqIO.parse("genome.fa", "fasta"):
        print(r.id)
        print(r.seq)

1. https://biopython.org/

chmaynard4y ago

> It's an impressive project, but I'm not sure the niche is big enough.

car4y ago· 1 in thread

Looks great, will definitely give this a try since it does sequence manipulations that I otherwise have to write myself.

Will this be available via conda? And how would seq integreate with Snakemake, since that is also based on Python?

tdidoOP4y ago

Seems like there's a conda package in the works: https://github.com/bioconda/bioconda-recipes/pull/29660

Bostonian4y ago· 1 in thread

The code examples look like Python 2 rather than Python 3. Print does have not parentheses. Why was this decision made?

haihaibye4y ago

They support both print syntaxes, and will deprecate Python 2 style soon.

https://github.com/seq-lang/seq/issues/223

fuzzythinker4y ago

Used it for coding Coursera/Stepik's Bioinformatics course [1] when it was first announced 2 years ago.

Not claiming it as any sort of reference, but you can see how it [2] may be used to solve some basic genome sequencing.

[1] https://www.coursera.org/specializations/bioinformatics

[2] https://github.com/fuzzthink/seq-genomics

f6v4y ago

> Think of Seq as a strongly-typed and statically-compiled Python: all the bells and whistles of Python, boosted with a strong type system, without any performance overhead.

A pitch most people doing applied bioinformatics won’t understand/appreciate.

haihaibye4y ago

I'm in the target market but can't use this unless it supports all of my Python libraries like Django and Numpy.

It seems to me there is a huge demand for making Python faster, whether it be via making a more optimisation friendly subset, or ideally throwing engineering talent into improving the interpreter.

V8 shows this can be done with highly dynamic Javascript. I guess we need a big corporate sponsor or the community to fund some positions.

It's kind of crazy how few developers are working on optimising cPython, it may even be a worth it for environmental reasons.

kasperset4y ago

tdidoOP4y ago