undefined | Better HN

0 pointsdekhn5y ago0 comments

The issue is that, from what I can tell, the authors just used R to analyze some data, with no explicit parallelism. You would do a better job just renting time on AWS, saving money for everybody.

0 comments

9 comments · 3 top-level

secabeen5y ago· 2 in thread

In my experience with research computing, if you are able to keep a computer doing active work more than 60% of the time, it will be cheaper to purchase and run that computer yourself than renting it from AWS. That's the case even with commodity machines with only 10G Ethernet interconnect. $15k for a machine is only $0.34/hour over 5 years. That doesn't buy much of an AWS machine. (Yes, cooling, real estate and power are all overhead on that, but researchers often don't pay those costs directly, they are covered by the university with other monies.)

dekhnOP5y ago

You're completely ignoring the other valuable aspects of being in a cloud: you're close to huge amounts of high throughput storage (blob and DB), and can increase/decrease the size of your fleet trivially. These are critical to nearly all modern scientific workflows (moreso than the raw compute, IMHO).

As for the cost structure for research computing, the argument that the costs are externalized isn't a good one- that overhead that pays for the facility, and the networking, comes out of your grant money, and using grad student time to admin your cluster often just causes your grad students to leave for FAAMG.

secabeen5y ago

> You're completely ignoring the other valuable aspects of being in a cloud: you're close to huge amounts of high throughput storage (blob and DB), and can increase/decrease the size of your fleet trivially. These are critical to nearly all modern scientific workflows (moreso than the raw compute, IMHO).

That has not been my experience. There are lots of scientific workflows that only need 10s of TB at most, yet can still consume lots of cycles.

> As for the cost structure for research computing, the argument that the costs are externalized isn't a good one- that overhead that pays for the facility, and the networking, comes out of your grant money, and using grad student time to admin your cluster often just causes your grad students to leave for FAAMG.

At the universities I've worked at, equipment (large purchases) is except from overhead, or results in a lower overhead charge. (Researchers balk at paying a ~50% overhead rate on a $1million instrument). Using grad student time to admin your cluster is dumb, but I'm more talking about users who need single-digit numbers of computers. If you need real HPC, you're in the world of queues, national and regional supercomputers, etc. etc.

btilly5y ago· 2 in thread

R handles statistics.

It does not simulate complex biochemical interactions in different parts of the body.

From the description, they did something that requires a lot more horsepower.

dekhnOP5y ago

I read the underlying article (https://elifesciences.org/articles/59177) and was unable to find any evidence of that. In fact the paper doesn't mention anything about SUmmit or details on the computations :(

There certainly wasn't any "heavy biochemical calculations"; this work is entirely comparative genomics, so just operating on DNA strings.

I see this fluff article https://www.ornl.gov/blog/genomics-code-exceeds-exaops-summi... and there may be more detail here: https://www.hpcuserforum.com/presentations/april2019/Joubert... which shows near-linear performance they ascribe to "Made possible by aggressive communication overlap and low-congestion Mellanox Infiniband fat tree network with adaptive routing"

So there may actually be an HPC/supercomputer story in there, but I'm having trouble figuring out what they did in this most recent work.

rrss5y ago

I can't find details either, but I think they just used R for the post-processing, and there's a lot of computation behind this sentence:

> RNA-Seq analysis was performed using the latest version of the human transcriptome

I found this article discussing read mapping for RNA sequence analysis: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4833417/

> In particular, RNA sequencing (RNA-seq) technology,1 which provides a comprehensive profile of a transcriptome, is increasingly replacing conventional expression microarrays.2 Primary data processing in RNA-seq (as well as in other massive sequencing experiments, including genome resequencing) involves mapping reads onto a reference genome. This step constitutes a computationally expensive process in which, in addition, sensitivity is a serious concern

1 more reply

mnw21cam5y ago· 2 in thread

Depends how much of the "supercomputer" the calculation used. If it reserved a single CPU core on a single node, then that'd be cheap, allowing the rest of the system to get on with something else. There's no reason AWS would be cheaper.

natechols5y ago

The supercomputer in this article is extremely specialized hardware design to maximize peak performance across all nodes. Using just a single CPU core to run a bioinformatics study would be like taking an army tank to drop the kids off at soccer practice.

dekhnOP5y ago

The supercomputer nodes cost far more, per node, than AWS machines- 15% or more of the budget was spent on interconnect. SUpercomputers don't partition CPUs like that, because interference causes small performance degradation. Unfortunately, CPU is not perfectly compressible- in particular, the cache is shared by all processes that run on the CPU, so if you run another job on the same node, you will see slower performance due to higher cache replacement (this is measured using Cycles Per Instruction).

j / k navigate · click thread line to collapse

0 comments

9 comments · 3 top-level

secabeen5y ago· 2 in thread

dekhnOP5y ago

secabeen5y ago

That has not been my experience. There are lots of scientific workflows that only need 10s of TB at most, yet can still consume lots of cycles.

btilly5y ago· 2 in thread

R handles statistics.

It does not simulate complex biochemical interactions in different parts of the body.

From the description, they did something that requires a lot more horsepower.

dekhnOP5y ago

There certainly wasn't any "heavy biochemical calculations"; this work is entirely comparative genomics, so just operating on DNA strings.

So there may actually be an HPC/supercomputer story in there, but I'm having trouble figuring out what they did in this most recent work.

rrss5y ago

I can't find details either, but I think they just used R for the post-processing, and there's a lot of computation behind this sentence:

> RNA-Seq analysis was performed using the latest version of the human transcriptome

I found this article discussing read mapping for RNA sequence analysis: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4833417/

1 more reply

mnw21cam5y ago· 2 in thread

natechols5y ago

dekhnOP5y ago

j / k navigate · click thread line to collapse