The PI started asking me to run some analyses on a raw dataset. Since I was so new at it, I often messed up and had to rerun the whole thing after looking at the output; this was painful because the entire script took a few hours to run.
I started poking around to see whether it could be optimized at all. the raw data was divided up into hundreds files from different runs, sensors, etc..., that were each processed independently in sequence, and the results were all combined together into a big array for the final result. Seems reasonable enough.
Except this code was all written by scientists, and the combination was done in the "naive" way - after each of data files was processed, a new array was created and the previous results were copied into the new array, as were the results from the current data file. This meant that for the iterations at the end, we roughly needed to have Memory = 2 * Size of final data, which eventually exceeded the amount of physical memory on the machine (and because there were so many data files, it was doing this allocation and copying dozens of times after it used all the RAM).
I updated this to pre-allocate the required size at the beginning for a very very easy 3-4 fold improvement in the overall runtime and felt rather proud of myself.
To his credit once I (as nicely as possible) showed him how to do it with two nested for-loops he clearly felt stupid and conceded the point. He was otherwise a very smart guy and good to work with, but goes to show how we can take our training for granted. Even freshman-level stuff goes over the heads of PhDs, and I'm sure the same would be true if I were to drop into a biochem lab.
The formula can be "oblivious" to the final size of the matrix too, which is helpful if you're doing some sparse ML training on edges (e.g., GNNs).
Without even looking at the processing function, which I considered some sciency science, I set up pthreads and mutexes on the result array and such to reap almost perfectly linear scaling. So far, so good.
Then I ran a profiler to see what was actually taking so long.
... Uh, why are you spending all this time copying strings back and forth?
Turns out they passed all strings by value. Sprinkling in a few const & here and there got a 1000-fold speedup or such. I felt pretty stupid for my multithreading antics after that.
Also, H5 data formats[0] have been a god-send for scientific computing, due to its ability to inherently make sense of how to store your data. You can have your previous results curried over into your new analysis without doubling your data.
I believe what was roughly happening under the hood was: 1. Allocate an array `tmp` of size `length of allOrbitFiles` + `length of currentOrbitFiles`. 2. Copy data from `allOrbitFiles` over to `tmp`. 3. Copy data from `currentOrbitFiles` to `tmp` 4. Reassign `allOrbitFiles` to the new array `tmp`. 5. Garbage collect the old `allOrbitFiles`.
So the doubling of memory usage comes after Step 1. I would imagine (but don't know for sure) that this would actually occur in any garbage collected language I'm familiar with as well (Java, Python, Javascript).
A good example: there was recently a thread on the Julia discourse comparing Julia and Mojo. Julia used no external libraries (compared to 7 with Mojo) implemented a simpler, faster, and cleaner version of the Mojo code that was used to showcase how fast Mojo was: https://discourse.julialang.org/t/julia-mojo-mandelbrot-benc.... Then further still, folks were able to optimize for even more speed with various abstractions that let Julia take more advantage of the hardware.
That's the promise I think Julia makes and delivers on - you can write incredibly "fast" code simply and cleanly. Yes, you can have a higher standard of "fast" which requires a bit more advanced knowledge but I'd argue that Julia still offers the cleanest/simplest way to take advantage of those micro-optimizaitons.
Performance is on a spectrum, and usually a tradeoff against readability and conciseness. I think it IS true that Julia excels in that it gives, by far, the best expressibility/performance tradeoff.
Also, often, you really can get "free lunch" - there are many times where if you just do the obvious thing, Julia and its backend LLVM can optimise it to extremely efficient code. For a simple example, just summing an array with a for loop, for example.
This is quite a different situation to traditional scientific computing.
No because of JIT compilation he would write code faster than Python by default. Now to truly rival optimized C++ code one has to do the tricks mentioned in this post like optimizing memory access, SIMD and maximizing instruction parallelism.
The key point is you are better off by default and can do some ugly stuff in the critical parts of the code while still using the same language.
I remember a quote that was like “Lisp programmers know the value of everything and the cost of nothing” in reference to that.
Obviously the developers of Lisp Machine operating systems could not ignore the cost of the operations. Especially since they developed ambitious software (an operating system and its application) on relatively slow machines (a Symbolics 3600 was as fast as a 1 MIPS DEC VAX 11/780).
GC was a kernel service, and there were low level primitives, including Assembly level Lisp forms.
Parenthesis all the way down to microcode.
The site is a staticly published version of a Pluto notebook, which uses modern web features to enable interactivity, reactivity, code syntax highlighting, etc. etc. Tradeoffs to enable those features but requires enabling your browser features. The underlying file that the notebook is based on is just a basic `.jl` file, so you could happily run the notebook from a Julia instance instead of the browser-based notebook environment.
Julia itself will be happy to run however you'd like it to of course.
I thought I was visiting a website.
The HN description actually referred to nicely layered technical abstractions. Which is why I had clicked the link. The description of Julia as being Lisp like. Thank you for taking an interest. I sill go see if a newer safaru works.
What scientists must know about hardware to write fast code (2020) - https://news.ycombinator.com/item?id=29601342 - Dec 2021 (29 comments)
We moved the website to https://biojulia.dev/, with permissions given to more people, including a core dev of Julia. That should reduce the risk of this happening again.
For those folks, getting the output they need is much more important than the CPU cycles - as it should be.
As a C++ programmer, I posed the question as to why they don’t hire coders to do this for them. The answer was cost which rather surprised me given the cost of the LHC.
We also have meetings dedicated to performance, some of which are not public, but this series from ROOT is: https://indico.cern.ch/category/14122/ If you search above, you will see many discussions about performance. The CI for ROOT also has a set of benchmarks to catch regressions, and Geant4 has two systems to track performance, a CI job checking every merge request, which I've set up myself (not publicly accessible), and a more complex system to track performance run by FNAL: https://g4cpt.fnal.gov/
These are just some examples from the projects I've worked on. There are also efforts to port stuff to GPUs and HPCs, and many other projects like event generators that are also undergoing performance work for HL-LHC. If you Google you can probably find a lot more stuff than what I already mentioned. Cheers,
But for large problems the article falls short. Scientific applications may need to use several computers at a time, COMP Superscalar (COMPSs) is a task-based programming model which aims to ease the development of applications for distributed infrastructures. COMPSs programmers do not need to deal with the typical duties of parallelization and distribution, such as thread creation and synchronization, data distribution, messaging or fault tolerance. Instead, the model is based on sequential programming, which makes it appealing to users that either lack parallel programming expertise or are looking for better programmability. Other popular frameworks such as LEGION offer a lower-level interface.
A minor detail I find a bit confusing, though, is explaining the potential benefits of SMT/hyperthreading with an example where threads are spending some of their time idle (or sleeping).
I don't know Julia so I don't know if sleep is implemented with busy-waiting or something there, but generally if a thread is put to sleep, the thread gets blocked from being run until the timer expires or the sleep is interrupted. The operating system doesn't schedule the blocked thread for running on the CPU in the first place, so a thread that's sleeping is not sharing a CPU core with another thread that's being executed.
So the example does not finish 8 jobs almost as fast as 4 or 1 jobs using 4 cores due to SMT; it's rather that half of the time each of the threads is not even being scheduled for running. A total of eight concurrent jobs/threads works out to approximately four of them being eligible to run at a time, matching the four physical cores available.
If there are only four concurrent jobs/threads, each sleeping half of the time, you end up not utilizing the four cores fully because on average two of the cores will be idle with no thread scheduled.
AFAIK SMT should only really be beneficial in cases of stalls due to CPU internal reasons such as cache misses or branch mispredictions, not in cases of threads being blocked for I/O (or sleeping).
The post is of course correct in that the example computation benefits from a higher number of concurrent jobs because of each thread being blocked half of the time. However, that's unrelated to SMT.
Considering how meticulous and detailed the post generally is, I think it would make sense to more clearly separate SMT from the benefits of multithreading in case of partially I/O-bound work.
Thanks for the heads up!
This link is not meant for you. It is meant for a scientist, and most scientists do not also have an EE degree or CS degree.
How much graduate level biology, oceanography, physics, geology, chemistry, meteorology, or other scientific field do you know?
All of those have subfields where computational performance is important. My experience is scientists are more likely to pick up the software skills than EEs are willing to pick up the science background. (In part because scientific software development generally pays less well than commercial software development.)
The standard entomologist curriculum does not require calculus, while a physics curriculum does. Both produce scientists. (For example, https://cals.cornell.edu/education/degrees-programs/entomolo... under "Major Requirements" says "One semester of college statistics or biometry", and the listed physics requirement doesn't require calculus.)
On the other hand, an entomologist interested in population ecology may need to know differential equations.
Your use of "study program" suggests your experience is at the undergrad level, and not at the grad school level, which is how most scientists I know got their training.
At the undergrad level the study programs do reflect what's needed for a solid education. If a student is interested in computational biology, that program will emphasize taking more CS courses than the program for a student interested in marine biology.
But at the grad level, the "study program" is much less formalized. You might take graduate level classes the first couple of years, but then you are expected to pick up the missing bits on your own.
Once you have your PhD and are a working scientist, you rarely have the luxury of following any study program.
And if you've been a scientist for 20 years, any CS training you had likely did not cover SIMD, and emphasized practices which are no longer relevant. (For example, the link points out "That advice [about HDDs] is mostly outdated today [with SSDs]".)
Those latter categories are who the linked-to piece is for, not undergrads in a well-defined study program.
If you had prepended the comment with something like "I love this topic!" to show enthusiasm or approval, you probably would have gotten a much different response.