Memory Bandwidth Napkin Math (opens in new tab)

(forrestthewoods.com)

269 pointsEpholys6y ago54 comments

54 comments

I’ve been fascinated by the 'napkin math' topic recently, but felt a need for a way to routinely practise. It’s an acquired skill once it’s effortless to do the order of magnitude calculations in a meeting, or zipping through possible solutions on a whiteboard (what I imagine Jeff Dean does). Will it be fast enough? How much will it cost? Does the benchmarked performance match the order of magnitude we’d expect?

To practise, I created a newsletter last year with a monthly problem that may be of interest to others wanting to sharpen their napkin math: https://sirupsen.com/napkin/

gumby6y ago

The two big things are a way to conceptualize your problem in a straightforward way (hard) and to learn to rapidly do adequate (not precise) arithmetic in your head (easy).

Wait, the math is easy? Sure, if all you are concerned with is the right, not correct answer. Back when people used slide rules this was common, but now when you do it it seems to weird some people out. You should develop this skill anyway because when you see an answer you should be able to tell at a glance if it's probably right or almost certainly wrong.

Why its simple: the first part is just to keep track of the order of magnitude.* People who used slide rules always had to to do this, and it's quick to pick up and pretty easy to do once you're used to it.

Second is just to know a few common fractions and be comfortable rounding intermediate results to convenient amounts (if you have "86" you might round it to 81 if you're dividing into thirds or ninths, or 88 if its by 11, or 80 or 90 if you care about 10x and would prefer your error to be a "too small" or "too big".

Third is to understand those error bars above, and, as you do when you work by floating point, avoid dealing with incommensurate numbers (this factor is so tiny I'll just ignore it).

When you get good at this you'll usually be within a few percent of the actual answer, which is usually enough to decide if it's worth actually calculating the answer or not.

Example: I remember a discussion a few years ago where we were trying to figure out if we could fit our product into a certain volume. As we discussed the parameters, I and a colleague simultaneously said "360 micrograms" (density was 1 so g = ml). The calculator welder beavered on and a few seconds later triumphantly said "357 mg". Great, 357 was more accurate than 360, but about right, but it was clear he was madly off in magnitude. He wanted to believe his calculator, and checked his work while the rest of us moved on.

* Unless you're a physicist in which case within a few orders of magnitude is probably OK, or a cosmologist in which case all you care about is 10^0

ramblenode6y ago

This sort of order of magnitude estimation is essentially a Fermi question [1]. Anyone interested in these might also check out problems from the high school Science Olympiad event of the same name. Here is a page which links to some past tests: [2]

[1] https://en.wikipedia.org/wiki/Fermi_problem

[2] https://www.sciencenc.com/resources/high-school/fermi-questi...

BubRoss6y ago

The article isn't really about napkin math and rough estimates. It actually goes into using simple benchmarks to calibrate to modern performance numbers.

ultrasounder6y ago

Thanks for making this list. And WOW!. What an impressive list of books You have read. I have Thinking Fast and Slow but haven't gotten around to reading it.

russellbeattie6y ago

150ms round trip California to the Netherlands. Oof. That's like 9 frames of a 60fps video. The speed of light sucks! Someone should do something about that!

wincy6y ago

Plenty of money to be made if you drill through the earth and run some fiber optics.

rckoepke6y ago

Light travels 31% slower in fiber optic cables than it does in a vacuum. (Speed of light is only constant in a vacuum, not in arbitrary media).

It can be faster to send the signal into space and bounce it off a satellite, than it would be through a perfect "great circle" surface fiber cable. This is the plan for SpaceX's Starlink constellation. Mark Handley makes an excellent series of YouTube videos explaining some of the high level engineering behind this with incredible visualizations[0]

I like the napkin idea of a "great chord" optic cable as you seem to be suggesting (it's novel at least!) but temperatures get pretty hot very quickly when drilling down. At 5km depth you're already hitting 170C/340F, otherwise 10km is 375C and definitely a hard limit. Most fiber optic cable has a melting point of 70C, way under this.

Keeping a maximum 6km depth (already way past the melting point of fiber cable), you can drill a perfectly straight line connecting two points on the surface of the earth that are 555.89 km apart. The horizontal tunnel will be 555.71 km long. Mainly this is because the earth is ~13,000 km diameter, so 6km depth is truly negligible.

This tunnel would cost ~$1.2 Trillion for a 0.32% distance savings.

Additionally, the speed of light in fiber slows down as the fiber increases in temperature. So you'll actually probably get significantly long ping, even though the signal is traveling an insignificantly shorter distance. [1]

Lastly, Rayleigh scattering (primarily, as well as Raman and Brillouin scattering) increases with respect to temperature. This means the signal won't go as far as it does at surface conditions.

0: https://www.youtube.com/channel/UC-k1j7M2-hBfXeECd9YAQ_g/vid...

1: https://accelconf.web.cern.ch/accelconf/d09/papers/tupb35.pd...

paranoidrobot6y ago

Another idea was to use beams of Neutrinos and go point-to-point. It's been demonstrated[1] already.

[1] https://physicsworld.com/a/neutrino-based-communication-is-a...

1 more reply

nkurz6y ago

> Most fiber optic cable has a melting point of 70C

This seems very unlikely, or at least unlikely to be relevant. The type of fiber optic cable that melts at 70C is PMMA acrylic plastic. To my understanding, this type of fiber not used for long distance communication. Instead, glass fibers with lower losses(which happen to have a much higher melting point) are used: "Because of these properties silica fibers are the material of choice in many optical applications, such as communications (except for very short distances with plastic optical fiber)" (https://en.wikipedia.org/wiki/Optical_fiber). Do you know more than my limited Wikipedia level of understanding, and are confident in saying that long haul cables indeed have a 70C melting point, or should you perhaps include a few more caveats on your napkin math?

1 more reply

867-53096y ago

> Light travels 31% slower in fiber optic cables than it does in a vacuum.

in 2013 this was reduced to 0.03%, but the abstract does not allude to distance or cost

https://www.nature.com/articles/nphoton.2013.45

PeterCorless6y ago

You can if you make the Planck constant a variable. People are postulating it just might be under the right conditions.

https://iopscience.iop.org/article/10.1088/1742-6596/1051/1/...

lightsighter6y ago

SF to Amsterdam is 5448 miles, which is 0.29 lightseconds, or 29ms at the speed of light, or about 2 frames of 60fps video, so 5X faster. :)

pkroll6y ago

I'm shocked no one has pointed out .29 lightseconds is 290ms, not 29ms, and that you meant 0.029 lightseconds.

rckoepke6y ago

Light in fiber optics travels 31% slower than the speed of light in a vacuum, and does not travel the ideal "great circle" route.

ComSubVie6y ago

That's a really great post!

Good problem description, good introduction, explained code, meaningful and explained graphs, full source code available, conclusion and even possible further exercises. I wish (all) scientific papers were that well written.

jandrese6y ago

This is a lot more than just Napkin Math. He writes a synthetic benchmark to demonstrate the orders of magnitude difference in performance you can see depending on how you approach a problem.

pixelpoet6y ago

... to get the numbers presented at the end for doing napkin math.

ColanR6y ago

It's funny, I had this conversation last week as the coworker, but didn't actually know how to go about calculating this without running tests.

It does seem like there are classes of problems which are completely bandwidth-limited. I've heard this is the next area of expansion for hardware tech, but I haven't seen much yet.

gumby6y ago

> I've heard this is the next area of expansion for hardware tech, but I haven't seen much yet.

This has been an continuously active area of hardware design since the 1960s (consider Cray's work on the 6600 and his later work at his own company). The whole HPC world has to obsess on this issue.

ColanR6y ago

Fair enough. I was thinking compared to cpu clock speed and ram capacity, which seemed to have been an area of greater focus compared to the bandwidth between them.

myself2486y ago

I swear I saw something recently where they're putting mediocre little processors right into the RAM packages.

So you send the task to the memory, and it's done right there, where the latency is lowest. And the more RAM you have, the more processors you have working in parallel. When they arrive at an answer, they send it back.

Sort of like content-addressable memory, but even more so.

hinkley6y ago

> Let this sink in. Random access into the cache has comparable performance to sequential access from RAM. The drop off from sub-L1 16 KB to L2-sized 256 KB is 2x or less.

> I think this has profound implications.

I think I agree.

Can anyone here posit a theory why this is true? Is this a consequence of all the stream processing work in recent generations of processor? Or something else?

Is he saying that pointer chasing even when the values are in cache is the culprit?

crest6y ago

The CPU data caches are low latency and high throughput if you hit them. There is nothing more to it. The interesting questions are:

* What is your cache hit rate? * How much of each cache hit is used? * Are you utilizing all available resources e.g. memory channels, cache banks, vector lanes?

wpietri6y ago

I'm sure you already know this, but if I understand hinkley's question correctly, there's one more thing to it. Modern processors may prefetch memory in chunks, greatly speeding up the sequential case: https://software.intel.com/en-us/articles/optimizing-applica...

hinkley6y ago

I took "Random access into the cache" to mean "random access of memory locations that are already in the cache".

That would be a cache hit rate of 100%. Which would imply that pointer chasing has seen none of the improvements that branch prediction has seen over that interval (see also, 'threaded interpreters' are now considered passé).

Is there a different interpretation I was meant to take from that sentence?

staticfloat6y ago

I actually don't think this is that profound; when OP is testing datasets that fit within cache, what's happening is that we are simply not waiting for data to be loaded in from RAM. Let's look at this from a different angle; instead of looking at GB/s, let's think about the CPU as a machine that executes instructions as fast as it can, then look at what can go wrong.

I could write a program in assembly that is simply 1000000 triplets of (load, add, store) instructions, each reading from a sequentially-increasing memory location. We could think of it like a fully-unrolled addition loop. My CPU, operating at 3GHz, supposedly should complete this program in ~1ms (3 million instructions running at 3 billion instructions per second), but (spoiler alert) it doesn't. Why?

The answer is that the CPU executes instructions as quickly as it can, but it spends an awful lot of time waiting around for various subsystems to finish doing something. Unless you are running highly optimized code, chances are your CPU, even though "utilization" is pegged at 100%, has more circuits sitting around doing nothing than it has circuits doing something. One of the largest contributors to your CPU impatiently tapping its foot is the memory subsystem, because fetches from RAM are so slow, and the `load` instruction (and all dependent instructions such as the `add` and `store` instructions) is completely blocked upon that instruction finishing completely.

To help with this, we have the whole cache hierarchy, with prefetching of whole cache lines and whatnot, to try and grab pieces of memory that we think will be used next into cache, such that we can then access them with _much_ lower latency (and therefore higher bandwidth, since the only thing preventing us from processing more data is waiting around for more data to come in, ergo latency is directly the inverse of bandwidth in this case).

Therefore, when doing random accesses upon a dataset that won't fit into cache, we expect the average time-per-instruction-retired to be roughly how long it takes to pull it in from RAM. Sequential access is faster only because when I ask for a value and it doesn't exist, I grab not only that whole value, but the entire cache line at once, such that the next couple of loads only have to reach out to cache. Smart compilers can place prefetching hints in here as well, so that future loads are overlapping our accesses to cache.

The reason I find random access into cache having the same performance as sequential access as not that profound is because it falls out directly from the above scenario: sequential access into RAM _is_ random access of cache! The reason sequential access to RAM is fast is because the values are in cache due to having fetched an entire cache line; therefore randomly accessing those same cache buckets in a random order is equivalent (from this perspective).

For those that are interested in learning more about the cache hierarchy and why it's important to organize your algorithms/datastructures such that you can do a lot of work without your working set falling out of cache, I highly recommend reading the HPC community's tutorials on writing high-performance matrix multiplication. Writing a fast GEMM is one of the simplest algorithms and datastructures that will drive home why we have tricks like tiling, loop unrolling, loop fusion, parallelization, etc.. to make maximal use of the hardware available to us.

For those that like academic papers, Goto et. al's paper has a lot of this laid out nicely: https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf For those that like follow-along tutorials, this was a fun one for me: http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/

forrestthewoods6y ago

Hi. OP here.

> The reason I find random access into cache having the same performance as sequential access as not that profound is because it falls out directly from the above scenario

Correct. The behavior can be logically explained. There's no magic involved.

> sequential access into RAM _is_ random access of cache

I actually like your statement even better than my post. Sequential access into RAM _is_ random access of the cache! What a delightfully profound statement.

High-performance computing is your specialty. Of course everything in my post is obvious to you. Elementary even.

If you want to argue the semantics as to whether a logical conclusion constitutes as profound or not. Well, I guess?

Not gonna lie. Your comment kind of comes across as a long-winded humblebrag. "Everything OP said is true. I just think it's obvious." Sorry if that wasn't your intent.

staticfloat6y ago

Sorry, I don't mean for this to come across as playing down your post; you're right that "profoundness" is completely subjective and there's no use in me saying "it's not that profound", for that I apologize.

I enjoyed your post quite a bit, especially how far you went in order to get concrete results that back up the theory. Thank you for going through the effort of writing this post and educating others on your findings. :)

1 more reply

nkurz6y ago

> I could write a program in assembly that is simply 1000000 triplets of (load, add, store) instructions, each reading from a sequentially-increasing memory location. We could think of it like a fully-unrolled addition loop. My CPU, operating at 3GHz, supposedly should complete this program in ~1ms (3 million instructions running at 3 billion instructions per second), but (spoiler alert) it doesn't. Why?

I'll bite. Why doesn't it? And how long do you expect it to take? I'll claim that with a modern processor a simple loop in C probably beats this speed. If you want, we can test afterward to see if our respective theories are true.

The linked article claims that single-threaded reading speed of sequential memory (on his machine) is 11 GB/s. This means a 3 GHz system has a throughput of a little over 3B per cycle (11/3). This means that ever 3 cycles we can pull down 11B, which should be enough to comfortably finish our 1M loads in 1 ms. With 64-bit integers, it's getting a little tighter, but still should be possible.

I guess on a technicality you might be right, but not in a good way. If you were to fully unroll the assembly, you might be able to slow things down enough such that you were running at less than 1 integer per 3 cycles. A simple loop is going to be faster here. Done right (fused SUB/JNZ) the loop overhead should actually be zero. Depending on what you are doing with the store (in place, or to another sequential array?) I'd guess you'd be able to get down to less than 2 cycles per 32-bit int with simple C, and pretty close to 1 cycle per int if you went all out with software prefetching and huge pages.

hinkley6y ago

I think I see the problem. OP thinks these sentences say the same thing:

"Random access into the cache has comparable performance to sequential access from RAM."

"Sequential access from RAM has comparable performance to random access into the cache."

Whereas I do not.

KiwiJohnno6y ago

If you are reading sequential addresses from RAM then the CPU cache is smart enough to prefetch what you want next, so you are actually hitting cache, not RAM. Remember, the "bandwidth" being referenced is limited not by actual bandwidth, but by latency. If you prefetch the right data, the latency dissapears.

Accessing a linked list (pointer chasing) means the memory accesses will be random, not sequential, so the prefetching doesn't really work.

thedance6y ago

You'll become a better programmer if you also understand why randomly accessing main memory is slow and why randomly accessing caches is not slow.

gumby6y ago

...randomly accessing caches need not be slow.

smallnamespace6y ago

Fast mental math is useful enough that folks in different fields keep reinventing variants of it, with different terminology:

- Estimation ('market sizing') is a standard part of management consulting company case interviews [1]. The reason is because clients will be throwing you questions and one part of the job is to look smart and give reasonable answers on the fly, without going to a computer or grabbing a calculator first.

- Physicists call them Fermi problems [2]

- Microsoft (in)famously asked 'How many ping pong balls fit into a 747?' as a brain teaser [3]. This was common enough that someone wrote a book about these brain teasers [4].

- Fast mental math is a standard part of many trader interviews, since you'll be making split-second decisions under pressure [5]

One technique is converting everything into log10 first, e.g. 3 billion is about 3 * 10^9 ~ 10^9.5, then you're just adding / subtracting exponents to multiply / divide. Another way is to always round inputs to 'easy' numbers (2, 3, 5), and calculate them separately from exponents.

A few minutes with a napkin can easily save several hours doing something that can't possibly be worthwhile [6]

[1] https://mconsultingprep.com/market-sizing-example/

[2] https://en.wikipedia.org/wiki/Fermi_problem

[3] https://www.inc.com/minda-zetlin/microsoft-changes-job-inter...

[4] https://www.amazon.com/How-Would-Move-Mount-Fuji/dp/03167784...

[5] https://www.quora.com/Why-do-hedge-prop-quant-funds-ask-ment...

[6] https://xkcd.com/1205/

rckoepke6y ago

I'm having trouble finding information on the datatype used, 'matrix4x4_simd'. Can anyone point me to some resources to learn about these?

inetknght6y ago

At the bottom of the blog post, the author links to a Github gist which includes the C++ source code.

PeterCorless6y ago

Check out pages 48-49.

http://englishonlineclub.com/pdf/Metal%20Programming%20Guide...

fulafel6y ago

Where does the 5 GB/s napkin estimate for RAM come from? Its lower than the pointer chasing fihure of 7 GB/s.

Taniwha6y ago

Unlike just reading a bunch of random data the read instructions can't be pipelined, the instruction that uses the read pointer can't be dispatched to the load-store unit until after it's address has arrived in the CPU (two reads where you know the address can just be queued, and even finished out of order if the second one hits in a closer cache than the first one)

fulafel6y ago

I think you are describing the pessimal pointer chasing case, that should be the smaller figure.

Taniwha6y ago

More I'm trying to explain why pointer chasing is going to be slower than random integer accesses

1 more reply

mooibos6y ago

Nice. It explains a few "WTF" performance issues I had in the past.

andikleen46y ago

Sounds like he's trying to reinvent the roofline model.

A lot of work has been already done in this area.

https://en.wikipedia.org/wiki/Roofline_model

bubblethink6y ago

Are you andi kleen from toplev ? Used it quite extensively during a summer a few years ago for profiling stuff. Great toolkit.

BubRoss6y ago

He didn't try to 'reinvent a model' he just showed where modern numbers fall in his simple benchmarks.

m0zg6y ago

Yep. Most programmers today don't realize that "R" in "random access memory" can turn your RAM into a pumpkin very easily, and reduce bandwidth to below what you can get reading sequentially _from a budget SSD_.

Another number to put things into perspective: at 4GHz, 60ns is _240_ cycles. So every time you feel like you don't care about cache locality, try laboriously counting to 240.

whois6y ago

Oh snap it's Forest! I wonder what he's up to these days. He used to be a dev at Uber Entertainment and did some cool community engagement.

j / k navigate · click thread line to collapse

54 comments

Sirupsen6y ago

To practise, I created a newsletter last year with a monthly problem that may be of interest to others wanting to sharpen their napkin math: https://sirupsen.com/napkin/

gumby6y ago

The two big things are a way to conceptualize your problem in a straightforward way (hard) and to learn to rapidly do adequate (not precise) arithmetic in your head (easy).

Third is to understand those error bars above, and, as you do when you work by floating point, avoid dealing with incommensurate numbers (this factor is so tiny I'll just ignore it).

When you get good at this you'll usually be within a few percent of the actual answer, which is usually enough to decide if it's worth actually calculating the answer or not.

* Unless you're a physicist in which case within a few orders of magnitude is probably OK, or a cosmologist in which case all you care about is 10^0

ramblenode6y ago

[1] https://en.wikipedia.org/wiki/Fermi_problem

[2] https://www.sciencenc.com/resources/high-school/fermi-questi...

BubRoss6y ago

The article isn't really about napkin math and rough estimates. It actually goes into using simple benchmarks to calibrate to modern performance numbers.

ultrasounder6y ago

Thanks for making this list. And WOW!. What an impressive list of books You have read. I have Thinking Fast and Slow but haven't gotten around to reading it.

russellbeattie6y ago

150ms round trip California to the Netherlands. Oof. That's like 9 frames of a 60fps video. The speed of light sucks! Someone should do something about that!

wincy6y ago

Plenty of money to be made if you drill through the earth and run some fiber optics.

rckoepke6y ago

Light travels 31% slower in fiber optic cables than it does in a vacuum. (Speed of light is only constant in a vacuum, not in arbitrary media).

This tunnel would cost ~$1.2 Trillion for a 0.32% distance savings.

Lastly, Rayleigh scattering (primarily, as well as Raman and Brillouin scattering) increases with respect to temperature. This means the signal won't go as far as it does at surface conditions.

0: https://www.youtube.com/channel/UC-k1j7M2-hBfXeECd9YAQ_g/vid...

1: https://accelconf.web.cern.ch/accelconf/d09/papers/tupb35.pd...

paranoidrobot6y ago

Another idea was to use beams of Neutrinos and go point-to-point. It's been demonstrated[1] already.

[1] https://physicsworld.com/a/neutrino-based-communication-is-a...

1 more reply

nkurz6y ago

> Most fiber optic cable has a melting point of 70C

1 more reply

867-53096y ago

> Light travels 31% slower in fiber optic cables than it does in a vacuum.

in 2013 this was reduced to 0.03%, but the abstract does not allude to distance or cost

https://www.nature.com/articles/nphoton.2013.45

PeterCorless6y ago

You can if you make the Planck constant a variable. People are postulating it just might be under the right conditions.

https://iopscience.iop.org/article/10.1088/1742-6596/1051/1/...

lightsighter6y ago

SF to Amsterdam is 5448 miles, which is 0.29 lightseconds, or 29ms at the speed of light, or about 2 frames of 60fps video, so 5X faster. :)

pkroll6y ago

I'm shocked no one has pointed out .29 lightseconds is 290ms, not 29ms, and that you meant 0.029 lightseconds.

rckoepke6y ago

Light in fiber optics travels 31% slower than the speed of light in a vacuum, and does not travel the ideal "great circle" route.

ComSubVie6y ago

That's a really great post!

jandrese6y ago

This is a lot more than just Napkin Math. He writes a synthetic benchmark to demonstrate the orders of magnitude difference in performance you can see depending on how you approach a problem.

pixelpoet6y ago

... to get the numbers presented at the end for doing napkin math.

ColanR6y ago

It's funny, I had this conversation last week as the coworker, but didn't actually know how to go about calculating this without running tests.

It does seem like there are classes of problems which are completely bandwidth-limited. I've heard this is the next area of expansion for hardware tech, but I haven't seen much yet.

gumby6y ago

> I've heard this is the next area of expansion for hardware tech, but I haven't seen much yet.

This has been an continuously active area of hardware design since the 1960s (consider Cray's work on the 6600 and his later work at his own company). The whole HPC world has to obsess on this issue.

ColanR6y ago

Fair enough. I was thinking compared to cpu clock speed and ram capacity, which seemed to have been an area of greater focus compared to the bandwidth between them.

myself2486y ago

I swear I saw something recently where they're putting mediocre little processors right into the RAM packages.

Sort of like content-addressable memory, but even more so.

hinkley6y ago

> Let this sink in. Random access into the cache has comparable performance to sequential access from RAM. The drop off from sub-L1 16 KB to L2-sized 256 KB is 2x or less.

> I think this has profound implications.

I think I agree.

Can anyone here posit a theory why this is true? Is this a consequence of all the stream processing work in recent generations of processor? Or something else?

Is he saying that pointer chasing even when the values are in cache is the culprit?

crest6y ago

The CPU data caches are low latency and high throughput if you hit them. There is nothing more to it. The interesting questions are:

* What is your cache hit rate? * How much of each cache hit is used? * Are you utilizing all available resources e.g. memory channels, cache banks, vector lanes?

wpietri6y ago

hinkley6y ago

I took "Random access into the cache" to mean "random access of memory locations that are already in the cache".

Is there a different interpretation I was meant to take from that sentence?

staticfloat6y ago

forrestthewoods6y ago

Hi. OP here.

> The reason I find random access into cache having the same performance as sequential access as not that profound is because it falls out directly from the above scenario

Correct. The behavior can be logically explained. There's no magic involved.

> sequential access into RAM _is_ random access of cache

I actually like your statement even better than my post. Sequential access into RAM _is_ random access of the cache! What a delightfully profound statement.

High-performance computing is your specialty. Of course everything in my post is obvious to you. Elementary even.

If you want to argue the semantics as to whether a logical conclusion constitutes as profound or not. Well, I guess?

Not gonna lie. Your comment kind of comes across as a long-winded humblebrag. "Everything OP said is true. I just think it's obvious." Sorry if that wasn't your intent.

staticfloat6y ago

1 more reply

nkurz6y ago

hinkley6y ago

I think I see the problem. OP thinks these sentences say the same thing:

"Random access into the cache has comparable performance to sequential access from RAM."

"Sequential access from RAM has comparable performance to random access into the cache."

Whereas I do not.

KiwiJohnno6y ago

Accessing a linked list (pointer chasing) means the memory accesses will be random, not sequential, so the prefetching doesn't really work.

thedance6y ago

You'll become a better programmer if you also understand why randomly accessing main memory is slow and why randomly accessing caches is not slow.

gumby6y ago

...randomly accessing caches need not be slow.

smallnamespace6y ago

Fast mental math is useful enough that folks in different fields keep reinventing variants of it, with different terminology:

- Physicists call them Fermi problems [2]

- Microsoft (in)famously asked 'How many ping pong balls fit into a 747?' as a brain teaser [3]. This was common enough that someone wrote a book about these brain teasers [4].

- Fast mental math is a standard part of many trader interviews, since you'll be making split-second decisions under pressure [5]

A few minutes with a napkin can easily save several hours doing something that can't possibly be worthwhile [6]

[1] https://mconsultingprep.com/market-sizing-example/

[2] https://en.wikipedia.org/wiki/Fermi_problem

[3] https://www.inc.com/minda-zetlin/microsoft-changes-job-inter...

[4] https://www.amazon.com/How-Would-Move-Mount-Fuji/dp/03167784...

[5] https://www.quora.com/Why-do-hedge-prop-quant-funds-ask-ment...

[6] https://xkcd.com/1205/

rckoepke6y ago

I'm having trouble finding information on the datatype used, 'matrix4x4_simd'. Can anyone point me to some resources to learn about these?

inetknght6y ago

At the bottom of the blog post, the author links to a Github gist which includes the C++ source code.

PeterCorless6y ago

Check out pages 48-49.

http://englishonlineclub.com/pdf/Metal%20Programming%20Guide...

fulafel6y ago

Where does the 5 GB/s napkin estimate for RAM come from? Its lower than the pointer chasing fihure of 7 GB/s.

Taniwha6y ago

fulafel6y ago

I think you are describing the pessimal pointer chasing case, that should be the smaller figure.

Taniwha6y ago

More I'm trying to explain why pointer chasing is going to be slower than random integer accesses

1 more reply

mooibos6y ago

Nice. It explains a few "WTF" performance issues I had in the past.

andikleen46y ago

Sounds like he's trying to reinvent the roofline model.

A lot of work has been already done in this area.

https://en.wikipedia.org/wiki/Roofline_model

bubblethink6y ago

Are you andi kleen from toplev ? Used it quite extensively during a summer a few years ago for profiling stuff. Great toolkit.

BubRoss6y ago

He didn't try to 'reinvent a model' he just showed where modern numbers fall in his simple benchmarks.

m0zg6y ago

Another number to put things into perspective: at 4GHz, 60ns is _240_ cycles. So every time you feel like you don't care about cache locality, try laboriously counting to 240.

whois6y ago

Oh snap it's Forest! I wonder what he's up to these days. He used to be a dev at Uber Entertainment and did some cool community engagement.

j / k navigate · click thread line to collapse