I was working in C, and looking back I came up with a quite performant solution mostly by accident: all the memory allocated up front in a very cache-friendly way.
The first time I ran the program, it finished in a couple seconds. I was sure something must have failed, so I looked at the output to try to find the error, but to my surprise it was totally correct. I added some debug statements to check that all the data was indeed being read, and it was working totally as expected.
I think before then I had a mental model of a little person inside the CPU looking over each line of code and dutifully executing it, and that was a real eye-opener about how computers actually work.
At one of my first jobs I was a DBA supporting a CRUD app in the finance industry. The app had one report that took forever and usually timed out, I was told to take a look at it. The DB query was just missing a couple indexes so I added those.
After I added them, my boss told one of the users of the app to try out the report and she said it was still broken. He asked what she meant and she said she clicked the button and the page with the results came up right away. She thought it was broken because it didn't take forever.
I blame all the ads, tracking, and bloatware that is prevalent now most of all.
It's bananas how slow the web is today on average when you're on a symmetric gigabit connection.
I've done video4linux stuff in Go, and passing an unsafe.Pointer to a Go struct in an ioctl() worked fine, which tells me that Go structs are isomorphic to C structs. Even though Go has garbage collection, it allocates everything it can on the stack, so only long-lived shared-between-goroutines objects are subject to garbage collection.
Go abstracts concurrency, completely removing all concurrent features from a language except for the "go" keyword (that launches a goroutine - which is basically a tiny virtual thread), channels (which are selectable queues) and "select" keyword that waits for the first "input" from a static set of channels.
Julia is fast, easy to read, and easy to write. But it's not easy to maintain. There is a direct tradeoff between dynamism on one hand making things easier to read/write and static enforceability on the other making it easier to maintain.
Then there were BLISS, Mesa and PL/I, but the OSes that made use of them lost to UNIX, so.
With exception of Mac OS, written in Object Pascal and later ported to a mix of Object Pascal and C++.
Having said this, plenty of alternatives with AOT compilers exist nowadays.
The only thing C has going for it, is historical weight, UNIX/POSIX ecosystem, and some domains that are closed to any alternative suggestions, due to tooling or cargo cult against alternatives.
Code being easier to read and maintain is a function of how close it is to human semantics. The more the algorithm is presented in terms and notations humans like and find familiar, the easier. Code being performant is a function of how close it is to machine semantics, the more the algorithm is presented as steps that the machine likes and finds familiar, the faster it will run, as the machine is doing less to execute each step.
There is a fundamental tension between the 2, even if compilation from high level languages might, at first glance, give us the illusion that we can have both. We can't, not in general. We can only do it for a class of human semantics that C++ folks call "Zero-Cost Abstractions", the set of abstractions that can be completely erased without a trace by the time you get to the executable.
But otherwise, there is a fundamental cost to making code more readable by humans: making it less readable by the machines that will execute it. This is a reflection of the fundamental alienness of computers, what they find quite easy you find quite hard and vice verca. Optimizing for huamans means generality and ruthless hiding of details, optimizing for machines is all about special cases and ruthless exploitation of assumptions.
(Incidentally, C is not all what it's cracked up to be. Generic containers, off the top of my head, resort to using void* pointers for data and function pointers for operation, which has a runtime cost besides being unsafe and error-prone. Templates in C++ can aggressively inline types and operations for you, on the other hand as if you haven't written generic code at all, no wonder templates is the poster boy for C++'s 0-Cost abstractions. Another example I hear often is how pointer semantics in C and C++ makes it extraordinarily difficult for the compiler to optimize array and memory operations, whereas a language like Fortran make it easier by not having pointers.)
I use JavaScript and C++ for different things, sometimes in the same day. (And python and PHP and others, but this is not relevant.)
Believe me, JavaScript can be a real head scratcher compared to C++.
And now for the purists: No, I don't use all features of C++, only the minimal necessary ones for the problem I have to solve. This ridiculous idea that you are not using C++ if you are not using every single language feature is what makes programs difficult to write and maintain.
The nice thing about D for me is that you can generally banish the unreadable metaprogramming code to a library.
Swap out { } for Begin End, and make a few other changes, and you've got Pascal. Single pass pascal compilers have been faster (at compiling) that almost anything out there since Turbo Pascal 3.0 for MS-DOS.
Modern versions, such as Free Pascal, Delphi and Lazarus also deal with strings in a manner that totally avoids needing to manually manage memory. The GUI builders are awesome as well.
Fortran is fine. Also lua (using the luajit interpreter you get really close to C speed) and julia (except for the atrocious startup time).
If you hold up a sign with, say, a multiplication, a CPU will produce the result before light reaches a person a few metres away.
The latency on multiplication (register input to register output) is 5-clock ticks, and many computers are 4GHz or 5GHz these days.
5-clock cycles at 5GHz is 1ns, which is 30-centimeters of light travel.
If we include L1 cache read and L1 cache write, IIRC its 4 clock cycles for read + 4 more for the write. So 13 clock ticks, which is almost 70 centimeters.
------------
DDR4 read and L1 cache write will add 50 nanoseconds (~250 cycles) of delay, and we're up to 13 meters.
And now you know why cache exists, otherwise computers will be waiting on DDR4 RAM all day, rather than doing work.
3
Are there any directives to the Operating System to say - “here keep this data in the fastest accessible L[1,2,3] please”?
It's interesting to consider this paired against how technologically primitive we ostensibly must be, given that digital computers didn't even exist 90 years ago.
So the processor in my hand can compute a multiplication fast than light can cross the room?
2. But, there is a real limit to the speed of a particular piece of code. You can try finding it with a roofline model, for example. This post didn't do that. So we don't know if 201ms is good for this benchmark. It could still be very slow.
I don't use a high-end laptop and I'm not eager to upgrade is because I can relate to the average user of the software I develop. I saw plenty of popular web apps feeling really sluggish.
Thank you so so much. It's insane how it feels like the speed of much of our software hasn't improved or even regressed despite the gigantic advancements made over the years. People really don't seem to care about this.
I had an argument about it with a senior colleague regarding some industry software. He figured it wasn't worthwhile to improve the speed of some table fetching and calculations that people actually had to wait on since it would only amount to a bit more than a second or so on top of the regular slowness of it all.
A second that had been multiplied on at least 20 pc's each going trough it at least a 100 times a day of more than 260 times each year over at least 10 years so far. Turns out more than 5 million seconds is a lot of man-hours which whilst cheaper than ours amount to manyfold what it would have taken to fix it.
It's much easier to reason about state updates if all you have is pure functions. It allows you avoid very annoying and hard to catch bugs. I've seen this personally, when replacing a spaghetti component with a straightforward `useReducer` hook.
Unfortunately, we don't really have a performant way to express this pattern in JS (or even in other languages?). You could use something like elm-lang, but it's not as widespread.
So from your post it follows that if a developer can reason about the state changes of their app without redux, they should do so if there are performance concerns. Right?
I say this as a webdev who has written pure vanilla Js SPAs a decade ago, and someone who often uses Redux now on most projects today. So I know it’s totally possible to have performant mutable state management on a project that isn’t a mess - that’s how we always did stuff before redux.
For the cases when it's not, use memo.
Developer time is spent once. Users will always have to pay the price of additional run time. For. Each. Single. User. Always.
It scales!
Due to the scale of, e.g. slow front-ends, with millions of users, this takes a HUGE amount of time. Only to save a few hours or days to develop it better.
Having 1 million users each wait a single second is already 11 days. If they have to wait that single second for each interaction, it quickly adds up.
It is also bad for the environment due to scaled up inefficiency and resulting increase of power usage.
Don't get me wrong, pandas is a nice library ... but the odd thing is, numpy already has, like, 99% of that functionality built in in the form of structured arrays and records, is super-optimised under the hood, and it's just that nobody uses it or knows anything about it. Most people will have never heard of it.
To me pandas seems to be the sort of library that because popular because it mimics the interface of a popular library from another language that people wanted to migrate to (namely dataframes from R), but that's about it.
Compounding this, is that, it is now becoming an effective library to do things, even if backward, because the network effect means that people are building stuff to work on top of pandas, rather than on top of numpy.
The only times I've had to use pandas in my personal projects was either:
a) when I needed a library that 'used pandas rather than numpy' to hijack a function I couldn't care writing by myself (most recently seaborn heatmaps, and exponentially weighted averages - both relatively trivial things to do with pure numpy, and probably faster, but, eh. Leftpad mentality etc ...)
b) when I knew I'd have to share the code with people who would then be looking for the pandas stuff.
I'm probably wrong, but ...
Respectfully, this is pretty wrong. Pandas does vastly more out of the box than numpy. Off the top of my head: I/O from over a dozen of data formats, joins/merges, sql queries directly to dataframes, sql-like queries on dataframes, index slicing by time, multi-indexes, much more ergonomic grouping/aggregation functions, ergonomic wrappers around common graphing use-cases, rolling windows.
I'm not even really a power user of it, so there's probably a zillion more things it does that numpy can't out of the box, and I don't wanna spend time writing time and validating if an implementation exists.
Because I needed these operations, I wanted to work with Numpy directly, and didn’t want to write custom implementations each time, I created a library to do it. It also has constructor methods for Python Dicts, any kind of Iterable, CSV, SQL query, pandas DataFrames and Series, or otherwise. As well as destructor methods to generate whatever you need when done. It tries its best to maintain the types you specify, and offers a means to cast as easily as possible. All functions return a single type to allow static type checking. And for performance, there is a “trust me I know what I’m doing” mode for extremely fast access to the data which achieves about a 10x speed up by skipping all data validation steps.
Everything it does outperforms pandas, except for the Joins. It does allow inequality joins and multiple join conditions, but the general solution used isn’t very fast. Anyone reading this who would be interested in improving these component would be welcome to contribute!
It gets even better when people start switching between percentages and "percentage points" referring to a measure that's in percentages originally.
Unfortunately, most of those things are easier communicated and harder to get wrong if you try speaking in a more natural way. This is now "twice as fast" or "2.1x faster" is much clearer and can't go past zero :)
Similarly, I think it'd help to switch back from percentages to actual factors (119% = 1.19), and saying "we reduced the time for the computation by 1.19 of original time" would clearly show what's wrong (and saying "by 1.19x" would signal how it's a small reduction, so it's wrong as well).
Finally, I am 94.8% certain people will keep using percentages even where inappropriate, and with too much precision too!
It's about language use and what of are those per-cents (per-hundredths).
Of course multiply this by the sheer number of calculations and even that little misprediction results in huge differences. The reality is actually quite sobering: a computer mostly calculates the same thing over and over.
I have a hard time using (pure) Python anymore for any task that speed is even remotely a consideration for anymore. Not only is it slow even at the best of times, but so many of its features beg you to slow down even more without thinking about it.
Related anecdote: My blog used to be written using Jekyll with Pygments for syntax highlighting. As the number of posts increased, it got closer and closer. Eventually, it took about 20 seconds to refresh a simple text change in a single blog post.
I eventually decided to just write my own damn blog engine completely from scratch in Dart. Wrote my own template language, build graph, and syntax highlighter. By having a smart build system that knew which pages actually needed to be regenerated based on what data actually changed, I hoped to get very fast incremental rebuilds in the common case where only text inside a single post had changed.
Before I got the incremental rebuild system working, I worked on getting it to just to a full build of the entire blog: every post page, pages, for each tag, date archives, and RSS support. I diffed it against the old blog to ensure it produced the same output.
Once I got that working... I realized I didn't even need to implement incremental rebuilds. It could build the entire blog and every single post from scratch in less than a second.
I don't know how people tolerate slow frameworks and build systems.
I've also worked in Python shops for the entirety of my career. There are a lot of Python programmers who don't have experience with and thus can't quite believe how much faster many other languages are (100X-1000X sounds fast in the abstract, but it's really, really fast). I've seen engineering months spent trying to get a CPU-bound endpoint to finish reliably in under 60s (yes, we tried all of the "rewrite the hot path in X" things), while a naive Go implementation completed in hundreds of milliseconds.
Starting a project in Python is a great way to paint yourself into a corner (unless you have 100% certainty that Python [and "rewrite hot path in X"] can handle every performance requirement your project will ever have). Yeah, 3.11 is going to get a bit faster, but other languages are 100-1000X faster--too little, too late.
Conversely when 99.9% of the software you use in your daily life is blazing fast C / C++, having to do anything in other stacks is a complete exercise in frustration, it feels like going back a few decades in time
And a product which is designed inefficiently where the engineer has figured out clever ways to get it to be more performant is most likely a product that is more complicated under the hood than it would be if performance were a design goal in the first place.
Also those languages show you don't actually have to give up modern features or even that much convenience in order to get blazing fast speeds.
At all my recent jobs, I grow frustrated with how slow running a single unit test is locally on a codebase. We are talking 5+ seconds for even the most trivial of trivial unit tests (say, purely functional arithmetic unit test).
And this is even with dynamic languages like Python (you see pytest reporting how your unit test completed in 0.00s, and wall time is 7s).
And then I get grumpy if they don't let me go and fix it because I am the only one who is that annoyed with this :D
Well, it felt slower after the "upgrade". Clicking the start menu and opening something like the Downloads or Documents folder was basically instant before. Now, with Windows 10 and the new SSD there was a noticeable delay when opening and browsing folders.
It really made me wonder how it would be running something like Windows 98 and websites of the past on modern hardware.
It's probable the old Windows 7 install was 32-bit while your fresh install of 10 would have defaulted to 64-bit. That combined with 10's naturally higher memory requirements means the system has less overhead to work with.
This should involve absolutely zero disk reads or anything of the sort, it's a window that runs a command. And it used to work reliably in past years. It feels like keyboard input simply isn't buffered like it used to be. Calculator it even worse as it loses input if you start typing the formula too soon. It used to be very easy for casual calculations now I have to wait for the computer.
A lot of things remained slow though.
I still remember how fast console based computing, an old gameboy or a 90's macintosh would be - click a button and stuff would show up instantly.
There was a tactility present with computers that's gone today.
Today everything feels sluggish - just writing this comment on my $3000 Macbook Pro and i can feel the latency, sometimes there's even small pauses. A little when i write stuff, a lot when i drag windows.
Hopefully the focus on 100hz+ screens in tech in general will put more focus on latency from click to screen print - now when resolution and interface graphics in general are close to biological limits.
I'm asking because I've been thinking of getting a MacBook Air in the future with the intent to use it for writing.
Come again? I think anything beyond 60hz still qualifies as niche. Vendors are still selling 720p laptops.
I'd also prefer the sluggishness gone if I had my choice between the two.
What do those tools even do for that long? They can read enough data from the disk to overflow my computer's main memory a few times during it.
By the way, Apple isn't much better. Xcode takes around 15 seconds to launch on an M1 Max.
edit: probably this video https://youtu.be/j_4iTovYJtc?t=282
But yeah. I agree. Why does Lightroom take forever to load, when I can query its backing SQLite in no time at all?
And that's not even mentioning the RAM elephant in the room: chrome.
Younglings today don't understand what a mindbogglingly large amount of data a GB is.
But here's the thing: it's cheaper to waste thousands of CPU cores on bad performance than to have an engineer spend a day optimizing it.
The result is usually one CPU core running at 40% with sporadic disk access while you stare at Loading progress bar.
Nobody has created a language that is both thousands of times faster than Python and nearly as straightforward to learn and to use. The closest thing I know of might be Julia, but that has its own performance problems and is tied closely to its AI/ML niche. Even within that niche I'm certainly not going to get most data scientists to write their code in C or C++ (or heaven forbid Rust) to solve a performance impediment that they've generally been able to work around.
It's great that you've been able to switch to higher-performance languages, but not everyone can do that easily enough to make it worth doing.
Some data scientists I know like (or even love) Scala, but that tends to blow up once it's handed over to the data engineers as Scala supports too many paradigms and just a couple DSs will probably manage to find all of them in one program.
We use Go extensively for other things, and most data scientists I've worked with sketching ideas in Go liked it a lot, but the library support just isn't there, and it's not really a priority for any of the big players who are all committed to Python wrapper + C/C++/GPU core, or stock Java stacks. (The performance also isn't quite there yet compared to the top C and C++ libraries, but it's improving.)
Not Python-based, but Lua-based is Nelua [1]
If you like Lua's syntax, LISP's metaprogramming abilities, and C's performance, well there you have it!
Python allows you to program as if you’re a jazz pianist. You can improvise, iterate and have fun.
And when you found a solution you just refactor it and use numba. Boom, it runs the same speed as a compiled language.
I once wrote one little program that ran in 24 min without numba and ca. 8 seconds with numba.
At least with Chrome's V8, the difference is not that big.
Sure, it loses to C/C++, because it can't vectorize and uses orders of magnitude more memory, but at least in the Computer Language Benchmarks Game it's "just" 2-4x slower.
I remember getting a faster program doing large matrix multiplication in JavaScript than in C with -o1, because V8 figured out that I'm reading from and writing to the same cell, so optimised that out, which gave it an edge, because in both cases the memory bandwidth limited the speed of execution.
As for Electron and the like: half of the reason why they're slow is that document reflows are not minimized, so the underlying view engine works really, really hard to re-render the same thing over and over again.
It's not nearly as visible in web apps, because these in turn are often slowed down by the HTTP connection limit(hardcoded to six in most browsers).
For as many factors of magnitude as I am talking about, you have to be screwing up algorithms, networks, and a whole bunch of other things too.
Python and similar languages like Ruby really do make it easy to accidentally pile things on top of each other, but you can screw up in pure assembler with enough work put into it. Assembler doesn't stop you from being accidentally quadratic or using networks in a silly way.
For most tasks, modern mid-level statically typed languages like C#, Go, Kotlin really are the sweet spot for productivity. Languages like Python, Ruby and JS are a false economy that appear more productive than they really are.
IOW you lose both. It's not a huge size either.
How do you know whether or not speed is a consideration?
Yes, OP delivered impressive efficiency gains. I'm sure he could improve the efficiency even more by dropping into pure Assembly.
But is it worth it?
The prime consideration is not execution speed but maintainability. The further that OP got away from pure Python, the more difficult to maintain the code became. That's a downside.
Now, OP describes an important technique because in the real world, you have a performance budget. Code needs to execute at speeds that return quickly enough to the user, or long execution is financially expensive (i.e. cloud computing resources), etc. But optimizing beyond what the budget requires is wasteful in terms of time needed to do the optimization as well as harmful in terms of negatively impacting future maintainability.
Why? And how did you measure this drop in maintainability? I'm asking because I see developers prioritize _perceived_ maintainability over _measurable_ things that matter to the user (like performance).
Most of the time it's not that you need a faster language, it's that you need to write faster code. I was working on a problem recently where random.choices was slow but I realized that due to the structure of my problem I could convert it to numpy and get a 100X speedup.
As a fun exercise this year I've been doing Advent of Code 2020 in C, and my god it's crazy how much faster my solutions seem to execute. These are just little toy problems, but even still the speed difference is night and day.
Although, I still find Python much easier to read and maintain, but that may just be I'm more experienced with the language.
Python is definitely easier to read and maintain if you have loads of dependencies. C dependency management is a pain.
If you can read and write a little C, you should consider giving C#/Java/Kotlin/Swift a try. They're probably an order of magnitude slower than C if you write them in a maintainable style, but they're still much faster than Python. If you're doing stuff like web APIs then ASP.NET/Spring will perform very admirably without manually optimizing code, for example. You might find that these languages are C-like enough to understand and Python-like enough to be productive in. Or you might not, but it's worth a shot!
I personally believe that C is difficult if not impossible to properly to maintain long term, at least not as much as the faster alternatives. On the other hand my experience with Python is that it's one of the slowest mainstream languages out there, relying heavily on C libraries to get acceptable performance.
If there isn't a compiler in the box (JIT or AOT), I won't be using language XYZ, unless forced by customers.
The only reason I use Python is for UNIX scripting.
This kind of blanket comment that "scripting languages are too slow" makes it sound like you shouldn't use them for anything, but they are perfectly adequate for many tasks. I'm more likely to have network and DB slowdowns than problems with scripting languages.
So you dont need to pretty much ever reinvent or even use a hackerrank algorithm, you need to understand that the database compute instance has a fast cpu and lots of RAM too
I wonder what would be the software engineering landscape today if hardware specs were growing like 10% per year...
Nowadays you need vector operations, you need to utilise GPU, you need to utilise various accelerators. For me it is black magic.
Another perspective on premature opt: When my software tool is used for an hour in the middle of a 20-day data pipeline, most optimization becomes negligible unless it's saving time on the scale of hours. And even then, some of my coworkers just shrug and run the job over the weekend.
The pure C++ version is so fast, it finishes before you even start it!
I once owned a small business server with a Xeon processor, Linux installed. Just for kicks I wrote a C program that would loop over many thousands of files, read their content, sort in memory, dump into a single output file.
I ran the program and as I ran it, it was done. I kept upping the scope and load but it seems I could throw anything at it and the response time was zero, or something perceived as zero.
Meanwhile, it's 2022 and we can't even have a text editor place a character on screen without noticeable lag.
Shit performance is even ingrained in our culture. When you have a web shop with a "submit order" button, if you'd click it and would instantly say "thanks for your order", people are going to call you. They wonder if the order got through.
Habit is a very powerful force.
Performance is somewhat abstract, as in "just throw more CPUs at it" / it works for me (on my top of the line PC). But people will happily keep on using unergonomic tools just because they've always done so.
I work for a shop that's mainly Windows (but I'm a Linux guy). I won't even get into how annoying the OS is and how unnecessary, since we're mostly using web apps through Chrome. But pretty much all my colleagues have no issue with using VNC for remote administration of computers.
It's so painful, it hurts to see them do it. And for some reason, they absolutely refuse to use RDP (I'm talking about local connections, over a controlled network). And they don't particularly need to see what the user in front of the computer is seeing, they just need to see that some random app starts or something.
I won't even get into Windows Remote Management and controlling those systems from the comfort of their local terminal with 0 lag.
But for some reason, "we've always done it this way" is stronger than the inconvenience through which they have to suffer every day.
If you stick to only doing arithmetic and avoid making lots of small objects, javascript engines are pretty fast (really!). The tricky part with doing performance-sensitive work in JS is that it’s hard to reason about the intricacies of JITs and differences between implementations and sometimes subtle mistakes will dramatically bonk performance, but it’s not impossible to be fast.
People building giant towers of indirection and never bothering to profile them is what slows the code down, not running in JS per se.
JS, like other high-level languages, offers convenient features that encourage authors to focus on code clarity and concision by building abstractions out of abstractions out of abstractions, whereas performance is best with simple for loops working over pre-allocated arrays.
I would like to see all of the actual code he omitted, because I am skeptical how that would happen. It's been a while since I've used pandas for anything, but it should be pretty fast. The only thing I can think is he was maybe trying to run an apply on a column where the function was something doing Python string processing, or possibly the groupby is on something that isn't a categorical variable and needs to be converted on the fly.
Though I will never understand webpages that use more code than you'd reasonably need to implement a performant lisp compiler and build the webpage in that (not that I'm saying that's what they should have done, I just don't understand how they use more code)
Implementations of languages like javascript, ruby - and I would presume python and php - are a lot faster than they used to be.
I think most slowness is architectural.
Its hilarious how quickly things work these days if you just used the 90s-era APIs.
Its also fun to play with ControlSpy++ and see the dozens, maybe hundreds, of messages that your Win32 windows receive, and imagine all the function calls that occur in a short period of time (ie: moving your mouse cursor over a button and moving it around a bit).
Think mobile game that could last 8 hours instead of 2 of it wasn’t doing unnecessary linear searches on timer in JavaScript.
By the next morning, I'd found it was doing an O(n^2) operation that, while probably sensible when the app had first been released, was now totally unnecessary and which I could safely remove. That alone reduced the 20 minutes to 200 milliseconds.
(And this is despite that coworker repeatedly emphasising the importance of making the phone battery last as long as possible).
NIM should be part of the conversation.
Typically, people trade slower compute time for faster development time.
With NIM, you don’t need to make that trade-off. It allows you to develop in a high-level but get C like performance.
I’m surprise its not more widely used.
It's a ~community language without the backing of an 800lb gorilla to offer up both financial and cheerleading support.
I love the idea of Nim, but it is in a real chicken-and-egg problem where it is hard for me to dedicate time to a language I fear will never reach a critical mass.
And on a slightly ranty note, Apple's A12z and A14 are still apparently "too weak" to run multiple windows simultaneously :)
https://appleinsider.com/articles/22/06/11/stage-manager-for...
"The function looks something like this:"
And then shows some grouping and sorting functions using pandas.Then he says:
"I replaced Pandas with simple python lists and implemented the algorithm manually to do the group-by and sort."
I think the point of the first optimization is you can do the relatively expenseive group/sort operations without pandas, and improve performance. For the rest of the article it's just "algorithm_wizardry", which no longer deals with that portion of the code.For all this decreased performance, what new features do we have to show for it? Oh great, I can search my Start menu and my taskbar had a shiny gradient for a decade.
My usual 1-to-1 translations result in C++ being 1-5% of Python exec time, even on combinatorial stuff.
+-------------------------------------------------+
| People really do love Python to death, do they? |
+-------------------------------------------------+
I find that extremely weird. As a bystander who never relied on Python for anything important, and as a person who regularly had to wrestle with it and tried to use it several times, the language is non-intuitive in terms of syntax, ecosystem, package management, different language version management, probably 10+ ways to install dependencies by now, subpar standard library and an absolute cosmic-wide Wild West state of things in general. Not to mention people keep making command-line tools with it, ignoring the fact that it often takes 0.3 seconds to even boot.Why would a programmer that wants semi-predictable productivity choose Python today (or even 10 years ago) remains a mystery to me. (Example: I don't like Go that much but it seems to do everything that Python does, and better.)
Can somebody chime in and give me something better than "I got taught Python in university and never moved on since" or "it pays the bills and I don't want to learn more"?
And please don't give me the fabled "Python is good, you are just biased" crap. Python is, technically and factually and objectively, not that good at all. There are languages out there that do everything that it does much better, and some are pretty popular too (Go, Nim).
I suppose it's the well-trodden path on integrating with pandas and numpy?
Or is it a collective delusion and a self-feeding cycle of "we only ever hired for Python" from companies and "professors teach Python because it's all they know" from universities? Perhaps this is the most plausible explanation -- inertia. Maybe people just want to believe because they are scared they have to learn something else.
I am interested in what people think about why is Python popular regardless of a lot of objective evidence that as a tech it's not impressive at all.
This talk by the creator of micropython [0] gives his reasoning for why to implement python on microcontrollers despite it being hundreds of times slower than C. Starts @ 3:00
- it has nice features like list comprehension, generators, and good exception handling
- it has a big, friendly, helpful community with lots of online learning resources
- it has a shallow but long learning curve. It's easy to get started as a beginner, but you never get bored of the language, there's always more advanced features to learn.
- it has native bitwise operations
- has good distinction between ints and floats, and floats are arbitrary precision, you're not restricted to doubles or even long longs. (I'll add that built in complex numbers is a plus)
- compiled language, so it can be optimized to improve performance
Emotionally though (once you have the environment set up), it’s just such a breeze to write it. It’s like executable pseudo code with zero boilerplate. You can focus purely on the algorithms and business logic. Compared to many other languages the line count is often 50-80%, even if you include type annotations! This doesn’t only apply to plain imperative code, using the dynamic features you can also turn it into your own DSL where needed.
Then there is obviously the huge eco-system around it, there is not a single service, file format or database that doesn’t have a good python library for it. While go might have equally wide library choices, I wouldn’t be so sure about nim, go on the other hand has a lot of other wtfs even though it provides a lot of good fresh tech.
Would I use it for a big service with potentially lots of performance requirements? No. But there is no doubt why it’s so popular. For many applications where the the outcome of the program is more important than the performance or environment, like glue code, simple intranet applications or exploratory coding, it is still the perfect choice. You also have to consider what it is replacing, often the alternative would be even worse; bash-scripts, Excel or Matlab.
Another way to put it is that it’s a very good Swiss Army knife that is good at everything but not best at anything.
Software/System Developers using 'good enough' stacks/solutions are externalising costs for their own benefit.
Making those externalities transparent will drive alot of the transformation needed.
You could have had those discussion at anytime since the upgraded computers and microprocessors have become compatible with the previous generation (i.e. the x86 and PC lines).
The point is that software efficiency measurement has never changed: it is human patience. The developers and their bosses decide the user can wait a reasonable time for the provided service. It is one-to-five seconds for non-real-time applications, it is often about a target framerate or refresh in 3D or real-time applications... The optimization stops when the target is met with current hardware, no matter how powerful it is.
This measure drives the use of programming languages, libraries, data load... all getting heavier and heavier when more processing power gets available. And that will probably never change.
Not sure about it? Just open your browser debugger on the Network tab and load the Google homepage (a field, a logo and 2 buttons). I just did: 2.2 MB, loaded in 2 seconds. It is sized for current hardware and 100 Mbps fiber, not for the actually provided service!
Using Pandas in production might make sense if your production system only has a few users. Who cares if 3 people have to wait 20 minutes 4 times a year? But if you're public facing and speed equals user retention then no way can you be that slow.
Almost always yes, because software is almost always used many more times than it is written. Even if you doubled your dev time to only get a 5% increase of speed at runtime, that's usually worth it!
(Of course, capitalism is really bad at dealing with externalities and it makes our society that much worse. But that's an argument against capitalism, not an argument against optimization.)
No. O3 is fine. -ffast-math is dangerous.
https://codegolf.stackexchange.com/questions/215216/high-thr...
An optimized assembler implementation is 500 times faster than a naive Python implementation.
By the way, it is still missing a Javascript entry!
Then rewrite it with a more performant language or cython hooks.
Developing features quickly is greatly aided by nice tools like Python and Pandas. And these tools make it easy to drop into something better when needed.
Eat your cake and have it too!
I doubt the author's C++ implementations beat BLAS/LAPACK, but since they're not shown I can only guess.
I've done stuff like this before but the tooling is really no fun, somewhere between 2 and 3 I'd just write it all in C++.
Changing the interface just to get parallelism out seems not great - give it to the user for free if the array is long enough - but maybe it was more reasonable for the non-trivial real problem.
Note that I'm not saying that their second version of the code wasn't faster, just that this has nothing to do with python vs. pandas.
This is for normal computer tasks-- browser, desktop applications, UI. The exception to this seem to be tasks that were previously bottlenecked by HDD speeds which have been much improved by solid state disks.
It amazes me, for example, that keeping a dozen miscellaneous tabs open in Chrome will eat roughly the same amount of idling CPU time as a dozen tabs did a decade ago, while RAM usage is 5-10x higher.
/s
Sorry for the rude sarcasm, but isn't this a post truly just about the efficiency pitfalls of Python? (or any language / framework choice for that matter)
Of course modern computers are lightning fast. The overhead of every language, framework, and tool will add significant additional compute however, reducing this lightning speed more and more with each complex abstraction level.
I don't know, I guess I'm just surprised this post is so popular, this stuff seems quite obvious.
For instance running unoptimised code can eat a lot of energy unnecessarily, which has an impact on carbon footprint.
Do you think we are going to see regulation in this area akin to car emission bands?
Even to an extent that some algorithms would be illegal to use when there are more optimal ways to perform a task? Like using BubbleSort when QuickSort would perform much better.
it has thankfully started: https://www.blauer-engel.de/en/productworld/resources-and-en...
I think KDE's Okular has been one of the first certified software :-)
To some extent they can claim to deliver a unique feature where there is no replacement for the algorithm they are using.
I overheard this quote recently: 'I'd rather have today's algorithms on an old computer, than a new computer with old algorithms'
I agree though. I used these tricks a lot in scientific computing. Go to the world outside and people are just unaware. With that said - there is a cost to introducing those tricks. Either in needing your team to learn new tools and techniques, maintaining the build process across different operating systems, etc. - Python extension modules on Windows for e.g. are still a PITA if you’re not able to use Conda.
[0] -- https://www.pola.rs/
As an example, with an ILP ~4 instruction/cycle at 5GHz we get 20 billion instructions executed each second in a single core. This number is not really tangible but it shocks
Nothing really happened at the end but it's a funny history in the office
[…]
Took ~8 seconds to do 1000 calls. Not good at all :(
Isn’t that 8ms per call, way faster than the target performance? Or should that “500ms” be “*500 μs”?
no surprise pandas was "slow"
Believe me I do. This is why my backends are single file native C++ with no Docker/VM/etc. The performance on decent hardware (dedicated servers rented from OVH/Hetzner/Selfhost) is nothing short of amazing.
Every cloud / SaaS is throwing free tier compute capacity at people and it’s just overwhelming (in a good way I suppose)
It could be a bit overkill, but whenever I'm writing code on top of optimizing data structures and memory allocations I always try to minimize the use of if statements to reduce the possibility of branch prediction errors. Seeing woefully unoptimized python code being used in a production environment just breaks my heart.
That is not to say aiming for generally unbranchy code is not a good thing - that often implies well designed code and well chosen data structures anyway
>double score_array[]
E.g.: call a "ping" function that does no computation using different styles.
In-process function call.
In-process virtual ("abstract") function.
Cross-process RPC call in the same operating system.
Cross-VM call on the same box (2 VMs on the same host).
Remote call across a network switch.
Remote call across a firewall and a load balancer.
Remote call across the above, but with HTTPS and JSON encoding.
Same as above, but across Availability Zones.
In my tests these scenarios have a performance range of about 1 million from the fastest to slowest. Languages like C++ and Rust will inline most local calls, but even when that's not possible overhead is typically less than 10 CPU clocks, or about 3 nanoseconds. Remote calls in the typical case start at around 1.5 milliseconds and HTTPS+JSON and intermediate hops like firewalls or layer-7 load balancers can blow this out to 3+ milliseconds surprisingly easily.
To put it another way, a synchronous/sequential stream of remote RPC calls in the typical case can only provide about 300-600 calls per second to a function that does nothing. Performance only goes downhill from here if the function does more work, or calls other remote functions.
Yet, every enterprise architecture you will ever see, without exception has layers and layers, hop upon hop, and everything is HTTPS and JSON as far as the eye can see.
I see K8s architectures growing side-cars, envoys, and proxies like mushrooms, and then having all of that go across external L7 proxies ("ingress"), multiple firewall hops, web application firewalls, etc...
If you provide an end result response from your web app to a user's browser in 50ms-100ms (before external latency) then things like 200 microseconds vs 4 milliseconds have less of a meaningful difference. If your app makes a couple of internal service calls (over HTTP inside of the same Kubernetes cluster) it's not breaking the bank in terms of performance even if you're using "slow" frameworks like Rails and get a few million requests a month.
I'm not defending microservices and using Kubernetes for everything but I could see how people don't end up choosing raw performance over everything. Personally my preference is to keep things as a monolith until you can't and in a lot of cases the time never comes to break it up for a large class of web apps. I also really like the idea of getting performance wins when I can (creating good indexes, caching as needed, going the extra mile to ensure a hot code path is efficient, generally avoiding slow things when I have a hunch it'll be slow, etc.) but I wouldn't choose a different language based only on execution speed for most of the web apps I build.
This is a provocative framing but I'm not sure it makes sense. Functions aren't resources; they don't have throughput or utilization. It would be bad if a core could only call the function 300-600 times per second, but that is why we have async programming models, lightweight threads, etc. So that the core can do other stuff during the waiting-on-IO slices of the timeline. Which, as you mention, dominate.
It would also be bad if a user had to wait on 300-600 sequential RPCs to get back a single request, but like... don't do that. Remote endpoints are not for use in tight loops. There are cases where pathological architectures lead to ridiculous fanout/amplification, but even then we are usually talking about parallel tasks.
There is overhead to doing things remotely vs. locally. But the waiting isn't the interesting part. It's serialization, deserialization, copying, tracking which tasks are waiting, etc. A lot of performance work goes on around these topics! Compact and efficient binary wire protocols, zero-copy network stacks, epoll, green threads, async function coloring schemes, etc. The upshot of this work is also, as is typical in web/enterprise backend world, not so much about the latency of individual requests (those are usually simple) but about the number of concurrent requests/users you can serve from a given hardware footprint. That is normally what we're optimizing for. It's a different set of constraints vs. few but individually expensive computations. So of course the solution space looks different too.
Granted, this is exacerbated when architectures don't make a good division between control/compute/data planes.
Control plane, which is exposed to users, should almost certainly be limited to a single (or handful, at most) microservice calls. Preferably to the fastest storage mechanism that you have, such that what latency it does add is minimized entirely.
Converted the list to band performed a simple binary search to find it.
A basic python script. could handle about 4,000 records a second.
Corporate IT reached out to Oracle. Built a custom solution that cost probably a couple hundred thousand.
They tried to force us to use it. They were a little upset when I asked if they could up the performance by a few thousand percent.
I was on their shit list after that until I had leave.
And then your actual function starts, and returns after roughly 10s.
I think you underestimate just how inefficient enterprise can be. The extra time taken in connections between layers is not even a consideration.
Should be the opposite. Overhead as a proportion of total time goes down the more useful work is involved.
Doesn't this mean it's less of a problem? Like, isn't that good?
With Nixos I switch between Gnome 40 (I do like the Gnome workflow) and i3 w/ some Xfce4 packages, but lately on my older machine the performance of Gnome (especially while running Firefox) is so sluggish in comparison that I may have switched back permanently now.