undefined | Better HN

0 pointsperl4ever5y ago0 comments

To you and easton, my point is that even if Power Query has shortcomings, it's clearly the best thing to build on and improve, assuming VBA is dying a slow death and can't be revived. Even if, like, you wanted to make another separate language, it should still resemble Power Query, only better.

I don't think people at Microsoft are looking at Excel as a whole, like lost souls squatting in a mansion and building sand castles in the room that they live in that have no relationship to the actual building and what it needs to keep from falling down.

I'm not sure what you mean by performance cliffs. Can you give an example of where and how you would better accomplish something without Power Query? Are you talking about processing data in the range of a few hundred megabytes?

0 comments

9 comments · 2 top-level

greggyb5y ago· 5 in thread

Power Query's streaming semantics for tables and lists can lead to severe performance issues with even modest data volumes. Table.Buffer and List.Buffer offer some small amount of control, but it's likely that you have a pipeline that creates a series of intermediate table and/or list values. Every single table and list function (with the exception of the buffer functions mentioned above) creates a new lazy stream.

Accumulation patterns perform abysmally even with data in the 100Ks of elements. Say you have a table of inventory movements and want instead a snapshot table of inventory at point in time. You can do an O(n^2) self-join of a table with itself to all records with a lesser date, summing all movements to derive a total quantity at that time.

If you want to use an accumulation pattern, you can sort and cast your table to a list of records and then use List.Accumulate to iterate over each list element, deriving a new field with the running total of inventory amount. If you do this, you will find that it falls right over even with 1Ks or 10Ks of records. This is because the intermediate list that you're appending to through the accumulation is itself a lazy stream. Thus, you have to use List.Buffer at each step. Even with List.Buffer at each step, this solution falls over at high 10Ks or low 100Ks of records.

Incredibly unintuitively, you can use List.Generate with an already-buffered input list to derive a new list that can then be cast back to a table, though this still struggles with 100Ks of records.

If your snapshots can be aggregates, then you can happily throw out the idea of such an accumulation pattern and just join to a date table at the appropriate grain with all movement records less than or equal to the date in that date table.

I'll note that I regularly speak with several of the people whose blogs you will inevitably come across when performance tuning Power Query. The approaches above are the current state of the art in PQ for iteration and accumulation patterns. This is not an appeal to authority or a brag. This is to highlight the difference with the Excel spreadsheet formula approach below, which even beginners can derive from first principals.

In an Excel spreadsheet, for the same challenge, you just define a new column with a special first row formula, and each subsequent cell referencing the row above. This will happily run right up to the spreadsheet row limit with no performance concerns. If you really want, you can spill over to multiple spreadsheets, which is clunky to manage, but still performs just fine, and degrades slowly. The M approaches above hit a cliff and start hanging.

Excel formulas make it trivial to reference arbitrary cells. M is a nearly-general purpose language. PQ uses M, but as a framework for writing M, it has a strong emphasis on a query/table paradigm. A table-based operation model cuts against the grain of a spreadsheet, because a spreadsheet is a collection of arbitrary cells. A tabular approach is a collection of similarly shaped records stacked one upon the other. These two paradigms have a fair amount of overlap, but are not isomorphic. There are things trivial to express in one that become difficult bordering on impossible in the other.

louisvgchi5y ago

As someone developing essentially a competitor to Excel-and-PowerQuery/M, I find all this very interesting.

My language is strict and statically typed. However, after arrays (tables are arrays of records conceptually) exceed a certain length, rather than processing them in-memory as arrays, they will be offloaded to storage and processed (transparently) in a streaming fashion.

I’m surprised that this doesn’t work well in PowerQuery. I would have thought that 100K would be peanuts for it.

Mine is a SaaS however, so the user’s laptop isn’t a constraint, and I can transparently throw a million records in BigQuery or some other data warehouse and use its aggregates if needed. Although at the 100K scale you can use SQLite and it can handle that scale of data trivially on commodity laptops.

So your experience is interesting indeed.

greggyb5y ago

Feel free to reach out via email if you want to follow up. My address is in my profile.

I'll note, as I did to a sibling reply of yours, I made observations about a specific pattern that showcases performance issues in PQ/M. PQ/M easily scales beyond 100Ks of records, but not for arbitrary processing patterns.

perl4everOP5y ago

I'm skeptical. I want an example, because my experience differs.

perl4everOP5y ago

Power Query falls over with thousands of items?

That's not my experience. At work, the data usually isn't very large, but I have experimented on my own time with, for instance, a public covid data file that I think was several GB.

I also thought lazy semantics is a good thing, not a fundamental flaw.

Rather than debate, I would be interested enough to spend some time on a sample problem, if you could provide one, where you believe Power Query inadequate, and at the same time have an alternate solution to provide a benchmark of what is adequate.

greggyb5y ago

Not "PQ falls over with 1Ks of items," but rather "the M language does not do well with accumulation patterns on tables; naive approaches can hit significant performance issues in the 1Ks of records and sophisticated approaches struggle with 100Ks."

These are two very different statements. I've happily used PQ to ingest GBs of data. Its streaming semantics are fine to great for some types of processing and introduce performance cliffs for others. There's no binary judgment to be made here. Laziness is neither a fundamental flaw not an unmitigated good.

I've already shared one specific pattern above. I can share some mocked up data if you need me to, but that might be a day or two. Also, feel free to reach out via email (in my profile).

2 more replies

Closi5y ago· 2 in thread

PowerQuery isn’t a replacement for excel though - it’s a data preprocessing tool for analysts.

It’s not going to replace functionality of core spreadsheet-based excel for accountants, for instance, who typically won’t have a use for PowerQuery as their data is structured differently.

perl4everOP5y ago

I am not an accountant per se, but I work for an accounting organization. Could you give an example of what you mean by "structured differently"?

Closi5y ago

PowerQuery assumes tabular data where each column is the data type and each row is a data element / entity. It is structured similar to a database.

In a spreadsheet the data is much less structured which is where a lot of the power comes from - for instance PowerQuery doesn’t really support things like subtotals easily, or doing scratch-calculations, or building quick financial models. It is closer to a paper-ledger with calculations scribbled into the margins than a big-data database.

PowerQuery is more about ingesting lots of data and cleaning it, while finance is often about working stuff out and playing with numbers to see what happens - and playing with numbers is easier in a less-structured-loosely-typed environments.

1 more reply

j / k navigate · click thread line to collapse

0 comments

9 comments · 2 top-level

greggyb5y ago· 5 in thread

Incredibly unintuitively, you can use List.Generate with an already-buffered input list to derive a new list that can then be cast back to a table, though this still struggles with 100Ks of records.

louisvgchi5y ago

As someone developing essentially a competitor to Excel-and-PowerQuery/M, I find all this very interesting.

I’m surprised that this doesn’t work well in PowerQuery. I would have thought that 100K would be peanuts for it.

So your experience is interesting indeed.

greggyb5y ago

Feel free to reach out via email if you want to follow up. My address is in my profile.

perl4everOP5y ago

I'm skeptical. I want an example, because my experience differs.

perl4everOP5y ago

Power Query falls over with thousands of items?

That's not my experience. At work, the data usually isn't very large, but I have experimented on my own time with, for instance, a public covid data file that I think was several GB.

I also thought lazy semantics is a good thing, not a fundamental flaw.

greggyb5y ago

I've already shared one specific pattern above. I can share some mocked up data if you need me to, but that might be a day or two. Also, feel free to reach out via email (in my profile).

2 more replies

Closi5y ago· 2 in thread

PowerQuery isn’t a replacement for excel though - it’s a data preprocessing tool for analysts.

It’s not going to replace functionality of core spreadsheet-based excel for accountants, for instance, who typically won’t have a use for PowerQuery as their data is structured differently.

perl4everOP5y ago

I am not an accountant per se, but I work for an accounting organization. Could you give an example of what you mean by "structured differently"?

Closi5y ago

PowerQuery assumes tabular data where each column is the data type and each row is a data element / entity. It is structured similar to a database.

1 more reply

j / k navigate · click thread line to collapse