1) Make front-end calculate the next-id for each element after a sort, and call the back-end to update only required records.
2) next_id is a unique key, so if data was stale, worst case, a transaction error occurs, inform user that sort failed and undo the optimistic update, but this happens if multiple users are sorting the same list like crazy. generally, it just works. can be problematic if users are doing mass updates on a giant list. in such a case, instead of allowing free-sort, adding a user-adjustable "priority" column which allows equals would make much more sense.
3) deletes and inserts are more costly, because now they also require an extra read and update, but we never have the case of updating the whole list, and updating a record through a unique indexed key is an insignificant cost (in most cases - noted as otherwise someone would surely nerd-snipe me with an uncommon case, hehe).
4) for whatever reason, you want sorted results from the back-end and not do sorting on the front-end, if keys are not sorted, means recursive CTE, which isn't the end of the world but could be slower and means additional complexity.
5) you can change next_id to prev_id and spare the updates for the common case of inserting at the bottom (you still need the read, and a retry mechanism on transaction fail though)
It was definitely fast enough for my purposes, sorting about 30k items in 3 different linked lists in just a second or two. The function allows you to limit your depth or modify your starting location in the linked list (graph) which opens a lot of possibilities and/or performance improvements when only loading the first N records.
I guess the worst case is where you want to iterate over a large dataset in list order, but either you can't fit it in memory, or you don't want it all to go over the wire. In that case...yeah I don't know...what's the linked list equivalent of a B-tree? :)
1. The position of an element in an array is not a property of the element but of the array.
2. Don't use sorted sets when you need the properties of an array / linked list.
You can store the todo list as an array of todos (use a user defined datatype), you can create another table called todo_list that contains an array of references to the todos. You can also create a linked list. Note that you'll have to use modern SQL to retrieve the linked list 'efficiently'.
The problem only becomes difficult when we dislike the obvious solutions because it violates some property of 'elegance' which apparently means restricting yourself to the 1980s versions of SQL implementations.
If the language we were writing this in was called JS and the code has the order of the array as a property of the object and we were continually sorting it every time we accessed the array instead of just reordering the array we'd recognize this as the anti-pattern it is. Similarly if for some reason using JS functions was considered 'inelegant', or for some reason using only JS features of 1996 instead of 2020.
Modern SQL has CTEs, functions, and arrays. They are very elegant when solving the problem of arrays in SQL.
You can try to emulate it with triggers, and start to appreciate the subtle complexities of the problem. For example, would we want a new class of element-wise actions to act analogous to ON DELETE/UPDATE CASCADE/SET NULL? Rather than pruning referenced row or referencing column value, you'd want to mutate just an element within the structured type, right? How would you expose these many choices if trying to design a new constraint syntax with reusable machinery?
Even without modern SQL arrays one could even keep references to place, per user, in a simple comma separated text field and munge it in code. Elegant? Probably not. But perhaps more practical then some of the approaches presented.
But modern SQL does have arrays.
I'm using Mudder.js to do it, but the algorithm is straightforward. Define an alphabet (e.g., A-Z). Every time an item is inserted/repositioned, its sort order is given the midpoint between the sort values of its two neighbors. So the first item is given the sort value M (midpoint between A and Z). Inserting before that item gets G (between A and M). Inserting between these two items gets J (between G and M).
Once you run out of letters between two sort values, the algorithm tacks a digit onto the sort value of the item ahead of it. So inserting between M and N yields MM. If you do this enough times in a pathological pattern and wind up with some long string of characters you can reflow your list and rebalance everything out evenly (though that's strictly an optimization for storage space/bandwidth, and not a requirement for the algorithm to function).
This all sorts perfectly with ORDER BY etc., supports an number of repositions bounded only by your storage space, and doesn't require arbitrary-precision decimal datatypes or fraction handling.
Strings work everywhere, and are first-class data types it most systems/languages. They take more bits per digit than numbers (though you can choose a wider alphabet to mitigate that, such as ASCII 33 '!' through 126 '~'), but I'm happy to trade the storage away for using a first-class data types.
Start with .M (encoded as "M"); add in .F, .T to insert before or after. Once you find yourself trying to insert between .M and .N, do it by adding .MM.
But in a database char types can be arbitrarily long, and database indexes are well suited to indexing them. That gives them a lot of advantages for this kind of usecase.
If you have a space of 64 bits to represent elements of some ordered field, and you insert items one at a time, there is some ordering of the items that allows you to make only 64 inserts.
Claiming that numbering the items with integers only allows you 16 insertions, but "It would take virtually forever to run out of space through repeated list reordering" using rational numbers just can't be correct. The author cherry-picked a bad implementation for the integers and a favourable insertion order for his pet approach.
Where exactly did he go wrong? By picking the first number in the sequence of integers to be 65536, one of the smallest of the unsigned 64-bit integers, and then deciding all the inserts would happen in descending order. If he had picked a number like 2^63 instead, he would have got at least 63 inserts no matter what the insertion order. Using this integer strategy also lends itself to easy reindexing if you do need to: the row of rank N gets reindexed to 2^64 / N. Or if you want to increase the size of the index field to 128 bits, bitshift everything left.
Note, just because we have this NFL theorem doesn't mean one approach can't outperform another in typical use patterns, but I don't see evidence for that here. The most logical integer-index strategy isn't considered and the insertion orders that exhaust the rational strategy are not particularly pathological.
If 64 bits represent an ordering, and you split it in two for each insert - necessarily, to permit future insertion on either side of the new insert - then at least one side has no more than 63 bits left to store its ordering information for all future inserts on its side. You can't avoid this even with tricks that represent fractional bits.
The problem is that in order to avoid renumbering (which redistributes ordering state), the bisection needs to pre-allocate state space for future ordering information. The best it can do (without having more information about future insert order) is to split the state space perfectly evenly.
In the article, there's a big blue Update box which mentions a sequence of L/R path flips which make the fraction into a Fibonacci sequence, where you can go to 46 in sequence before exceeding a 32-bit integer. You'll observe than 46 is less than 64.
If he had picked 2^63, it wouldn’t be possible to create an initial list of even 2 items.
Depending on the use case, I probably would have picked 2^54 or so (allows for appending about a thousand items)
If your lists are small it doesn’t matter much what you do, but for large lists, I don’t think there’s a clean solution for this, if one keeps the key size constant.
N = 2^48 is what gives you his strategy of numbering the initial entries in increments of 65536.
Keeping a space between the items and reordering/"defraging" periodically might be easier.
A DB schema with an ID for the whole collection and a linked list is basically optimal here. Various obvious extension present themselves at that point, such as multiple orders by moving the linked list to an external table or adding reverse links. You don't need a complicated CTE, in 2021 you just query the whole list by the collection ID and assemble in memory.
By the time this solution doesn't scale, you've almost certainly got other problems where your data structure no longer matches the true nature of the data. Humans do not generally have total orders on 10,000s of elements. In those rare cases where for some reason you do, which I have encountered 0 times in ~25 years, by all means use some other solution. But don't expect it to come up often.
I seem to recall seeing _something_ more elegant than a linked list for this type of thing, but I don't recall what is what, offhand.
Early in a project, people stick to some heuristic, like using small z-index values like 0, 1, 2, 3... etc. But this has drawbacks because you can't add new z-index items in between without also shifting everything with a higher z-index
Using z-index values with a gap is a better bet. 0, 10, 20, 30, etc. Of course, you are limited to only a few items in between the original. You've also lost an expectation of ordering in the values.
Larger gaps make more sense, as long as you don't exceed the size of signed integers ±2147483647.
The author of the post mentions how floating point has a similar problem in the amount of floating point precision you can wring out before you end up with rounding errors.
The true crux of the problem is one of items related to one another, in other words it's a graph problem. I want these items to show up underneath these other items.
This seems to have exposed a bug in HN. Clicking on the "past" link for the present submission does not turn up that past submission, although the links are identical.
However, clicking on the "past" link for that three year old submission does turn up the present submission.
>However, clicking on the "past" link for that three year old submission does turn up the present submission.
The "past" feature is implemented by a third-party site, not HN.
Also, you can see the URL of the page and highlighted text on the page that it's searching by the title text as well as the link. The previous submission's title did not have all the words of the current one, hence does not show in the search results.
The current 'past' link isn't searching for the url at all. I think if it looked at the url instead of the title then it would miss when there are articles from other sites on the same topic so there is a bit of a tradeoff. If it looked at both, Algolia treats it as an 'and' so it would also not return the older submission. An 'or' would probably work well for this use case but not sure if it is supported.
Say you insert three items. a: 1, b: 2, c: 3.
Then you reorder 3 after 1. The new order is a: 1, aa: 3, b:2.
Then you reorder 2 after 1. The new order is a: 1, aa: 2, aa: 3.
But since there are now two values with the same sort key, it might as well be a: 1, aa: 3, aa: 2.
Did I misunderstand your solution or is that a bug?
A similar system is used in Jira called LexoRank
To be fair though, the scenario of inserting 38 items in a row in the same location (the fail mode of Approach 2) does sound like something that could happen in one user session, so might need another check of some sort in addition to the weekly check.
Come to think of it, 38 inserts in the same location might be a good test case to add for any sortable data model.
Querying is a bit less straightforward, though; you need a recursive query to traverse the pointers. (Or traverse in application logic, at the cost of a bunch of unnecessary round-trips.)
Updating dozens/hundreds/thousands of records at once is obviously unnecessarily heavy to process. But depending on your storage layer's locking model, it could cause high lock contention if concurrent actors are updating the same list and fight over large fractions of the dataset with every update.
If you're working in a networked scenario, naïve implementations can turn into transmitting a lot of data across the wire for every single reposition.
It's also algorithmically challenging in concurrent scenarios. If I update my local state rearrange items, and you update your local state to rearrange items in an overlapping range, when they have to sync up this quickly turns into the developer having to learn to implement CRDTs or Operational Transforms. You could implement optimistic or pessimistic locking, but you have to lock such a large fraction of the dataset that it's easy to limit throughput.
Distributed/eventual-consistency problems are particularly challenging with ordering solutions of this variety, but the difference between one operation updating a wide swath of values vs just the single updated value can make a difference in how primitive a reconciliation algorithm the develop can bring to bear.
ItemsOrderConfiguration Table
Id | Json (varchar(MAX) or proper JSON type in Relational DBs)
1 | [
{
"ItemId": 1,
"Order": 2
},
{
"ItemId": 2,
"Order": 1
},
]
2 | [ {
"ItemId": 15,
"Order": 1
},
{
"ItemId": 32,
"Order": 2
},
]
the bad thing about this is that you lose constrains e.g that ItemId actually exists, but it shouldn't be a problem.This belongs to the class of problems involving trying to represent a tree in a relational database. There are many solutions, all with some problem.
I found this slide show from him. It covers the same ideas:
https://www.slideshare.net/billkarwin/models-for-hierarchica...
I do not see why this is necessary, you could also reindex the positions because only the relative order of integers matters. How hard can it be to recreate a bunch of integers once every 16 insertions in the same place in the list?
Drill the junior devs: use decimals if you want accuracy.
What is unsaid with this: if you want accuracy with monetary representation.
Lots of problems out there where decimals are simply inappropriate.
One day, rational as a core type in postgres would be a nice addition. Rationals should belong as a core type in most every platform where general purpose compute can happen.