What's good about offset pagination; designing parallel cursor-based web APIs (opens in new tab)

(brandur.org)

100 pointsclra5y ago51 comments

51 comments

29 comments · 9 top-level

draw_down5y ago· 6 in thread

> it uses offsets for pagination... understood to be bad practice by today’s standards. Although convenient to use, offsets are difficult to keep performant in the backend

This is funny. Using offsets is known to be bad practice because.... it’s hard to do.

Look I’m just a UI guy so what do I know. But this argument gets old because I’m sorry, but people want a paginated list and to know how many pages are in the list. Clicking “next page” 10 times instead of clicking to page 10 is bullshit, and users know it.

yxhuvud5y ago

No, what is bullshit is having the option to go to page 10 in the first place. If the user does that then the UI is already broken. What is needed is good filter abilities.

1 more reply

juancn5y ago

The hardest part is to be able to parallelize the sorts in the backend and keep the working sets reasonably sized.

If you ask for "first 50 items after key X", you just need to keep a priority queue of size 50 on each BE node and merge them before returning them (I'm assuming a distributed backend). It doesn't matter on which page you are.

But if you specify "first 50 items after element N" it gets really tricky, each BE shard needs to sort the first N elements, and it can use some trickery to avoid doing a naive merge (see: https://engineering.medallia.com/blog/posts/sorting-and-pagi... ).

You can at most save some transfer over the network.

Rafert5y ago

How do people even figure out they need to be on page 10? Sounds like a filter of some kind (e.g. before a date) would be better, except they trained themselves to use x amount of pages to approximate.

Either way, 10 pages isn't so bad but tens of thousands can become troublesome as explained on https://shopify.engineering/pagination-relative-cursors

PurplePanda5y ago

Maybe they've been there before and happen to remember "page 10".

Maybe someone else has been there before and told them "Go to page ten".

Maybe you know that there are 20 pages, and are looking to find the median, or are just informally playing around, looking at the distribution.

Same as you'd do with an interesting page of a book. I don't think I'd stop using page numbers in dead tree books if they somehow came with filters.

alexchamberlain5y ago

I can't find one right now, but I feel like there must be an algorithm that can identify the key that can be used for each page, yet cheaper than a full sort (ie cheaper than offset pagination in generality). Of course, a skip list would work for a secondary index.

juancn5y ago

This is the best one I've been able to come up with: https://engineering.medallia.com/blog/posts/sorting-and-pagi...

gampleman5y ago· 5 in thread

To point out the obvious: generally API providers don’t particularly want you to pararelize your request (they even implement rate limiting to make it harder on purpose). If they wanted to make it easy to get all the results, they would allow you to access the data without pagination - just download all the data in one go.

eyelidlessness5y ago

A certain level of parallelism is generally within the realm of good API citizenship. Even naive rate limiting schemes tend to permit a certain number of concurrent requests (as they well should, since even browsers may perform concurrent requests without any developer intervention).

Rate limiting and pagination aren’t (necessarily) about making full data consumption more difficult. They’re more often about optimizing common use cases and general quality of service.

Edit to add: in certain circles (eg those of us who take REST and HATEOAS as baseline HTTP API principles), parallelism is often not just expected but often encouraged. A service can provide efficient, limited subsets of a full representation and allow clients to retrieve as little or as much of the full representation as they see fit.

corty5y ago

One thing that frequently bugs me is APIs limiting number of items per page for reasons of efficiency. I can perfectly understand low limits for other reasons, like not helping people scrape your data.

But limiting for efficiency is usually done in a way that I would call a cargo cult: First, the number of items per "page" is usually a number one would pick per displayed page, in the range of 10 to 20. This is inefficient for the general case, the amount of data transmitted is usually just the same size as the request plus response headers. So if the API isn't strictly for display purposes, pick a number of items per page that gives a useful balance between not transmitting too much useless data, but keeping query and response overhead low. Paginate in chunks of 100kB or more.

In terms of computation and backend load, pagination can be as expensive for a 1-page-query as for a full query. Usually this occurs when the query doesn't directly hit an index or similar data structure where a full sweep over all the data cannot be avoided. So think and benchmark before you paginate, and maybe add an index here and there.

4 more replies

sb82445y ago

> If they wanted to make it easy to get all the results

Speaking from experience...we want to make it easy but also want to keep it performant. Getting the data all in one go is generally not performant and is easy to abuse as an API consumer. For example, always asking for all of the data rather than maintaining a cursor and secondary index (which is so much more performant for everyone involved).

alexchamberlain5y ago

We provide (internal) access to data where we provide interactive access via GraphQL-based APIs and bulk access via CSV or RDF dumps - I feel like dump files are grossly undervalued these days.

1 more reply

tshaddox5y ago

That’s the point. Running multiple paginated queries in parallel is essentially circumventing the API provider’s intent to limit the number of items requested at one time.

1 more reply

adontz5y ago· 2 in thread

I believe data export and/or backup should be a separate API, which is low priority and ensures consistency.

Here we just see regular APIs are being abused for data export. I'm rather surprised the author did not face rate limiting.

eyelidlessness5y ago

Coming from a REST perspective, I wouldn’t implement a separate API, I would use HTTP semantics (eg headers or, if truly necessary query params) on the resource listing to indicate the export/sync intention. Likely with an Accept header. If pagination is still preferred/required, the service could return an ETag or some other continuation token which when provided in subsequent responses could be used to indicate the consistent snapshot being requested. Since this is entirely optional, clients could use this mechanism to opt into stable/parallelizable requests (as I described in less specificity in another sub thread).

At this point, it these requests are expensive you have an opportunity to use a very simple (and optimistic) cache for good faith API users, relegate rate limiting to prevent abuse of cache creation (which should be even easier to detect than just overzealous parallelism), and even use the same or similar semantics to implement deltas for subsequent export/sync.

adontz5y ago

I hardly imagine consistent integral paginated data view without creating a snapshot. I would be manual MVCC implementation or something. Separate API seems a much simpler solution to me.

eyelidlessness5y ago· 2 in thread

I think keeping temporal history and restricting paginated results to the data at the point in time where the first page was retrieved would be a pretty decent way to solve offset based interfaces (regardless of the complexity of making the query implementation efficient). Data with a lot of churn could churn on, but clients would see a consistent view until they return to the point of entry.

Obviously this has some potential caveats if that churn is also likely to quickly invalidate data, or revoke sensitive information. Time limits for historical data retrieval can be imposed to help mitigate this. And individual records can be revised (eg with bitemporal modeling) without altering the set of referenced records.

ako5y ago

I think for most use cases, as a user i'd rather see the newest items in a list, then consistency of pagination. If i forget to manually refresh, i might miss out on important new items.

Why do you think it is important for users to have temporal consistency?

eyelidlessness5y ago

Well I’ll use a recent example I encountered that was actually very frustrating. I was looking for a font to use for a logo for a personal project. The site I was using (won’t name and shame, and I can’t recall the site now anyway) had no sorting options, items were ordered by whatever “popularity” formula they use. As I paginated, many of the fonts I’d previously viewed would appear on subsequent pages, often in a different order. It was frustrating not just because I could tell that I was probably missing fonts that were being bumped up to previous pages, but also because it made me doubt my mental model of my own browsing history: “Did I navigate back too far? Did I forget a tangential click and end up on a different search path?”

It’s not a great UX. And in some ways I suspect that my own views were at least partially causing it, which made me more hesitant to even click on anything unless I was sure it was worth the disruption.

1 more reply

ppeetteerr5y ago· 2 in thread

Pagination of an immutable collection is one thing and can be parallelized. Pagination of a mutable collection (e.g. a database table), on the other hand, is risky since two requests might return intersecting data if new data was added between the requests being executed.

True result sets require relative page tokens and a synchronization mechanism if the software demands it.

simonw5y ago

Intersecting data is fine provided there's a unique ID for each result that can be used to de-duplicate them.

Ideally I'd want a system that guarantees at-least-once delivery of every item. I can handle duplicates just fine, what I want to avoid is an item being missed out entirely due to the way I break up the data.

ppeetteerr5y ago

It's more than just de-duplicating, tho. Imagine you query a dataset and get something like a page count and a chunk size. That page count cannot be trusted if the dataset is mutable. If an item is inserted at the beginning of the set, you're going to miss the last item.

Pagination is hard

1 more reply

gigatexal5y ago· 2 in thread

From the code sample in the article I didn’t know you could append to a slice from within a go func

mssundaram5y ago

As long as you use the mutex locks

gigatexal5y ago

Of course. I see that now it’s so obvious not sure why I didn’t see that earlier.

felixhuttmann5y ago· 1 in thread

A few thoughts:

1) AWS dynamodb has a parallel scanning functionality for this exact use case. https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

2) A typical database already internally maintains an approximately balanced b-tree for every index. Therefore, it should in principal be cheap for the database to return a list of keys that approximately divide the keyrange into N similarly large ranges, even if the key distribution is very uneven. Is somebody aware of a way where this information could be obtained in a query in e.g. postgres?

3) The term 'cursor pagination' is sometimes used for different things, either referring to an in-database concept of cursor, or sometimes as an opaque pagination token. Therefore, for the concept described in the article, I have come to prefer the term keyset pagination, as described in https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin.... The term keyset pagination makes it clear that we are paginating using conditions on a set of columns that form a unique key for the table.

ComodoHacker5y ago

>a way where this information could be obtained in a query

There's no standard way because index implementation details are hidden for a reason.

>in e.g. postgres

You can query pg_stats view (histogram_bounds column in particular) after statistics are collected.

jasonhansel5y ago

It's important here that "created" is an immutable attribute. Otherwise you could get issues where the same item appears on multiple lists (or doesn't appear at all) because its attributes changed during the scanning process.

arcbyte5y ago

I think you could accomplish something similar with token pagination by requesting a number of items that will result in multiple "pages" for your user interface. Then as the user iterates through you can request additional items. This isn't parallelizing, but provides the same low-latency user experience.

j / k navigate · click thread line to collapse

51 comments

29 comments · 9 top-level

draw_down5y ago· 6 in thread

> it uses offsets for pagination... understood to be bad practice by today’s standards. Although convenient to use, offsets are difficult to keep performant in the backend

This is funny. Using offsets is known to be bad practice because.... it’s hard to do.

yxhuvud5y ago

No, what is bullshit is having the option to go to page 10 in the first place. If the user does that then the UI is already broken. What is needed is good filter abilities.

1 more reply

juancn5y ago

The hardest part is to be able to parallelize the sorts in the backend and keep the working sets reasonably sized.

You can at most save some transfer over the network.

Rafert5y ago

Either way, 10 pages isn't so bad but tens of thousands can become troublesome as explained on https://shopify.engineering/pagination-relative-cursors

PurplePanda5y ago

Maybe they've been there before and happen to remember "page 10".

Maybe someone else has been there before and told them "Go to page ten".

Maybe you know that there are 20 pages, and are looking to find the median, or are just informally playing around, looking at the distribution.

Same as you'd do with an interesting page of a book. I don't think I'd stop using page numbers in dead tree books if they somehow came with filters.

alexchamberlain5y ago

juancn5y ago

This is the best one I've been able to come up with: https://engineering.medallia.com/blog/posts/sorting-and-pagi...

gampleman5y ago· 5 in thread

eyelidlessness5y ago

Rate limiting and pagination aren’t (necessarily) about making full data consumption more difficult. They’re more often about optimizing common use cases and general quality of service.

corty5y ago

4 more replies

sb82445y ago

> If they wanted to make it easy to get all the results

alexchamberlain5y ago

We provide (internal) access to data where we provide interactive access via GraphQL-based APIs and bulk access via CSV or RDF dumps - I feel like dump files are grossly undervalued these days.

1 more reply

tshaddox5y ago

That’s the point. Running multiple paginated queries in parallel is essentially circumventing the API provider’s intent to limit the number of items requested at one time.

1 more reply

adontz5y ago· 2 in thread

I believe data export and/or backup should be a separate API, which is low priority and ensures consistency.

Here we just see regular APIs are being abused for data export. I'm rather surprised the author did not face rate limiting.

eyelidlessness5y ago

adontz5y ago

I hardly imagine consistent integral paginated data view without creating a snapshot. I would be manual MVCC implementation or something. Separate API seems a much simpler solution to me.

eyelidlessness5y ago· 2 in thread

ako5y ago

I think for most use cases, as a user i'd rather see the newest items in a list, then consistency of pagination. If i forget to manually refresh, i might miss out on important new items.

Why do you think it is important for users to have temporal consistency?

eyelidlessness5y ago

1 more reply

ppeetteerr5y ago· 2 in thread

True result sets require relative page tokens and a synchronization mechanism if the software demands it.

simonw5y ago

Intersecting data is fine provided there's a unique ID for each result that can be used to de-duplicate them.

ppeetteerr5y ago

Pagination is hard

1 more reply

gigatexal5y ago· 2 in thread

From the code sample in the article I didn’t know you could append to a slice from within a go func

mssundaram5y ago

As long as you use the mutex locks

gigatexal5y ago

Of course. I see that now it’s so obvious not sure why I didn’t see that earlier.

felixhuttmann5y ago· 1 in thread

A few thoughts:

1) AWS dynamodb has a parallel scanning functionality for this exact use case. https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

ComodoHacker5y ago

>a way where this information could be obtained in a query

There's no standard way because index implementation details are hidden for a reason.

>in e.g. postgres

You can query pg_stats view (histogram_bounds column in particular) after statistics are collected.

jasonhansel5y ago

arcbyte5y ago

j / k navigate · click thread line to collapse