Rate limiting and pagination aren’t (necessarily) about making full data consumption more difficult. They’re more often about optimizing common use cases and general quality of service.
Edit to add: in certain circles (eg those of us who take REST and HATEOAS as baseline HTTP API principles), parallelism is often not just expected but often encouraged. A service can provide efficient, limited subsets of a full representation and allow clients to retrieve as little or as much of the full representation as they see fit.
But limiting for efficiency is usually done in a way that I would call a cargo cult: First, the number of items per "page" is usually a number one would pick per displayed page, in the range of 10 to 20. This is inefficient for the general case, the amount of data transmitted is usually just the same size as the request plus response headers. So if the API isn't strictly for display purposes, pick a number of items per page that gives a useful balance between not transmitting too much useless data, but keeping query and response overhead low. Paginate in chunks of 100kB or more.
In terms of computation and backend load, pagination can be as expensive for a 1-page-query as for a full query. Usually this occurs when the query doesn't directly hit an index or similar data structure where a full sweep over all the data cannot be avoided. So think and benchmark before you paginate, and maybe add an index here and there.
Pagination would just complicate things. I think with most APIs, intended as APIs (i.e., not just an endpoint primarily meant to feed a front-end page), you're better off thinking of your default as "I'm going to just stream everything they ask for", and look for reasons why that won't work, rather than start from the presumption that everything must be paginated from the beginning.
Don't get me wrong; there are plenty of solid reasons to paginate. You may discover one applies to your API. But if you can leave it out, it's often simpler for both the producer and the consumer. Wait until you find the need for it. Plus, if that happens, you'll have a better understanding of the actual problem you need to solve and better solutions may reveal themselves.
In very simple cases, like a single table sql query, absolutely - databases effectively have to compute the full result, sort it, and return a window. There's almost no reason to paginate here, at an API level, unless the consumer wants only a subset (say, bandwidth limitations). Sending it all at once can be a huge benefit for those that will use it all, it's both simpler and faster for all parties.
But in most real-world cases, there are at least two additional details that can add significant response time: joins (when not involved in sorting) and additional data-gathering needed to fully build the response (e.g. getting data from other systems, internal or external). Joined data is not typically loaded prior to computing limit/offset since it may be a massive waste, and external data is effectively the same issue but with far higher latency.
And that's before getting into other practical issues, e.g. systems that can't process the response stream as it comes in - a subset will load-and-return faster than the whole content in all cases, so e.g. a website loading some json can show initial UI faster while loading more in the background. Streaming is often possible and that'll negate a lot of the downsides, but it's far less common than processing a request only after it completes.
Speaking from experience...we want to make it easy but also want to keep it performant. Getting the data all in one go is generally not performant and is easy to abuse as an API consumer. For example, always asking for all of the data rather than maintaining a cursor and secondary index (which is so much more performant for everyone involved).
Is there any good literature or patterns on supporting dumps in the tens of millions or larger?
I wrote a sheets plug-in that uses our cursor API to provide a full dump into a spreadsheet. Our professional services team is in love with it, so I bet they'd love generic data export capability.
1) AWS dynamodb has a parallel scanning functionality for this exact use case. https://docs.aws.amazon.com/amazondynamodb/latest/developerg...
2) A typical database already internally maintains an approximately balanced b-tree for every index. Therefore, it should in principal be cheap for the database to return a list of keys that approximately divide the keyrange into N similarly large ranges, even if the key distribution is very uneven. Is somebody aware of a way where this information could be obtained in a query in e.g. postgres?
3) The term 'cursor pagination' is sometimes used for different things, either referring to an in-database concept of cursor, or sometimes as an opaque pagination token. Therefore, for the concept described in the article, I have come to prefer the term keyset pagination, as described in https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin.... The term keyset pagination makes it clear that we are paginating using conditions on a set of columns that form a unique key for the table.
There's no standard way because index implementation details are hidden for a reason.
>in e.g. postgres
You can query pg_stats view (histogram_bounds column in particular) after statistics are collected.
Here we just see regular APIs are being abused for data export. I'm rather surprised the author did not face rate limiting.
At this point, it these requests are expensive you have an opportunity to use a very simple (and optimistic) cache for good faith API users, relegate rate limiting to prevent abuse of cache creation (which should be even easier to detect than just overzealous parallelism), and even use the same or similar semantics to implement deltas for subsequent export/sync.
Obviously this has some potential caveats if that churn is also likely to quickly invalidate data, or revoke sensitive information. Time limits for historical data retrieval can be imposed to help mitigate this. And individual records can be revised (eg with bitemporal modeling) without altering the set of referenced records.
Why do you think it is important for users to have temporal consistency?
It’s not a great UX. And in some ways I suspect that my own views were at least partially causing it, which made me more hesitant to even click on anything unless I was sure it was worth the disruption.
True result sets require relative page tokens and a synchronization mechanism if the software demands it.
Ideally I'd want a system that guarantees at-least-once delivery of every item. I can handle duplicates just fine, what I want to avoid is an item being missed out entirely due to the way I break up the data.
Pagination is hard
This is different from LIMIT in RDBMS
Wouldn’t this pattern solve the complexity you are talking about?
This is funny. Using offsets is known to be bad practice because.... it’s hard to do.
Look I’m just a UI guy so what do I know. But this argument gets old because I’m sorry, but people want a paginated list and to know how many pages are in the list. Clicking “next page” 10 times instead of clicking to page 10 is bullshit, and users know it.
If you ask for "first 50 items after key X", you just need to keep a priority queue of size 50 on each BE node and merge them before returning them (I'm assuming a distributed backend). It doesn't matter on which page you are.
But if you specify "first 50 items after element N" it gets really tricky, each BE shard needs to sort the first N elements, and it can use some trickery to avoid doing a naive merge (see: https://engineering.medallia.com/blog/posts/sorting-and-pagi... ).
You can at most save some transfer over the network.
Either way, 10 pages isn't so bad but tens of thousands can become troublesome as explained on https://shopify.engineering/pagination-relative-cursors
Maybe someone else has been there before and told them "Go to page ten".
Maybe you know that there are 20 pages, and are looking to find the median, or are just informally playing around, looking at the distribution.
Same as you'd do with an interesting page of a book. I don't think I'd stop using page numbers in dead tree books if they somehow came with filters.