Oh my sweet summer child. Unfortunately, there is a whole industry built around this. This a great blog discussing different detection methods: https://incolumitas.com/
Let me tell you about this little thing called tax accountants and the IRS...
> I think the aliens would conclude there is something wrong with the economic system.
These aliens are unfamiliar with adversarial systems?
Reminds me a a SciFi short story I read once where some aliens came to earth. Everyone thought they were amazingly smart, but it turned out they had just been working on their tech for a lot longer and were very dumb. The protagonist in the story figured this out, and sold them the Brooklyn Bridge...
My advice, based on experience: when you find yourself in this situation immediately start looking for a way off the boat. Urgently. It is rare for people on the boat to notice this before it sinks, and those few who do always seem to overestimate the supply of lifeboats.
Regardless, starting out with any variation of "you're blissfully ignorant" isn't needed either. I get offense usually isn't intended but the use of that phrase has always stuck me as a very condescending way to respond.
In the early days, the developer abuses around ASP view state payload were an absolute nightmare to deal with. I used to half-joke that I could speak HTTP after staring at the raw traffic and how 5 page loads could generate 100+ requests which had dependencies on one another.
Interestingly, there were also an interesting class of client-server bugs that only were obvious in recordings (e.g. multiple repeat HTTP head requests to check if a resource existed). Each object or library dev clearly had no knowledge that the function triggered just before also wanted to check if that resource exists. This resulted in a huge amount of redundant unnecessary calls because nobody coordinated and optimized at this level.
Fun stroll down memory lane.
This happens more often than you would expect, even without any auth sometimes. At that point you're basically developing with the same DX as internal developers.
My theory is people just turn off the GraphiQL endpoint on their GraphQL server and think they have hidden the schema, not realizing any external tool can do the introspection. Either that or it's developers slipping a little something under the radar for other developers (same thing with source maps).
Another tip: If the service in question has a mobile app, sniffing the traffic on that with a MITM proxy can yield more interesting results than a web app.
If you really intend for your GraphQL API to be used only internally and from your official web client, and you consider any fields not currently requested in your web client to be highly sensitive, you should really turn off public access to the full GraphQL API and use something like GraphQL's persisted queries where your web client requests queries by an opaque unique identifier rather than the fully text of the query.
Not only is it translucent, but your audience tends to have a better idea than they can directly see at the moment as to what it's hiding.
https://github.com/nikitastupin/clairvoyance https://github.com/swisskyrepo/GraphQLmap
If you get stuck, look at their javascript, see what it is doing. double check your network requests in developer tools, some of them might be more important than you think, plus it's so nice that we don't have to use burp for this anymore. Some sites check referrers, and user agents, or expect a field from a specific server rendered page to be added to a header. More than one expected a javascript style timestamp on every request.
The weirdest behavior comes from older apps that started as purely server rendered, and slowly added a dynamic frontend. I always cringe when it's obvious that different developers were given tasks over the years, and completed them without bothering to learn the rest of the system.
I work on some SPAs and some server side rendered systems.
It's so nice to fire up the network tab and see some of the requests right away to troubleshoot.
Server side rendered stuff, not so easy. Not impossible and you can always add some debugging, but the nature of SPAs to just call all the things that are easily seen, very nice. And I can use that elsewhere.
Specifically regarding Instagram, you can take a look at the implementation of https://github.com/dilame/instagram-private-api to understand more workarounds, as Instagram is getting better and better at working against the workarounds.
https://github.com/lwthiker/curl-impersonate
I haven't tried it, haven't really come across a service that blocks curl, but I'll be keeping an eye on it in case I need it.
Except curl.
You can "curl" their RSS feed. You can open it in a browser. Anything else that doesn't lie about its User-Agent will fail.
W T F.
Somebody please go strangle those people. I had to set my RSS feed reader to impersonate curl's User-Agent.
I wrote a bit about that here: https://simonwillison.net/2022/Mar/10/shot-scraper/#how-it-w...
Yes, go ahead and disable that header when piping curl's output into `less`, however when converting the curl request into python just remember to re-add that header. Pretty much every python library I've used to handle web requests will automatically unzip the response from the server so you don't need to futz about with the zipping/unzipping logic yourself.
Having done this multiple times be aware that you can break other people stuff by messing up requests. Most web APIs suck and some won't behave nicely on unexpected failures.
1. When trying to automate a process on an energy management platform I ended up creating resources under some kind of master account, some things broke and they had to manually clean the DB.
2. When trying to access an operation I couldn't do via the provided API I reverse engineered the API of their admin dashboard. It sucked really bad, with a lot of strange sync tokens that felt like going back to 20 years ago. Anyway my implementation wasn't perfect, it grinded their platform to a halt.
I could go on, so please just do stuff like that if you're in contact with the people on the other side. If you're not limit yourself to GETs.
Pro-tip: If the undocumented API has a "CORS:*" header, you can call these APIs directly from the browser on your domain, without having to proxy them or using curl
As an example, I published https://captnemo.in/plugo/ this week that calls the Plugo.io private API (the ones used by the mobile app) to fetch the data, and publish it using GitHub Pages. The data is just a list of places where Plugo provides powerbanks on rent (500+ locations, mostly concentrated across 3 Indian cities, and 2 places in Germany somehow). I'm running a simple curl command on a scheduled GitHub Action that commits back to itself so the data remains updated.
I similarly did this to make a nocode frontend for another "clubhouse-alternative" which would keep recordings, but only provide them in-app. A friend wanted to listen to his prior recordings, but the app was too cumbersome, so I made a alternative frontend that would call the private API, and render a simple table with MP4 links for all recordings.
I even use this as a "nocode testing ground"[1] for many of the new nocode apps in the market - seeing if they are feasible enough to build fully functional frontends on top of existing APIs (which would be great for someone like me).
As a bonus, this works as a alternative-data stream for i)Plugo's Growth Metrics, if you were a investor, or interested in the "rent-powerbank" space as well as ii)Finding out cool new places to visit around you.
There is a way.^1 One might need to copy the static elements of the TLS Client Hello in addition to certain HTTP headers.
1. https://blog.squarelemon.com/tls-fingerprinting/
See, e.g., https://github.com/refraction-networking/utls
"problem 1: expiring session cookies
One big problem here is that I'm using my Google session cookie for authentication, so this script will stop working whenever my browser session expires.
That means that this approach wouldn't work for a long running program (I'd want to use a real API), but if I just need to quickly grab a little bit of data as a 1-time thing, it can work great!"
Sometimes Google keeps users logged in. For example, session cookies in Gmail will last for months or more. This makes it easy to check Gmail from the command line without a browser. It also means if someone steals a session cookie and the user never logs out, e.g., she closes the browser without logging out first,^2 then the thief can access the account for months, or longer.
2. Of course, it is also possible to logout and disable specific session cookies from the command line, without a browser.
"A special compilation of curl that makes it impersonate Chrome & Firefox", and it now can also impersonate Edge and Safari.
Previously discussed: https://news.ycombinator.com/item?id=30378562 _Show HN: Curl modified to impersonate Firefox and mimic its TLS handshake_ (21 days ago, 58 comments)
What is a reasonable rate to send requests? I've done a little scraping and I wanted to do the same thing but I realized I had no idea what would be considered acceptable use and what would be unacceptable. If anyone has a heuristic they like to use I'm all ears.
Usually you're automating these things not to get the job done that much faster, but instead just to do it without all the tedium, so a slow but asynchronous scrape is fine.
This is wrong, and the fact that somebody clearly experienced in web development is totally unaware that it is wrong should be a clear sign of the danger.
For starters: TLS fingerprinting, ETAG fingerprinting (including subtle browser-to-browser changes in how ETAGs are cached and evicted), JS VM fingerprinting, timing side channels, there is a massive list here. And then there's wasm...
It also makes the process of writing your code more mechanical, which is useful since you'll likely have to redo the process when the API changes.
IME, this "header minimisation" works for almost any website, or "endpoint". IOW, it is useful outside of "APIs". As a matter of practice, I minimise headers automatically with a forward proxy.^1
Thus, one can send less data to "tech" companies and still receive the same results. We know that data received by "tech" companies is used at every opportunity to support surveillance and online advertising. The most well-known example is perhaps "fingerprinting". Given a choice between sending more data or less data to "tech" companies, what is the choice that, in the aggregate,^2 lends itself better to increased survelliance and online advertising.
If the author here can send fewer headers and still get the desired result, then it stands to reason sending those extra headers benefits someone else besides the user. Send more data, not less, to make surveillance and online advertising easier. "Tech" companies will often defend data collection by suggesting that data supplied in headers are being used to "improve the user experience" or some such, and this may well be true for many cases, but the "fingerprinting" example exemplifies how there can also be another purpose. Data can be multi-purpose.
1. An added benefit is one does not need to fiddle with the browser to copy HTTP headers^3 as they are all easily accessible in the proxy logs.
2. Here, "in the aggregate" means "if every user makes the same choice".
3. The online advertising company or its business partner (e.g., Mozilla) could change the browser, without notice, at any time.
I have a switch (I think a TP-Link TL-SG1016PE) with PoE - and a finnicky PoE device that periodically needs a reboot, so I figured I'd replay turning the port on and off in the Web interface. Notably, logging in does not issue me any authentication token, but I can still turn the port on and off - and can still do it via `curl`, too. But as soon as I try it on another machine? Access denied!
(Yes, I could just fake the login process the same way, but that was more work than I had time for.)
You probably don't want Accept: */*. If the value of Accept is anything other than */*, then you probably want it.