How to use undocumented web APIs (opens in new tab)

(jvns.ca)

239 pointspingiun4y ago91 comments

91 comments

81 comments · 26 top-level

benmmurphy4y ago· 16 in thread

> I think there’s literally no way for the backend to tell that the request isn’t sent by my browser and is actually being sent by a random Python program.

Oh my sweet summer child. Unfortunately, there is a whole industry built around this. This a great blog discussing different detection methods: https://incolumitas.com/

paxys4y ago

The point still stands. The server can use any number of heuristics to try and figure out traffic source but (1) it is still an approximation, since they can all be spoofed, and (2) the more strict you make the detection logic, the more regular users are affected as well.

djeikyb4y ago

There's a way to make your point without being rude and infantilizing.

benmmurphy4y ago

I can't resist a rhetorical flourish :) The situation between defenders of the API and users of the API is extremely weird. If you were an alien from another planet and witnessed what was going on you would be shocked. People are being paid to prevent people from accessing an API and people are being paid to defeat these countermeasures. This is similar to people being paid to dig holes and other people are being paid to fill in the holes. I think the aliens would conclude there is something wrong with the economic system. I don't know what the solution is but seeing as I'm a digger I really don't want rock the boat :)

nl4y ago

> People are being paid to prevent people from accessing an API and people are being paid to defeat these countermeasures. This is similar to people being paid to dig holes and other people are being paid to fill in the holes

Let me tell you about this little thing called tax accountants and the IRS...

> I think the aliens would conclude there is something wrong with the economic system.

These aliens are unfamiliar with adversarial systems?

Reminds me a a SciFi short story I read once where some aliens came to earth. Everyone thought they were amazingly smart, but it turned out they had just been working on their tech for a lot longer and were very dumb. The protagonist in the story figured this out, and sold them the Brooklyn Bridge...

1 more reply

octoberfranklin4y ago

> there is something wrong with the economic system. ... I'm a digger I really don't want rock the boat

My advice, based on experience: when you find yourself in this situation immediately start looking for a way off the boat. Urgently. It is rare for people on the boat to notice this before it sinks, and those few who do always seem to overestimate the supply of lifeboats.

Spivak4y ago

I think that interpretation is more on you than anything -- "sweet summer child" is not literally referring to a child but someone whose innocence or blissful ignorance hasn't been ruined by the can of worms they just opened.

zamadatix4y ago

It means those things precisely because the person is a child. In A Song of Ice and Fire winter can last over a decade hence "summer child", a young one that has never experienced the hardship of winter https://en.wiktionary.org/wiki/sweet_summer_child

Regardless, starting out with any variation of "you're blissfully ignorant" isn't needed either. I get offense usually isn't intended but the use of that phrase has always stuck me as a very condescending way to respond.

3 more replies

yakshaving_jgt4y ago

Even if that were so, it strikes me as odd to characterise the author — one of the world's more accomplished software professionals — as blissfully ignorant.

vincentmarle4y ago

Browser calls (and sessions) are indeed tricky to emulate - you'll generally have much better luck with reverse engineering mobile client API calls.

boilerupnc4y ago

Totally agree. I used to work on a load testing product that spent many, many dev hours attempting to achieve a high degree of fidelity on web recordings at the HTTP and sometimes even the socket level of emulation. It was extremely tricky. We employed alot of regex matching mechanisms and used to keep a regression test bucket of thousands of example HTTP traffic recordings to avoid messing up cookies, headers, post data and query strings to name a few things.

In the early days, the developer abuses around ASP view state payload were an absolute nightmare to deal with. I used to half-joke that I could speak HTTP after staring at the raw traffic and how 5 page loads could generate 100+ requests which had dependencies on one another.

Interestingly, there were also an interesting class of client-server bugs that only were obvious in recordings (e.g. multiple repeat HTTP head requests to check if a resource existed). Each object or library dev clearly had no knowledge that the function triggered just before also wanted to check if that resource exists. This resulted in a huge amount of redundant unnecessary calls because nobody coordinated and optimized at this level.

Fun stroll down memory lane.

aryamaan4y ago

any resources for that just like this post? I am going to google as well, but wanted something if people already had on the top of their head

1 more reply

Dunedan4y ago

Her example code doesn't even set a user agent header, making it trivial to distinguish these requests from ones an actual browser would make.

MaxDPS4y ago

Yup, she is using the Requests library which has a default header that explicitly states the request is coming from Python Requests library.

naniwaduni4y ago

Perhaps the fact that there's a whole industry built around this instead of, you know, a couple of well-cooked solutions that everyone uses, implies that it's not possible and isn't going to be for the foreseeable future.

tshaddox4y ago

Well, in this case the existence of a whole industry built around this is a reflection of the fact that it's impossible to implement a single static solution to the problem. It will forever be a cat and mouse game.

benmmurphy4y ago

Yes. And I think the great thing about her observation is the naive view is actually true in practice (to some degree...). It is very difficult to distinguish between a 'legitimate' user that is a human and a bot. At least without reverting to captcha's and if the benefit to subverting the captcha's is high enough then you can just get human's to solve the captchas.

kall4y ago· 13 in thread

When they have a GraphQL API with introspection enabled, it feels like discovering a pot of gold.

This happens more often than you would expect, even without any auth sometimes. At that point you're basically developing with the same DX as internal developers.

My theory is people just turn off the GraphiQL endpoint on their GraphQL server and think they have hidden the schema, not realizing any external tool can do the introspection. Either that or it's developers slipping a little something under the radar for other developers (same thing with source maps).

Another tip: If the service in question has a mobile app, sniffing the traffic on that with a MITM proxy can yield more interesting results than a web app.

tshaddox4y ago

I've always thought it's a bit silly have a publicly accessible GraphQL API but then turn off introspection. If the only thing you're relying on to prevent someone from knowing about a certain field is that none of your web client code currently requests that field, you're already in a pretty flimsy predicament. And even then, people could trivially check for common or expected field names, or even brute force a lot of short field names.

If you really intend for your GraphQL API to be used only internally and from your official web client, and you consider any fields not currently requested in your web client to be highly sensitive, you should really turn off public access to the full GraphQL API and use something like GraphQL's persisted queries where your web client requests queries by an opaque unique identifier rather than the fully text of the query.

kall4y ago

Well, the gist of the op article is kind of "they can't prevent you from using their internal API", so most services shouldn't try. I think there's still a difference between making people scan your entire frontend code/traffic to find all the edge cases and making them reverse engineer your auth/headers/cookies (hours of work) vs handing them database access after 2 minutes of work. But I appreciate it, and it might be engineers that know this (that preventing access is futile) leaving it on intentionally. I certainly have done that.

slaymaker19074y ago

So it's not only security through obscurity, it's very weak obscurity.

monocasa4y ago

Security through the obscurity of a wedding veil.

Not only is it translucent, but your audience tends to have a better idea than they can directly see at the moment as to what it's hiding.

throwthere4y ago

You may not even need introspection--

https://github.com/nikitastupin/clairvoyance https://github.com/swisskyrepo/GraphQLmap

trever1234y ago

Some GraphQL APIs do this on purpose if the API is meant to be completely public and if they want to allow self discovery and documentation of things through introspection. Allows anyone to point their own instance of GraphiQL or GraphQL playground and the endpoint and find things out. We even include comments in the schema to help with this as another form of documentation.

pantsforbirds4y ago

I saw a website that exposed the results of a very expensive paid Linkedin API + the enrichment they did to those results in their GraphQL endpoint. Seemed like an expensive oversite

robk4y ago

Which site was that??

brazzledazzle4y ago

Just be ready for mitm proxying on some mobile apps to be a bust if they use certificate pinning. I’m not aware of anything that can get you past that besides patching the app itself.

buildfocus4y ago

https://httptoolkit.tech/blog/frida-certificate-pinning/ has a good guide and Frida script that will disable certificate pinning automatically in most cases.

jonatron4y ago

There's plenty of Frida scripts that can disable app certificate pinning

brazzledazzle4y ago

To be fair I’m sure that uses patching but I didn’t know about that tool and how easy it is to use. Thanks for another thing to put in the ol’ bag o’ tricks.

oyebenny4y ago

I love you.

dec0dedab0de4y ago· 6 in thread

I said this on a thread complaining about SPAs a little bit ago, but I love that the SPA trend has caused all kinds of web apps to open up APIs to their users. It's not as fun as pure screen scraping, but it is very exciting when you figure out whatever weird behavior they're expecting, and it starts working.

If you get stuck, look at their javascript, see what it is doing. double check your network requests in developer tools, some of them might be more important than you think, plus it's so nice that we don't have to use burp for this anymore. Some sites check referrers, and user agents, or expect a field from a specific server rendered page to be added to a header. More than one expected a javascript style timestamp on every request.

The weirdest behavior comes from older apps that started as purely server rendered, and slowly added a dynamic frontend. I always cringe when it's obvious that different developers were given tasks over the years, and completed them without bothering to learn the rest of the system.

throwawayboise4y ago

Sometimes is is when two or three different products were acquired and then clumsily glued together in a single web UI. You can often spot these by jarring changes in conventions and behavior in different areas of the app.

klenwell4y ago

As I started to try to track down government sources of Covid data a couple years ago, I soon discovered this approach was generally much more efficient than consulting any official documentation.

8n4vidtmkvmk4y ago

lol you think it's different developers but it's just me evolving over 8 years and not bothering to update the old shit

pjmlp4y ago

We didn't need SPAs for that. Ajax, XML-RPC and SOAP exist since around 1999.

jaredsohn4y ago

Yes, SPAs aren't required but they make it more likely that the site will use AJAX to request data instead of rendering the page with server data PHP-style.

duxup4y ago

It is pretty handy.

I work on some SPAs and some server side rendered systems.

It's so nice to fire up the network tab and see some of the requests right away to troubleshoot.

Server side rendered stuff, not so easy. Not impossible and you can always add some debugging, but the nature of SPAs to just call all the things that are easily seen, very nice. And I can use that elsewhere.

1 more reply

helsinki4y ago· 5 in thread

You would be surprised to find out that some web servers are capable of detecting browser emulation through curl or Python’s requests lib. Try programmatically scrolling through Instagram photos. It will work if you use curl, but it will not work using Python’s requests lib. Not sure how they detect it - maybe related to timing of packets.

chockchocschoir4y ago

The most trivial check a website owner can do is checking the user-agent, which Python requests automatically sets to show its name, unless you configure your own. Trivial way to work around is to set your own user-agent to one that looks like a browser.

Specifically regarding Instagram, you can take a look at the implementation of https://github.com/dilame/instagram-private-api to understand more workarounds, as Instagram is getting better and better at working against the workarounds.

jmt_4y ago

In the particular case of Instagram as GP mentions, I'm guessing the devs don't go off of user agent since curl's default user agent is "curl/<installed version num>". Even if they are going off user agent, seems strange to block requests but not curl. GP doesn't mention if they tried to changing the user agent, would be interested to know if Instagram can guess if the client is curl or requests based off other heuristics

RF_Savage4y ago

I wonder if they have some internal tooling or monitoring that use curl. And thus blocking it would break things.

abdusco4y ago

This tool claims to replicate Firefox/Chrome's TLS handshake signature:

https://github.com/lwthiker/curl-impersonate

I haven't tried it, haven't really come across a service that blocks curl, but I'll be keeping an eye on it in case I need it.

octoberfranklin4y ago

The clowns who run the Seattle Times's website block all non-browser user-agent requests to their RSS feeds.

Except curl.

You can "curl" their RSS feed. You can open it in a browser. Anything else that doesn't lie about its User-Agent will fail.

W T F.

Somebody please go strangle those people. I had to set my RSS feed reader to impersonate curl's User-Agent.

gfd4y ago· 4 in thread

I found puppeteer very nice to script against if you need a real headless browser:

https://github.com/puppeteer/puppeteer

simonw4y ago

I've just started switching from Puppeteer to Playwright - pretty much the exact same functionality, but in a more actively maintained, tighter package (and with great language bindings for JavaScript, Python, .NET and Java.

I wrote a bit about that here: https://simonwillison.net/2022/Mar/10/shot-scraper/#how-it-w...

ydant4y ago

[Playwright](https://playwright.dev/) (Node / Python) is my current preferred - mainly because I seem to have less reliability issues with the browser starting/stopping cleanly (although it's never perfect with any of the tools I've tried).

dorianmariefr4y ago

I like to use capybara https://github.com/teamcapybara/capybara

lysecret4y ago

I used selenium. Really like it and very well maintained.

cameroncairns4y ago· 3 in thread

Really great techniques listed in this thread! I wanted to point out though that it's generally nicer to the website owner if you enable `Accept-Encoding: gzip, deflate`. The difference in the amount of bandwidth charges for the site owner is quite significant, especially should you want to do comprehensive crawls.

Yes, go ahead and disable that header when piping curl's output into `less`, however when converting the curl request into python just remember to re-add that header. Pretty much every python library I've used to handle web requests will automatically unzip the response from the server so you don't need to futz about with the zipping/unzipping logic yourself.

Nextgrid4y ago

Your HTTP client library is likely to set that by itself to a value it can understand. Setting it manually risks setting it to something your library can’t actually decode when it gets the response.

Klonoar4y ago

No, some HTTP clients actually require you to set it - you wouldn't set the header directly, sure, but you would enable gzip/etc. Their point is super valid.

1vuio0pswjnm74y ago

There have been some very popular websites that ignore Accept-Encoding and only send compressed data. Sometimes I want uncompressed responses. I always have the urge to complain about these websites on HN but I sense that HN commenters/voters would be unsympathetic. (I do not use curl nor python.)

01acheru4y ago· 2 in thread

I need to point something out to people doing that kind of thing to other people wesites/webapps/whatever:

Having done this multiple times be aware that you can break other people stuff by messing up requests. Most web APIs suck and some won't behave nicely on unexpected failures.

1. When trying to automate a process on an energy management platform I ended up creating resources under some kind of master account, some things broke and they had to manually clean the DB.

2. When trying to access an operation I couldn't do via the provided API I reverse engineered the API of their admin dashboard. It sucked really bad, with a lot of strange sync tokens that felt like going back to 20 years ago. Anyway my implementation wasn't perfect, it grinded their platform to a halt.

I could go on, so please just do stuff like that if you're in contact with the people on the other side. If you're not limit yourself to GETs.

trinovantes4y ago

Bold of you to assume GET requests do not have side effects

sokoloff4y ago

It’s amusing to think the same devs who cobbled together a pile of otherwise fragile excrement were somehow careful to make sure that GETs were side-effect free.

getcrunk4y ago· 2 in thread

So I just checked WhatsApp web app. No network activity whats so ever on full loaded page that has incoming messages. And then a bunch of error messages in console about we sockets and source maps. How did they pull that off? does chrome not show web socket activity or service worker activity on the network tab?

plibither84y ago

You can definitely see the incoming/outgoing messages in the DevTools! Since it's a WebSocket connection, you must have the DevTools open and reload the page. Filter for the WebSocket network activity (you can quickly do this by selecting "WS"), and you'll find the WebSocket connection. Clicking on it and selecting the "Messages" sub-tab will let you see the live list of binary messages sent and received by the connection. Not too meaningful though, unfortunately.

getcrunk4y ago

ah ur right! thanks.

captn3m04y ago· 1 in thread

I love doing this, especially to liberate content that is locked away in a app-only world otherwise. That's one important usecase that I'd love more people to work on - it is a great way to start with reverse-engineering, and building simple websites.

Pro-tip: If the undocumented API has a "CORS:*" header, you can call these APIs directly from the browser on your domain, without having to proxy them or using curl

As an example, I published https://captnemo.in/plugo/ this week that calls the Plugo.io private API (the ones used by the mobile app) to fetch the data, and publish it using GitHub Pages. The data is just a list of places where Plugo provides powerbanks on rent (500+ locations, mostly concentrated across 3 Indian cities, and 2 places in Germany somehow). I'm running a simple curl command on a scheduled GitHub Action that commits back to itself so the data remains updated.

I similarly did this to make a nocode frontend for another "clubhouse-alternative" which would keep recordings, but only provide them in-app. A friend wanted to listen to his prior recordings, but the app was too cumbersome, so I made a alternative frontend that would call the private API, and render a simple table with MP4 links for all recordings.

I even use this as a "nocode testing ground"[1] for many of the new nocode apps in the market - seeing if they are feasible enough to build fully functional frontends on top of existing APIs (which would be great for someone like me).

As a bonus, this works as a alternative-data stream for i)Plugo's Growth Metrics, if you were a investor, or interested in the "rent-powerbank" space as well as ii)Finding out cool new places to visit around you.

[1]: https://news.ycombinator.com/item?id=29243536

slaymaker19074y ago

They can still prevent you from sending requests from another domain by looking at the origin header. AFAIK, origin inspection is actually more secure since no OPTIONS request is sent for GET requests. If CORS doesn't allow a GET request, what typically happens is the request is still made, but the browser tells the requestor that the request failed. Therefore, you could get timing attacks or something and you have to deal with additional load. Just inspecting the origin header can be done with a lot less resources than looking up a bunch of data in the database to service some request.

1vuio0pswjnm74y ago· 1 in thread

"The answer is sort of yes - browsers aren't magic! All the information browsers send to your backend is just HTTP requests. So if I copy all of the HTTP headers that my browser is sending, I think there's literally no way for the backend to tell that the request isn't sent by my browser and is actually being sent by a random Python program."

There is a way.^1 One might need to copy the static elements of the TLS Client Hello in addition to certain HTTP headers.

1. https://blog.squarelemon.com/tls-fingerprinting/

See, e.g., https://github.com/refraction-networking/utls

"problem 1: expiring session cookies

One big problem here is that I'm using my Google session cookie for authentication, so this script will stop working whenever my browser session expires.

That means that this approach wouldn't work for a long running program (I'd want to use a real API), but if I just need to quickly grab a little bit of data as a 1-time thing, it can work great!"

Sometimes Google keeps users logged in. For example, session cookies in Gmail will last for months or more. This makes it easy to check Gmail from the command line without a browser. It also means if someone steals a session cookie and the user never logs out, e.g., she closes the browser without logging out first,^2 then the thief can access the account for months, or longer.

2. Of course, it is also possible to logout and disable specific session cookies from the command line, without a browser.

epitactic4y ago

The first problem can be solved with curl-impersonate: https://github.com/lwthiker/curl-impersonate

"A special compilation of curl that makes it impersonate Chrome & Firefox", and it now can also impersonate Edge and Safari.

Previously discussed: https://news.ycombinator.com/item?id=30378562 _Show HN: Curl modified to impersonate Firefox and mimic its TLS handshake_ (21 days ago, 58 comments)

cehrlich4y ago· 1 in thread

Undocumented APIs are great when you only need to use them for a short amount of time, but if you try to build anything long term on top of them you should keep in mind that there could be changes that completely break your stuff, unannounced, at any time.

throwawayboise4y ago

You should, at least, build your own shim between your app and the API. That way, if there are changes, hopefully the fixes (if they are possible) are at least confined to one place.

burnished4y ago· 1 in thread

>>If I’m using a small website, there’s a chance that my little Python script could take down their service because it’s doing way more requests than they’re able to handle. So when I’m doing this I try to be respectful and not make too many requests too quickly.

What is a reasonable rate to send requests? I've done a little scraping and I wanted to do the same thing but I realized I had no idea what would be considered acceptable use and what would be unacceptable. If anyone has a heuristic they like to use I'm all ears.

ydant4y ago

If you're going to be doing it manually regardless (if not automated), then as far as I'm concerned, you could definitely just use a "normal clicking speed" rate - so a second or two between clicks is probably just fine and non-parallel requests. Usually if it's likely to overload the server, it's probably slower to return, too, so the server itself will slow the requests down naturally if you're not using parallel requests.

Usually you're automating these things not to get the job done that much faster, but instead just to do it without all the tedium, so a slow but asynchronous scrape is fine.

isbvhodnvemrwvn4y ago

I would also add that any search boxes are typically keys to the kingdom if you're scraping shops/job boards or similar things. They are often not hardened, so you can file e.g. an empty query (even if frontend doesn't allow it), or effectively disable pagination by requesting 1000000 results per page.

octoberfranklin4y ago

> there’s literally no way for the backend to tell that the request isn’t sent by my browser and is actually being sent by a random Python program.

This is wrong, and the fact that somebody clearly experienced in web development is totally unaware that it is wrong should be a clear sign of the danger.

For starters: TLS fingerprinting, ETAG fingerprinting (including subtle browser-to-browser changes in how ETAGs are cached and evicted), JS VM fingerprinting, timing side channels, there is a massive list here. And then there's wasm...

kjgkjhfkjf4y ago

It's more robust not to remove the extra headers IMO. Otherwise you give an unnecessary signal to the backend that the traffic's not coming from the expected sources.

It also makes the process of writing your code more mechanical, which is useful since you'll likely have to redo the process when the API changes.

1vuio0pswjnm74y ago

"I usually just figure out which headers I can delete with trial and error - I keep removing headers until the request starts failing. In general you probably don't need Accept, Referer, Sec-, DNT, User-Agent, and caching headers though."

IME, this "header minimisation" works for almost any website, or "endpoint". IOW, it is useful outside of "APIs". As a matter of practice, I minimise headers automatically with a forward proxy.^1

Thus, one can send less data to "tech" companies and still receive the same results. We know that data received by "tech" companies is used at every opportunity to support surveillance and online advertising. The most well-known example is perhaps "fingerprinting". Given a choice between sending more data or less data to "tech" companies, what is the choice that, in the aggregate,^2 lends itself better to increased survelliance and online advertising.

If the author here can send fewer headers and still get the desired result, then it stands to reason sending those extra headers benefits someone else besides the user. Send more data, not less, to make surveillance and online advertising easier. "Tech" companies will often defend data collection by suggesting that data supplied in headers are being used to "improve the user experience" or some such, and this may well be true for many cases, but the "fingerprinting" example exemplifies how there can also be another purpose. Data can be multi-purpose.

1. An added benefit is one does not need to fiddle with the browser to copy HTTP headers^3 as they are all easily accessible in the proxy logs.

2. Here, "in the aggregate" means "if every user makes the same choice".

3. The online advertising company or its business partner (e.g., Mozilla) could change the browser, without notice, at any time.

SahAssar4y ago

I do a bit of scraping for hobby projects, and much of that comes down to basically this (but I do it in node instead of python). Sometimes you need to use jsdom or puppeteer, but the second step (after checking if there are official data dumps made available or some official API) is always checking the full data flow in devtools if there is some undocumented way to more quickly get the raw data I want.

simonw4y ago

A trick that works great for me: filter the browser network pane by XHR, then sort by size - this usually ends up with the most interesting JSON responses listed at the top.

theblazehen4y ago

If you still use the website via browser, I find https://github.com/richardpenman/browsercookie/ is great for working around the expiring cookie problem

don-code4y ago

While I've successfully used this method for public APIs, I ran into an interesting one not long ago: where authentication is performed _by IP address_.

I have a switch (I think a TP-Link TL-SG1016PE) with PoE - and a finnicky PoE device that periodically needs a reboot, so I figured I'd replay turning the port on and off in the Web interface. Notably, logging in does not issue me any authentication token, but I can still turn the port on and off - and can still do it via `curl`, too. But as soon as I try it on another machine? Access denied!

(Yes, I could just fake the login process the same way, but that was more work than I had time for.)

joshstrange4y ago

It's always a joy when you start to reverse engineer an undocumented API and find out it is cleaner/nicer than some paid APIs you've used. Paprika (Cloud sync for the recipes/other data) was an example of that for me. Their API is (was, it's been a minute since I last looked at it) super RESTful and really easy to reason about, more less just simple CRUD.

slaymaker19074y ago

The copy as cURL is a great idea! That makes it easy to get a succinct summary of the components to the request including how they are doing auth. If the API in question is a desktop app, Fiddler can be a great alternative. Obviously WireShark can see more, but Fiddler is a lot easier to use and setup in my experience.

moron4hire4y ago

Small nitpick on the comments about removing the headers that the browser request had made.

You probably don't want Accept: */*. If the value of Accept is anything other than */*, then you probably want it.

jeffrallen4y ago

Julia is really an excellent teacher.

tkanarsky4y ago

I used this approach last year to run a Twitter bot that would report when local pharmacies had 'rona vaccine appointments open up. I scraped the API's of CVS, Rite-Aid, Walgreens, and a few other chains this way. Although I didn't go fancy and try to distill the API down to the bare minimum headers, I just called into cURL from Python with that giant command as a string.

ipnon4y ago

gobuster is an effective way to enumerate subdomains and their directories quickly.

https://github.com/OJ/gobuster

j / k navigate · click thread line to collapse

91 comments

81 comments · 26 top-level

benmmurphy4y ago· 16 in thread

> I think there’s literally no way for the backend to tell that the request isn’t sent by my browser and is actually being sent by a random Python program.

Oh my sweet summer child. Unfortunately, there is a whole industry built around this. This a great blog discussing different detection methods: https://incolumitas.com/

paxys4y ago

djeikyb4y ago

There's a way to make your point without being rude and infantilizing.

benmmurphy4y ago

nl4y ago

Let me tell you about this little thing called tax accountants and the IRS...

> I think the aliens would conclude there is something wrong with the economic system.

These aliens are unfamiliar with adversarial systems?

1 more reply

octoberfranklin4y ago

> there is something wrong with the economic system. ... I'm a digger I really don't want rock the boat

Spivak4y ago

zamadatix4y ago

3 more replies

yakshaving_jgt4y ago

Even if that were so, it strikes me as odd to characterise the author — one of the world's more accomplished software professionals — as blissfully ignorant.

vincentmarle4y ago

Browser calls (and sessions) are indeed tricky to emulate - you'll generally have much better luck with reverse engineering mobile client API calls.

boilerupnc4y ago

Fun stroll down memory lane.

aryamaan4y ago

any resources for that just like this post? I am going to google as well, but wanted something if people already had on the top of their head

1 more reply

Dunedan4y ago

Her example code doesn't even set a user agent header, making it trivial to distinguish these requests from ones an actual browser would make.

MaxDPS4y ago

Yup, she is using the Requests library which has a default header that explicitly states the request is coming from Python Requests library.

naniwaduni4y ago

tshaddox4y ago

benmmurphy4y ago

kall4y ago· 13 in thread

When they have a GraphQL API with introspection enabled, it feels like discovering a pot of gold.

This happens more often than you would expect, even without any auth sometimes. At that point you're basically developing with the same DX as internal developers.

Another tip: If the service in question has a mobile app, sniffing the traffic on that with a MITM proxy can yield more interesting results than a web app.

tshaddox4y ago

kall4y ago

slaymaker19074y ago

So it's not only security through obscurity, it's very weak obscurity.

monocasa4y ago

Security through the obscurity of a wedding veil.

Not only is it translucent, but your audience tends to have a better idea than they can directly see at the moment as to what it's hiding.

throwthere4y ago

You may not even need introspection--

https://github.com/nikitastupin/clairvoyance https://github.com/swisskyrepo/GraphQLmap

trever1234y ago

pantsforbirds4y ago

I saw a website that exposed the results of a very expensive paid Linkedin API + the enrichment they did to those results in their GraphQL endpoint. Seemed like an expensive oversite

robk4y ago

Which site was that??

brazzledazzle4y ago

Just be ready for mitm proxying on some mobile apps to be a bust if they use certificate pinning. I’m not aware of anything that can get you past that besides patching the app itself.

buildfocus4y ago

https://httptoolkit.tech/blog/frida-certificate-pinning/ has a good guide and Frida script that will disable certificate pinning automatically in most cases.

jonatron4y ago

There's plenty of Frida scripts that can disable app certificate pinning

brazzledazzle4y ago

To be fair I’m sure that uses patching but I didn’t know about that tool and how easy it is to use. Thanks for another thing to put in the ol’ bag o’ tricks.

oyebenny4y ago

I love you.

dec0dedab0de4y ago· 6 in thread

throwawayboise4y ago

klenwell4y ago

As I started to try to track down government sources of Covid data a couple years ago, I soon discovered this approach was generally much more efficient than consulting any official documentation.

8n4vidtmkvmk4y ago

lol you think it's different developers but it's just me evolving over 8 years and not bothering to update the old shit

pjmlp4y ago

We didn't need SPAs for that. Ajax, XML-RPC and SOAP exist since around 1999.

jaredsohn4y ago

Yes, SPAs aren't required but they make it more likely that the site will use AJAX to request data instead of rendering the page with server data PHP-style.

duxup4y ago

It is pretty handy.

I work on some SPAs and some server side rendered systems.

It's so nice to fire up the network tab and see some of the requests right away to troubleshoot.

1 more reply

helsinki4y ago· 5 in thread

chockchocschoir4y ago

jmt_4y ago

RF_Savage4y ago

I wonder if they have some internal tooling or monitoring that use curl. And thus blocking it would break things.

abdusco4y ago

This tool claims to replicate Firefox/Chrome's TLS handshake signature:

https://github.com/lwthiker/curl-impersonate

I haven't tried it, haven't really come across a service that blocks curl, but I'll be keeping an eye on it in case I need it.

octoberfranklin4y ago

The clowns who run the Seattle Times's website block all non-browser user-agent requests to their RSS feeds.

Except curl.

You can "curl" their RSS feed. You can open it in a browser. Anything else that doesn't lie about its User-Agent will fail.

W T F.

Somebody please go strangle those people. I had to set my RSS feed reader to impersonate curl's User-Agent.

gfd4y ago· 4 in thread

I found puppeteer very nice to script against if you need a real headless browser:

https://github.com/puppeteer/puppeteer

simonw4y ago

I wrote a bit about that here: https://simonwillison.net/2022/Mar/10/shot-scraper/#how-it-w...

ydant4y ago

dorianmariefr4y ago

I like to use capybara https://github.com/teamcapybara/capybara

lysecret4y ago

I used selenium. Really like it and very well maintained.

cameroncairns4y ago· 3 in thread

Nextgrid4y ago

Klonoar4y ago

No, some HTTP clients actually require you to set it - you wouldn't set the header directly, sure, but you would enable gzip/etc. Their point is super valid.

1vuio0pswjnm74y ago

01acheru4y ago· 2 in thread

I need to point something out to people doing that kind of thing to other people wesites/webapps/whatever:

Having done this multiple times be aware that you can break other people stuff by messing up requests. Most web APIs suck and some won't behave nicely on unexpected failures.

1. When trying to automate a process on an energy management platform I ended up creating resources under some kind of master account, some things broke and they had to manually clean the DB.

I could go on, so please just do stuff like that if you're in contact with the people on the other side. If you're not limit yourself to GETs.

trinovantes4y ago

Bold of you to assume GET requests do not have side effects

sokoloff4y ago

It’s amusing to think the same devs who cobbled together a pile of otherwise fragile excrement were somehow careful to make sure that GETs were side-effect free.

getcrunk4y ago· 2 in thread

plibither84y ago

getcrunk4y ago

ah ur right! thanks.

captn3m04y ago· 1 in thread

Pro-tip: If the undocumented API has a "CORS:*" header, you can call these APIs directly from the browser on your domain, without having to proxy them or using curl

[1]: https://news.ycombinator.com/item?id=29243536

slaymaker19074y ago

1vuio0pswjnm74y ago· 1 in thread

There is a way.^1 One might need to copy the static elements of the TLS Client Hello in addition to certain HTTP headers.

1. https://blog.squarelemon.com/tls-fingerprinting/

See, e.g., https://github.com/refraction-networking/utls

"problem 1: expiring session cookies

One big problem here is that I'm using my Google session cookie for authentication, so this script will stop working whenever my browser session expires.

That means that this approach wouldn't work for a long running program (I'd want to use a real API), but if I just need to quickly grab a little bit of data as a 1-time thing, it can work great!"

2. Of course, it is also possible to logout and disable specific session cookies from the command line, without a browser.

epitactic4y ago

The first problem can be solved with curl-impersonate: https://github.com/lwthiker/curl-impersonate

"A special compilation of curl that makes it impersonate Chrome & Firefox", and it now can also impersonate Edge and Safari.

Previously discussed: https://news.ycombinator.com/item?id=30378562 _Show HN: Curl modified to impersonate Firefox and mimic its TLS handshake_ (21 days ago, 58 comments)

cehrlich4y ago· 1 in thread

throwawayboise4y ago

You should, at least, build your own shim between your app and the API. That way, if there are changes, hopefully the fixes (if they are possible) are at least confined to one place.

burnished4y ago· 1 in thread

ydant4y ago

Usually you're automating these things not to get the job done that much faster, but instead just to do it without all the tedium, so a slow but asynchronous scrape is fine.

isbvhodnvemrwvn4y ago

octoberfranklin4y ago

> there’s literally no way for the backend to tell that the request isn’t sent by my browser and is actually being sent by a random Python program.

This is wrong, and the fact that somebody clearly experienced in web development is totally unaware that it is wrong should be a clear sign of the danger.

kjgkjhfkjf4y ago

It's more robust not to remove the extra headers IMO. Otherwise you give an unnecessary signal to the backend that the traffic's not coming from the expected sources.

It also makes the process of writing your code more mechanical, which is useful since you'll likely have to redo the process when the API changes.

1vuio0pswjnm74y ago

IME, this "header minimisation" works for almost any website, or "endpoint". IOW, it is useful outside of "APIs". As a matter of practice, I minimise headers automatically with a forward proxy.^1

1. An added benefit is one does not need to fiddle with the browser to copy HTTP headers^3 as they are all easily accessible in the proxy logs.

2. Here, "in the aggregate" means "if every user makes the same choice".

3. The online advertising company or its business partner (e.g., Mozilla) could change the browser, without notice, at any time.

SahAssar4y ago

simonw4y ago

A trick that works great for me: filter the browser network pane by XHR, then sort by size - this usually ends up with the most interesting JSON responses listed at the top.

theblazehen4y ago

If you still use the website via browser, I find https://github.com/richardpenman/browsercookie/ is great for working around the expiring cookie problem

don-code4y ago

While I've successfully used this method for public APIs, I ran into an interesting one not long ago: where authentication is performed _by IP address_.

(Yes, I could just fake the login process the same way, but that was more work than I had time for.)

joshstrange4y ago

slaymaker19074y ago

moron4hire4y ago

Small nitpick on the comments about removing the headers that the browser request had made.

You probably don't want Accept: */*. If the value of Accept is anything other than */*, then you probably want it.

jeffrallen4y ago

Julia is really an excellent teacher.

tkanarsky4y ago

ipnon4y ago

gobuster is an effective way to enumerate subdomains and their directories quickly.

https://github.com/OJ/gobuster

j / k navigate · click thread line to collapse