How we scrape 300k prices per day from Google Flights (opens in new tab)

(medium.com)

47 pointsgusgordon6y ago67 comments

67 comments

50 comments · 14 top-level

jsnell6y ago· 9 in thread

> This isn’t an astronomical number, but it’s large enough that we (at least, as a bootstrapped company) have to care about cost efficiency.

... by externalizing the costs to a third party.

In general, I'm really surprised that they published this article. It's like they described exactly the data that somebody working on preventing scraping would need to block this traffic, in totally unnecessary level of detail. (E.g. telling exactly which ASN this traffic would be arriving from, describing the very specific timing of their traffic spikes, the kind of multi-city searches that probably see almost no organic traffic).

I just don't get it. It's like they're intentionally trying to get blocked so that they can write a follow-up "how Google blocked our bootstrapped business" blog post.

fxtentacle6y ago

Or they just don't understand that what they are doing is illegal.

I'm always surprised by the level of ignorance, but I've seen more than one startup burn because the founders didn't understand which taxes were due and, thus, failed to account for them in their pricing.

temp43t4536y ago

? https://news.ycombinator.com/item?id=22180559 i dont think it's unethical in the age of ad driven web

sdinsn6y ago

Scraping public data is not illegal in the US.

> I'm always surprised by the level of ignorance

Such as the ignorance displayed in your comment?

meritt6y ago

> what they are doing is illegal

It's not illegal. Google can sue them and bury them in court fees and potentially win a civil suit, but it sure as hell isn't illegal.

1 more reply

aardvark2916y ago

Unethical? Yes. Illegal? How?

2 more replies

splonk6y ago

I agree with you in principle, but having worked on both sides of this, I think there's very little chance they get blocked at their current traffic levels.

I do think that if they ever get traction they'll have a lot of problems - there's a reason GDS access to flight availability is slow, expensive, and difficult to implement well. Scraping definitely won't scale.

sneak6y ago

> E.g. telling exactly which ASN this traffic would be arriving from

The article mentions that they are using rotating residential proxies.

jsnell6y ago

Ah, thanks! I missed that part in the writeup.

1 more reply

fxtentacle6y ago

Why would they need those proxies if what they are doing is fully legal?

1 more reply

cortesoft6y ago· 6 in thread

> The crawl function reads a URL from the SQS queue, then Pyppeteer tells Chrome to navigate to that page behind a rotating residential proxy. The residential proxy is necessary to prevent Google from blocking the IP Lambda makes requests from.

I am very interested in what a 'rotating residential proxy' is. Are they routing requests through random people's internet connections? Are these people willing participants? Where do they come from?

heipei6y ago

Check out Luminati for example. They have a huge network of true residential IPs to exit traffic from, and you have to pay a hefty premium per GB of traffic to do so ($12.50 per GB for rotating residential IPs, but requires a minimum $500 commitment per month). The reason they can offer this is because they're exiting traffic through the users of the free Hola VPN Chrome extension.

kazz6y ago

It looks like that's basically what they are (https://smartproxy.com/blog/what-is-a-residential-proxies-ne...).

A residential proxy is listed as an "IP address provided by an Internet Service Provider", but I still don't really understand how they get access to them. ISPs have to be selling them access, right?

paulryanrogers6y ago

My guess is free VPN services and browser extensions which resell your residential network in tiny chunks.

nsgf6y ago

Yes, to all your questions.

https://luminati.io/

Providers of the 'free' Hola vpn.

ac296y ago

How awful.

"80M+ Monthly devices hosting Luminati's SDK" & "100% Peers chose to opt-in to Luminati's network" (https://luminati.io/network-details)

There is a 0% chance that 80M+ are agreeing "I am OK with Luminati selling access to my home internet connection to any party able to pay", which seems like an honest description of their business model. More likely Luminati is paying unscrupulous app developers to include this SDK in their apps, and some put some legalese into 10,000 word install-time agreements that no one reads.

1 more reply

Traster6y ago

I think you can make a reasonable argument that Hola VPN is largely exploiting users who don't actually understand and consent to having their IP address and connection used as a proxy.

dlhavema6y ago· 5 in thread

Interesting. A scraper scraping a scraper. I don't get what the value add is over clients just searching Google Flights directly. Not trying to be mean, just trying to understand.

namdnay6y ago

Google Flights isn’t a scraper, it’s an evolution of ITA matrix from what I remember, directly connected to that GDS. They aren’t piggy backing on someone else’s servers.

Which is what this guy could have done, instead of behaving like pond scum. It’s not like it’s particularly complicated to get programmatic access to a GDS API, that’s what they’re there for.

bryanrasmussen6y ago

Pond scum? He's scraping some data from a company that got rich scraping data and that probably will tell him to stop doing it. I mean if he's pond scum, what level of scum are those guys with upshot sites? What level of scum is Mark Zuckerberg? Pond scum is generally supposed to be pretty scummy, I can think of thousands of people more scummy than someone scraping from google.

gusgordonOP6y ago

It’s expensive to get access to a GDS API and, from what I’ve heard, the data they provide is quite difficult to work with. There’s a reason Google bought ITA for $700m, right? If this project ever grows, it could make sense to pull from a GDS.

1 more reply

dlhavema6y ago

Cool. I did not know that. Thanks for the clarification.

tpmx6y ago

Well, Google Flights is probably the best publicly available data on flight prices.

> Brisk Voyage finds cheap, last-minute weekend trips for our members. The basic idea is that we continuously check a bunch of flight and hotel prices, and when we find a trip that’s a low-priced outlier, we send an email with booking instructions.

Edit: Ok, this could actually be interesting. At least in the short while. .)

mongodbhater6y ago· 5 in thread

All the (AWS) technologies used are totally unnecessary. SQS/DynamoDB/Lambda. I can buy a laptop in walmart for $500 and i can do all the scrapping in starbucks wifi.

rblatz6y ago

Right, it seems like they overbuilt this hacky solution. You are scraping, eventually you just need to subscribe to the data. Why invest that much effort into a temporary solution.

nunez6y ago

Lambda is needed to get rotating IPs and scale while avoiding browser fingerprinting. SQS takes the results of those scrapes and puts them into a database, DynamoDB. It's a straightforward web scraping pipeline.

toddh6y ago

Lambda isn't enough. You'll get blocked in a heartbeat. You still need a proxy service.

atesti6y ago

No, the access google over residental proxy servers provided by packetstream.io

sdinsn6y ago

Of course its unnecessary. The point is that you can do it in the cloud, instead of on a laptop at starbucks...

dmortin6y ago· 4 in thread

It's strange they write about this so openly. Aren't they wary that someone at Google Fights will read it and they will try blocking them? (E.g. by scrambling the page's code)

meritt6y ago

Google doesn't need to block them on a technical level, they just need to send a simple C&D. If Brisk keeps scraping without permission after that, they can look forward to financially a ruinous legal battle [1]. Or they could just not blog about what they're doing and fly under the radar for years and years without any concern.

[1] https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...

z66y ago

This isn't really anything new that Google isn't aware of.

thinkloop6y ago

I don't think it's possible to "scramble" a site to be unscrapeable - it has to render at some point.

tpmx6y ago

You're supposed to break rules and laws in the early days; it's part of the startup lore.

Blogging about it publicly, as they're doing it: that may appear newish, but I'm sure some other startup did that 15 years ago.

cleansy6y ago· 3 in thread

It's ironic writing an article like that, while their ToS states:

> As a user of the Site, you agree not to:

> 1. Systematically retrieve data or other content from the Site to create or compile, directly or indirectly, a collection, compilation, database, or directory without written permission from us.

Frost1x6y ago

Irony is even deeper when you look on the other side, which is Google who made most their money off scraping data from people in different forms.

It's data scraping/middlemen all the way down... I wonder if Google indexes their scrape results to throw some loops in the mix.

namdnay6y ago

Google have respected non scrape headers for decades no?

1 more reply

sdinsn6y ago

Just because its in the ToS doesn't mean its enforceable. That line is not enforceable in the US.

randombytes68696y ago· 2 in thread

To those lamenting that they're scraping... Google is the biggest scraper of them all. Facebook, Amazon, Google, Microsoft. All the big boys scrape voraciously, yet try their best to block themselves from being scraped. Scraping is vital for the functionality of the internet. The narrative that scraping is evil is what big companies want you to think.

When you block small scrapers from your site but permit giants like Googlebot and Bing all you're doing is locking in a monopoly that's bad for everyone

occamrazor6y ago

Google has the (often implicit) permission of the website owner to scrape. OTOH, Google Flights explicitly disallows scraping results.

sdinsn6y ago

No, Google's scraping is opt-out only, which they offer to be friendly.

Google does not need anyone's permission to scrape publicly accessible data, and they are not required to follow any opt-out requests.

nojito6y ago· 2 in thread

You state that you care about costs but you end up using some of the most expensive cloud offerings out there?

heipei6y ago

I'm torn about their account. It's true that you could easily scrape 25k pages per day on a small VPS that costs less than the $50 Lambda costs they mentioned. And in order to scrape from that VPS you wouldn't have to engineer this much with getting Chrome to run in Lambda, batching URLs, and you wouldn't worry about Lambda timeouts because you could run the whole scrape in one session more or less. So you could say that the engineering effort they spent building this was a waste of money. On the other hand, if they ever do need to scale up for whatever reason (information spread across more pages, or they need to scrape more services, or need multiple attempts per URL), all they have to do is push a button, at which point the upfront engineering effort will have paid off. Either way, their current Lambda costs are definitely eclipsed by the costs of paying for the residential proxy IPs. My two cents.

sdinsn6y ago

Seriously, Lambda does not make sense for their use case.

nunez6y ago

Flights isn't really the best way of getting cheap flights. They pepper the results, especially if they think you're scraping (which they probably do). Matrix is more accurate. Using a GDS is even more accurate but that costs money.

dandanio6y ago

Hey Gus, you might be interested in https://pricelinepartnernetwork.com/ (take a look at the API part for example)

(Disclaimer: I work for priceline).

founderling6y ago

The way I read it, they scrape 25k pages per day?

I wonder if that could already bring them on Googles radar. If so, Google would probably send a cease and desist letter and this startup would simply give up.

I wonder if Google would also demand their legal expenses? Probably a couple thousand dollars?

I know, nobody would go to court against Google - but what would happen if this did go to court? Which laws would Google cite to deem this illegal?

BaitBlock6y ago

Reader mode in case you don't prefer Medium: https://baitblock.app/read/medium.com/brisk-voyage/how-we-sc...

ykevinator6y ago

This is awesome

tpmx6y ago

The Internet is not series of tubes. It's a series of leeches...

j / k navigate · click thread line to collapse

67 comments

50 comments · 14 top-level

jsnell6y ago· 9 in thread

> This isn’t an astronomical number, but it’s large enough that we (at least, as a bootstrapped company) have to care about cost efficiency.

... by externalizing the costs to a third party.

I just don't get it. It's like they're intentionally trying to get blocked so that they can write a follow-up "how Google blocked our bootstrapped business" blog post.

fxtentacle6y ago

Or they just don't understand that what they are doing is illegal.

temp43t4536y ago

? https://news.ycombinator.com/item?id=22180559 i dont think it's unethical in the age of ad driven web

sdinsn6y ago

Scraping public data is not illegal in the US.

> I'm always surprised by the level of ignorance

Such as the ignorance displayed in your comment?

meritt6y ago

> what they are doing is illegal

It's not illegal. Google can sue them and bury them in court fees and potentially win a civil suit, but it sure as hell isn't illegal.

1 more reply

aardvark2916y ago

Unethical? Yes. Illegal? How?

2 more replies

splonk6y ago

I agree with you in principle, but having worked on both sides of this, I think there's very little chance they get blocked at their current traffic levels.

sneak6y ago

> E.g. telling exactly which ASN this traffic would be arriving from

The article mentions that they are using rotating residential proxies.

jsnell6y ago

Ah, thanks! I missed that part in the writeup.

1 more reply

fxtentacle6y ago

Why would they need those proxies if what they are doing is fully legal?

1 more reply

cortesoft6y ago· 6 in thread

I am very interested in what a 'rotating residential proxy' is. Are they routing requests through random people's internet connections? Are these people willing participants? Where do they come from?

heipei6y ago

kazz6y ago

It looks like that's basically what they are (https://smartproxy.com/blog/what-is-a-residential-proxies-ne...).

A residential proxy is listed as an "IP address provided by an Internet Service Provider", but I still don't really understand how they get access to them. ISPs have to be selling them access, right?

paulryanrogers6y ago

My guess is free VPN services and browser extensions which resell your residential network in tiny chunks.

nsgf6y ago

Yes, to all your questions.

https://luminati.io/

Providers of the 'free' Hola vpn.

ac296y ago

How awful.

"80M+ Monthly devices hosting Luminati's SDK" & "100% Peers chose to opt-in to Luminati's network" (https://luminati.io/network-details)

1 more reply

Traster6y ago

I think you can make a reasonable argument that Hola VPN is largely exploiting users who don't actually understand and consent to having their IP address and connection used as a proxy.

dlhavema6y ago· 5 in thread

Interesting. A scraper scraping a scraper. I don't get what the value add is over clients just searching Google Flights directly. Not trying to be mean, just trying to understand.

namdnay6y ago

Google Flights isn’t a scraper, it’s an evolution of ITA matrix from what I remember, directly connected to that GDS. They aren’t piggy backing on someone else’s servers.

Which is what this guy could have done, instead of behaving like pond scum. It’s not like it’s particularly complicated to get programmatic access to a GDS API, that’s what they’re there for.

bryanrasmussen6y ago

gusgordonOP6y ago

1 more reply

dlhavema6y ago

Cool. I did not know that. Thanks for the clarification.

tpmx6y ago

Well, Google Flights is probably the best publicly available data on flight prices.

Edit: Ok, this could actually be interesting. At least in the short while. .)

mongodbhater6y ago· 5 in thread

All the (AWS) technologies used are totally unnecessary. SQS/DynamoDB/Lambda. I can buy a laptop in walmart for $500 and i can do all the scrapping in starbucks wifi.

rblatz6y ago

Right, it seems like they overbuilt this hacky solution. You are scraping, eventually you just need to subscribe to the data. Why invest that much effort into a temporary solution.

nunez6y ago

toddh6y ago

Lambda isn't enough. You'll get blocked in a heartbeat. You still need a proxy service.

atesti6y ago

No, the access google over residental proxy servers provided by packetstream.io

sdinsn6y ago

Of course its unnecessary. The point is that you can do it in the cloud, instead of on a laptop at starbucks...

dmortin6y ago· 4 in thread

It's strange they write about this so openly. Aren't they wary that someone at Google Fights will read it and they will try blocking them? (E.g. by scrambling the page's code)

meritt6y ago

[1] https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...

z66y ago

This isn't really anything new that Google isn't aware of.

thinkloop6y ago

I don't think it's possible to "scramble" a site to be unscrapeable - it has to render at some point.

tpmx6y ago

You're supposed to break rules and laws in the early days; it's part of the startup lore.

Blogging about it publicly, as they're doing it: that may appear newish, but I'm sure some other startup did that 15 years ago.

cleansy6y ago· 3 in thread

It's ironic writing an article like that, while their ToS states:

> As a user of the Site, you agree not to:

> 1. Systematically retrieve data or other content from the Site to create or compile, directly or indirectly, a collection, compilation, database, or directory without written permission from us.

Frost1x6y ago

Irony is even deeper when you look on the other side, which is Google who made most their money off scraping data from people in different forms.

It's data scraping/middlemen all the way down... I wonder if Google indexes their scrape results to throw some loops in the mix.

namdnay6y ago

Google have respected non scrape headers for decades no?

1 more reply

sdinsn6y ago

Just because its in the ToS doesn't mean its enforceable. That line is not enforceable in the US.

randombytes68696y ago· 2 in thread

When you block small scrapers from your site but permit giants like Googlebot and Bing all you're doing is locking in a monopoly that's bad for everyone

occamrazor6y ago

Google has the (often implicit) permission of the website owner to scrape. OTOH, Google Flights explicitly disallows scraping results.

sdinsn6y ago

No, Google's scraping is opt-out only, which they offer to be friendly.

Google does not need anyone's permission to scrape publicly accessible data, and they are not required to follow any opt-out requests.

nojito6y ago· 2 in thread

You state that you care about costs but you end up using some of the most expensive cloud offerings out there?

heipei6y ago

sdinsn6y ago

Seriously, Lambda does not make sense for their use case.

nunez6y ago

dandanio6y ago

Hey Gus, you might be interested in https://pricelinepartnernetwork.com/ (take a look at the API part for example)

(Disclaimer: I work for priceline).

founderling6y ago

The way I read it, they scrape 25k pages per day?

I wonder if that could already bring them on Googles radar. If so, Google would probably send a cease and desist letter and this startup would simply give up.

I wonder if Google would also demand their legal expenses? Probably a couple thousand dollars?

I know, nobody would go to court against Google - but what would happen if this did go to court? Which laws would Google cite to deem this illegal?

BaitBlock6y ago

Reader mode in case you don't prefer Medium: https://baitblock.app/read/medium.com/brisk-voyage/how-we-sc...

ykevinator6y ago

This is awesome

tpmx6y ago

The Internet is not series of tubes. It's a series of leeches...

j / k navigate · click thread line to collapse