Tracking supermarket prices with Playwright (opens in new tab)

(sakisv.net)

467 pointssakisv1y ago210 comments

210 comments

155 comments · 44 top-level

brikym1y ago· 16 in thread

I have been doing something similar for New Zealand since the start of the year with Playwright/Typescript dumping parquet files to cloud storage. I've just collecting the data I have not yet displayed it. Most of the work is getting around the reverse proxy services like Akamai and Cloudflare.

At the time I wrote it I thought nobody else was doing but now I know of at least 3 start ups doing the same in NZ. It seems the the inflation really stoked a lot of innovation here. The patterns are about what you'd expect. Supermarkets are up to the usual tricks of arbitrary making pricing as complicated as possible using 'sawtooth' methods to segment time-poor people from poor people. Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.

ustad1y ago

Can anyone comment how supermarkets exploit customer segmentation by updating prices? How do the time-poor and poor-poor people generally respond?

“Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.”

brikym1y ago

Let's say there are three brands of some item. Each week one of the brands is rotated to $1 while the others are $2. And let's also suppose that the supermarket pays 80c per item.

The smart shopper might only buy in bulk once every three weeks when his favourite brand at a lower price, or twitch to the cheapest brand every week. A hurried or lazy shopper might always pick their favourite brand every week. If they buy one item a week the lazy shopper would have spent $5, while the smart shopper has only spent $3.

They've made 60c off the smart shopper and $2.60 off the lazy shopper. By segmenting out the lazy shoppers they've made $2. The whole idea of rotating the prices is nothing to do with the cost of goods sold it's all about making shopping a pain in the ass for busy people and catching them out.

1 more reply

seoulmetro1y ago

Legality of this is rocky in Australia. I dare say that NZ is the same?

There are so many scrapers that come and go doing this in AU but are usually shut down by the big supermarkets.

It's a cycle of usefulness and "why doesn't this exist", except it had existed many times before.

russelg1y ago

I think with the current climate wrt the big supermarkets in AU, now would be the time to push your luck. The court of public opinion will definitely not be on the supermarkets side, and the government may even step in.

1 more reply

timrkn1y ago

Agreed. Hopefully the govs price gouging mitigation strategy includes free flow of information (allowing scraping for price comparison).

I’ve been interested in price comparison for Australia for a while, am a Product designer/manager with a concept prototype design, looking for others interested to work on it. My email is on my profile if you are.

jaza1y ago

Aussie here. I hadn't heard that price scraping is only quasi-legal here and that scrapers get shut down by the big supermarkets - but then again I'm not surprised.

I'm thinking of starting a little price comparison site, mainly to compare select products at Colesworths vs Aldi (I've just started doing more regular grocery shopping at Aldi myself). But as far as I know, Aldi don't have any prices / catalogues online, so my plan is to just manually enter the data myself in the short-term, and to appeal to crowdsourcing the data in the long-term. And plan is to just make it a simple SSG site (e.g. Hugo powered), data all in simple markdown / json files, data all sourced via github pull requests.

Feel free to get in touch if you'd like to help out, or if you know of anything similar that already exists: greenash dot net dot au slash contact

3 more replies

_nivlac_1y ago

For the other commenters here - looks like this site does the job? https://hotprices.org/

With the corresponding repo too: https://github.com/Javex/hotprices-au

sumedh1y ago

> Legality of this is rocky in Australia. I dare say that NZ is the same?

You might be breaking the sites terms and conditions but that does not mean its illegal.

Dan Murphy uses a similar thing, they have their own price checking algorithm.

1 more reply

Dev1021y ago

I built one called https://bbdeals.in/ for India. I mostly use it to buy just fruits and its saved me about 20% of sending. which is not bad in these hard times.

Building crawlers and infra to support it tool not more than 20 hours.

alwinaugustin1y ago

Does this work for HYD only?

1 more reply

pikelet1y ago

As a kiwi, are your able to make any of these (or your) projects? I'm quite interested.

walterbell1y ago

Those who order grocery delivery online would benefit from price comparisons, because they can order from multiple stores at the same time. In addition, there's only one marketplace that has all the prices from different stores.

gruez1y ago

>Those who order grocery delivery online would benefit from price comparisons, because they can order from multiple stores at the same time.

Not really, since the delivery fees/tips that you have to pay would eat up all the savings, unless maybe if you're buying for a family of 5 or something.

1 more reply

teruakohatu1y ago

I think the fees they tack on for online orders would ruin ordering different products from different stores. It mostly makes sense with staples that don't perish.

With fresh produce I find Pak n Save a lot more variable with quality, making online orders more risky despite the lower cost.

1 more reply

teruakohatu1y ago

I was planning on doing the same in NZ. I would be keen to chat to you about it (email in HN profile). I am a data scientist

Did you notice anything pre and post Whittakers price increase(s)? They must have a brilliant PR firm in retainer for every major news outlet to more or less push the line that increased prices are a good thing for the consumer. I noticed more aggressive "sales" more recently, but unsure if I am just paying more attention.

My prediction is that they will decrease the size of the bars soon.

scubadude1y ago

I think Whittaker's changed their recipe some time in the last year. Whittaker's was what Cadbury used to be (good) but now I think they have both followed the same course. Markedly lower quality. This is the 200g blocks fwiw not sure about the wee 50g peanut slab.

RasmusFromDK1y ago· 16 in thread

Nice writeup. I've been through similar problems that you have with my contact lens price comparison website https://lenspricer.com/ that I run in ~30 countries. I have found, like you, that websites changing their HTML is a pain.

One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it).

I've found that building the scrapers and infrastructure is somewhat the easy part. The hard part is maintaining all of the scrapers and figuring out if when a product disappears from a site, is that because my scraper has an error, is it my scraper being blocked, did the site make a change, was the site randomly down for maintenance when I scraped it etc.

A fun project, but challenging at times, and annoying problems to fix.

siamese_puff1y ago

Doing the work we need. Every year I get fucked by my insurance company when buying a basic thing - contacts. Pricing is all over the place and coverage is usually 30% done by mail in reimbursement. Thanks!

RasmusFromDK1y ago

Thanks for the nice words!

heap_perms1y ago

I'm curious, can you wear contact lenses while working? I notice my eyes get tired when I look at a monitor for too long. Have you found any solutions for that?

RasmusFromDK1y ago

I use contact lenses basically every day, and I have had no problems working in front of screens. There's a huge difference between the different brands. Mine is one of the more expensive ones (Acuvue Oasys 1-Day), so that might be part of it, but each eye is compatible with different lenses.

If I were you I would go to an optometrist and talk about this. They can also often give you free trials for different contacts and you can find one that works for you.

3 more replies

pavel_lishin1y ago

This is very likely age-dependent.

When I was in my 20s, this was absolutely not a problem.

When I hit my 30s, I started wearing glasses instead of contacts basically all the time, and it wasn't a problem.

Now that I'm in my 40s, I'm having to take my glasses off to read a monitor and most things that are closer than my arm's reach.

1 more reply

kristianbrigman1y ago

My eye doctor recommended wearing “screen glasses”. They are a small prescription (maybe 0.25 or 0.5) with blue blocking. It’s small but it does help; I work on normal glasses at night (so my eyes can rest) and contacts + screen glasses during the day and they are really close.

dotancohen1y ago

Go try an E-Ink device. B&N Nooks are small Android tablets in disguise, you just need to install a launcher. Boox devices are also Android.

I can use an E-Ink device all day without my eyes getting tired.

siamese_puff1y ago

I cannot, personally. They dry out

shellfishgene1y ago

For Germany, below the prices it says "some links may be sponsored", but it does not mark which ones. Is that even legal? Also there seem to be very few shops, are maybe all the links sponsored? Also idealo.de finds lower prices.

RasmusFromDK1y ago

When I decided to put the text like that, I had looked at maybe 10-20 of the biggest price comparison websites across different countries because I of course want to make sure I respect all regulations that there are. I found that many of them don't even write anywhere that the links may be sponsored, and you have to go to the "about" page or similar to find this. I think that I actually go further than most of them when it comes to making it known that some links may be sponsored.

Now that you mention idealo, there seems to be no mention at all on a product page that they are paid by the stores, you have to click the "rank" link in the footer to be brought to a page https://www.idealo.de/aktion/ranking where they write this.

1 more reply

bane1y ago

> One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it)

In the U.S. at least, big retailers will have product suppliers build slightly different SKUs for them to make price comparisons tricky. Costco is somewhat notorious for this where almost everything electronics (and many other products) sold in their stores is a custom SKU -- often with slightly product configuration.

throwaway7ahgb1y ago

Costco does this for sure, but Costco also creates their own products. For instance there are some variations of a package set that can only be bought at Costco, so you aren't getting the exact same box and items as anywhere else.

bob_theslob6461y ago

Would that still matter if you just compare by description?

ludvigk1y ago

Isn’t this a use-case where LLMs could really help?

RasmusFromDK1y ago

Yeah it is to some degree. I tried to use it as much as possible, but there's always those annoying edge cases that makes me not trust the results and I have to check everything, and it ended up being faster just building some simple UI where I can easily classify the name myself.

Part of the problem is simply due to bad data from the websites. Just as an example - there's a 2-week contact lens called "Acuvue Oasys". And there's a completely different 1-day contact lens called "Acuvue Oasys 1-Day". Some sites have been bad at writing this properly, so both variants may be called "Acuvue Oasys" (or close to it), and the way to distinguish them is to look at the image to see which actual lens they mean, look at the price etc.

It's true that this could probably also be handled by AI, but in the end, classifying the lenses takes like 1-2% of the time it takes to make a scraper for a website so I found it was not worth trying to build a very good LLM classifier for this.

1 more reply

brunoqc1y ago

Do you support Canada?

langsoul-com1y ago· 8 in thread

The hard thing is not scraping, but getting around the increasingly sophisticated blockers.

You'll need to constantly rotate residential proxies (high rated) and make sure not to exhibit data scraping patterns. Some supermarkets don't show the network requests in the network tab, so cannot just get that api response.

Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.

I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it

__MatrixMan__1y ago

I wonder if we could get some legislation in place to require that they publish pricing data via an API so we don't have to tangle with the blockers at all.

immibis1y ago

Perhaps in Europe. Anywhere else, forget about it.

zackmorris1y ago

I'd prefer that governments enact legislation that prevents discriminating against IP addresses, perhaps under net neutrality laws.

For anyone with some clout/money who would like to stop corporations like Akamai and Cloudflare from unilaterally blocking IP addresses, the way that works is you file a lawsuit against the corporations and get an injunction to stop a practice (like IP blacklisting) during the legal proceedings. IANAL, so please forgive me if my terminology isn't quite right here:

https://pro.bloomberglaw.com/insights/litigation/how-to-file...

https://www.law.cornell.edu/wex/injunctive_relief

Injunctions have been used with great success for a century or more to stop corporations from polluting or destroying ecosystems. The idea is that since anyone can file an injunction, that creates an incentive for corporations to follow the law or risk having their work halted for months or years as the case proceeds.

I'd argue that unilaterally blocking IP addresses on a wide scale pollutes the ecosystem of the internet, so can't be allowed to continue.

Of course, corporations have thought of all of this, so have gone to great lengths to lobby governments and use regulatory capture to install politicians and judges who rule in their favor to pay back campaign contributions they've received from those same corporations:

https://www.crowell.com/en/insights/client-alerts/supreme-co...

https://www.mcneeslaw.com/nlrb-injunction/

So now the pressures that corporations have applied on the legal system to protect their own interests at the cost of employees, taxpayers and the environment have started to affect other industries like ours in tech.

You'll tend to hear that disruptive ideas like I've discussed are bad for business from the mainstream media and corporate PR departments, since they're protecting their own interests. That's why I feel that the heart of hacker culture is in disrupting the status quo.

sakisvOP1y ago

Thankfully I'm not there yet.

Since this is just a side project, if it starts demanding too much of my time too often I'll just stop it and open both the code and the data.

BTW, how could the network request not appear in the network tab?

For me the hardest part is to correlate and compare products across supermarkets

langsoul-com1y ago

If they don't populate the page via Ajax or network requests. Ie server side, then no requests for supermarket data will appear.

See nextjs server side, I believe they mention that as a security thing in their docs.

In terms of comparison, most names tend to be the same. So some similarity search if it's in the same category matches good enough.

seanthemon1y ago

And you couldn't use OCR and simply take an image of the product list? Not ideal, but difficult or impossible to track depending on your method.

langsoul-com1y ago

You'll get blocked before even seeing the page most times.

eddyfromtheblok1y ago

Crowdsource it with a browser extension

nosecreek1y ago· 8 in thread

Very cool! I did something similar in Canada (https://grocerytracker.ca/)

odiroot1y ago

Similar for Austria: https://heisse-preise.io

snac1y ago

Love your site! It was a great source of inspiration with the amount of data you collect.

I did the same and made https://grocerygoose.ca/

Published the API endpoints that I “discovered” to make the app https://github.com/snacsnoc/grocery-app (see HACKING.md)

It’s an unfortunate state of affairs when devs like us have to go to such great lengths to track the price of a commodity (food).

kareemm1y ago

Was looking for one in Canada. Tried this out and it seems like some of the data is missing from where I live (halifax). Got an email I can hit you up at? Mine's in my HN profile - couldn't find yours on HN or your site.

nosecreek1y ago

For sure, just replace the first dot in the url from my profile with an @

sakisvOP1y ago

Oh nice!

A thorny problem in my case is that the same item is named in 3 different ways between the 3 supermarkets which makes it very hard and annoying to do a proper comparison.

Did you have a similar problem?

seszett1y ago

I have built a similar system for myself, but since it's small scale I just have "groups" of similar items that I manually populate.

I have the additional problem that I want to compare products across France and Belgium (Dutch-speaking side) so there is no hope at all to group products automatically. My manual system allows me to put together say 250g and 500g packaging of the same butter, or of two of the butters that I like to buy, so I can always see easily which one I should get (it's often the 250g that's cheaper by weight these days).

Also the 42000 or so different packagings for Head and Shoulders shampoo. 250ml, 270ml, 285ml, 480ml, 500ml (285ml is usually cheapest). I'm pretty sure they do it on purpose so each store doesn't have to match price with the others because it's a "different product".

nosecreek1y ago

Absolutely! It’s made it difficult to implement some of the cross-retailer comparison features I would like to add. For my charts I’ve just manually selected some products, but I’ve also been trying to get a “good enough but not perfect” string comparison algorithm working.

2 more replies

maxglute1y ago

Excellent work.

odysseus1y ago· 7 in thread

I used to price track when I moved to a new area, but now I find it way easier to just shop at 2 markets or big box stores that consistently have low prices.

In Europe, that would probably be Aldi/Lidl.

In the U.S., maybe Costco/Trader Joe's.

For online, CamelCamelCamel/Amazon. (for health/beauty/some electronics but not food)

If you can buy direct from the manufacturer, sometimes that's even better. For example, I got a particular brand of soap I love at the soap's wholesaler site in bulk for less than half the retail price. For shampoo, buying the gallon size direct was way cheaper than buying from any retailer.

bufferoverflow1y ago

> In the U.S., maybe Costco/Trader Joe's.

Costco/Walmart/Aldi in my experience.

Trader Joe's is higher quality, but generally more expensive.

DontchaKnowit1y ago

walmart is undisputed king of low prices and honestly in my experience the quality on their store brand stuff is pretty damn solid. and usually like half the price of comparable products. Been living off their greek yogurt for a while now. Its great

dawnerd1y ago

Sams club I’ve found beats Costco in some areas but for some items Costco absolutely undercuts like crazy. Cat litter at sams is twice the price when not on sale.

I pretty much just exclusively shop at Aldi/Walmart as they have the best prices overall. Kroger owned stores and Albertsons owned are insanely overpriced. Target is a good middle ground but I can’t stand shopping there now with everything getting locked up.

shiroiushi1y ago

Trader Joe's also only carries Trader Joe's-branded merchandise, aside from the produce. So if you're looking for something in particular that isn't a TJ item, you won't find it there.

odysseus1y ago

Occasionally you can get the same Trader Joe’s private label products rebranded as Aldi merchandise for even cheaper at Aldi.

dexwiz1y ago

You can find ALDIs in the USA, but they are regional. Trader Joe’s is owned by the same family as ALDIs, and until recently (past 10 years) you wouldn’t see them in the same areas.

jasomill1y ago

I'd usually associate the term "regional" with chains like Meijer, Giant Eagle, and Winn-Dixie.

With 2,392 stores in 38 states plus DC[1], I'm not sure Aldi US qualifies.

[1] https://stores.aldi.us

xyst1y ago· 6 in thread

Would be nice to have a price transparency of goods. It would make processes like this much more easier to track by store, and region.

For example, compare the price of oat milk at different zip codes and grocery stores. Additionally track “shrinkflation” (same price but smaller portion).

On that note, it seems you are tracking price but are you also checking the cost per gram (or ounce)? Manufacturer or store could keep price the same but offer less to the consumer. Wonder if your tool would catch this.

sakisvOP1y ago

I do track the price per unit (kg, lt, etc) and I was a bit on the fence on whether I should show and graph that number instead of the price that someone would pay at the checkout, but I opted for the latter to keep it more "familiar" with the prices people see.

Having said that, that's definitely something that I could add and it would show when the shrinkflation occured if any.

barbazoo1y ago

Grocers not putting per unit prices on the label is a pet peeve of mine. I can’t imagine any purpose not rooted in customer hostility.

baronswindle1y ago

In my experience, grocers always do include unit prices…at least in the USA. I’ve lived in Florida, Indiana, California, and New York, and in 35 years of life, I can’t remember ever not seeing the price per oz, per pound, per fl oz, etc. right next to the total price for food/drink and most home goods.

There may be some exceptions, but I’m struggling to think of any except things where weight/volume aren’t really relevant to the value — e.g., a sponge.

3 more replies

dawnerd1y ago

Or when they change what unit to display so you can’t easily cross compare.

girvo1y ago

It's required by law in Australia, which is nice

candiddevmike1y ago

Imagine mandating transparent cost of goods pricing. I'd love to see farmer was paid X, manufacturer Y, and grocer added Z.

batata0041y ago· 4 in thread

I created a similar website which got lots of interest in my city. I scrape even app and websites data using a single server at Linode with 2GB of RAM with 5 IPv4 and 1000 IPv6 (which is free) and every single product is scraped at most 40 minutes interval, never more than that with avg time of 25 minutes. I use curl impersonate and scrape JSON as much as possible because 90% of markets provide prices from Ajax calls and the other 10% I use regex to easily parse the HTML. You can check it at https://www.economizafloripa.com.br

latexr1y ago

> I scrape even app and websites data

And then try to sell it back to businesses, even suggesting they use the data to train AI. You also make it sound like there’s a team manually doing all the work.

https://www.economizafloripa.com.br/?q=parceria-comercial

That whole page makes my view of the project go from “helpful tool for the people, to wrestle back control from corporations selling basic necessities” to “just another attempt to make money”. Which is your prerogative, I was just expecting something different and more ethically driven when I read the homepage.

mechanical_bear1y ago

Where does this lack ethics? It seems that they are providing a useful service, that they created with their hard work. People are allowed to make money with their work.

2 more replies

presentation1y ago

It’s almost like people try to do valuable services for others in exchange for money.

2 more replies

siamese_puff1y ago

How does the ipv6 rotation work in this flow?

grafraf1y ago· 4 in thread

We have been doing it for the Swedish market in more than 8 years. We have a website https://www.matspar.se/ , where the customer can browse all the products of all major online stores, compare the prices and add the products they want to buy in the cart. The customer can in the end of the journey compare the total price of that cart (including shipping fee) and export the cart to the store they desire to order it.

I'm also one of the founders and the current CTO, so there been a lot of scraping and maintaining during the years. We are scraping over 30 million prices daily.

filleokus1y ago

On the business side, what's your business model, how do you generate revenue? What's the longer term goals?

(Public data shows the company have a revenue of ≈400k USD and 6 employees https://www.allabolag.se/5590076351/matspar-i-sverige-ab)

grafraf1y ago

We are selling price/distribution data about the products we scrape. We do run some ads and have an affiliate deals.

The insight i can share is that the main (tech) goal is to make the product more user friendly and more aligned with the customer need as it has many pain points and we have gain some insights on the preferred customer journey.

showsover1y ago

Do you have a technical writeup of your scraping approach? I'd love to read more about the challenges and solutions for them.

grafraf1y ago

Unfortunately no, but i can share some insights that i hope can be of value:

- Tech: Everything is hosted in AWS. We are using Golang in docker containers that does the scraping. They run on ECS Fargate spots when needed using cronjob. The scraping result is stored as a parquet in S3 and processed in our RDS Postgresql. We need to be creative and have some methods to identify that a particular product A in store 1 is the same as product A in store 2 so they are mapped together. Sometimes it needs to be verified manually. The data that are of interest for the user/site is indexed into an Elastic search.

Things that might be of interest: - We always try to avoid parsing the HTML but instead calling the sites APIs directly to reduce scraping time. We also try to scrape the category listing to access multiple prices by one request, this can reduce the total requests from over 100 000 to maybe less than 1000 requests.

- We also try to avoid scraping the sites during peak times and respect their robots.txt. We add some delay to each request. The scrapes are often done during night/early morning.

- The main challenge is that stores can redesign or modify which make our scrapers fail, so we need to be fast and adopt to the new changes.

- Another major hidden challenge is that the stores have different prices for the same product depending on your zip code, so we have our ways of identifying the stores different warehouses, what zip codes belong to a specific warehouse and do a scrape for that warehouse. So a store might have 5 warehouses, so we need to scrape it 5 times with different zip codes

There is much more but i hope that gave you some insights of challenges and some solutions!

3 more replies

lotsofpulp1y ago· 4 in thread

In the US, retail businesses are offering individualized and general coupons via the phone apps. I wonder if this pricing can be tracked, as it results in significant differences.

For example, I recently purchased fruit and dairy at Safeway in the western US, and after I had everything I wanted, I searched each item in the Safeway app, and it had coupons I could apply for $1.5 to $5 off per item. The other week, my wife ran into the store to buy cream cheese. While she did that, I searched the item in the app, and “clipped” a $2.30 discount, so what would have been $5.30 to someone that didn’t use the app was $3.

I am looking at the receipt now, and it is showing I would have spent $70 total if I did not apply the app discounts, but with the app discounts, I spent $53.

These price obfuscation tactics are seen in many businesses, making price tracking very difficult.

mcoliver1y ago

I wrote a chrome extension to help with this. Clips all the coupons so you don't have to do individual searches. Has resulted in some wild surprise savings when shopping. www.throwlasso.com

Larrikin1y ago

This looks amazing. Do you have plans to support Firefox and other browsers?

2 more replies

koolba1y ago

Ha! I have the same thing as a bookmarklet for specific sites. It’s fun to watch it render the clicks.

1 more reply

lotsofpulp1y ago

Wow! This is amazing, thank you. I usually use Safari, but will give it a try.

pcblues1y ago· 3 in thread

This is interesting because I believe the two major supermarkets in Australia can create a duopoly in anti-competitive pricing by just employing price analysis AI algorithms on each side and the algorithms will likely end up cooperating to maximise profit. This can probably be done legally through publicly obtained prices and illegally by sharing supply cost or profit per product data. The result is likely to be similar. Two trained AIs will maximise profit in weird ways using (super)multidimensional regression analysis (which is all AI is), and the consumer will pay for maximised profits to ostensible competitors. If the pricing data can be obtained like this, not much more is needed to implement a duopoly-focused pair of machine learning implementations.

TrackerFF1y ago

Here in Norway, what is called the "competition authority"(https://konkurransetilsynet.no/norwegian-competition-authori...), is frequently critical to open and transparent (food) price information for that exact reason.

The rationale is that if all prices are out there in the open, consumers will end up paying a higher price, as the actors (supermarkets) will end up pricing their stuff equally, at a point where everyone makes a maximum profit.

For years said supermarkets have employed "price hunters", which are just people that go to competitor stores and register the prices of everything.

Here in Norway you will oftentimes notice that supermarket A will have sale/rebates on certain items one week, then the next week or after supermarket B will have something similar, to attract customers.

pcblues1y ago

The word I was looking for was collusion, but done with software and without people-based collusion.

avador1y ago

Compusion.

ikesau1y ago· 3 in thread

Ah, I love this. Nice work!

I really wish supermarkets were mandated to post this information whenever the price of a particular SKU updated.

The tools that could be built with such information would do amazing things for consumers.

sakisvOP1y ago

Thanks!

If Greece's case is anything to go by, I doubt they'd ever accept that as it may bring to light some... questionable practices.

At some point I need to deduplicate the products and plot the prices across all 3 supermarkets on the same graph as I suspect it will show some interesting trends.

project2501a1y ago

fyi, I posted this on /r/greece

1 more reply

robotnikman1y ago

As someone who actively works on these kind of systems, it's a bit more complicated than that. The past few years we worked on migrating from some old system from the 80's designed for LAN use only, to a cloud based item catalogue system that finally allowed us the ability to easily make pricing info more available to consumers, such as through an app.

gadders1y ago· 3 in thread

This reminds me a bit of a meme that said something along the lines of "I don't want AI to draw my art, I want AI review my weekly grocery shop, workout which combinations of shops save me money, and then schedule the deliveries for me."

ElCapitanMarkla1y ago

Something I was talking over with a friend a while ago was something along the lines of this.

Where you could set a list of various meals that you like to eat regularly, a list of like 20 meal options. And then the app fetches the pricing for all ingredients and works out which meals are the best value that week.

You kind of end up with a DIY HelloFresh / meal in a box service.

gadders1y ago

Yes, that would work.

"Dave, the cheapest meals for you this week are [LIST OF DINNERS]. Based on your preferred times, deliveries from Waitrose, Tesco and Sainburys are turning up at 7pm on Monday. Please check you still have the following staples in stock [EG pasta, tinned tomatoes etc]".

sakisvOP1y ago

Ha, you can't imagine how many times I've thought of doing just that - it's just that it's somewhat blocked by other things that need to happen before I even attempt to do it

joelthelion1y ago· 3 in thread

We should mutualize scraping efforts, creating a sort of Wikipedia of scraped data. I bet a ton of people and cool applications would benefit from it.

sakisvOP1y ago

Haha all we have to do is agree on the format, right?

Spivak1y ago

We already did. The format supports attaching related content, the scraped info, with the archive itself. So you get your data along with the means to generate it yourself if you want something different.

https://en.m.wikipedia.org/wiki/WARC_(file_format)

joelthelion1y ago

Honestly I don't think that matters a lot. Even if all sites were scraped in a different format, the collection would still be insanely useful.

The most important part is being able to consistently scrape every day or so for a long time. That isn't easy.

haolez1y ago· 3 in thread

I heard that some e-commerce sites will not block scrappers, but poison the data shown to them (e.g. subtly wrong prices). Does anyone know more about this?

barryrandall1y ago

I never poisoned data, but I have implemented systems where clients who made requests too quickly got served data from a snapshot that only updated every 15 minutes.

MathMonkeyMan1y ago

This HN post had me playing around with Key Food's website. A lot of information is wrapped up in a cookie, but it looks like there isn't too much javascript rendering.

But when I hit the URLs with curl, without a cookie, I get a valid looking page, but it's just a hundred listings for "Baby Bok Choy." Maybe a test page?

After a little more fiddling, the server just responded with an empty response body. So, it looks like I'll have to use browser automation.

marginalia_nu1y ago

Yeah, by far the most reliable way of preventing bots is to silently poison the data. The harder you try to fight them in a visible fashion, the harder they become to detect. If you block them, they just come back with a hundred times as many IP addresses and u-a fingerprints.

hk13371y ago· 3 in thread

I would be curious if there were a price difference between what is online and physically in the store.

devjab1y ago

In Denmark there often is, things like localised sales the 4-8 times a year a specific store celebrates its birthday or similar. You can scan their PDF brochures but you would need image recognition for most of them, and some well trained recognitions to boot since they often alter their layouts and prices are listed differently.

The biggest sales come from the individual store “close to expiration” sales where items can become really cheap. These aren’t available anywhere but the stores themselves though.

Here I think the biggest challenge might be the monopoly supermarket chains have on the market. We basically two major corporations with various brands. They are extremely similar in their pricing, and even though there are two low price competitors, these don’t seem to affect the competition with the two major corporations at all. What is worse is that one of these two major corporations is “winning”, meaning that we’re heading more and more toward what will basically be a true monopoly.

flir1y ago

Next step: monitoring the updates to those e-ink shelf edge labels that are starting to crop up.

sakisvOP1y ago

The few random checks that I did on a few products as I was shopping didn't show any difference.

Either I was lucky or they don't bother, who knows

maerten1y ago· 2 in thread

Nice article!

> The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.

I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/

Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.

The main challenge I have been trying to work on, is trying to link products from different supermarkets, so that you can list prices in a single view. See for example: https://www.superprijsvergelijker.nl/supermarkt-aanbieding/6...

It works for the most part, as long as at least one correct barcode number is provided for a product.

sakisvOP1y ago

Thanks!

> I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

Yes, that's exactly what I've been doing and it saved me more times than I'd care to admit!

Tryk1y ago

Awesome, have been looking for something like this!

seanwilson1y ago· 2 in thread

> They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products. However the way they write the prices has changed and now a bag of chips doesn't cost €1.99 but €199. To catch these changes I rely on my transformation step being as strict as possible with its inputs.

You could probably add some automated checks to not sync changes to prices/products if a sanity check fails e.g. each price shouldn't change by more than 100%, and the number of active products shouldn't change by more than 20%.

z3t41y ago

Sanity checks in programming are underrated, not only are they cheap performance vice, they catch bugs early that would otherwise poison the state.

sakisvOP1y ago

Yeah I thought about that, but I've seen cases that a product jumped more than 100%.

I used this kind of heuristic to check if a scrape was successful by checking that the amount of products scraped today is within ~10% of the average of the last 7 days or so

andrewla1y ago· 2 in thread

One problem that the author notes is that so much rendering is done client side via javascript.

The flip side to this is that very often you find that the data populating the site is in a very simple JSON format to facilitate easy rendering, ironically making the scraping process a lot more reliable.

sakisvOP1y ago

Initially that's what I wanted to do, but the first supermarket I did is sending back HTML rendered on the server side, so I abandonded this approach for the sake of "consistency".

Lately I've been thinking to bite the bullet and Just Do It, but since it's working I'm a bit reluctant to touch it.

andrewla1y ago

For your purposes scraping the user-visible site probably makes the most sense since in the end, their users' eyes are the target.

I am typically doing one-off scraping and for that, an undocumented but clean JSON api makes things so much easier, so I've grown to enjoy sites that are unnecessarily complex in their rendering.

PigiVinci831y ago· 2 in thread

Nice article, enjoyed reading it. I’m Pier, co founder of https://Databoutique.com, which is a marketplace for web scraped data. If you’re willing to monetize your data extractions, you can list them on our website. We just started with the grocery industry and it would be great to have you on board.

bob_theslob6461y ago

This looks like a really cool website but my only critique is how are you verifying that the data is actually real and not just generated randomly?

redblacktree1y ago

Do you have data on which data is in higher demand? Do you keep a list of frequently-requested datasets?

Alifatisk1y ago· 2 in thread

Some stores don’t have an interactive website but instead send out magazines to your email with news for the week.

How would one scrape those? Anyone experienced?

psd11y ago

Imap library to dump the attachment, pandoc to convert it to html, then DOM library to parts it statically.

Likely easier than website scraping.

Alifatisk1y ago

I’ll try this approach, thanks! Most magazines I’ve noticed are using a grid design, so my first thought was to somehow detect each square then OCR the product name with it’a price.

xnx1y ago· 1 in thread

Scraping tools have become more powerful than ever, but bot restrictions have become equally more strict. It's hard to scrape reliably under any circumstance, or even consistently without residential proxies.

sakisvOP1y ago

When I first started it there was a couple of instances that my IP was blocked - despite being a residential IP behind CGNAT.

I then started randomising every aspect of the scraping process that I could: The order that I visited the links, the sleep duration between almost every action, etc.

As long as they don't implement a strict fingerprinting technique, that seems to be enough for now

ptrik1y ago· 1 in thread

> My CI of choice is [Concourse](https://concourse-ci.org/) which describes itself as "a continuous thing-doer". While it has a bit of a learning curve, I appreciate its declarative model for the pipelines and how it versions every single input to ensure reproducible builds as much as it can.

What's the thought process behind using a CI server - which I thought is mainly for builds - for what essentially is a data pipeline?

sakisvOP1y ago

Well I'm just thinking of concourse the same way it describes itself, "a continuous thing doer".

I want something that will run some code when something happens. In my case that "something" is a specific time of day. The code will spin up a server, connect it to tailscale, run the 3 scraping jobs and then tear down the server and parse the data. Then another pipeline runs that loads the data and refreshes the caches.

Of course I'm also using it for continuously deploying my app across 2 environments, or its monitoring stack, or running terraform etc.

Basically it runs everything for me so that I don't have to.

hnrodey1y ago· 1 in thread

Nice job getting through all this. I kind of enjoy writing scrapers and browser automation in general. Browser automation is quite powerful and under explored/utilized by the average developer.

Something I learned recently, which might help your scrapers, is the ability in Playwright to sniff the network calls made through the browser (basically, programmatic API to the Network tab of the browser).

The boost is that you allow the website/webapp to make the API calls and then the scraper focuses on the data (rather than allowing the page to render DOM updates).

This approach falls apart if the page is doing server side rendering as there are no API calls to sniff.

sakisvOP1y ago

...or worse, if there _is_ an API call but the response is HTML instead of a json

mishu21y ago· 1 in thread

Playwright is basically necessary for scraping nowadays, as the browser needs to do a lot of work before the web page becomes useful/readable. I remember scraping with HTTrack back in high school and most of the sites kept working...

For my project (https://frankendash.com/), I also ran into issues with dynamically generated class names which change on every site update, so in the end I just went with saving a crop area from the website as an image and showing that.

ElCapitanMarkla1y ago

HTTrack was fantastic, still was a couple of years ago when I used it for a small project too.

moohaad1y ago· 1 in thread

Cloudflare Worker has Browser Rendering API

pencilcode1y ago

It’s pretty good actually. Used in a small scraping site and worked without a hitch.

antman1y ago· 1 in thread

Looks great. Perhaps more than 30 days comparisons would be interesting. Or customizable should be fast enough with a duckdb backend

sakisvOP1y ago

When you click on a product you get its full price history by default.

I did consider adding a 3 and 6 month button, but for some reason I decided against it, don't remember why. It wasn't performance because I'm heavily caching everything so it wouldn't have made a difference. Maybe aesthetics?

65101y ago· 1 in thread

Can someone name the South-American country where they have a government price comparison website. Listing all products was required by law.

Someone showed me this a decade ago. The site had many obvious issues but it did list everything. If I remember correctly it was started to stop merchants pricing things by who is buying.

I forget which country it was.

roberdam1y ago

Argentina https://www.preciosclaros.gob.ar/#!/buscar-productos

Scrapemist1y ago· 1 in thread

What if you add all products to your shopping cart and save it as “favourites” and scrape that every other day.

nilsherzig1y ago

You would still need a way to add all items and to check if there are new ones

ptrik1y ago· 1 in thread

> While the supermarket that I was using to test things every step of the way worked fine, one of them didn't. The reason? It was behind Akamai and they had enabled a firewall rule which was blocking requests originating from non-residential IP addresses.

Why did you pick Tailscale as the solution for proxy vs scraping with something like AWS Lambda?

anamexis1y ago

Didn't you answer your own question with the quote? It needs to originate from a residential IP address

mt_1y ago· 1 in thread

What about networking costs? Is it free in Hetzner?

kjksf1y ago

Depends on the server.

Most have at least 20 TB of bandwidth included in the price, even the lowest $5/mo shared cpu machines. 20 TB is a gigantic amount unless you're serving videos or some such.

Some have unlimited bandwidth (I mean they are effectively limited by the speed of network connection but you don't pay for amount).

jfil1y ago

I'm building something similar for 7 grocery vendor in Canada and am looking to talk with others who are doing this - my email is in my profile.

One difference: I'm recording each scraping session as a HAR file (for proving provenance). mitmproxy (mitmdump) is invaluable for that.

kinderjaje1y ago

A few years ago, we had a client and built a price-monitoring app for women's beauty products. They had multiple marketplaces, and like someone mentioned before, it was tricky because many products come in different sizes and EANs, and you need to be able to match them.

We built a system for admins so they can match products from Site A with products from Site B.

The scraping part was not that hard. We used our product https://automatio.co/ where possible, and where we couldn't, we built some scrapers from scratch t using simple CURL or Puppetteer.

Thanks for sharing your experience, especially that I didn't use Playwright before.

Stubbs1y ago

I did something very similar but for the price of wood from sellers here in the UK but instead of Platwright, which I'd never heard of at the time, I used NodeRED.

You just reminded me, it's probably still running today :-D

ptrik1y ago

> I went from 4vCPUs and 16GB of RAM to 8vCPUs and 16GB of RAM, which reduced the duration by about ~20%, making it comparable to the performance I get on my MBP. Also, because I'm only using the scraping server for ~2h the difference in price is negligible.

Good lesson on cloud economics. Below certain threshold we get linear performance gain with more expensive instance type. It is essentially the same amount of spending but you would save time running the same workload with more expensive machine but for shorter period of time.

scarredwaits1y ago

Great article and congrats on making this! It would be great to have a chat if you like, because I’ve built Zuper, also for Greek supermarkets, which has similar goals (and problems!)

NKosmatos1y ago

Hey, thanks for creating https://pricewatcher.gr/en/ very much appreciated.

Nice blog post and very informative. Good to read that it costs you less than 70€ per year to run this and hope that the big supermarkets don’t block this somehow.

Have you thought of monetizing this? Perhaps with ads from the 3 big supermarkets you scrape ;-)

cynicalsecurity1y ago

> My first thought was to use AWS, since that's what I'm most familiar with, but looking at the prices for a moderately-powerful EC2 instance (i.e. 4 cores and 8GB of RAM) it was going to cost much more than I was comfortable to spend for a side project.

Yep, AWS is hugely overrated and overpriced.

jonatron1y ago

If you were thinking of making a UK supermarket price comparison site, IIRC there's a company who owns all the product photos, read more at https://news.ycombinator.com/item?id=31900312

janandonly1y ago

I live in the Netherlands, where we are blessed with a price comparison website (https://tweakers.net/pricewatch/) for gadgets.

ptrik1y ago

> The data from the scraping are saved in Cloudflare's R2 where they have a pretty generous 10GB free tier which I have not hit yet, so that's another €0.00 there.

Wonder how's the data from R2 fed into frontend?

Closi1y ago

This is great! Would be great if the website would give a summary of which shop was actually cheapest (e.g. based on a basket of comparable goods that all retailers stock).

Although might be hard to do with messy data.

SebFender1y ago

I've worked with similar solutions for decades (complete different need) and in the end web changes made the solution unscalable. Fun idea to play but with too many error scenarios.

throwaway3464341y ago

https://prices.openfoodfacts.org/

raybb1y ago

Anyone know of one of these for Spain?

j / k navigate · click thread line to collapse

210 comments

155 comments · 44 top-level

brikym1y ago· 16 in thread

ustad1y ago

Can anyone comment how supermarkets exploit customer segmentation by updating prices? How do the time-poor and poor-poor people generally respond?

“Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.”

brikym1y ago

Let's say there are three brands of some item. Each week one of the brands is rotated to $1 while the others are $2. And let's also suppose that the supermarket pays 80c per item.

1 more reply

seoulmetro1y ago

Legality of this is rocky in Australia. I dare say that NZ is the same?

There are so many scrapers that come and go doing this in AU but are usually shut down by the big supermarkets.

It's a cycle of usefulness and "why doesn't this exist", except it had existed many times before.

russelg1y ago

1 more reply

timrkn1y ago

Agreed. Hopefully the govs price gouging mitigation strategy includes free flow of information (allowing scraping for price comparison).

jaza1y ago

Aussie here. I hadn't heard that price scraping is only quasi-legal here and that scrapers get shut down by the big supermarkets - but then again I'm not surprised.

Feel free to get in touch if you'd like to help out, or if you know of anything similar that already exists: greenash dot net dot au slash contact

3 more replies

_nivlac_1y ago

For the other commenters here - looks like this site does the job? https://hotprices.org/

With the corresponding repo too: https://github.com/Javex/hotprices-au

sumedh1y ago

> Legality of this is rocky in Australia. I dare say that NZ is the same?

You might be breaking the sites terms and conditions but that does not mean its illegal.

Dan Murphy uses a similar thing, they have their own price checking algorithm.

1 more reply

Dev1021y ago

I built one called https://bbdeals.in/ for India. I mostly use it to buy just fruits and its saved me about 20% of sending. which is not bad in these hard times.

Building crawlers and infra to support it tool not more than 20 hours.

alwinaugustin1y ago

Does this work for HYD only?

1 more reply

pikelet1y ago

As a kiwi, are your able to make any of these (or your) projects? I'm quite interested.

walterbell1y ago

gruez1y ago

>Those who order grocery delivery online would benefit from price comparisons, because they can order from multiple stores at the same time.

Not really, since the delivery fees/tips that you have to pay would eat up all the savings, unless maybe if you're buying for a family of 5 or something.

1 more reply

teruakohatu1y ago

I think the fees they tack on for online orders would ruin ordering different products from different stores. It mostly makes sense with staples that don't perish.

With fresh produce I find Pak n Save a lot more variable with quality, making online orders more risky despite the lower cost.

1 more reply

teruakohatu1y ago

I was planning on doing the same in NZ. I would be keen to chat to you about it (email in HN profile). I am a data scientist

My prediction is that they will decrease the size of the bars soon.

scubadude1y ago

RasmusFromDK1y ago· 16 in thread

A fun project, but challenging at times, and annoying problems to fix.

siamese_puff1y ago

RasmusFromDK1y ago

Thanks for the nice words!

heap_perms1y ago

I'm curious, can you wear contact lenses while working? I notice my eyes get tired when I look at a monitor for too long. Have you found any solutions for that?

RasmusFromDK1y ago

If I were you I would go to an optometrist and talk about this. They can also often give you free trials for different contacts and you can find one that works for you.

3 more replies

pavel_lishin1y ago

This is very likely age-dependent.

When I was in my 20s, this was absolutely not a problem.

When I hit my 30s, I started wearing glasses instead of contacts basically all the time, and it wasn't a problem.

Now that I'm in my 40s, I'm having to take my glasses off to read a monitor and most things that are closer than my arm's reach.

1 more reply

kristianbrigman1y ago

dotancohen1y ago

Go try an E-Ink device. B&N Nooks are small Android tablets in disguise, you just need to install a launcher. Boox devices are also Android.

I can use an E-Ink device all day without my eyes getting tired.

siamese_puff1y ago

I cannot, personally. They dry out

shellfishgene1y ago

RasmusFromDK1y ago

1 more reply

bane1y ago

throwaway7ahgb1y ago

bob_theslob6461y ago

Would that still matter if you just compare by description?

ludvigk1y ago

Isn’t this a use-case where LLMs could really help?

RasmusFromDK1y ago

1 more reply

brunoqc1y ago

Do you support Canada?

langsoul-com1y ago· 8 in thread

The hard thing is not scraping, but getting around the increasingly sophisticated blockers.

Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.

I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it

__MatrixMan__1y ago

I wonder if we could get some legislation in place to require that they publish pricing data via an API so we don't have to tangle with the blockers at all.

immibis1y ago

Perhaps in Europe. Anywhere else, forget about it.

zackmorris1y ago

I'd prefer that governments enact legislation that prevents discriminating against IP addresses, perhaps under net neutrality laws.

https://pro.bloomberglaw.com/insights/litigation/how-to-file...

https://www.law.cornell.edu/wex/injunctive_relief

I'd argue that unilaterally blocking IP addresses on a wide scale pollutes the ecosystem of the internet, so can't be allowed to continue.

https://www.crowell.com/en/insights/client-alerts/supreme-co...

https://www.mcneeslaw.com/nlrb-injunction/

sakisvOP1y ago

Thankfully I'm not there yet.

Since this is just a side project, if it starts demanding too much of my time too often I'll just stop it and open both the code and the data.

BTW, how could the network request not appear in the network tab?

For me the hardest part is to correlate and compare products across supermarkets

langsoul-com1y ago

If they don't populate the page via Ajax or network requests. Ie server side, then no requests for supermarket data will appear.

See nextjs server side, I believe they mention that as a security thing in their docs.

In terms of comparison, most names tend to be the same. So some similarity search if it's in the same category matches good enough.

seanthemon1y ago

And you couldn't use OCR and simply take an image of the product list? Not ideal, but difficult or impossible to track depending on your method.

langsoul-com1y ago

You'll get blocked before even seeing the page most times.

eddyfromtheblok1y ago

Crowdsource it with a browser extension

nosecreek1y ago· 8 in thread

Very cool! I did something similar in Canada (https://grocerytracker.ca/)

odiroot1y ago

Similar for Austria: https://heisse-preise.io

snac1y ago

Love your site! It was a great source of inspiration with the amount of data you collect.

I did the same and made https://grocerygoose.ca/

Published the API endpoints that I “discovered” to make the app https://github.com/snacsnoc/grocery-app (see HACKING.md)

It’s an unfortunate state of affairs when devs like us have to go to such great lengths to track the price of a commodity (food).

kareemm1y ago

nosecreek1y ago

For sure, just replace the first dot in the url from my profile with an @

sakisvOP1y ago

Oh nice!

A thorny problem in my case is that the same item is named in 3 different ways between the 3 supermarkets which makes it very hard and annoying to do a proper comparison.

Did you have a similar problem?

seszett1y ago

I have built a similar system for myself, but since it's small scale I just have "groups" of similar items that I manually populate.

nosecreek1y ago

2 more replies

maxglute1y ago

Excellent work.

odysseus1y ago· 7 in thread

I used to price track when I moved to a new area, but now I find it way easier to just shop at 2 markets or big box stores that consistently have low prices.

In Europe, that would probably be Aldi/Lidl.

In the U.S., maybe Costco/Trader Joe's.

For online, CamelCamelCamel/Amazon. (for health/beauty/some electronics but not food)

bufferoverflow1y ago

> In the U.S., maybe Costco/Trader Joe's.

Costco/Walmart/Aldi in my experience.

Trader Joe's is higher quality, but generally more expensive.

DontchaKnowit1y ago

dawnerd1y ago

Sams club I’ve found beats Costco in some areas but for some items Costco absolutely undercuts like crazy. Cat litter at sams is twice the price when not on sale.

shiroiushi1y ago

Trader Joe's also only carries Trader Joe's-branded merchandise, aside from the produce. So if you're looking for something in particular that isn't a TJ item, you won't find it there.

odysseus1y ago

Occasionally you can get the same Trader Joe’s private label products rebranded as Aldi merchandise for even cheaper at Aldi.

dexwiz1y ago

You can find ALDIs in the USA, but they are regional. Trader Joe’s is owned by the same family as ALDIs, and until recently (past 10 years) you wouldn’t see them in the same areas.

jasomill1y ago

I'd usually associate the term "regional" with chains like Meijer, Giant Eagle, and Winn-Dixie.

With 2,392 stores in 38 states plus DC[1], I'm not sure Aldi US qualifies.

[1] https://stores.aldi.us

xyst1y ago· 6 in thread

Would be nice to have a price transparency of goods. It would make processes like this much more easier to track by store, and region.

For example, compare the price of oat milk at different zip codes and grocery stores. Additionally track “shrinkflation” (same price but smaller portion).

sakisvOP1y ago

Having said that, that's definitely something that I could add and it would show when the shrinkflation occured if any.

barbazoo1y ago

Grocers not putting per unit prices on the label is a pet peeve of mine. I can’t imagine any purpose not rooted in customer hostility.

baronswindle1y ago

There may be some exceptions, but I’m struggling to think of any except things where weight/volume aren’t really relevant to the value — e.g., a sponge.

3 more replies

dawnerd1y ago

Or when they change what unit to display so you can’t easily cross compare.

girvo1y ago

It's required by law in Australia, which is nice

candiddevmike1y ago

Imagine mandating transparent cost of goods pricing. I'd love to see farmer was paid X, manufacturer Y, and grocer added Z.

batata0041y ago· 4 in thread

latexr1y ago

> I scrape even app and websites data

And then try to sell it back to businesses, even suggesting they use the data to train AI. You also make it sound like there’s a team manually doing all the work.

https://www.economizafloripa.com.br/?q=parceria-comercial

mechanical_bear1y ago

Where does this lack ethics? It seems that they are providing a useful service, that they created with their hard work. People are allowed to make money with their work.

2 more replies

presentation1y ago

It’s almost like people try to do valuable services for others in exchange for money.

2 more replies

siamese_puff1y ago

How does the ipv6 rotation work in this flow?

grafraf1y ago· 4 in thread

I'm also one of the founders and the current CTO, so there been a lot of scraping and maintaining during the years. We are scraping over 30 million prices daily.

filleokus1y ago

On the business side, what's your business model, how do you generate revenue? What's the longer term goals?

(Public data shows the company have a revenue of ≈400k USD and 6 employees https://www.allabolag.se/5590076351/matspar-i-sverige-ab)

grafraf1y ago

We are selling price/distribution data about the products we scrape. We do run some ads and have an affiliate deals.

showsover1y ago

Do you have a technical writeup of your scraping approach? I'd love to read more about the challenges and solutions for them.

grafraf1y ago

Unfortunately no, but i can share some insights that i hope can be of value:

- We also try to avoid scraping the sites during peak times and respect their robots.txt. We add some delay to each request. The scrapes are often done during night/early morning.

- The main challenge is that stores can redesign or modify which make our scrapers fail, so we need to be fast and adopt to the new changes.

There is much more but i hope that gave you some insights of challenges and some solutions!

3 more replies

lotsofpulp1y ago· 4 in thread

In the US, retail businesses are offering individualized and general coupons via the phone apps. I wonder if this pricing can be tracked, as it results in significant differences.

I am looking at the receipt now, and it is showing I would have spent $70 total if I did not apply the app discounts, but with the app discounts, I spent $53.

These price obfuscation tactics are seen in many businesses, making price tracking very difficult.

mcoliver1y ago

I wrote a chrome extension to help with this. Clips all the coupons so you don't have to do individual searches. Has resulted in some wild surprise savings when shopping. www.throwlasso.com

Larrikin1y ago

This looks amazing. Do you have plans to support Firefox and other browsers?

2 more replies

koolba1y ago

Ha! I have the same thing as a bookmarklet for specific sites. It’s fun to watch it render the clicks.

1 more reply

lotsofpulp1y ago

Wow! This is amazing, thank you. I usually use Safari, but will give it a try.

pcblues1y ago· 3 in thread

TrackerFF1y ago

For years said supermarkets have employed "price hunters", which are just people that go to competitor stores and register the prices of everything.

pcblues1y ago

The word I was looking for was collusion, but done with software and without people-based collusion.

avador1y ago

Compusion.

ikesau1y ago· 3 in thread

Ah, I love this. Nice work!

I really wish supermarkets were mandated to post this information whenever the price of a particular SKU updated.

The tools that could be built with such information would do amazing things for consumers.

sakisvOP1y ago

Thanks!

If Greece's case is anything to go by, I doubt they'd ever accept that as it may bring to light some... questionable practices.

At some point I need to deduplicate the products and plot the prices across all 3 supermarkets on the same graph as I suspect it will show some interesting trends.

project2501a1y ago

fyi, I posted this on /r/greece

1 more reply

robotnikman1y ago

gadders1y ago· 3 in thread

ElCapitanMarkla1y ago

Something I was talking over with a friend a while ago was something along the lines of this.

You kind of end up with a DIY HelloFresh / meal in a box service.

gadders1y ago

Yes, that would work.

sakisvOP1y ago

Ha, you can't imagine how many times I've thought of doing just that - it's just that it's somewhat blocked by other things that need to happen before I even attempt to do it

joelthelion1y ago· 3 in thread

We should mutualize scraping efforts, creating a sort of Wikipedia of scraped data. I bet a ton of people and cool applications would benefit from it.

sakisvOP1y ago

Haha all we have to do is agree on the format, right?

Spivak1y ago

https://en.m.wikipedia.org/wiki/WARC_(file_format)

joelthelion1y ago

Honestly I don't think that matters a lot. Even if all sites were scraped in a different format, the collection would still be insanely useful.

The most important part is being able to consistently scrape every day or so for a long time. That isn't easy.

haolez1y ago· 3 in thread

I heard that some e-commerce sites will not block scrappers, but poison the data shown to them (e.g. subtly wrong prices). Does anyone know more about this?

barryrandall1y ago

I never poisoned data, but I have implemented systems where clients who made requests too quickly got served data from a snapshot that only updated every 15 minutes.

MathMonkeyMan1y ago

This HN post had me playing around with Key Food's website. A lot of information is wrapped up in a cookie, but it looks like there isn't too much javascript rendering.

But when I hit the URLs with curl, without a cookie, I get a valid looking page, but it's just a hundred listings for "Baby Bok Choy." Maybe a test page?

After a little more fiddling, the server just responded with an empty response body. So, it looks like I'll have to use browser automation.

marginalia_nu1y ago

hk13371y ago· 3 in thread

I would be curious if there were a price difference between what is online and physically in the store.

devjab1y ago

The biggest sales come from the individual store “close to expiration” sales where items can become really cheap. These aren’t available anywhere but the stores themselves though.

flir1y ago

Next step: monitoring the updates to those e-ink shelf edge labels that are starting to crop up.

sakisvOP1y ago

The few random checks that I did on a few products as I was shopping didn't show any difference.

Either I was lucky or they don't bother, who knows

maerten1y ago· 2 in thread

Nice article!

> The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.

I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/

It works for the most part, as long as at least one correct barcode number is provided for a product.

sakisvOP1y ago

Thanks!

> I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

Yes, that's exactly what I've been doing and it saved me more times than I'd care to admit!

Tryk1y ago

Awesome, have been looking for something like this!

seanwilson1y ago· 2 in thread

z3t41y ago

Sanity checks in programming are underrated, not only are they cheap performance vice, they catch bugs early that would otherwise poison the state.

sakisvOP1y ago

Yeah I thought about that, but I've seen cases that a product jumped more than 100%.

I used this kind of heuristic to check if a scrape was successful by checking that the amount of products scraped today is within ~10% of the average of the last 7 days or so

andrewla1y ago· 2 in thread

One problem that the author notes is that so much rendering is done client side via javascript.

sakisvOP1y ago

Initially that's what I wanted to do, but the first supermarket I did is sending back HTML rendered on the server side, so I abandonded this approach for the sake of "consistency".

Lately I've been thinking to bite the bullet and Just Do It, but since it's working I'm a bit reluctant to touch it.

andrewla1y ago

For your purposes scraping the user-visible site probably makes the most sense since in the end, their users' eyes are the target.

I am typically doing one-off scraping and for that, an undocumented but clean JSON api makes things so much easier, so I've grown to enjoy sites that are unnecessarily complex in their rendering.

PigiVinci831y ago· 2 in thread

bob_theslob6461y ago

This looks like a really cool website but my only critique is how are you verifying that the data is actually real and not just generated randomly?

redblacktree1y ago

Do you have data on which data is in higher demand? Do you keep a list of frequently-requested datasets?

Alifatisk1y ago· 2 in thread

Some stores don’t have an interactive website but instead send out magazines to your email with news for the week.

How would one scrape those? Anyone experienced?

psd11y ago

Imap library to dump the attachment, pandoc to convert it to html, then DOM library to parts it statically.

Likely easier than website scraping.

Alifatisk1y ago

I’ll try this approach, thanks! Most magazines I’ve noticed are using a grid design, so my first thought was to somehow detect each square then OCR the product name with it’a price.

xnx1y ago· 1 in thread

sakisvOP1y ago

When I first started it there was a couple of instances that my IP was blocked - despite being a residential IP behind CGNAT.

I then started randomising every aspect of the scraping process that I could: The order that I visited the links, the sleep duration between almost every action, etc.

As long as they don't implement a strict fingerprinting technique, that seems to be enough for now

ptrik1y ago· 1 in thread

What's the thought process behind using a CI server - which I thought is mainly for builds - for what essentially is a data pipeline?

sakisvOP1y ago

Well I'm just thinking of concourse the same way it describes itself, "a continuous thing doer".

Of course I'm also using it for continuously deploying my app across 2 environments, or its monitoring stack, or running terraform etc.

Basically it runs everything for me so that I don't have to.

hnrodey1y ago· 1 in thread

Nice job getting through all this. I kind of enjoy writing scrapers and browser automation in general. Browser automation is quite powerful and under explored/utilized by the average developer.

The boost is that you allow the website/webapp to make the API calls and then the scraper focuses on the data (rather than allowing the page to render DOM updates).

This approach falls apart if the page is doing server side rendering as there are no API calls to sniff.

sakisvOP1y ago

...or worse, if there _is_ an API call but the response is HTML instead of a json

mishu21y ago· 1 in thread

ElCapitanMarkla1y ago

HTTrack was fantastic, still was a couple of years ago when I used it for a small project too.

moohaad1y ago· 1 in thread

Cloudflare Worker has Browser Rendering API

pencilcode1y ago

It’s pretty good actually. Used in a small scraping site and worked without a hitch.

antman1y ago· 1 in thread

Looks great. Perhaps more than 30 days comparisons would be interesting. Or customizable should be fast enough with a duckdb backend

sakisvOP1y ago

When you click on a product you get its full price history by default.

65101y ago· 1 in thread

Can someone name the South-American country where they have a government price comparison website. Listing all products was required by law.

Someone showed me this a decade ago. The site had many obvious issues but it did list everything. If I remember correctly it was started to stop merchants pricing things by who is buying.

I forget which country it was.

roberdam1y ago

Argentina https://www.preciosclaros.gob.ar/#!/buscar-productos

Scrapemist1y ago· 1 in thread

What if you add all products to your shopping cart and save it as “favourites” and scrape that every other day.

nilsherzig1y ago

You would still need a way to add all items and to check if there are new ones

ptrik1y ago· 1 in thread

Why did you pick Tailscale as the solution for proxy vs scraping with something like AWS Lambda?

anamexis1y ago

Didn't you answer your own question with the quote? It needs to originate from a residential IP address

mt_1y ago· 1 in thread

What about networking costs? Is it free in Hetzner?

kjksf1y ago

Depends on the server.

Most have at least 20 TB of bandwidth included in the price, even the lowest $5/mo shared cpu machines. 20 TB is a gigantic amount unless you're serving videos or some such.

Some have unlimited bandwidth (I mean they are effectively limited by the speed of network connection but you don't pay for amount).

jfil1y ago

I'm building something similar for 7 grocery vendor in Canada and am looking to talk with others who are doing this - my email is in my profile.

One difference: I'm recording each scraping session as a HAR file (for proving provenance). mitmproxy (mitmdump) is invaluable for that.

kinderjaje1y ago

We built a system for admins so they can match products from Site A with products from Site B.

The scraping part was not that hard. We used our product https://automatio.co/ where possible, and where we couldn't, we built some scrapers from scratch t using simple CURL or Puppetteer.

Thanks for sharing your experience, especially that I didn't use Playwright before.

Stubbs1y ago

I did something very similar but for the price of wood from sellers here in the UK but instead of Platwright, which I'd never heard of at the time, I used NodeRED.

You just reminded me, it's probably still running today :-D

ptrik1y ago

scarredwaits1y ago

Great article and congrats on making this! It would be great to have a chat if you like, because I’ve built Zuper, also for Greek supermarkets, which has similar goals (and problems!)

NKosmatos1y ago

Hey, thanks for creating https://pricewatcher.gr/en/ very much appreciated.

Nice blog post and very informative. Good to read that it costs you less than 70€ per year to run this and hope that the big supermarkets don’t block this somehow.

Have you thought of monetizing this? Perhaps with ads from the 3 big supermarkets you scrape ;-)

cynicalsecurity1y ago

Yep, AWS is hugely overrated and overpriced.

jonatron1y ago

If you were thinking of making a UK supermarket price comparison site, IIRC there's a company who owns all the product photos, read more at https://news.ycombinator.com/item?id=31900312

janandonly1y ago

I live in the Netherlands, where we are blessed with a price comparison website (https://tweakers.net/pricewatch/) for gadgets.

ptrik1y ago

> The data from the scraping are saved in Cloudflare's R2 where they have a pretty generous 10GB free tier which I have not hit yet, so that's another €0.00 there.

Wonder how's the data from R2 fed into frontend?

Closi1y ago

This is great! Would be great if the website would give a summary of which shop was actually cheapest (e.g. based on a basket of comparable goods that all retailers stock).

Although might be hard to do with messy data.

SebFender1y ago

I've worked with similar solutions for decades (complete different need) and in the end web changes made the solution unscalable. Fun idea to play but with too many error scenarios.

throwaway3464341y ago

https://prices.openfoodfacts.org/

raybb1y ago

Anyone know of one of these for Spain?

j / k navigate · click thread line to collapse