At the time I wrote it I thought nobody else was doing but now I know of at least 3 start ups doing the same in NZ. It seems the the inflation really stoked a lot of innovation here. The patterns are about what you'd expect. Supermarkets are up to the usual tricks of arbitrary making pricing as complicated as possible using 'sawtooth' methods to segment time-poor people from poor people. Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.
“Often they'll segment on brand loyalty vs price sensitive people; There might be 3 popular brands of chocolate and every week only one of them will be sold at a fair price.”
The smart shopper might only buy in bulk once every three weeks when his favourite brand at a lower price, or twitch to the cheapest brand every week. A hurried or lazy shopper might always pick their favourite brand every week. If they buy one item a week the lazy shopper would have spent $5, while the smart shopper has only spent $3.
They've made 60c off the smart shopper and $2.60 off the lazy shopper. By segmenting out the lazy shoppers they've made $2. The whole idea of rotating the prices is nothing to do with the cost of goods sold it's all about making shopping a pain in the ass for busy people and catching them out.
There are so many scrapers that come and go doing this in AU but are usually shut down by the big supermarkets.
It's a cycle of usefulness and "why doesn't this exist", except it had existed many times before.
I’ve been interested in price comparison for Australia for a while, am a Product designer/manager with a concept prototype design, looking for others interested to work on it. My email is on my profile if you are.
I'm thinking of starting a little price comparison site, mainly to compare select products at Colesworths vs Aldi (I've just started doing more regular grocery shopping at Aldi myself). But as far as I know, Aldi don't have any prices / catalogues online, so my plan is to just manually enter the data myself in the short-term, and to appeal to crowdsourcing the data in the long-term. And plan is to just make it a simple SSG site (e.g. Hugo powered), data all in simple markdown / json files, data all sourced via github pull requests.
Feel free to get in touch if you'd like to help out, or if you know of anything similar that already exists: greenash dot net dot au slash contact
With the corresponding repo too: https://github.com/Javex/hotprices-au
You might be breaking the sites terms and conditions but that does not mean its illegal.
Dan Murphy uses a similar thing, they have their own price checking algorithm.
Building crawlers and infra to support it tool not more than 20 hours.
Not really, since the delivery fees/tips that you have to pay would eat up all the savings, unless maybe if you're buying for a family of 5 or something.
With fresh produce I find Pak n Save a lot more variable with quality, making online orders more risky despite the lower cost.
Did you notice anything pre and post Whittakers price increase(s)? They must have a brilliant PR firm in retainer for every major news outlet to more or less push the line that increased prices are a good thing for the consumer. I noticed more aggressive "sales" more recently, but unsure if I am just paying more attention.
My prediction is that they will decrease the size of the bars soon.
One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it).
I've found that building the scrapers and infrastructure is somewhat the easy part. The hard part is maintaining all of the scrapers and figuring out if when a product disappears from a site, is that because my scraper has an error, is it my scraper being blocked, did the site make a change, was the site randomly down for maintenance when I scraped it etc.
A fun project, but challenging at times, and annoying problems to fix.
If I were you I would go to an optometrist and talk about this. They can also often give you free trials for different contacts and you can find one that works for you.
When I was in my 20s, this was absolutely not a problem.
When I hit my 30s, I started wearing glasses instead of contacts basically all the time, and it wasn't a problem.
Now that I'm in my 40s, I'm having to take my glasses off to read a monitor and most things that are closer than my arm's reach.
I can use an E-Ink device all day without my eyes getting tired.
Now that you mention idealo, there seems to be no mention at all on a product page that they are paid by the stores, you have to click the "rank" link in the footer to be brought to a page https://www.idealo.de/aktion/ranking where they write this.
In the U.S. at least, big retailers will have product suppliers build slightly different SKUs for them to make price comparisons tricky. Costco is somewhat notorious for this where almost everything electronics (and many other products) sold in their stores is a custom SKU -- often with slightly product configuration.
Part of the problem is simply due to bad data from the websites. Just as an example - there's a 2-week contact lens called "Acuvue Oasys". And there's a completely different 1-day contact lens called "Acuvue Oasys 1-Day". Some sites have been bad at writing this properly, so both variants may be called "Acuvue Oasys" (or close to it), and the way to distinguish them is to look at the image to see which actual lens they mean, look at the price etc.
It's true that this could probably also be handled by AI, but in the end, classifying the lenses takes like 1-2% of the time it takes to make a scraper for a website so I found it was not worth trying to build a very good LLM classifier for this.
You'll need to constantly rotate residential proxies (high rated) and make sure not to exhibit data scraping patterns. Some supermarkets don't show the network requests in the network tab, so cannot just get that api response.
Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.
I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it
For anyone with some clout/money who would like to stop corporations like Akamai and Cloudflare from unilaterally blocking IP addresses, the way that works is you file a lawsuit against the corporations and get an injunction to stop a practice (like IP blacklisting) during the legal proceedings. IANAL, so please forgive me if my terminology isn't quite right here:
https://pro.bloomberglaw.com/insights/litigation/how-to-file...
https://www.law.cornell.edu/wex/injunctive_relief
Injunctions have been used with great success for a century or more to stop corporations from polluting or destroying ecosystems. The idea is that since anyone can file an injunction, that creates an incentive for corporations to follow the law or risk having their work halted for months or years as the case proceeds.
I'd argue that unilaterally blocking IP addresses on a wide scale pollutes the ecosystem of the internet, so can't be allowed to continue.
Of course, corporations have thought of all of this, so have gone to great lengths to lobby governments and use regulatory capture to install politicians and judges who rule in their favor to pay back campaign contributions they've received from those same corporations:
https://www.crowell.com/en/insights/client-alerts/supreme-co...
https://www.mcneeslaw.com/nlrb-injunction/
So now the pressures that corporations have applied on the legal system to protect their own interests at the cost of employees, taxpayers and the environment have started to affect other industries like ours in tech.
You'll tend to hear that disruptive ideas like I've discussed are bad for business from the mainstream media and corporate PR departments, since they're protecting their own interests. That's why I feel that the heart of hacker culture is in disrupting the status quo.
Since this is just a side project, if it starts demanding too much of my time too often I'll just stop it and open both the code and the data.
BTW, how could the network request not appear in the network tab?
For me the hardest part is to correlate and compare products across supermarkets
See nextjs server side, I believe they mention that as a security thing in their docs.
In terms of comparison, most names tend to be the same. So some similarity search if it's in the same category matches good enough.
I did the same and made https://grocerygoose.ca/
Published the API endpoints that I “discovered” to make the app https://github.com/snacsnoc/grocery-app (see HACKING.md)
It’s an unfortunate state of affairs when devs like us have to go to such great lengths to track the price of a commodity (food).
A thorny problem in my case is that the same item is named in 3 different ways between the 3 supermarkets which makes it very hard and annoying to do a proper comparison.
Did you have a similar problem?
I have the additional problem that I want to compare products across France and Belgium (Dutch-speaking side) so there is no hope at all to group products automatically. My manual system allows me to put together say 250g and 500g packaging of the same butter, or of two of the butters that I like to buy, so I can always see easily which one I should get (it's often the 250g that's cheaper by weight these days).
Also the 42000 or so different packagings for Head and Shoulders shampoo. 250ml, 270ml, 285ml, 480ml, 500ml (285ml is usually cheapest). I'm pretty sure they do it on purpose so each store doesn't have to match price with the others because it's a "different product".
In Europe, that would probably be Aldi/Lidl.
In the U.S., maybe Costco/Trader Joe's.
For online, CamelCamelCamel/Amazon. (for health/beauty/some electronics but not food)
If you can buy direct from the manufacturer, sometimes that's even better. For example, I got a particular brand of soap I love at the soap's wholesaler site in bulk for less than half the retail price. For shampoo, buying the gallon size direct was way cheaper than buying from any retailer.
Costco/Walmart/Aldi in my experience.
Trader Joe's is higher quality, but generally more expensive.
I pretty much just exclusively shop at Aldi/Walmart as they have the best prices overall. Kroger owned stores and Albertsons owned are insanely overpriced. Target is a good middle ground but I can’t stand shopping there now with everything getting locked up.
With 2,392 stores in 38 states plus DC[1], I'm not sure Aldi US qualifies.
For example, compare the price of oat milk at different zip codes and grocery stores. Additionally track “shrinkflation” (same price but smaller portion).
On that note, it seems you are tracking price but are you also checking the cost per gram (or ounce)? Manufacturer or store could keep price the same but offer less to the consumer. Wonder if your tool would catch this.
Having said that, that's definitely something that I could add and it would show when the shrinkflation occured if any.
There may be some exceptions, but I’m struggling to think of any except things where weight/volume aren’t really relevant to the value — e.g., a sponge.
And then try to sell it back to businesses, even suggesting they use the data to train AI. You also make it sound like there’s a team manually doing all the work.
https://www.economizafloripa.com.br/?q=parceria-comercial
That whole page makes my view of the project go from “helpful tool for the people, to wrestle back control from corporations selling basic necessities” to “just another attempt to make money”. Which is your prerogative, I was just expecting something different and more ethically driven when I read the homepage.
I'm also one of the founders and the current CTO, so there been a lot of scraping and maintaining during the years. We are scraping over 30 million prices daily.
(Public data shows the company have a revenue of ≈400k USD and 6 employees https://www.allabolag.se/5590076351/matspar-i-sverige-ab)
The insight i can share is that the main (tech) goal is to make the product more user friendly and more aligned with the customer need as it has many pain points and we have gain some insights on the preferred customer journey.
- Tech: Everything is hosted in AWS. We are using Golang in docker containers that does the scraping. They run on ECS Fargate spots when needed using cronjob. The scraping result is stored as a parquet in S3 and processed in our RDS Postgresql. We need to be creative and have some methods to identify that a particular product A in store 1 is the same as product A in store 2 so they are mapped together. Sometimes it needs to be verified manually. The data that are of interest for the user/site is indexed into an Elastic search.
Things that might be of interest: - We always try to avoid parsing the HTML but instead calling the sites APIs directly to reduce scraping time. We also try to scrape the category listing to access multiple prices by one request, this can reduce the total requests from over 100 000 to maybe less than 1000 requests.
- We also try to avoid scraping the sites during peak times and respect their robots.txt. We add some delay to each request. The scrapes are often done during night/early morning.
- The main challenge is that stores can redesign or modify which make our scrapers fail, so we need to be fast and adopt to the new changes.
- Another major hidden challenge is that the stores have different prices for the same product depending on your zip code, so we have our ways of identifying the stores different warehouses, what zip codes belong to a specific warehouse and do a scrape for that warehouse. So a store might have 5 warehouses, so we need to scrape it 5 times with different zip codes
There is much more but i hope that gave you some insights of challenges and some solutions!
For example, I recently purchased fruit and dairy at Safeway in the western US, and after I had everything I wanted, I searched each item in the Safeway app, and it had coupons I could apply for $1.5 to $5 off per item. The other week, my wife ran into the store to buy cream cheese. While she did that, I searched the item in the app, and “clipped” a $2.30 discount, so what would have been $5.30 to someone that didn’t use the app was $3.
I am looking at the receipt now, and it is showing I would have spent $70 total if I did not apply the app discounts, but with the app discounts, I spent $53.
These price obfuscation tactics are seen in many businesses, making price tracking very difficult.
The rationale is that if all prices are out there in the open, consumers will end up paying a higher price, as the actors (supermarkets) will end up pricing their stuff equally, at a point where everyone makes a maximum profit.
For years said supermarkets have employed "price hunters", which are just people that go to competitor stores and register the prices of everything.
Here in Norway you will oftentimes notice that supermarket A will have sale/rebates on certain items one week, then the next week or after supermarket B will have something similar, to attract customers.
I really wish supermarkets were mandated to post this information whenever the price of a particular SKU updated.
The tools that could be built with such information would do amazing things for consumers.
If Greece's case is anything to go by, I doubt they'd ever accept that as it may bring to light some... questionable practices.
At some point I need to deduplicate the products and plot the prices across all 3 supermarkets on the same graph as I suspect it will show some interesting trends.
Where you could set a list of various meals that you like to eat regularly, a list of like 20 meal options. And then the app fetches the pricing for all ingredients and works out which meals are the best value that week.
You kind of end up with a DIY HelloFresh / meal in a box service.
"Dave, the cheapest meals for you this week are [LIST OF DINNERS]. Based on your preferred times, deliveries from Waitrose, Tesco and Sainburys are turning up at 7pm on Monday. Please check you still have the following staples in stock [EG pasta, tinned tomatoes etc]".
The most important part is being able to consistently scrape every day or so for a long time. That isn't easy.
But when I hit the URLs with curl, without a cookie, I get a valid looking page, but it's just a hundred listings for "Baby Bok Choy." Maybe a test page?
After a little more fiddling, the server just responded with an empty response body. So, it looks like I'll have to use browser automation.
The biggest sales come from the individual store “close to expiration” sales where items can become really cheap. These aren’t available anywhere but the stores themselves though.
Here I think the biggest challenge might be the monopoly supermarket chains have on the market. We basically two major corporations with various brands. They are extremely similar in their pricing, and even though there are two low price competitors, these don’t seem to affect the competition with the two major corporations at all. What is worse is that one of these two major corporations is “winning”, meaning that we’re heading more and more toward what will basically be a true monopoly.
Either I was lucky or they don't bother, who knows
> The second kind is nastier. > > They change things in a way that doesn't make your scraper fail. Instead the scraping continues as before, visiting all the links and scraping all the products.
I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.
I have built a similar system and website for the Netherlands, as part of my master's project: https://www.superprijsvergelijker.nl/
Most of the scraping in my project is done by doing simple HTTP calls to JSON apis. For some websites, a Playwright instance is used to get a valid session cookie and circumvent bot protection and captchas. The rest of the crawler/scraper, parsers and APIs are build using Haskell and run on AWS ECS. The website is NextJS.
The main challenge I have been trying to work on, is trying to link products from different supermarkets, so that you can list prices in a single view. See for example: https://www.superprijsvergelijker.nl/supermarkt-aanbieding/6...
It works for the most part, as long as at least one correct barcode number is provided for a product.
> I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.
Yes, that's exactly what I've been doing and it saved me more times than I'd care to admit!
You could probably add some automated checks to not sync changes to prices/products if a sanity check fails e.g. each price shouldn't change by more than 100%, and the number of active products shouldn't change by more than 20%.
I used this kind of heuristic to check if a scrape was successful by checking that the amount of products scraped today is within ~10% of the average of the last 7 days or so
The flip side to this is that very often you find that the data populating the site is in a very simple JSON format to facilitate easy rendering, ironically making the scraping process a lot more reliable.
Lately I've been thinking to bite the bullet and Just Do It, but since it's working I'm a bit reluctant to touch it.
I am typically doing one-off scraping and for that, an undocumented but clean JSON api makes things so much easier, so I've grown to enjoy sites that are unnecessarily complex in their rendering.
How would one scrape those? Anyone experienced?
Likely easier than website scraping.
I then started randomising every aspect of the scraping process that I could: The order that I visited the links, the sleep duration between almost every action, etc.
As long as they don't implement a strict fingerprinting technique, that seems to be enough for now
What's the thought process behind using a CI server - which I thought is mainly for builds - for what essentially is a data pipeline?
I want something that will run some code when something happens. In my case that "something" is a specific time of day. The code will spin up a server, connect it to tailscale, run the 3 scraping jobs and then tear down the server and parse the data. Then another pipeline runs that loads the data and refreshes the caches.
Of course I'm also using it for continuously deploying my app across 2 environments, or its monitoring stack, or running terraform etc.
Basically it runs everything for me so that I don't have to.
Something I learned recently, which might help your scrapers, is the ability in Playwright to sniff the network calls made through the browser (basically, programmatic API to the Network tab of the browser).
The boost is that you allow the website/webapp to make the API calls and then the scraper focuses on the data (rather than allowing the page to render DOM updates).
This approach falls apart if the page is doing server side rendering as there are no API calls to sniff.
For my project (https://frankendash.com/), I also ran into issues with dynamically generated class names which change on every site update, so in the end I just went with saving a crop area from the website as an image and showing that.
I did consider adding a 3 and 6 month button, but for some reason I decided against it, don't remember why. It wasn't performance because I'm heavily caching everything so it wouldn't have made a difference. Maybe aesthetics?
Someone showed me this a decade ago. The site had many obvious issues but it did list everything. If I remember correctly it was started to stop merchants pricing things by who is buying.
I forget which country it was.
Why did you pick Tailscale as the solution for proxy vs scraping with something like AWS Lambda?
Most have at least 20 TB of bandwidth included in the price, even the lowest $5/mo shared cpu machines. 20 TB is a gigantic amount unless you're serving videos or some such.
Some have unlimited bandwidth (I mean they are effectively limited by the speed of network connection but you don't pay for amount).
One difference: I'm recording each scraping session as a HAR file (for proving provenance). mitmproxy (mitmdump) is invaluable for that.
We built a system for admins so they can match products from Site A with products from Site B.
The scraping part was not that hard. We used our product https://automatio.co/ where possible, and where we couldn't, we built some scrapers from scratch t using simple CURL or Puppetteer.
Thanks for sharing your experience, especially that I didn't use Playwright before.
You just reminded me, it's probably still running today :-D
Good lesson on cloud economics. Below certain threshold we get linear performance gain with more expensive instance type. It is essentially the same amount of spending but you would save time running the same workload with more expensive machine but for shorter period of time.
Nice blog post and very informative. Good to read that it costs you less than 70€ per year to run this and hope that the big supermarkets don’t block this somehow.
Have you thought of monetizing this? Perhaps with ads from the 3 big supermarkets you scrape ;-)
Yep, AWS is hugely overrated and overpriced.
Wonder how's the data from R2 fed into frontend?
Although might be hard to do with messy data.