I thought this was a pretty funny story and just wanted to rant a bit about it.
Recently I was deploying our application Potarix on an EC2 instance. I asked ChatGPT for the right EC2 server for our SAAS app, and it recommended spot instances because spot instances are cheaper than normal instances since they use AWS’s unused EC2 capacity. I thought you know why not? I prefer to save a little extra cash and there are probably periods of downtime in our app.
We deploy and everything is fine for a few days. I wake up on a Friday morning to a bunch of pings that the application isn’t working. I was like hmmm strange, I gave it a try and confirmed our app doesn’t work. I then decided to ssh into the EC2 instance to figure out what was going wrong, only to find that the EC2 instance wasn’t even on the AWS portal. I was panicking and thinking, did I even deploy this, or did I give the instance a different name, or did I accidentally delete it? I looked at the EC2 history and found that Amazon terminated the instance for me. A quick ChatGPT prompt revealed that these instances can be terminated anytime Amazon reclaims the capacity.
Through this experience, I learned 2 things: 1. Don’t use spot instances for anything that’s going to be deployed to production and needs to be active all the time. 2. Every time ChatGPT gives you a sketch recommendation, ask for downsides.
We found that people were manually going through Google Maps to find SMBs. They would use the search and manually type in the businesses they were looking for. For example, they would type “restaurants” and manually call/email them.
What we decided to do was gather the Google Maps data autonomously and surface that to our customers so they could download it. The problem was that we would need a bunch of data from Google Maps to pull it off. We would need to grab all the SMBs across the United States which is a huge undertaking.
Initially, I tried no-code AI web scraping solutions and they worked horribly. For some reason, I couldn’t even get them to scroll down on the page. I was also able to reverse engineer their open-source code and discover that they were taking the entire web page and passing it into GPT to extract data. That just burned my Openai bill.
I then tried the semi-code approach where I would use something like Apify or Google Places API to scrape the businesses. This worked better but still, there was an issue of price at the scale we wanted.
Eventually, we ended up writing our scraper for the task. The main problem came after writing the scraper and having to parallelize and concurrently run the processes on a server (The scraping task was so large that we had to parallelize for time). Battling with different infrastructure cost us weeks of time.
This experience was so horrible we ended up creating potarix.com. It's a no code scraper. Simply type in the url you want to scrape and briefly describe your scraping task and our AI will generate a script. We’ve also made it super easy to control the infrastructure and specify how many processes you want running, provide login information, and bypass captchas.
We also understand AI is shit and doesn’t work a lot of times, so depending on your task we’re also creating a white glove onboarding service, so we’ll work in conjunction with the AI to complete a data extraction task for you.