I think one or both were acquired and immediately shut down, but I’m not 100% sure about that.
We are doing well and are independently owned.
I think there are 3 things that contribute to this:
1. It is very easy to make a prototype that looks "magical" but very hard to build something that works in real applications. There are an enormous amount of quirks that a browser allows, and each site you encounter will use a different set of those quirks. Sites also tend to be unreliable, so whatever you build has to be very resistant to errors.
2. There is a technological wall that every company in this space reaches where it is not yet possible to mass-specialize for different websites. So even if you're able to build a tool that works very well on any individual website, the technology is not there yet to be able to generalize the instructions across websites in the same category. So if a customer wants to scrape 1000 websites, they still have to build custom instructions for each website (5-10x reduction in labor vs scripting) when what they really want/is economically viable for them is to build a single set of instructions that will work for all similar websites (10000x reduction in labor vs scripting). This is something that we're working on for the next version of parsehub, but is still a couple years away from launch.
3. Many of the YC startups you hear about have raised funding from investors and have short term pressures to exit.
The combination of the three makes it very tempting to give up and sell.
Just curious, in your experimentation, have you found it necessary to train a new model for each "category"? Or have you found a way to generalize it?
Can't this be crowdsourced in some way? Having each individual entity reinvent the same wheel feels like the main problem to me. What if there was a marketplace? The ability to buy / trade / sell? Maybe subscription based in some way?
If I wanted to scrape 100 sites, it might be worth $1 per year per site. Those who put in the time make money. Those who don't have the time would pay.
This isn't a technology issue per se. It's scaling a solution to the final gap the technology can't cover. A different kind of mechanical turk?
I wonder what other kinds of products and services would be good for that model. In other words, would tend to be acquired for good money in order to stop them.
1. Narrow target
Your market is people who need scraped data to input into some kind of app/program/code, but don't have the resources/skills/time to use scrapy or whatever.
2. Sensitive to configuration
This is also the problem with visual code and ML apps, but you even a small issue with the source you are scraping from -- say, captcha, or login, or some weird format or css you did not anticipate -- makes it almost useless, whereas if you were coding up a solution you can (usually, not always) deal with it more easily.
Those are the reasons they shut down.
The reasons why they launch:
1. Many developers have this need
Many developers have built scrapers internally, and then used them so a lot of people have worked on this problem.
What follows from this is that they can productize it, see that other people have the need, imagine the market etc.
The idea is to be able to choose a website, select the data you want, and make it available (as JSON, CSV or an API) with as little friction as possible.
Kimono was the gold standard for a while so did yoink some of their ideas, while doing some other things differently.
Still needs some work but as an MVP would appreciate any feedback. Cheers.
Any option for a firefox build?
No particular tricks to avoid detection. It's Puppeteer under the hood with a few customizations which works well on the majority of sites tested so far.
Given the cat-and-mouse game around web scraping you may never cover every website, and that's ok.
I have a similar idea that I'm working on, your site is definitely bookmarked and will try the extension later.
If they don't want their data to be scraped, it is up to them to secure it.
> You agree that you will not:
> Copy, modify or create derivative works of the Service or any Content;
> Copy, manipulate or aggregate any Content (including data) for the purpose of making it available to any third party; Trade, sell, rent, loan, lease or license any Content or access to the Service, whether commercially or free of charge;
> Use or introduce to the Service any data mining, crawling, "scraping", robot or similar automated or data gathering or extraction method, or manually access, acquire, monitor or copy any portion of the Service, or download or store Content (unless expressly authorized by CMC).
We built WrapAPI (https://wrapapi.com) back in the day, before we ended up starting Wanderlog (https://wanderlog.com), our current travel planning Y Combinator startup. This definitely is still an unsolved problem.
However, from a business point of view, we found that it was rather difficult to make a business out of an unspecialized scraping tool. The Kimono founders expressed a similar sentiment: ultimately, scraping is a solution looking for a problem.
Developers can often roll their own solution too, which limits your customer base and how much you can charge. Instead, vertical-specific tools that target particular industries seem to be the way to go (see Plaid as an example!)
Alternatively, you have to be good at Enterprise and B2B sales. This is a product that you need to get the word out, get a champion, and do customer success on since it has a substantial learning curve. We were not, so that was why we chose to focus on other projects to start out
Best of luck, and feel free to get in touch if you'd like to chat more
Really appreciate the insights.
You're right that much depends on mapping the solution to a particular problem. Are you selling yet another scraping tool or are you freeing data to drive better decisions / save time / yada yada.
With the right frame, a sensible price point, and as much complexity abstracted away as is possible, there may exist a business model - seems to be many opportunities hiding in plain sight.
Will reach out soon for sure. Best of luck with Wanderlog
On my mobile device on brave iOS, entering the Date in the calendar was janky FYI and i had to click another text box to keep my date selection and make the calendar widget disappear, so I could submit the form.
This just seems to add another dependency to whatever I'm developing. Plus, it sends data through a server I don't control. (I assume)
Also, what use could website spitting essentially the same python/js script over and over have?
The site/extension basically has to do that each time it scrapes locally (or use generic parametrised scraper) If you wanted to use it in an API, my impression is that you can run it in chrome as an extension you need to get from the chrome store or tunnel your data through a third party server. Is that wrong?
Can you scrape data locally without running chrome/the extension? I can't tell from reading the site, sorry. (if it's actually there, please link an anchor tag to it or something please)
And no - if you choose to also create a cloud recipe that runs on the server, the remote browser instance won't be able to access data behind a login.
It's possible but I'd rather not store third-party credentials for the time being.
Also, scraping a website to use/copy it’s data is illegal in my country (Belgium). I’m not sure this tool itself would be.
Please consider adding the ability to script clicks on elements, e.g. buttons.
I manage a site where we load a subset of articles on initial page load and then have a "Load more" button that executes Javascript to load another batch of articles. Getting a list of articles from our CMS is a bit of a hassle so being able to scrape it easily instead would be ideal.
If the site's publicly accessible and you're able to share, send the details to mike @ simplescraper.io and I'll get this working for you.
Anyways give me an email at joel at browserless dot io if you ever want to chat
For sure, once the app is a notch more tried and tested I'll get in touch. Appreciate it.
This will likely change as I have more stats and feedback on usage and expenses. But the goal is to offer a price point that's fair and low relative to other options.
Hmm, definite missed merch opportunity there.
- Click the pagination icon and then click the pagination element (usually 'Next' or an arrow). The icon will turn green
- Click 'view results' and then choose to save the recipe
- Select the number of pages you'd like to scrape
- Run your recipe and it will scrape those pages
"Turn website into an API", for me, evokes the image that I can automate (say) placing an order in Amazon as an API, or paying my bills automatically. It includes scraping, of course, but requires a lot more (mechanize/twill/selenium/phantom/etc power).
There was a company called Orsus that did exactly that. Last I heard about them it was the year 2000.
This seems hard to solve entirely programmatically, maybe having a way to be more specific by providing a selector yourself or selecting multiple entries and having the plugin figure it out could add a lot of utility in such cases.