Show HN: Turn any website into an API (for those who miss Kimono) (opens in new tab)

(simplescraper.io)

332 pointswelanes6y ago74 comments

74 comments

61 comments · 20 top-level

save_ferris6y ago· 9 in thread

What is it about this service as a business model that prevents it from taking off? I’ve known at least two YC startups that tried to build businesses around this idea.

I think one or both were acquired and immediately shut down, but I’m not 100% sure about that.

tsergiu6y ago

I'm the founder of parsehub.

We are doing well and are independently owned.

I think there are 3 things that contribute to this:

1. It is very easy to make a prototype that looks "magical" but very hard to build something that works in real applications. There are an enormous amount of quirks that a browser allows, and each site you encounter will use a different set of those quirks. Sites also tend to be unreliable, so whatever you build has to be very resistant to errors.

2. There is a technological wall that every company in this space reaches where it is not yet possible to mass-specialize for different websites. So even if you're able to build a tool that works very well on any individual website, the technology is not there yet to be able to generalize the instructions across websites in the same category. So if a customer wants to scrape 1000 websites, they still have to build custom instructions for each website (5-10x reduction in labor vs scripting) when what they really want/is economically viable for them is to build a single set of instructions that will work for all similar websites (10000x reduction in labor vs scripting). This is something that we're working on for the next version of parsehub, but is still a couple years away from launch.

3. Many of the YC startups you hear about have raised funding from investors and have short term pressures to exit.

The combination of the three makes it very tempting to give up and sell.

swalsh6y ago

#2 is what would transform this from a nice niche tool, to something very valuable. In the ecommerce space, tracking competitor pricing is a great example of this type of thing. I can also see use casese for recipe's, finance, healthcare, you name it. Those b2b use cases are worth real money.

Just curious, in your experimentation, have you found it necessary to train a new model for each "category"? Or have you found a way to generalize it?

2 more replies

chiefalchemist6y ago

> So if a customer wants to scrape 1000 websites, they still have to build custom instructions for each website...

Can't this be crowdsourced in some way? Having each individual entity reinvent the same wheel feels like the main problem to me. What if there was a marketplace? The ability to buy / trade / sell? Maybe subscription based in some way?

If I wanted to scrape 100 sites, it might be worth $1 per year per site. Those who put in the time make money. Those who don't have the time would pay.

This isn't a technology issue per se. It's scaling a solution to the final gap the technology can't cover. A different kind of mechanical turk?

2 more replies

tixocloud6y ago

Possibly a mix between use cases, maintainability and economics. We used to scrape economic indicators data at a fintech startup and monetized it - every slight change to the website created an issue to the data feeds. It was a huge nightmare to maintain. Scraping any website is quite generic and doesn't really speak to a specific audience on a specific need. But more importantly, having been in the data and analytics industry for years, data has far lower margins than insights and recommendations. The market is willing to pay a crazy premium (look at how much all the consultants are being billed out for) to get insights and recommendations. Data itself isn't inherently valuable to most companies.

jlokier6y ago

Repeatedly being acquired to be immediately shut down sounds like quite a good business model, if your goal is to be paid.

I wonder what other kinds of products and services would be good for that model. In other words, would tend to be acquired for good money in order to stop them.

pezo19196y ago

Acquired by who?

1 more reply

omarhaneef6y ago

I would guess:

1. Narrow target

Your market is people who need scraped data to input into some kind of app/program/code, but don't have the resources/skills/time to use scrapy or whatever.

2. Sensitive to configuration

This is also the problem with visual code and ML apps, but you even a small issue with the source you are scraping from -- say, captcha, or login, or some weird format or css you did not anticipate -- makes it almost useless, whereas if you were coding up a solution you can (usually, not always) deal with it more easily.

Those are the reasons they shut down.

The reasons why they launch:

1. Many developers have this need

Many developers have built scrapers internally, and then used them so a lot of people have worked on this problem.

What follows from this is that they can productize it, see that other people have the need, imagine the market etc.

slowenough6y ago

I applied to YC with an idea like this and was rejected. 12 times. Maybe it's not the idea. Maybe it's me. Or maybe it's YC.

hk__26y ago

I don’t know anything about your case, but the general rule is that ideas are worthless, it’s the execution that matter.

1 more reply

welanesOP6y ago· 7 in thread

Hey HN, I posted this in a comment thread the other day and (to my surprise) it got a positive reception so added a few more updates and decided to post it proper.

The idea is to be able to choose a website, select the data you want, and make it available (as JSON, CSV or an API) with as little friction as possible.

Kimono was the gold standard for a while so did yoink some of their ideas, while doing some other things differently.

Still needs some work but as an MVP would appreciate any feedback. Cheers.

nannal6y ago

>would appreciate any feedback

Any option for a firefox build?

welanesOP6y ago

Yes, working on it now.

lucasverra6y ago

also will try when on FF

bko6y ago

When I saw this service last week, I think you had a section about a paid service where you do the scraping on a server and send the results. Do you offer that? How do you get around anti-scraping technology, if it exists?

welanesOP6y ago

Yeah, that's offered although it's currently free.

No particular tricks to avoid detection. It's Puppeteer under the hood with a few customizations which works well on the majority of sites tested so far.

Given the cat-and-mouse game around web scraping you may never cover every website, and that's ok.

rapind6y ago

Unrelated question. There's a "Made by Lanes" badge. What was made by lanes.io though? The web page?

giarc6y ago

Why is one page scrape 2 credits? Why not just 1?

uberswe6y ago· 7 in thread

I like the idea but I was skeptical as to how well it works and noticed the video on the main page of your website which scans coinmarketcap seems to be wrong. It gets 200 cryptocurrency names but only 100 prices which means only the first result is correct.

I have a similar idea that I'm working on, your site is definitely bookmarked and will try the extension later.

welanesOP6y ago

Good catch, uberswe. Was an older video and I flubbed the selection process - here it is working correctly: https://www.kapwing.com/videos/5dbc3e33ee4d0f00136d01e6

m00dy6y ago

Hi, is it the chrome extension that does whole work or There is a separate background task on your side that actually runs those recipes ?

1 more reply

uberswe6y ago

Nice, that looks much better!

treve6y ago

Also interesting that this main example is also a violation of coinmarketcap's terms. They have a paid API.

chirau6y ago

If i use my pen and notebook to write down all those values, am i also in violation of those terms?

If they don't want their data to be scraped, it is up to them to secure it.

1 more reply

sh876y ago

I think so too. From their terms [1]

> You agree that you will not:

> Copy, modify or create derivative works of the Service or any Content;

> Copy, manipulate or aggregate any Content (including data) for the purpose of making it available to any third party; Trade, sell, rent, loan, lease or license any Content or access to the Service, whether commercially or free of charge;

> Use or introduce to the Service any data mining, crawling, "scraping", robot or similar automated or data gathering or extraction method, or manually access, acquire, monitor or copy any portion of the Service, or download or store Content (unless expressly authorized by CMC).

[1]: https://coinmarketcap.com/terms/

19966y ago

Then use data from a free API without any TOS, and more data like separating bid and ask:

http://cmplot.com/api.json

phsource6y ago· 5 in thread

This is very cool! I love how you brought back the original Kimono UI with the checkmark and Xs for adding and removing data tags.

We built WrapAPI (https://wrapapi.com) back in the day, before we ended up starting Wanderlog (https://wanderlog.com), our current travel planning Y Combinator startup. This definitely is still an unsolved problem.

However, from a business point of view, we found that it was rather difficult to make a business out of an unspecialized scraping tool. The Kimono founders expressed a similar sentiment: ultimately, scraping is a solution looking for a problem.

Developers can often roll their own solution too, which limits your customer base and how much you can charge. Instead, vertical-specific tools that target particular industries seem to be the way to go (see Plaid as an example!)

Alternatively, you have to be good at Enterprise and B2B sales. This is a product that you need to get the word out, get a champion, and do customer success on since it has a substantial learning curve. We were not, so that was why we chose to focus on other projects to start out

Best of luck, and feel free to get in touch if you'd like to chat more

welanesOP6y ago

Thanks! Yeah the checkmark confirmation just feels effortless. Haven't got it perfected yet, but soon.

Really appreciate the insights.

You're right that much depends on mapping the solution to a particular problem. Are you selling yet another scraping tool or are you freeing data to drive better decisions / save time / yada yada.

With the right frame, a sensible price point, and as much complexity abstracted away as is possible, there may exist a business model - seems to be many opportunities hiding in plain sight.

Will reach out soon for sure. Best of luck with Wanderlog

bravura6y ago

I tried your site and am curious that, for ko pha ngan there is only one recommended resource. Shouldn’t there be more?

On my mobile device on brave iOS, entering the Date in the calendar was janky FYI and i had to click another text box to keep my date selection and make the calendar widget disappear, so I could submit the form.

xmly6y ago

Very insightful comments!

MetalGuru6y ago

Curious, what comparison are you making with Plaid here?

phsource6y ago

Plaid, Yodlee, and others abstracts away extracting data from various banks and financial services providers, so they're providing a solution built on top of the same data extraction techniques that this tool uses

1 more reply

ainiriand6y ago· 2 in thread

Hi, is it possible to make it compatible with firefox?

welanesOP6y ago

Sure, in fact I'll do it this weekend.

seniorThrowaway6y ago

I'm also interested in this. I no longer use Chrome due to its pervasive surveillance and telemetry.

flingo6y ago· 2 in thread

Is there a reason this doesn't spit out some python or JavaScript code to scrape the same info out?

This just seems to add another dependency to whatever I'm developing. Plus, it sends data through a server I don't control. (I assume)

petr-nagy6y ago

Did you read the website? It says "Scrape locally or create recipes that run quickly in the cloud."

Also, what use could website spitting essentially the same python/js script over and over have?

flingo6y ago

I must have skimmed past that. Whoops. I avoided trying it out because it's not available on firefox, so I couldn't correct my assumption by testing it. Also, couldn't easily find copy of the extension source and gave up.

The site/extension basically has to do that each time it scrapes locally (or use generic parametrised scraper) If you wanted to use it in an API, my impression is that you can run it in chrome as an extension you need to get from the chrome store or tunnel your data through a third party server. Is that wrong?

Can you scrape data locally without running chrome/the extension? I can't tell from reading the site, sorry. (if it's actually there, please link an anchor tag to it or something please)

holeyness6y ago· 2 in thread

Does this work with authenticated pages?

welanesOP6y ago

Yes - you're able to save data behind a login using the point and click functionality as it extracts whatever data is loaded in your browser ("local scraping").

And no - if you choose to also create a cloud recipe that runs on the server, the remote browser instance won't be able to access data behind a login.

It's possible but I'd rather not store third-party credentials for the time being.

darkstar9996y ago

It doesn't look like it. I got an error trying to scrape my HN upvotes url.

nopcode6y ago· 1 in thread

I believe this could be a good solution to turn legacy software into an API. The “generated code” should be a reverse proxy, not a scraping lib.

Also, scraping a website to use/copy it’s data is illegal in my country (Belgium). I’m not sure this tool itself would be.

ilrwbwrkhv6y ago

nothing can stop it. lots of belgian sites are scraped everyday across the world.

maroonblazer6y ago· 1 in thread

I like this.

Please consider adding the ability to script clicks on elements, e.g. buttons.

I manage a site where we load a subset of articles on initial page load and then have a "Load more" button that executes Javascript to load another batch of articles. Getting a list of articles from our CMS is a bit of a hassle so being able to scrape it easily instead would be ideal.

welanesOP6y ago

Hey, right now you can select a Pagination element that the app will use to load the next page / new data.

If the site's publicly accessible and you're able to share, send the details to mike @ simplescraper.io and I'll get this working for you.

mrskitch6y ago· 1 in thread

This is super cool. I really enjoyed and missed the kimono workflow. Automating something like this with browserless.io would be really fun (I run that project). Extensions is one of the things we’re looking to support.

Anyways give me an email at joel at browserless dot io if you ever want to chat

welanesOP6y ago

Cheers Joel. I have most of your blog posts on Puppeteer bookmarked - super helpful and well written.

For sure, once the app is a notch more tried and tested I'll get in touch. Appreciate it.

joelvalleroy6y ago· 1 in thread

Awesome! One question I have after reading the page is - what is the pricing plans concerning credits? (for automated scraping)

welanesOP6y ago

Right now it's free and will be until it's stable. Starting price will be about $25 for 4000 scraping credits, 200k API calls and data storage.

This will likely change as I have more stats and feedback on usage and expenses. But the goal is to offer a price point that's fair and low relative to other options.

ntaylor6y ago· 1 in thread

Kimono was cool, nice to see another option. I still have a Kimono t-shirt in a drawer somewhere.

kitd6y ago

Kimono t-shirt

Hmm, definite missed merch opportunity there.

matz16y ago· 1 in thread

How to use the 'pagination' feature ? The help guide doesn't even mentioned it.

welanesOP6y ago

Hey, yes the guide still needs work. Here's what you gotta do:

- Click the pagination icon and then click the pagination element (usually 'Next' or an arrow). The icon will turn green

- Click 'view results' and then choose to save the recipe

- Select the number of pages you'd like to scrape

- Run your recipe and it will scrape those pages

earth2mars6y ago· 1 in thread

if you can add RSS feed response that would be great

SweeToxin6y ago

If you need data from a website that updates on a regular basis there’s a recent Show HN I’ve seen that does exactly this https://news.ycombinator.com/item?id=21398524

beagle36y ago

I don't feel it is right to describe it as "turns a website into an API", rather "gives scraped data through an API".

"Turn website into an API", for me, evokes the image that I can automate (say) placing an order in Amazon as an API, or paying my bills automatically. It includes scraping, of course, but requires a lot more (mechanize/twill/selenium/phantom/etc power).

There was a company called Orsus that did exactly that. Last I heard about them it was the year 2000.

mikikian6y ago

Maybe a better business model is to offer this as a service to site owners who are not tech savvy. Site owners then have the ability to offer an API to new customers making it a win / win. Site owners can now offer an API (free or paid), and API consumer can rely on getting data in the future.

MildlySerious6y ago

I just gave this a shot on the ISO website to get a list of country codes[1], but it seems the selection algorithm breaks down when there's no specific classes applied to elements, as every td.v-grid-cell is selected, which is all of them, instead of the values of the alpha2 column for example.

This seems hard to solve entirely programmatically, maybe having a way to be more specific by providing a selector yourself or selecting multiple entries and having the plugin figure it out could add a lot of utility in such cases.

[1] - https://www.iso.org/obp/ui/#search/code/

monkeydust6y ago

Looks good, could this be integrated into n8n.io to be used to drive a workflow?

cfan016y ago

Firefox add in please.

nightnight6y ago

OT: or just use puppeteer, not really hard, for free and you can rule the world

1 more reply

j / k navigate · click thread line to collapse

74 comments

61 comments · 20 top-level

save_ferris6y ago· 9 in thread

What is it about this service as a business model that prevents it from taking off? I’ve known at least two YC startups that tried to build businesses around this idea.

I think one or both were acquired and immediately shut down, but I’m not 100% sure about that.

tsergiu6y ago

I'm the founder of parsehub.

We are doing well and are independently owned.

I think there are 3 things that contribute to this:

3. Many of the YC startups you hear about have raised funding from investors and have short term pressures to exit.

The combination of the three makes it very tempting to give up and sell.

swalsh6y ago

Just curious, in your experimentation, have you found it necessary to train a new model for each "category"? Or have you found a way to generalize it?

2 more replies

chiefalchemist6y ago

> So if a customer wants to scrape 1000 websites, they still have to build custom instructions for each website...

If I wanted to scrape 100 sites, it might be worth $1 per year per site. Those who put in the time make money. Those who don't have the time would pay.

This isn't a technology issue per se. It's scaling a solution to the final gap the technology can't cover. A different kind of mechanical turk?

2 more replies

tixocloud6y ago

jlokier6y ago

Repeatedly being acquired to be immediately shut down sounds like quite a good business model, if your goal is to be paid.

I wonder what other kinds of products and services would be good for that model. In other words, would tend to be acquired for good money in order to stop them.

pezo19196y ago

Acquired by who?

1 more reply

omarhaneef6y ago

I would guess:

1. Narrow target

Your market is people who need scraped data to input into some kind of app/program/code, but don't have the resources/skills/time to use scrapy or whatever.

2. Sensitive to configuration

Those are the reasons they shut down.

The reasons why they launch:

1. Many developers have this need

Many developers have built scrapers internally, and then used them so a lot of people have worked on this problem.

What follows from this is that they can productize it, see that other people have the need, imagine the market etc.

slowenough6y ago

I applied to YC with an idea like this and was rejected. 12 times. Maybe it's not the idea. Maybe it's me. Or maybe it's YC.

hk__26y ago

I don’t know anything about your case, but the general rule is that ideas are worthless, it’s the execution that matter.

1 more reply

welanesOP6y ago· 7 in thread

Hey HN, I posted this in a comment thread the other day and (to my surprise) it got a positive reception so added a few more updates and decided to post it proper.

The idea is to be able to choose a website, select the data you want, and make it available (as JSON, CSV or an API) with as little friction as possible.

Kimono was the gold standard for a while so did yoink some of their ideas, while doing some other things differently.

Still needs some work but as an MVP would appreciate any feedback. Cheers.

nannal6y ago

>would appreciate any feedback

Any option for a firefox build?

welanesOP6y ago

Yes, working on it now.

lucasverra6y ago

also will try when on FF

bko6y ago

welanesOP6y ago

Yeah, that's offered although it's currently free.

No particular tricks to avoid detection. It's Puppeteer under the hood with a few customizations which works well on the majority of sites tested so far.

Given the cat-and-mouse game around web scraping you may never cover every website, and that's ok.

rapind6y ago

Unrelated question. There's a "Made by Lanes" badge. What was made by lanes.io though? The web page?

giarc6y ago

Why is one page scrape 2 credits? Why not just 1?

uberswe6y ago· 7 in thread

I have a similar idea that I'm working on, your site is definitely bookmarked and will try the extension later.

welanesOP6y ago

Good catch, uberswe. Was an older video and I flubbed the selection process - here it is working correctly: https://www.kapwing.com/videos/5dbc3e33ee4d0f00136d01e6

m00dy6y ago

Hi, is it the chrome extension that does whole work or There is a separate background task on your side that actually runs those recipes ?

1 more reply

uberswe6y ago

Nice, that looks much better!

treve6y ago

Also interesting that this main example is also a violation of coinmarketcap's terms. They have a paid API.

chirau6y ago

If i use my pen and notebook to write down all those values, am i also in violation of those terms?

If they don't want their data to be scraped, it is up to them to secure it.

1 more reply

sh876y ago

I think so too. From their terms [1]

> You agree that you will not:

> Copy, modify or create derivative works of the Service or any Content;

[1]: https://coinmarketcap.com/terms/

19966y ago

Then use data from a free API without any TOS, and more data like separating bid and ask:

http://cmplot.com/api.json

phsource6y ago· 5 in thread

This is very cool! I love how you brought back the original Kimono UI with the checkmark and Xs for adding and removing data tags.

Best of luck, and feel free to get in touch if you'd like to chat more

welanesOP6y ago

Thanks! Yeah the checkmark confirmation just feels effortless. Haven't got it perfected yet, but soon.

Really appreciate the insights.

You're right that much depends on mapping the solution to a particular problem. Are you selling yet another scraping tool or are you freeing data to drive better decisions / save time / yada yada.

With the right frame, a sensible price point, and as much complexity abstracted away as is possible, there may exist a business model - seems to be many opportunities hiding in plain sight.

Will reach out soon for sure. Best of luck with Wanderlog

bravura6y ago

I tried your site and am curious that, for ko pha ngan there is only one recommended resource. Shouldn’t there be more?

xmly6y ago

Very insightful comments!

MetalGuru6y ago

Curious, what comparison are you making with Plaid here?

phsource6y ago

1 more reply

ainiriand6y ago· 2 in thread

Hi, is it possible to make it compatible with firefox?

welanesOP6y ago

Sure, in fact I'll do it this weekend.

seniorThrowaway6y ago

I'm also interested in this. I no longer use Chrome due to its pervasive surveillance and telemetry.

flingo6y ago· 2 in thread

Is there a reason this doesn't spit out some python or JavaScript code to scrape the same info out?

This just seems to add another dependency to whatever I'm developing. Plus, it sends data through a server I don't control. (I assume)

petr-nagy6y ago

Did you read the website? It says "Scrape locally or create recipes that run quickly in the cloud."

Also, what use could website spitting essentially the same python/js script over and over have?

flingo6y ago

Can you scrape data locally without running chrome/the extension? I can't tell from reading the site, sorry. (if it's actually there, please link an anchor tag to it or something please)

holeyness6y ago· 2 in thread

Does this work with authenticated pages?

welanesOP6y ago

Yes - you're able to save data behind a login using the point and click functionality as it extracts whatever data is loaded in your browser ("local scraping").

And no - if you choose to also create a cloud recipe that runs on the server, the remote browser instance won't be able to access data behind a login.

It's possible but I'd rather not store third-party credentials for the time being.

darkstar9996y ago

It doesn't look like it. I got an error trying to scrape my HN upvotes url.

nopcode6y ago· 1 in thread

I believe this could be a good solution to turn legacy software into an API. The “generated code” should be a reverse proxy, not a scraping lib.

Also, scraping a website to use/copy it’s data is illegal in my country (Belgium). I’m not sure this tool itself would be.

ilrwbwrkhv6y ago

nothing can stop it. lots of belgian sites are scraped everyday across the world.

maroonblazer6y ago· 1 in thread

I like this.

Please consider adding the ability to script clicks on elements, e.g. buttons.

welanesOP6y ago

Hey, right now you can select a Pagination element that the app will use to load the next page / new data.

If the site's publicly accessible and you're able to share, send the details to mike @ simplescraper.io and I'll get this working for you.

mrskitch6y ago· 1 in thread

Anyways give me an email at joel at browserless dot io if you ever want to chat

welanesOP6y ago

Cheers Joel. I have most of your blog posts on Puppeteer bookmarked - super helpful and well written.

For sure, once the app is a notch more tried and tested I'll get in touch. Appreciate it.

joelvalleroy6y ago· 1 in thread

Awesome! One question I have after reading the page is - what is the pricing plans concerning credits? (for automated scraping)

welanesOP6y ago

Right now it's free and will be until it's stable. Starting price will be about $25 for 4000 scraping credits, 200k API calls and data storage.

This will likely change as I have more stats and feedback on usage and expenses. But the goal is to offer a price point that's fair and low relative to other options.

ntaylor6y ago· 1 in thread

Kimono was cool, nice to see another option. I still have a Kimono t-shirt in a drawer somewhere.

kitd6y ago

Kimono t-shirt

Hmm, definite missed merch opportunity there.

matz16y ago· 1 in thread

How to use the 'pagination' feature ? The help guide doesn't even mentioned it.

welanesOP6y ago

Hey, yes the guide still needs work. Here's what you gotta do:

- Click the pagination icon and then click the pagination element (usually 'Next' or an arrow). The icon will turn green

- Click 'view results' and then choose to save the recipe

- Select the number of pages you'd like to scrape

- Run your recipe and it will scrape those pages

earth2mars6y ago· 1 in thread

if you can add RSS feed response that would be great

SweeToxin6y ago

If you need data from a website that updates on a regular basis there’s a recent Show HN I’ve seen that does exactly this https://news.ycombinator.com/item?id=21398524

beagle36y ago

I don't feel it is right to describe it as "turns a website into an API", rather "gives scraped data through an API".

There was a company called Orsus that did exactly that. Last I heard about them it was the year 2000.

mikikian6y ago

MildlySerious6y ago

[1] - https://www.iso.org/obp/ui/#search/code/

monkeydust6y ago

Looks good, could this be integrated into n8n.io to be used to drive a workflow?

cfan016y ago

Firefox add in please.

nightnight6y ago

OT: or just use puppeteer, not really hard, for free and you can rule the world

1 more reply

j / k navigate · click thread line to collapse