Show HN: Finic – Open source platform for building browser automations (opens in new tab)

(github.com)

143 pointsjasonwcfan1y ago77 comments

Last year we launched a project called Psychic that did moderately well on hacker news, but was a commercial failure. We were able to find customers, but none with compelling and overlapping use cases. Everyone who was interested was too early to be a real customer.

This was our launch: https://news.ycombinator.com/item?id=36032081

We recently decided to revive and rebrand the project after seeing a sudden spike in interest from people who wanted to connect LLMs to data - but specifically through browsers. It's also a problem we've experienced firsthand, having built scraping features into Psychic and previously working on bot detection at Robinhood.

If you haven’t built a web scraper or browser automation before, you might assume it’s very straightforward. People have been building scrapers for as long as the internet has existed, so there must be many tools for the job.

The truth is that web scraping strategies need to constantly adapt as web standard change, and as companies that don’t want to be scraped adopt new technologies to try and block it. The old standards never completely go away, so the longer the internet exists, the more edge cases you’ll need to account for. This adds up to a LOT of infrastructure that needs to be set up and a lot of schlep developers have to go through to get up and running.

Scraping is no easier today than it was 10 years ago - the problems are just different.

Finic is an open source platform for building and deploying browser agents. Browser agents are bots deployed to the cloud that mimic the behaviour of humans, like web scrapers or remote process automation (RPA) jobs. Simple examples include scripts that scrape static websites like the SEC's EDGAR database. More complex use cases include integrating with legacy applications that don’t have public APIs, where the best way to automate data entry is to just manipulate HTML selectors (EHRs for example).

Our goal is to make Finic the easiest way to deploy a Playwright-based browser automation. With this launch, you can already do so in just 4 steps. Check out our docs for more info: https://docs.finic.io/quickstart

Show HN: Finic – Open source platform for building browser automations

(github.com)

143 pointsjasonwcfan1y ago77 comments

This was our launch: https://news.ycombinator.com/item?id=36032081

Scraping is no easier today than it was 10 years ago - the problems are just different.

77 comments

57 comments · 14 top-level

krick1y ago· 12 in thread

Does anyone know solid (not SaaS, obviously) solution for scraping these days? It's getting pretty hard to get around some pretty harmless cases (like bulk-downloading MY OWN gpx tracks from some fucking fitness-watch servers), with all these js tricks, countless redirects, cloudflare and so on. Even if you already have the cookies, getting non-403 response to any request is very much not trivial. I feel like it's time to upgrade my usual approach of python requests+libxml, but I don't know if there is a library/tool that solves some of the problems for you.

_boffin_1y ago

- launch chrome with loading of specified data dir.

- connect to it remotely

- ghost cursor and friends

- save cookies and friends to data dir

- run from residential ip

- if get served captcha or cloudflare, direct to solver and to then route back.

- mobile ip if possible

…can’t go into anymore specifics than that

…I forget the site right now, but there a guy that gives a good rundown of this stuff. I’ll see id I can find it.

mhuffman1y ago

I would be interesting if you can find it.

1 more reply

thealchemi1st1y ago

You can give the open-source tools mentioned in this guide a look: https://scrapfly.io/blog/how-to-scrape-without-getting-block...

sebmellen1y ago

https://browserless.io might be what you’re looking for. Open source although they do have a SaaS option.

djbusby1y ago

I use a few things. First, I scrape from my home IP at very low rates. I drive either FF or Chrome using extension. Sometimes I have to start the session manually (not a robot) and then engage the crawler. Sometimes, site dependant, can run headless or puppeteer. But the extension in "normal" browser that goes slow has been working great for me.

It seems that some sites can determine when using headless or web-driver enabled profile.

Sometimes I'm through a VPN.

The automation is the easy part.

_boffin_1y ago

Heads up, requests adds some extra headers on send.

One thing I’ve also been doing recently when I find a site that I just want an api is just use python and execute a curl via python. I populate the curl from chrome’s network tab. I also have a purpose built extension I have in my browser that saves cookies to a lan Postgres DB and then the use those values for the script.

Can even probably do more by automating the browser to navigate there on failure.

kfrzcode1y ago

https://github.com/yifeikong/curl_cffi

bobbylarrybobby1y ago

On a Mac, I use keyboard maestro, which can interact with the UI (which is usually stable enough to form an interface of sorts) — wait for an graphic to appear on screen, then click it, then simulate keystrokes, run JavaScript on the current page and get a result back... looks very human to a website in a browser, and is nearly as easy to write as Python.

iansinnott1y ago

In short: Don't use HTML endpoints, use APIs.

This is not always possible, but if the product in question has a mobile app or a wearable talking to a server, you might be able to utilize the same API it's using:

- intercept requests from the device - find relevant auth headers/cookies/params - use that auth to access the API

whilenot-dev1y ago

If requests solves any 403 headaches for you, just pass the session cookies to a playwright instance, and you should be good to go. Just did that for scraping the SAP Software Download Center.

lambdaba1y ago

I've found selenium with undetected-chromedriver to work best.

unsupp0rted1y ago

Doesn't get around Cloudflare's anti-bot

1 more reply

mdaniel1y ago· 6 in thread

> Finic uses Playwright to interact with DOM elements, and recommends BeautifulSoup for HTML parsing.

I have never, ever understood anyone who goes to the trouble of booting up a browser, and then uses a python library to do static HTML parsing

Anyway, I was surfing around the repo trying to find what, exactly "Safely store and access credentials using Finic’s built-in secret manager" means

ayanb94401y ago

We're in the middle of putting this together right now but it's going to be a wrapper around Google Secret Manager for those that don't want to set up a secrets manager themselves.

0x3444ac531y ago

Often times websites won't load the HTML without executing the JavaScript. or uses JavaScript running client side to generate the entire page.

mdaniel1y ago

I feel that we are in agreement for the cases where one would use Playwright, and for damn sure would not involve BS4 for anything in that case

msp261y ago

What would you recommend for parsing instead?

mdaniel1y ago

In this specific scenario, where the project is using *automated Chrome* to even bother with the connection, redirects, and bazillions of other "browser-y" things to arrive at HTML to be parsed, the very idea that one would `soup = BeautifulSoup(playright.content())` is crazypants to me

I am open to the fact that html5lib strives to parse correctly, and good for them, but that would be the case where one wished to use python for parsing to avoid the pitfalls of dragging a native binary around with you

1 more reply

ghxst1y ago

In python specifically I like lxml (pretty sure that's what BS uses under the hood?), parse5 if you're using node is usually my go to. Ideally though you shouldn't really have to parse anything (or not much at all) when doing browser automation as you have access to the DOM which gives you an interface that accepts query selectors directly (you don't even need the Runtime domain for most of your needs).

1 more reply

suriya-ganesh1y ago· 4 in thread

I've been working on browser agent the last week[1]. So this is very exciting. There are also browser agent implementations like Skyvern[2] (Also YC backed) ,or Tarsier[3] Seems like, finic is providing a way to scale/schedule these agents? If that's the case what's the advantage over something like airflow or windmill ?

If I remember correctly, Skyvern also has an implementation of scaling these browser tasks built in.

ps. Is it not called Robotic Process Automation? First time I'm hearing it as Remote process Automation.

[1]https://github.com/ProductLoft/arachne

[2]https://www.skyvern.com/

[3]https://github.com/reworkd/tarsier

mdaniel1y ago

https://github.com/reworkd/tarsier/pull/115/files represents someone who does not know what git is used for

  Cloning into 'tarsier'...
  remote: Enumerating objects: 15238, done.
  remote: Counting objects: 100% (1613/1613), done.
  remote: Compressing objects: 100% (929/929), done.
  Receiving objects: 100% (15238/15238), 3.01 GiB | 14.82 MiB/s, done.

ayanb94401y ago

Looks like somebody forgot to update the gitignore lol

ayanb94401y ago

Yup that's right its Robotic Process Automation.

Based on the feedback in this thread we're going to be releasing an updated version that focuses more around tooling for the browser agents themselves as opposed to scaling/scheduling, so stay tuned for that!

mdaniel1y ago

And since the other two links are to GH: https://github.com/Skyvern-AI/skyvern (AGPLv3)

ghxst1y ago· 3 in thread

Cool service but how will you deal / how do you plan to deal with anti scraping and anti bot services like Akamai, Arkose, Cloudflare, DataDome etc.? Automation of the web isn't solved by another playwright or puppeteer abstraction, you need to solve more fundemental problems in order to mitigate the issues you run into at scale.

jasonwcfanOP1y ago

I mentioned this in another comment, but I know from experience that it's impossible to reliably differentiate bots from humans over a network. And since the right to automate browsers has survived repeated legal challenges, all vendors can do is make it incrementally harder to weed out the low sophistication actors.

This actually creates an evergreen problem that companies need to overcome, and our paid version will probably involve helping companies overcome these barriers.

Also I should clarify that we're explicitly not trying to build a playwright abstraction - we're trying to remain as unopinionated as possible about how developers code the bot, and just help with the network-level infrastructure they'll need to make it reliable and make it scale.

It's good feedback for us, we'll make that point more clear!

ghxst1y ago

> but I know from experience that it's impossible to reliably differentiate bots from humans over a network

While this might be true in theory, it doesn't stop them from trying! And believe me, it's getting to a point where the WAF settings on some websites are even annoying the majority of the real users! Some of the issues I am hinting at however are fundemental issues you run into when automating the web using any mainstream browser that hasn't had some source code patches, I'm curious to see if a solution to that will be part of your service if you decide to tackle it.

candiddevmike1y ago

Don't take this the wrong way, but this is the kind of unethical behavior that our industry should frown upon IMO. I view this kind of thing on the same level as DDoS-as-a-Service companies.

I wish your company the kind of success it deserves.

1 more reply

dataviz10001y ago· 3 in thread

I build browser automation systems with either Playwright or Chrome Extensions. The biggest issue with automating 3rd party websites is knowing when the 3rd party developer pushes changes which break the automation. The way I dealt with that is run a headless browser in the cloud which checks the behavior of the automated site periodically sending emails and sms messages when it breaks.

If you don't already have this feature for your system, I would recommend it.

ghxst1y ago

IO between humans and websites can be broken down to only a few fundamental pieces (or elements I should say). This is actually where AI has a lot of opportunity to add value as it has the capability of significantly reducing the possibilty of breakage between changes.

ayanb94401y ago

That's a great suggestion! Essentially a cron job to check for website changes before your automation runs and possibly breaks.

What does this check look like for you? Do you just diff the html to see if there are any changes?

dataviz10001y ago

The issue with diffing html is selectors are autogenerated with any update to a website's code. Often website which combat scraping will autogenerate different HTML. First thing is to screen caption a website for comparison. Second, it is possible to determine all the visible elements on a page. With Playwright, inject event listeners to all elements on a page and start automated clicking. If the agent fills out forms, then make sure that all fields are available to populate. There are a lot of heuristics.

1 more reply

ilrwbwrkhv1y ago· 3 in thread

Backed by YC = Not open source. Eventually pressure to exit and hyper scale will take over.

ayanb94401y ago

There are quite a few open source YC startups at this point. Our understanding is that:

1. Developer tooling should be open source by default 2. Open source doesn't meaningfully affect revenue/scaling because developers that would use your self-hosted version would build in-house anyway.

ilrwbwrkhv1y ago

I know there are quite a few open source by default companies. But the ethos of open source is sharing / building something by the community and getting paid in a way which does not scale the way VC funding expectations work.

So to have some respect for the open source way on top of which you are building all this please stop advertising it as "open source infrastructure" in bold and sell it like a normal software product with "source available" on the footer.

If you do plan to go open source and actually follow its ethos, remove the funded by VC label and have self hosting front and center in the docs with the hosted bit somewhere in the footer.

1 more reply

yard20101y ago

I'm curious, can't do both?

Oras1y ago· 2 in thread

Don't take this as a negative thing, but I'm confused. Is it a playwright? Is it a residential proxy? It's not clear from your video.

jasonwcfanOP1y ago

Proxies are definitely on our roadmap, but for now it just supports stock Playwright.

Thanks for the feedback! I just updated the repo to make it more clear that it's Playwright based. Once my cofounder wakes up I'll see if he can re-record the video as well.

ghxst1y ago

What kind of proxies are on your road map, do you have any experience with in-house proxy networks?

whatnotests21y ago· 2 in thread

With agents like Finic, soon the web will be built for agents, rather than humans.

I can see a few years from now almost all web traffic is agents.

jasonwcfanOP1y ago

Yep. I used to be the guy responsible for bot detection at Robinhood so I can tell you firsthand it's impossible to reliably differentiate between humans and machines over a network. So either you accept being automated, or you overcorrect and block legitimate users.

I don't think the dead internet theory is true today, but I think it will be true soon. IMO that's actually a good thing, more agents representing us online = more time spent in the real world.

candiddevmike1y ago

That is some bizarre mental gymnastics to justify the work you've done. What about the rest of us who don't want agents representing us?

2 more replies

j0r0b01y ago· 2 in thread

Thank you for sharing!

Your sign up flow might be broken. I tried creating an account (with my own email), received the confirmation email, but couldn't get my account to be verified. I get "Email not confirmed" when I try to log in.

Also, the verification email was sent from accounts@godealwise.com, which is a bit confusing.

jasonwcfanOP1y ago

Oops! We tested the Oauth flow but forgot to update the email one. Thanks for the heads up, fixing this now.

ayanb94401y ago

This should be fixed now

computershit1y ago· 2 in thread

First, nice work. I'm certainly glad to see such a tool in this space right now. Besides a UI, what does this provide that something like Browserless doesn't?

jasonwcfanOP1y ago

Thanks! Wasn't familiar with Browserless but took a quick look. It seems they're very focused on the scraping use case. We're more focused on the agent use case. One of our first customers turned us on to this - they wanted to build an RPA automation to push data to a cloud EHR. The problem was it ran as a single page application with no URL routing, and had an extremely complex API for their backend that was difficult to reverse engineer. So automating the browser was the best way to integrate.

If you're trying to build an agent for a long-running job like that, you run into different problems: - Failures are magnified as a workflow has multiple upstream dependencies and most scraping jobs don't. - You have to account for different auth schemes (Oauth, password, magic link, etc) - You have to implement token refresh logic for when sessions expire, unless you want to manually login several times per day

We don't have most of these features yet, but it's where we plan to focus.

And finally, we've licensed Finic under Apache 2.0 whereas Browserless is only available under a commercial license.

sahmeepee1y ago

Sounds like a prooblem that can be solved with a Playwright script with a bit of error checking in it.

I think this needs more elaboration on what the Finic wrapper is adding to stock Playwright that can't just be achieved through more effective use of stock Playwright.

1 more reply

slewis1y ago· 2 in thread

Is it stateful? Like can I do a run, read the results, and then do another run from that point?

ayanb94401y ago

We currently don't save the browser state after the run has completed but that's something we can definitely add as a feature. Could you elaborate on your use case? In which scenarios would it be better to split a run into multiple steps?

mdaniel1y ago

Almost any process that involves the word "workflow" (my mental model is one where the user would press alt-tab to look up something else in another window). The very, very common case would be one where they have a stupid SMS-based or "click email link" login flow: one would not wish to do that a ton, versus just leaving the session authenticated for reuse later in the day

Also, if my mental model is correct, the more browsing and mouse-movement telemetry those cloudflare/akamai/etc gizmos encounter, the more likely they are to think the browser is for real, versus encountering a "fresh" one is almost certainly red-alert. Not a panacea, for sure, but I'd guess every little bit helps

1 more reply

skeptrune1y ago· 1 in thread

I wonder if there are hidden observality problems with scraping with ideal solutions of a different shape than a dashboard. Feels like sentry connection or other common alert monitoring solutions would combine well with the LLM proposed changes and help trams react more quickly to pipeline problems.

ayanb94401y ago

We do support sentry. Finic projects are poetry scripts so you can `poetry add` any observability library you need.

ushakov1y ago· 1 in thread

I do not understand what this actually is. Any difference between Browserbase and what you’re building?

Also, curious why your unstructured idea did not pan out?

ayanb94401y ago

Looking at their docs, it seems that with Browserbase you would still have to deploy your Playwright script to a long-running job and manage the infra around that yourself.

Our approach is a bit different. With finic you just write the script. We handle the entire job deployment and scaling on our end.

sebmellen1y ago

We use https://windmill.dev which is great for this!

j / k navigate · click thread line to collapse

77 comments

57 comments · 14 top-level

krick1y ago· 12 in thread

_boffin_1y ago

- launch chrome with loading of specified data dir.

- connect to it remotely

- ghost cursor and friends

- save cookies and friends to data dir

- run from residential ip

- if get served captcha or cloudflare, direct to solver and to then route back.

- mobile ip if possible

…can’t go into anymore specifics than that

…I forget the site right now, but there a guy that gives a good rundown of this stuff. I’ll see id I can find it.

mhuffman1y ago

I would be interesting if you can find it.

1 more reply

thealchemi1st1y ago

You can give the open-source tools mentioned in this guide a look: https://scrapfly.io/blog/how-to-scrape-without-getting-block...

sebmellen1y ago

https://browserless.io might be what you’re looking for. Open source although they do have a SaaS option.

djbusby1y ago

It seems that some sites can determine when using headless or web-driver enabled profile.

Sometimes I'm through a VPN.

The automation is the easy part.

_boffin_1y ago

Heads up, requests adds some extra headers on send.

Can even probably do more by automating the browser to navigate there on failure.

kfrzcode1y ago

https://github.com/yifeikong/curl_cffi

bobbylarrybobby1y ago

iansinnott1y ago

In short: Don't use HTML endpoints, use APIs.

This is not always possible, but if the product in question has a mobile app or a wearable talking to a server, you might be able to utilize the same API it's using:

- intercept requests from the device - find relevant auth headers/cookies/params - use that auth to access the API

whilenot-dev1y ago

If requests solves any 403 headaches for you, just pass the session cookies to a playwright instance, and you should be good to go. Just did that for scraping the SAP Software Download Center.

lambdaba1y ago

I've found selenium with undetected-chromedriver to work best.

unsupp0rted1y ago

Doesn't get around Cloudflare's anti-bot

1 more reply

mdaniel1y ago· 6 in thread

> Finic uses Playwright to interact with DOM elements, and recommends BeautifulSoup for HTML parsing.

I have never, ever understood anyone who goes to the trouble of booting up a browser, and then uses a python library to do static HTML parsing

Anyway, I was surfing around the repo trying to find what, exactly "Safely store and access credentials using Finic’s built-in secret manager" means

ayanb94401y ago

We're in the middle of putting this together right now but it's going to be a wrapper around Google Secret Manager for those that don't want to set up a secrets manager themselves.

0x3444ac531y ago

Often times websites won't load the HTML without executing the JavaScript. or uses JavaScript running client side to generate the entire page.

mdaniel1y ago

I feel that we are in agreement for the cases where one would use Playwright, and for damn sure would not involve BS4 for anything in that case

msp261y ago

What would you recommend for parsing instead?

mdaniel1y ago

1 more reply

ghxst1y ago

1 more reply

suriya-ganesh1y ago· 4 in thread

If I remember correctly, Skyvern also has an implementation of scaling these browser tasks built in.

ps. Is it not called Robotic Process Automation? First time I'm hearing it as Remote process Automation.

[1]https://github.com/ProductLoft/arachne

[2]https://www.skyvern.com/

[3]https://github.com/reworkd/tarsier

mdaniel1y ago

https://github.com/reworkd/tarsier/pull/115/files represents someone who does not know what git is used for

  Cloning into 'tarsier'...
  remote: Enumerating objects: 15238, done.
  remote: Counting objects: 100% (1613/1613), done.
  remote: Compressing objects: 100% (929/929), done.
  Receiving objects: 100% (15238/15238), 3.01 GiB | 14.82 MiB/s, done.

ayanb94401y ago

Looks like somebody forgot to update the gitignore lol

ayanb94401y ago

Yup that's right its Robotic Process Automation.

mdaniel1y ago

And since the other two links are to GH: https://github.com/Skyvern-AI/skyvern (AGPLv3)

ghxst1y ago· 3 in thread

jasonwcfanOP1y ago

This actually creates an evergreen problem that companies need to overcome, and our paid version will probably involve helping companies overcome these barriers.

It's good feedback for us, we'll make that point more clear!

ghxst1y ago

> but I know from experience that it's impossible to reliably differentiate bots from humans over a network

candiddevmike1y ago

Don't take this the wrong way, but this is the kind of unethical behavior that our industry should frown upon IMO. I view this kind of thing on the same level as DDoS-as-a-Service companies.

I wish your company the kind of success it deserves.

1 more reply

dataviz10001y ago· 3 in thread

If you don't already have this feature for your system, I would recommend it.

ghxst1y ago

ayanb94401y ago

That's a great suggestion! Essentially a cron job to check for website changes before your automation runs and possibly breaks.

What does this check look like for you? Do you just diff the html to see if there are any changes?

dataviz10001y ago

1 more reply

ilrwbwrkhv1y ago· 3 in thread

Backed by YC = Not open source. Eventually pressure to exit and hyper scale will take over.

ayanb94401y ago

There are quite a few open source YC startups at this point. Our understanding is that:

ilrwbwrkhv1y ago

If you do plan to go open source and actually follow its ethos, remove the funded by VC label and have self hosting front and center in the docs with the hosted bit somewhere in the footer.

1 more reply

yard20101y ago

I'm curious, can't do both?

Oras1y ago· 2 in thread

Don't take this as a negative thing, but I'm confused. Is it a playwright? Is it a residential proxy? It's not clear from your video.

jasonwcfanOP1y ago

Proxies are definitely on our roadmap, but for now it just supports stock Playwright.

Thanks for the feedback! I just updated the repo to make it more clear that it's Playwright based. Once my cofounder wakes up I'll see if he can re-record the video as well.

ghxst1y ago

What kind of proxies are on your road map, do you have any experience with in-house proxy networks?

whatnotests21y ago· 2 in thread

With agents like Finic, soon the web will be built for agents, rather than humans.

I can see a few years from now almost all web traffic is agents.

jasonwcfanOP1y ago

I don't think the dead internet theory is true today, but I think it will be true soon. IMO that's actually a good thing, more agents representing us online = more time spent in the real world.

candiddevmike1y ago

That is some bizarre mental gymnastics to justify the work you've done. What about the rest of us who don't want agents representing us?

2 more replies

j0r0b01y ago· 2 in thread

Thank you for sharing!

Also, the verification email was sent from accounts@godealwise.com, which is a bit confusing.

jasonwcfanOP1y ago

Oops! We tested the Oauth flow but forgot to update the email one. Thanks for the heads up, fixing this now.

ayanb94401y ago

This should be fixed now

computershit1y ago· 2 in thread

First, nice work. I'm certainly glad to see such a tool in this space right now. Besides a UI, what does this provide that something like Browserless doesn't?

jasonwcfanOP1y ago

We don't have most of these features yet, but it's where we plan to focus.

And finally, we've licensed Finic under Apache 2.0 whereas Browserless is only available under a commercial license.

sahmeepee1y ago

Sounds like a prooblem that can be solved with a Playwright script with a bit of error checking in it.

I think this needs more elaboration on what the Finic wrapper is adding to stock Playwright that can't just be achieved through more effective use of stock Playwright.

1 more reply

slewis1y ago· 2 in thread

Is it stateful? Like can I do a run, read the results, and then do another run from that point?

ayanb94401y ago

mdaniel1y ago

1 more reply

skeptrune1y ago· 1 in thread

ayanb94401y ago

We do support sentry. Finic projects are poetry scripts so you can `poetry add` any observability library you need.

ushakov1y ago· 1 in thread

I do not understand what this actually is. Any difference between Browserbase and what you’re building?

Also, curious why your unstructured idea did not pan out?

ayanb94401y ago

Looking at their docs, it seems that with Browserbase you would still have to deploy your Playwright script to a long-running job and manage the infra around that yourself.

Our approach is a bit different. With finic you just write the script. We handle the entire job deployment and scaling on our end.

sebmellen1y ago

We use https://windmill.dev which is great for this!

j / k navigate · click thread line to collapse