An example of this is California drought data: automatically grabbing data on the drought is incredibly difficult because it involves scraping HTML tables. I tried to build an API that presents drought data so volunteers would have an easier time building out data visualizations. I ended up just getting exhausted doing all the scraping work.
I then moved onto a new project: building a free-to-use Padmapper for affordable housing. The data for income restricted apartment units are driven by a government contracted vendor. A city county will declare income stabilization policies and legally enforce them against landowners and then the landowners would send over their list of units to the vendor.
This would be great except the vendor does the bare minimum. Padmapper looks amazing but, really, it's only applicable for the upper middle class due to explosive housing costs in the Bay Area. So, in order to provide a more modern website and mobile application for the community, I started to scrape the vendor's website. It was terrible. I kept getting throttled. So I gave up.
We have a new WrapAPI API Builder looks like a browser, and is as easy to use as one too. You can define your API's inputs with a quick tap on the address bar, and point and click at the data you want to extract.
We also have a Chrome extension is smarter and better-integrated than ever. It records your requests and It'll automatically create parameter inputs for the values that change between requests to the same endpoints. The contents of your captures are immediately ready for you to start defining outputs and data to extract too.
Let me know if you have any questions or feedback!
This is a big thing on many sites now.
Also, since that is the case, you could build this in a few hours using something like: https://github.com/bda-research/node-crawler. Yes, it would have no gui, so you lose that.
Just reading about Kantu now. It reminds me of http://www.sikuli.org/
>Do you find it better than Phantom?
It depends. Once you have a working script, web scraping with Phantom is much faster and much more resource efficient. But since Kantu works visually, you do not have to touch any page source code. That makes it much easier/faster to create the automation in the first place, especially for complex sites with date controls, drag & drop and other Javascript.
Is this happening on your site? If not, would appreciate some tips about coding it and how to handle exception cases where the wizard can't keep in sync or user click on unintended page elements.
The most helpful part is that you can pass a callback which will trigger before/during/after each step, which can let you ensure that the state of the page matches what you're expecting. In our case, we use it to make sure that you're switched to the right tab, etc. Take a look! I highly recommend it.
I often see one or more commenters write what seems like an excessively positive thought dump on Show HNs. It just doesn't seem like the natural conversational tone everyone uses, but I can't quite put my finger on it.
Has anyone else noticed it? Is there a term for this sort of writing style?
That endpoint will then emit a state token, which includes the session cookies. You can feed that state token into your next request and it'll authenticate you
I wanted to give you a heads-up that the youtube video at the end of your joyride tutorial is broken.
It tries to play this: https://www.youtube.com/watch?v=10yKzP3gtkc
Why?
When I was last working inside an organization and reviewing vendors for a product, it really left a bad taste in my mouth when they had "Ask for Pricing." I get it, my consulting work is basically Ask for Pricing, I understand the business strategy. But it's such a headache to sit through bullshit product demos for multiple vendors over a few weeks just to hear that their pricing structure is way out of line.
There is this idea that a lot of companies have, where they're more "professional" or conversion-optimized by removing public pricing and putting everyone through a sales funnel. But that concept only works if 1) you have a great product and 2) you have a great sales team, capable of making my time to failure in the conversion process fast and painless. Every company thinks they have this, but they almost never do. I really don't think you want to optimize your business for keeping stamp enthusiasts happy.
In the back of their heads, some people imagine the service is going to be huge, and then they worry that all the profits will be paid out to wrapapi.
Better to have a high headline number and then offer discounts for certain uses (non-profit, open source, students, etc). People are optimistic about how much money they might make so a high headline future price for when you graduate from the free tier is not necessarily bad.
WrapApi seems to tackle the same task (web scraping) from a very different angle. I wonder if anyone has used both and can compare.
Let's say you have a web-based inventory management system or CRM that requires a login, but you want to take data a customer has sent you in a spreadsheet and automatically batch enter it into the CRM, which doesn't have that functionality. You could then:
1. Create an API endpoint that allows you to log into that system and return a state token
2. Create a second API endpoint that's parametrized the inputs of the form to create a new inventory entry
3. Chain those 2 API endpoints together so that the 2 actions are actually combined into one API call
Our focus is not only on getting data, but automating the many things that you or your company does with websites to save time
I've used similar services like parsehub.com in the past and if they didn't have a pricing page I would have never tried it. Just my 2 cents.
You are using xpath here right?
Bought by Palantir, they retired in a good way, keeping people's data available for a moment and communicating well.
It was a great product still complicated to get a practical business model.
This WrapAPI v2 is an alternative I think, but I would use them with care as the economical model is unsure and it seems to be really new, still promising! :)
The company that runs this software as a service needs to be very careful. 3Taps was similar and got destroyed for relaying data scraped from Craigslist.
Contacting the server after its operator has expressed its wish for you to stop is a violation of the CFAA (in that you are "exceeding authorized access" and/or gaining "unauthorized access" to a protected computer system). If it's found that the site's ToS is binding upon you, which it typically would be, you don't really even need separate notice to be held liable.
Storing a copy of a web page in RAM creates a copy that is eligible for copyright protection, and it is likely that any implied license to read that page will be invalidated by the access revocation.
IANAL.
https://books.google.ca/books?id=a-yu2-JUQNAC&pg=PT249&lpg=P...
Thanks for that! Like I said, I'm not a lawyer and I'm sure there are other gaps in my case knowledge. It's certainly positive to see the Second Circuit recognizing that there is some need to consider the transient nature of RAM copies before ruling them infringing.
The ruling suggests that MAI v. Peak did not address the transitory argument merely because it was not raised by the litigants, and that the precedent set there (which wouldn't have necessarily been binding anyway) is therefore not abrogated by ruling that some RAM copies are transient enough to fail to qualify.
Importantly, the durations listed here describe the runtime of the content, not the amount of time the data is held in the RAM. It is said that the system would buffer 0.1 seconds (100ms) of content at one point and 1.2 seconds of content at another point.
The Court does not seem to establish "1.2 seconds" as a general benchmark for RAM transience, but rather it suggests that transience should be considered on a case-by-case basis, per the language of the statute.
However, the general rule of thumb is that if a copy exists long enough to derive any value from it, it is non-transient. Guidance from the Copyright Office [0] reads:
>[...] we believe that Congress intended the copyright owner’s exclusive right to extend to all reproductions from which economic value can be derived. The economic value derived from a reproduction lies in the ability to copy, perceive or communicate it. Unless a reproduction manifests itself so fleetingly that it cannot be copied, perceived or communicated, the making of that copy should fall within the scope of the copyright owner’s exclusive rights. The dividing line, then, can be drawn between reproductions that exist for a sufficient period of time to be capable of being "perceived, reproduced, or otherwise communicated" and those that do not. As a practical matter, as discussed above, this would cover the temporary copies that are made in RAM in the course of using works on computers and computer networks.
and scrapers have been held liable for copyright infringement via RAM copies on multiple occasions. Ticketmaster v. RMG states:
>[...] copies of ticketmaster.com webpages automatically stored on a viewer's computer are “copies” within the meaning of the Copyright Act.
despite the fact that they likely would've been held for a much shorter time than either 100ms or 1.2 seconds.
Notably, this was before the case referenced above, but it's typical of later cases, and it succinctly demonstrates that courts are likely to find RAM copies of an entire work (the web page) more likely to be of non-transitory nature than snippets of ~ 1/1500th of an entire work, regardless of how long they're stored in RAM.
[0; PDF] https://www.copyright.gov/reports/studies/dmca/sec-104-repor...