undefined | Better HN

0 pointsminimaxir9y ago0 comments

That analogy is not equitable. If you take photographs of a building while on the building's property, they have the right to tell you to stop, or call the police to escort you off if you refuse to do so.

0 comments

15 comments · 4 top-level

mindslight9y ago· 7 in thread

Sure, but they do not have the right to retroactively declare you as having been trespassing, nor even to preemptively put up a "no photography" sign and have you arrested for trespassing if you disobey it.

The entire point of protocols is to precisely define the terms of communication. The status code is '200 OK', not '200 OK/Asterisk'. But of course if lawlers didn't force themselves into the situation, they'd be out of jobs.

As an aside, I'd really like to see a browser plugin that would scrape sites in the normal course of access, storing the proceeds in a distributed public database.

cookiecaper9y ago

>As an aside, I'd really like to see a browser plugin that would scrape sites in the normal course of access, storing the proceeds in a distributed public database.

This would be copyright infringement, since the content of the page is a substantive unique work that is automatically copyrighted by its author. A site that doesn't want you scraping its content is not going to want you posting dumps of its pages. Much like BitTorrent, they'd get into the protocol and send subpoenas to the ISPs behind the IPs that serve their pages, and use that info to sue the customer.

When my company was shut down by a legal threat related to scraping, I did suggest to my lawyer that we create something like a browser extension that would grab the data we needed out of normal client-side browsing sessions. This wouldn't be as nice as controlling the flow of information ourselves but it would've worked OK. My lawyer strongly suggested avoiding that as it could've been construed as conspiratorial conduct that would've made criminal prosecution more likely.

niftich9y ago

Not the discount the validity of your experience, but the usual counterpoint to this is Google, who (like mentioned elsewhere in the thread) has been continuously scraping since the very beginning and in fact built their entire business model on doing so. They are also responsible for advancing the state-of-the-art of scraping (albeit mostly internally), through the development of V8 and headless Chromium so that they can inspect dynamic pages too.

Perhaps this illustrates the fungibility of the legal system: it's an inherently human construct that pits a plaintiff against a defendant, and given a big enough warchest and persuasive-enough arguments, catastrophe can be avoided -- by Google; perhaps not by you, me, or someone else.

1 more reply

SilasX9y ago

Isn't that what webarchive/wayback machine do? I think they use a "Fair Use" defense.

mindslight9y ago

Oh for sure. But BitTorrent is still around and works great!

Asooka9y ago

Well, it's just a technical response code. 200 OK - everything went as normal, here's your data. By the same margin, the door on a shop doesn't stop you walking out without paying and the road markings don't stop you from driving in the wrong lane.

I think imbuing technical protocols with legal implications would be even worse than the current situation since then changing anything on a protocol would require changing the law and getting a protocol implementation slightly wrong would carry real-world legal repercussions on the order of licensing your work in the public domain rather than retaining copyright. Let the lawyers make the law and check the human terms of service before using the data. Trying to out-lawyer the lawyers is like challenging a hedgehog to a butt-kicking brawl.

Doctor_Fegg9y ago

The protocol is also that you send a valid, non-faked User-Agent:

"The User-Agent request-header field contains information about the user agent originating the request. This is for [...] the tracing of protocol violations [...]. User agents SHOULD include this field with requests"

Many scrapers disregard this part of the protocol. Of course, whether a headless browser should send a different UA is an interesting question.

https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

TheCoelacanth9y ago

User-Agent is a SHOULD, not a MUST. There are also practically no browsers that send a non-fake User-Agent, since they almost all claim to be Mozilla/5.0.

ysavir9y ago· 4 in thread

Let's go with a more apt analogy:

If you're entering a country, do its laws not apply to you until you've seen a copy of them? "Oh, sorry, no one told me theft is illegal here. Where does it say that? Oh, I see. Okay. I'll stop now. Thanks for letting me know."

If you cross the border without necessary documents, does that country have no right to detain you, simply because you haven't checked the laws?

Just because a website is visible and public doesn't mean its content is public domain. It just means that your first order of business as a user should be to check the terms of service. Sure, most people using a website probably don't need to--same as not needing to check a country's stance on murder--and so can just use the website as intended without violating the terms. But when you plan on using it in a way that might not be intended, and you don't check the terms of service, well, that's on you.

baddox9y ago

Country's laws are a bit different, simply because a country has virtually absolute legal power over its territory. Countries can and do punish people for breaking laws that one cannot feasibly know they were breaking. Does any human know all the laws in the United States? Would that even be physically possible?

mentat9y ago

There's some interesting science fiction opportunities here. When you open a connect to a site then all traffic over that connection is subject to the jurisdiction of the ToS for that site regardless of disclosure.

Also we don't even know how many laws there are in the United States for I'd say knowing the content is impossible.

buzzdenver9y ago

That is not a good analogy. There is such a thing as reasonable expectations when visiting a website, so you do not need to read the TOS. Otherwise I could put "you own me $1000 for visiting my site" into the TOS. In other words, just clicking on a page does not constitute entering into a contract with the website. Registering and accepting the TOS does, but that still doesn't mean that anything in the TOS is enforceable.

slrz9y ago

But when you plan on using it in a way that might not be intended, and you don't check the terms of service, well, that's on you.

I don't need to check your terms of services if I'm doing something that I'm allowed to do by law anyway; the TOS cannot deny me those rights (they might, of course, grant me additional rights provided that I follow certain conditions).

baddox9y ago

Regardless of whether that would be reasonable, is it actually true? I know that the United States has specific rules for "public accommodations," which are private properties that are generally accessible to the public, like retail businesses. Property owners in this case don't have complete control over who enters their property. The obvious example is refusal of service due to membership of a protected class like race or religion.

So I'm not so sure that police will escort you out of a Walmart because they caught you taking a picture of the parking lot with your smartphone.

jessaustin9y ago

One rarely visits corporate property in order to access corporate websites. The analogy may be flawed, but this objection to it is as well.

j / k navigate · click thread line to collapse

0 comments

15 comments · 4 top-level

mindslight9y ago· 7 in thread

As an aside, I'd really like to see a browser plugin that would scrape sites in the normal course of access, storing the proceeds in a distributed public database.

cookiecaper9y ago

>As an aside, I'd really like to see a browser plugin that would scrape sites in the normal course of access, storing the proceeds in a distributed public database.

niftich9y ago

1 more reply

SilasX9y ago

Isn't that what webarchive/wayback machine do? I think they use a "Fair Use" defense.

mindslight9y ago

Oh for sure. But BitTorrent is still around and works great!

Asooka9y ago

Doctor_Fegg9y ago

The protocol is also that you send a valid, non-faked User-Agent:

Many scrapers disregard this part of the protocol. Of course, whether a headless browser should send a different UA is an interesting question.

https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

TheCoelacanth9y ago

User-Agent is a SHOULD, not a MUST. There are also practically no browsers that send a non-fake User-Agent, since they almost all claim to be Mozilla/5.0.

ysavir9y ago· 4 in thread

Let's go with a more apt analogy:

If you cross the border without necessary documents, does that country have no right to detain you, simply because you haven't checked the laws?

baddox9y ago

mentat9y ago

Also we don't even know how many laws there are in the United States for I'd say knowing the content is impossible.

buzzdenver9y ago

slrz9y ago

But when you plan on using it in a way that might not be intended, and you don't check the terms of service, well, that's on you.

baddox9y ago

So I'm not so sure that police will escort you out of a Walmart because they caught you taking a picture of the parking lot with your smartphone.

jessaustin9y ago

One rarely visits corporate property in order to access corporate websites. The analogy may be flawed, but this objection to it is as well.

j / k navigate · click thread line to collapse