undefined | Better HN

0 pointspolitician13y ago0 comments

Why are bulk downloads prevented when S3 Requestor-Pays buckets could deliver public domain content at $0.12/GiB? Why does a one-time salvage operation require a watermark - are they asserting a copyright?

0 comments

6 comments · 2 top-level

res0nat0r13y ago· 3 in thread

Exactly how is .12c/gb going to fund an operation that provides full text searches of academic journals which saw 74 million downloads in 2010?

Does anyone chiming in who claims this info must be available for pennies per article actually have any evidence than an operation of this scale can be funded this cheaply? The costs to provide this service are not this simple and as cheap as you think.

duaneb13y ago

> The costs to provide this service are not this simple and as cheap as you think.

Exactly. I can't believe bandwidth is jstor's limiting costs.

politicianOP13y ago

Ah, yeah, I think you've misunderstood the implications of what I've suggested; from your tone, I think you'll find a corrected interpretation both more and less radical.

JSTOR sends messages into the marketplace that they are a faithful steward of the public domain, but knows that its TOS a) prevents unrestricted access to the public domain, and b) knows that a US Attorney will prosecute violations of that TOS as felonies requiring several years in prison; I argue that their speech does not match their actions and that this is dishonest. Aggravating this, the extreme negative consequences of taking them at their word (as Aaron did) is why I've selected to say that their actions probe the depths of intellectual dishonesty.

Full-text search, like printing, is a value-added service. There is no reason to keep public domain works hostage by the threat of sending people like Aaron to jail for decades just so that they can offer FTS. If JSTOR wants to offer FTS or other services on the corpus of public domain works, let them charge for access to those services.

An existence proof: arXiv offers bulk download access to the works in their repository (~490GiB) via Amazon S3 Requester-Pays buckets [1]. The requester is paying Amazon, not arXiv; so, arXiv doesn't earn "pennies per article", it earns nothing. Incidentally, arXiv also provides full-text search [3].

Let's talk about capacity planning. Let's guess that average size of a digitized journal article is 5 MiB, that comes out to ~361 TB, or $44,400 to stream from S3. The at-rest cost of those articles is far lower because there are far fewer articles than downloads (I don't have a number, but would you argue otherwise?).

My proposal: JSTOR removes its watermarks and puts all public domain works and associated metadata into S3 Requestor-Pays buckets. They finance their operations by selling non-public-domain works and value-added services like FTS at a price the market will bear.

Earnings made by restricting bulk access to public domain works is blood money. Watermarking these documents is confusing - no one seems to have answered my questions about whether they are claiming a new copyright.

[1] http://arxiv.org/help/bulk_data_s3 [2] https://forums.aws.amazon.com/ann.jspa?annID=386 [3] http://arxiv.org/find

res0nat0r13y ago

aXriv is funded by Cornell.

How will JSTO be a "good faith steward" if they give all of their content away and make it available for free? They have to pay their bills to maintain the infrastructure to support all of this and pay their employees to do so. Charging for article access is how this is accomplished and how the company behind this allows the service to continue. There is a lot more money behind the operation that just copying pdf's to an s3 bucket.

1 more reply

duaneb13y ago· 1 in thread

> Why are bulk downloads prevented when S3 Requestor-Pays buckets could deliver public domain content at $0.12/GiB?

Because a) they need ammo against people like, well, Aaron Swartz, and b) doesn't take many people bulk downloading the DDOS the sucker.

> Why does a one-time salvage operation require a watermark - are they asserting a copyright?

They probably don't distinguish between public domain and copyrighted in document production. TBH I doubt that most documents people access are public domain, so it probably hasn't been an issue until now. Even in fields that are really old (e.g. classics) the vast majority of work is recent and copyrighted.

Besides, if I were to scan these things in, I would want people to know I did it. Do you know how much effort goes into the process? Even when things are completely automated (which is expensive) there's a lot of manual labor in operating the machine, editing, and cleaning stuff up. What do you do with figures? How about typeset math? Glyphs not in unicode?

Not saying I agree with their tactics, but I understand them and I don't think it makes them an immoral/bad organization.

bjustin13y ago

> Do you know how much effort goes into the process? Even when things are completely automated (which is expensive) there's a lot of manual labor in operating the machine, editing, and cleaning stuff up. What do you do with figures? How about typeset math? Glyphs not in unicode?

Step 1: Ask Google to do it. Step 2: There is no step two.

Google Books shows that Google is willing to do these kinds of tasks, as long as they can show the results on their site and thus keep people using Google services. I'm sure they would digitize all of the public domain papers and host them at no cost to JSTOR.

j / k navigate · click thread line to collapse