ISBNdb dump – how many books are preserved forever? (opens in new tab)

(annas-blog.org)

174 pointspilimi_anna3y ago38 comments

38 comments

29 comments · 10 top-level

bloak3y ago· 9 in thread

Google has claimed that about 130 million books have been published (that factoid is all over the web). The number of 10-digit ISBNs is 1000 million (there's a check digit) and people have only just started using 13-digit ISBNs that start with 979 instead of 978; but of course there must be lots of wasted ISBNs, for example when a publisher optimistically buys a big block and then goes bankrupt. Both those numbers suggest that the "ISBNdb" with less than 31 million ISBNs is far from complete.

The frequency of each top-level prefix (which tells you the geographical or language region) would be interesting. That would the first thing I'd calculate if I had the data on my disc.

ComputerGuru3y ago

I always loved how despite the massive domain differences, the ISBN situation is extremely similar to the IPv4/IPv6 situation (except more aggressively rent-seeking), with prefixes leased out to the old dogs, concerns about eventual address/isbn exhaustion, a scheme for mapping old ISB10 to new ISBN13 codes, etc etc.

23skidoo3y ago

I'm a little perplexed by the ISBN system. The whole centralized affair, where you have to purchase ISBNs seems like a racket. ISBNs cost more in some countries (America) than they do in others (Canada). Not for any reason other than that they can get away with it.

Much better would be a UUID generated from unique values, like a hash of the timestamp and publisher of a book. If you limit the length and number of the fields you hash to generate the UUID, you could even prove there will be zero collisions and eliminate any need to collision checks and thus an organization that charges money.

egypturnash3y ago

ISBN was introduced in 1970. While hash functions did exist at this point (https://en.wikipedia.org/wiki/Hash_function#History) the computational resources generally available for this sort of thing were... rather lacking. The Apple II wasn't introduced until 1977.

I will leave figuring out which hashing functions were known back in 1970, and experimenting with calculating them by hand, up to you. :)

xyzzy1233y ago

While archaic, ISBN doesn't seem a bad system to me.

Short values are more reliable in retail situations. They can be typed in by hand or read with cheap scanners.

You are of course free to publish without an ISBN if you don't care about the legacy ecosystem.

There's nothing stopping anyone from creating or promoting an alternative but I don't think the incentives are there. There's not enough money in it, and I don't think the cost savings are enough to make a switch compelling.

1 more reply

bloak3y ago

That's definitely an interesting question, why they don't use a longer identifier without central/hierarchical allocation. I don't have an answer, but some possibly relevant points:

* Rather than compute a hash you could just generate a random number: same risk of collision if done correctly (but different opportunities for making a mistake).

* When ISBNs were introduced in the 1960s people would have been typing and even handwriting them so keeping them short would have been important.

* ISBNs have now been incorporated into EANs (13 digits), which are used for all things sold by retailers, except in the USA and Canada, which, according to Wikipedia, use a system called UPC. (Ironically, the U stands for "universal" while the E stands for "European". Of course the 12-digit system got incorporated into the 13-digit system. Probably there will be a 14-digit system one day.)

* In a UK supermarket if the barcode won't scan someone has to type in the digits. I assume that in most cases they type all 13 digits but I haven't watched carefully. (Of course I am now inspired to watch more carefully next time it happens.) They could have a really clever interface connected to a real-time database of barcodes which recently failed to scan because I expect whole batches of a product have badly printed or crinkled packaging.

* A suitably designed 25-digit system would only take twice as long, or less than twice as long, to type in as the current 13-digit system, but the system would have to be suitably designed for that purpose. Having the computer tell the human at the end "there's a mistake somewhere" would be no good at all. At the very least you could have a check digit for each half and tell the human which half contains the mistake but of course you could do much better than that ...

* I have noticed that Sainsbury's (a major UK supermarket) has a system of 8-digit barcodes for its own products, but Tesco (another major supermarket) uses the standard 13-digit barcodes for its own products.

* ALDI products have giant barcodes printed in several places on the packaging without the corresponding digits printed underneath the barcode: the scanner will never fail!

IncRnd3y ago

> Much better would be a UUID generated from unique values, like a hash of the timestamp and publisher of a book. If you limit the length and number of the fields you hash to generate the UUID, you could even prove there will be zero collisions and eliminate any need to collision checks and thus an organization that charges money.

That's false. Your algorithm of hashing a timestamp and book publisher name cannot be proven to be collision-free.

1 more reply

contingencies3y ago

Many things are published without ISBNs or have ISBNs and aren't traditional books. Here in China, to get an ISBN for a book you have to have a government approval process. So many publishers will print stuff on the proverbial sly, often at night, without assigning an ISBN. There's also book-like printed matter (pamphlets, maps, posters, puzzles, 3D/fold-out dioramas, etc.) which often lack an ISBN. So equating ISBN with book is not correct. Then there's all the stuff published pre-ISBN...

adolph3y ago

An ISBN is assigned to each separate edition and variation (except reprintings) of a publication. For example, an e-book, a paperback and a hardcover edition of the same book will each have a different ISBN.

Additionally, there is address fragmentation; ISMB has blocks:

ISBN issuance is country-specific, in that ISBNs are issued by the ISBN registration agency that is responsible for that country or territory regardless of the publication language. The ranges of ISBNs assigned to any particular country are based on the publishing profile of the country concerned, and so the ranges will vary depending on the number of books and the number, type, and size of publishers that are active. Some ISBN registration agencies are based in national libraries or within ministries of culture and thus may receive direct funding from the government to support their services. In other cases, the ISBN registration service is provided by organisations such as bibliographic data providers that are not government funded.

https://en.wikipedia.org/wiki/ISBN

eCa3y ago

Yeah, lots and lots of unused ISBNs. As an example, O’reilly has the 978-0-596 series. That’s a hundred thousand editions.

billblack3y ago· 5 in thread

As someone who would like to publish, my main concern with ISBN's is the cost, because publishers are required to assign an ISBN to every item in their catalog.

Section 6.1 of the ISBN International User Manual "A separate ISBN shall be assigned to each separate monographic publication or separate edition or format of a monographic publication issued by a publisher."

This would not be a problem if the numbers were more affordable.

gigel823y ago

That's surprising; I wanted to make a picture book (with a bit of text the kids wrote) to send to grandparents and stumbled upon BookWright; seemed an affordable choice but was very surprised they actually included an ISBN with the little one-off kids picture book.

Maybe they're just sitting on a big block of numbers and just giving them away...

pilimi_annaOP3y ago

This is interesting to learn about. How expensive is it, and do you know if it differs around the world (since there are lots of national ISBN agencies)?

jkingsman3y ago

It's $125 list price for a single ISBN, but there are bulk discounts buying direct and purchasing through a large scale supplier can make them as cheap as $10 each. There are deals to be had; Amazon, for example, may give you a free ISBN for your ebook as long as you publish it using KDP, their walled-garden publishing system, but the gotcha is the ISBN is not portable/you're not permitted to use it for other editions outside of the Amazon system.

The other downside to these free (just about always) and discounted (sometimes) ISBNs is that they link the publisher as the service you got the ISBN through, rather than yourself, even if you're doing what would classically be considered a self-publishing job. How big of an issue is that? IANAExpert, but it seems like there are some nooks and crannies of IP law that can be swayed by owning the imprint, but little practical concern for the average person putting an ebook on Amazon e.g. Perhaps someone with more in depth publishing knowledge can color the risks better than I.

3 more replies

wrs3y ago

For the US, between $125 each (for 1) and $1.50 each (for 1000) from the official source, a company called Bowker. The structure is described in Bowker’s FAQ [0].

[0] http://isbn.org/faqs_general_questions#isbn_faq6

1 more reply

jwilk3y ago

In Poland, you get ISBNs for free.

tedivm3y ago· 3 in thread

The timing on this for me is really interesting, as last week I got an ISBN issues for a book I'm working on (9781633438002 if anyone is curious!).

This will be the first book I'm the author of, but the second book I've worked on (the first I was the technical editor for). Neither of these books are out yet (I start writing tomorrow) but they both have ISBNs issued. Even if I never publish the book that ISBN is locked in.

I imagine there's a lot of books that started out but never got finished. That said it looks like ISBNdb doesn't grab directly from the source, but instead crawls the internet looking for ISBN data to put into its database. I'll be interested to see at which stage my ISBN shows up in the database.

delecti3y ago

What's the rationale behind reserving an ISBN before even beginning the writing process?

lmm3y ago

It's a good unique key to use for tracking the book, even internally. You might change the title of the book at a late stage. You could use your own ID scheme, but what if your publisher merges with another while the book production is in process?

johannes12343213y ago

Before writing might be a bit early, but before finishing and producing is useful as it can be listed early in catalogs for preorder. Getting the ISBN earlier probably is cheap and allows the publisher to use the ISBN as identifier for the whole project.

1 more reply

pugworthy3y ago· 1 in thread

Define "forever" in this context? 10 years? 100? 1000?

It's a legit question to answer.

kleer0013y ago

A real forever? Past the life of Sol and Earth. Unlikely.

A more conservative forever... at the end of the human species? Maybe.

ZeroGravitas3y ago· 1 in thread

> extracting ISBNs from the actual book scans themselves (in the case of Z-Library/Libgen).

OpenLibrary also uses book scans in Archive.org to extract ISBNs (and a few other bits of metadata, like urls in the text):

https://blog.openlibrary.org/2021/08/23/gsoc-2021-making-boo...

And have a software pipeline for that kind of thing available.

pilimi_annaOP3y ago

There's probably a lot of things that Open Library does that we can try to apply to shadow libraries!

Tomte3y ago

ISBNs are supposed to be unique, but they aren't. Publishers reuse ISBNs, by mistake if they are following the rules, or sometimes intentionally.

It's not super common, but common enough that I ran across that problem when scanning in my bookshelf years ago.

Archelaos3y ago

The problem with counting "books" is that the term is used in so many different ways, that one might end up with estimates that differ by several magnitudes depending how narrow or wide a definition or charactierization one adopts. How many books is a bible? One or around 80.[1] When there is a new minor edition, do we count no, one or 80 new books? Some of this 80 "books" are only letters and less than a page or only a few pages long. Shall we count them all as "books"? If we do so, should we than count each letter of a modern published correspondence as a single "book"? Poems were often published as very small booklets, but for prominent writers you may be able to purchase their "complete works" in a single more or less thick volumn, or the very same text in one thick volumn or a few more handy volumns. How should we count this?

> Physical copies. Obviously this is not very helpful, since they’re just duplicates of the same material.

Alas, this is quite often not the case, in particular for older books for various reasons, for example copies were bound from sheets of different print runs that used freshly assembled typesettings containing accidential or deliberate variations, sometimes sheets were missing or the order of pages is not correct, etc., etc.[2] For important "books" we should therefore digitize every available copy.

As great it would be to have 129,864,880 "books" scanned, this would be just an initial phase. We would need a quality control: Is the resolution of the scans really always sufficient? Are the colours correctly represented (includes every scan a standard colour chart for comparison)? What about watermarks (they are extremly important for dating old books)? ... ...

Besides, I personally prefer to speak of "making books digitally available" rather than of "preserving" them, because many features of a physical copy are impossible to preserve digitally: chemical coposition, (bio-)chemical traces, the DNA of parchment or animal bindings, their texture, how it feels to handle them, their visual appearance under different illuminations ... ...

[1] The number varies from denomination to denomination.

[2] And even renowned contemporary publishers sometimes silently correct errors without changing the numbering of the edition.

photochemsyn3y ago

There are some interesting technologies in the pipeline for truly long-term data storage. Synthetic diamond is one option (light-sensitive, so perhaps susceptible to cosmic-ray degradation over time):

https://theconversation.com/turning-diamonds-defects-into-lo...

Another is microetching, i.e. ion-beam insertion of foreign atoms into crystalline materials, such as diamond or nickel, although the data density is lower than the above approach, it seems a lot less sensitive (i.e. light should have less effect):

https://en.wikipedia.org/wiki/HD-Rosetta

omoikane3y ago

That statement of "before the demise of Google Books" seems unnecessary. The next quoted bit of "at least until Sunday" might have been an attempt to complete the joke, but should be interpreted as the number of books changing rapidly according to the (12 year old) linked article.

http://booksearch.blogspot.com/2010/08/books-of-world-stand-...

mechanical_bear3y ago

Forever? 0.

j / k navigate · click thread line to collapse

38 comments

29 comments · 10 top-level

bloak3y ago· 9 in thread

The frequency of each top-level prefix (which tells you the geographical or language region) would be interesting. That would the first thing I'd calculate if I had the data on my disc.

ComputerGuru3y ago

23skidoo3y ago

egypturnash3y ago

I will leave figuring out which hashing functions were known back in 1970, and experimenting with calculating them by hand, up to you. :)

xyzzy1233y ago

While archaic, ISBN doesn't seem a bad system to me.

Short values are more reliable in retail situations. They can be typed in by hand or read with cheap scanners.

You are of course free to publish without an ISBN if you don't care about the legacy ecosystem.

1 more reply

bloak3y ago

That's definitely an interesting question, why they don't use a longer identifier without central/hierarchical allocation. I don't have an answer, but some possibly relevant points:

* Rather than compute a hash you could just generate a random number: same risk of collision if done correctly (but different opportunities for making a mistake).

* When ISBNs were introduced in the 1960s people would have been typing and even handwriting them so keeping them short would have been important.

* ALDI products have giant barcodes printed in several places on the packaging without the corresponding digits printed underneath the barcode: the scanner will never fail!

IncRnd3y ago

That's false. Your algorithm of hashing a timestamp and book publisher name cannot be proven to be collision-free.

1 more reply

contingencies3y ago

adolph3y ago

Additionally, there is address fragmentation; ISMB has blocks:

https://en.wikipedia.org/wiki/ISBN

eCa3y ago

Yeah, lots and lots of unused ISBNs. As an example, O’reilly has the 978-0-596 series. That’s a hundred thousand editions.

billblack3y ago· 5 in thread

As someone who would like to publish, my main concern with ISBN's is the cost, because publishers are required to assign an ISBN to every item in their catalog.

This would not be a problem if the numbers were more affordable.

gigel823y ago

Maybe they're just sitting on a big block of numbers and just giving them away...

pilimi_annaOP3y ago

This is interesting to learn about. How expensive is it, and do you know if it differs around the world (since there are lots of national ISBN agencies)?

jkingsman3y ago

3 more replies

wrs3y ago

For the US, between $125 each (for 1) and $1.50 each (for 1000) from the official source, a company called Bowker. The structure is described in Bowker’s FAQ [0].

[0] http://isbn.org/faqs_general_questions#isbn_faq6

1 more reply

jwilk3y ago

In Poland, you get ISBNs for free.

tedivm3y ago· 3 in thread

The timing on this for me is really interesting, as last week I got an ISBN issues for a book I'm working on (9781633438002 if anyone is curious!).

delecti3y ago

What's the rationale behind reserving an ISBN before even beginning the writing process?

lmm3y ago

johannes12343213y ago

1 more reply

pugworthy3y ago· 1 in thread

Define "forever" in this context? 10 years? 100? 1000?

It's a legit question to answer.

kleer0013y ago

A real forever? Past the life of Sol and Earth. Unlikely.

A more conservative forever... at the end of the human species? Maybe.

ZeroGravitas3y ago· 1 in thread

> extracting ISBNs from the actual book scans themselves (in the case of Z-Library/Libgen).

OpenLibrary also uses book scans in Archive.org to extract ISBNs (and a few other bits of metadata, like urls in the text):

https://blog.openlibrary.org/2021/08/23/gsoc-2021-making-boo...

And have a software pipeline for that kind of thing available.

pilimi_annaOP3y ago

There's probably a lot of things that Open Library does that we can try to apply to shadow libraries!

Tomte3y ago

ISBNs are supposed to be unique, but they aren't. Publishers reuse ISBNs, by mistake if they are following the rules, or sometimes intentionally.

It's not super common, but common enough that I ran across that problem when scanning in my bookshelf years ago.

Archelaos3y ago

> Physical copies. Obviously this is not very helpful, since they’re just duplicates of the same material.

[1] The number varies from denomination to denomination.

[2] And even renowned contemporary publishers sometimes silently correct errors without changing the numbering of the edition.

photochemsyn3y ago

https://theconversation.com/turning-diamonds-defects-into-lo...

https://en.wikipedia.org/wiki/HD-Rosetta

omoikane3y ago

http://booksearch.blogspot.com/2010/08/books-of-world-stand-...

mechanical_bear3y ago

Forever? 0.

j / k navigate · click thread line to collapse