The irony here is that the only second full copy of the Internet Archive is actually hosted at the library of Alexandria.
Source: Digital Amnesia Documentary [1]
[0] https://ipfs.io/
Related previous post: https://news.ycombinator.com/item?id=24485444
Much of the catalog functionality can be accessed from the fatcat.wiki API (https://api.fatcat.wiki/redoc). Scholar adds a search index over the body content of papers, and we are still thinking through how to make this available through a public API without slowing down query latency even more.
Folks here might also be interested in this CLI for interfacing with the catalog and making edits: https://gitlab.com/bnewbold/fatcat-cli
Super fast. All my test searches returned what I was looking for.
What is your relationship with semantic scholar like?
Any plans to integrate ranking signals like references, etc?
I'm going to double my monthly donation. This is great.
We are friendly with Semantic Scholar, and have used their "open corpus" dumps as one of several URL seed lists for crawling in the past. Their search and discovery tech is more sophisticated than ours is likely to be any time soon (https://medium.com/ai2-blog/building-a-better-search-engine-...). We would love to get to the place where groups like AI2, which are primarily research-oriented, could build on an existing open catalog and corpus, and not need to duplicate time crawling, merging catalogs, cleaning metadata, etc. As of today Microsoft Academic (used by Semantic Scholar) might be a better option.
Want to be thoughtful about ranking signals, and are deeply skeptical of journal impact factor, h-index, and most bibliometrics. "Has this been cited more than a handful of times" seems like a reasonable coarse boost. Hope to include more curated signals, like "won a paper prize", "journal in DOAJ and other reviewed indices", etc.
Have been working on a citation graph, keep an eye out for something about that in coming months. One cool thing we hope to do with the citation graph is find "missing works" not yet in the catalog (eg, don't have a DOI, especially for pre-1990 era).
The bad news is only half of them use DOI or embed licenses in the metadata. Are they indexed or archived somewhere?
To my surprise, there are more than 350,000 papers published in OA Diamond journals every year, and most journals publish fewer than 25 articles a year.
"BASE is one of the world's most voluminous search engines especially for academic web resources. BASE provides more than 240 million documents from more than 8,000 content providers. You can access the full texts of about 60% of the indexed documents for free (Open Access). BASE is operated by Bielefeld University Library."
BASE, SHARE (https://share.osf.io/), and CORE (https://core.ac.uk) all primarily pull metadata via OAI-PMH, though they may also incorporate other sources these days. We have worked with CORE to check content overlap. We have also done our own OAI-PMH bulk scraping and broad preservation crawling, but most of this content has not ended up indexed in fatcat yet.
The main reason that we haven't done bulk imports from OAI-PMH directly or from any of these sources is that we haven't gotten a handle on the metadata quality yet. There are many, many duplicate records out there, and we think it is important to merge these correctly (under "work" entities in fatcat). Until recently we didn't have a mechanism to fuzzy-match new records to prevent duplicate creation. Once we get a policy figured out and polish the de-dupe code, we expect to significantly increase the amount of content in the catalog from these sources.
Another thing we haven't figured out yet is accurately tagging OAI-PMH feeds as journals (eg, OJS instances) vs. institutional repositories or subject repositories. That distinction will change how we classify imported records (eg, "published" versions vs. "pre-print" or "manuscript").
Thus, I was able to solve the PDP-1 "Amherst Mystery" [1]: https://www.masswerk.at/nowgobang/2021/pdp1-spotting#update
Some fields/disciplines are probably still systemically under-represented. For example, I bet we are missing a bunch of scholarship on art and history published before 1980. We have a couple ideas up our sleeves which we hope will help with "completeness" across more disciplines.
To answer your question directly, you can search journal names here: https://fatcat.wiki/container/search
And click through to see how many articles we know about, and what we think the preservation status is. Click through again to the "coverage" tab for a more detailed breakdown. (improving the usability and ranking on the journal search results is on our short list)
Two large sources in Japan you might consider trying to mirror are the UTokyo Repository [1] and Researchmap [2], through which many researchers in Japan release PDFs of their own papers. Other Japanese universities probably have archives similar to [1].
If you would like me to contact somebody at [1] who might be able to work with you, please let me know in a reply to this comment. (I helped to arrange the IA’s recent tie-up with the University of Tokyo General Library.)
Don't forget to donate if you also like Internet Archive, they need every penny: https://archive.org/donate/?origin=hn
In computer science we are pretty lucky because open access is the norm.
I checked a few well known exceptions, and this seems to find them ok.
"Mastering the game of Go without human knowledge" (Deepmind in Nature): https://scholar.archive.org/search?q=key:work_yqdj7vjbefg7hh...
"Typing candidate answers using type coercion" (IBM Watson special edition, IEEE IBM Systems Journals): https://scholar.archive.org/search?q=key:work_dym4lqay5fcdxo...