Internet Archive Scholar: Search Millions of Research Papers (opens in new tab)

(blog.archive.org)

342 pointsbnewbold5y ago47 comments

47 comments

30 comments · 14 top-level

marcodiego5y ago· 5 in thread

The internet archive is becoming an alternative good internet. It has a web archive, film archive, software archive, media archive... and now research papers archive. That is the internet as a giant library as we dreamed in early 90's.

Black1015y ago

Way too centralized (Centranet?), but it is very nice for now. It's a bit like the library of Alexandria, so it could change/disappear at any time.

cookiengineer5y ago

> It's a bit like the library of Alexandria, so it could change/disappear at any time.

The irony here is that the only second full copy of the Internet Archive is actually hosted at the library of Alexandria.

Source: Digital Amnesia Documentary [1]

[1] https://www.youtube.com/watch?v=NdZxI3nFVJs

dbrereton5y ago

I'm sure they'd be willing to decentralize it if there was a good way to do that. Maybe this can be done with something like IPFS [0].

[0] https://ipfs.io/

4 more replies

phendrenad25y ago

Exactly. I encourage everyone to become a digital hoarder yourself. See a cool blog post? Assume it will be GONE in 5-10 years. So make a backup PDF copy, and throw it in dropbox. In 5-10 years if you re-encounter that page, and the internet archive is missing the page, you'll be delighted to find it in your own archive, and you can be the one who restores that information to the world.

puddingnomeat5y ago

Is it easy to have a local copy?

bnewboldOP5y ago· 3 in thread

This service was hinted at back in September, but is now formally announced and live at https://scholar.archive.org

Related previous post: https://news.ycombinator.com/item?id=24485444

Much of the catalog functionality can be accessed from the fatcat.wiki API (https://api.fatcat.wiki/redoc). Scholar adds a search index over the body content of papers, and we are still thinking through how to make this available through a public API without slowing down query latency even more.

Folks here might also be interested in this CLI for interfacing with the catalog and making edits: https://gitlab.com/bnewbold/fatcat-cli

breck5y ago

I absolutely love everything about it (the logo <3).

Super fast. All my test searches returned what I was looking for.

What is your relationship with semantic scholar like?

Any plans to integrate ranking signals like references, etc?

I'm going to double my monthly donation. This is great.

bnewboldOP5y ago

Thank you for the kind words!

We are friendly with Semantic Scholar, and have used their "open corpus" dumps as one of several URL seed lists for crawling in the past. Their search and discovery tech is more sophisticated than ours is likely to be any time soon (https://medium.com/ai2-blog/building-a-better-search-engine-...). We would love to get to the place where groups like AI2, which are primarily research-oriented, could build on an existing open catalog and corpus, and not need to duplicate time crawling, merging catalogs, cleaning metadata, etc. As of today Microsoft Academic (used by Semantic Scholar) might be a better option.

Want to be thoughtful about ranking signals, and are deeply skeptical of journal impact factor, h-index, and most bibliometrics. "Has this been cited more than a handful of times" seems like a reasonable coarse boost. Hope to include more curated signals, like "won a paper prize", "journal in DOAJ and other reviewed indices", etc.

Have been working on a citation graph, keep an eye out for something about that in coming months. One cool thing we hope to do with the citation graph is find "missing works" not yet in the catalog (eg, don't have a DOI, especially for pre-1990 era).

2 more replies

fenrissan5y ago

Today I discovered "Open Access Diamond journals" in a report (https://zenodo.org/record/4558704/files/OADJS-Findings.pdf). These are low-scale peer-reviewed free-to-read free-to-publish non-commercial journals, typically supported by Universities or Goverment agencies. They serve diverse communities and are not predatory journals (they are free after all).

The bad news is only half of them use DOI or embed licenses in the metadata. Are they indexed or archived somewhere?

To my surprise, there are more than 350,000 papers published in OA Diamond journals every year, and most journals publish fewer than 25 articles a year.

pasttense015y ago· 2 in thread

How does this compare to BASE and why isn't BASE used as a source?

"BASE is one of the world's most voluminous search engines especially for academic web resources. BASE provides more than 240 million documents from more than 8,000 content providers. You can access the full texts of about 60% of the indexed documents for free (Open Access). BASE is operated by Bielefeld University Library."

https://www.base-search.net/

bnewboldOP5y ago

Great question!

BASE, SHARE (https://share.osf.io/), and CORE (https://core.ac.uk) all primarily pull metadata via OAI-PMH, though they may also incorporate other sources these days. We have worked with CORE to check content overlap. We have also done our own OAI-PMH bulk scraping and broad preservation crawling, but most of this content has not ended up indexed in fatcat yet.

The main reason that we haven't done bulk imports from OAI-PMH directly or from any of these sources is that we haven't gotten a handle on the metadata quality yet. There are many, many duplicate records out there, and we think it is important to merge these correctly (under "work" entities in fatcat). Until recently we didn't have a mechanism to fuzzy-match new records to prevent duplicate creation. Once we get a policy figured out and polish the de-dupe code, we expect to significantly increase the amount of content in the catalog from these sources.

Another thing we haven't figured out yet is accurately tagging OAI-PMH feeds as journals (eg, OJS instances) vs. institutional repositories or subject repositories. That distinction will change how we classify imported records (eg, "published" versions vs. "pre-print" or "manuscript").

text72635y ago

Base indexes the metadata only I believe.

masswerk5y ago· 2 in thread

This is nice! I just managed to find an article, I couldn’t find with Google.

Thus, I was able to solve the PDP-1 "Amherst Mystery" [1]: https://www.masswerk.at/nowgobang/2021/pdp1-spotting#update

[1] https://news.ycombinator.com/item?id=26313124

Black1015y ago

Google fails to index so many good sources nowadays... I think that it has been gotten worst over the last 10 years.

masswerk5y ago

See also https://fatcat.wiki (which is, I think, incorporated into Internet Archive Scholar.)

BugsJustFindMe5y ago· 2 in thread

I couldn't find a list of what sources (like which journals) they're archiving from. Does anyone know where to find that? It would be nice to see what subject categories the archive covers.

bnewboldOP5y ago

We are mostly not indexing on a journal-by-journal basis, but try to import from large, broad sources. For example, DOI registrars (Crossref, Datacite, J-Stage), DOAJ article and journal metadata (for OA publications), etc. Some field-specific indexes we have imported from include JSTOR early journals subset, PubMed, and dblp.

Some fields/disciplines are probably still systemically under-represented. For example, I bet we are missing a bunch of scholarship on art and history published before 1980. We have a couple ideas up our sleeves which we hope will help with "completeness" across more disciplines.

To answer your question directly, you can search journal names here: https://fatcat.wiki/container/search

And click through to see how many articles we know about, and what we think the preservation status is. Click through again to the "coverage" tab for a more detailed breakdown. (improving the usability and ranking on the journal search results is on our short list)

tkgally5y ago

I did a vanity search for my own (modest) academic output and found only one paper, which was published by a journal in Europe. The other papers were all published in either Japan or Korea and don’t appear in your search results.

Two large sources in Japan you might consider trying to mirror are the UTokyo Repository [1] and Researchmap [2], through which many researchers in Japan release PDFs of their own papers. Other Japanese universities probably have archives similar to [1].

If you would like me to contact somebody at [1] who might be able to work with you, please let me know in a reply to this comment. (I helped to arrange the IA’s recent tie-up with the University of Tokyo General Library.)

[1] https://repository.dl.itc.u-tokyo.ac.jp/?lang=english

[2] https://researchmap.jp/?lang=en

1 more reply

tasogare5y ago· 1 in thread

What are the differences and advantages over Sci-Hub?

kmeisthax5y ago

Sci-Hub exists specifically to exfiltrate paywalled research papers; IA Scholar is for open-access papers that have disappeared off the Internet. They do different things.

endisneigh5y ago· 1 in thread

I'm curious, how does the Internet Archive handle copyright with all of its services?

breck5y ago

#endCopyright

capableweb5y ago

Internet Archive strikes again! I love Internet Archive, not just for archiving websites but for archiving everything and making it easily accessible. This is another great service that'll help a lot of researchers and hobby-researchers, which is lovely to see.

Don't forget to donate if you also like Internet Archive, they need every penny: https://archive.org/donate/?origin=hn

betamaxthetape5y ago

This is amazing. I had a play around with it whilst it was in beta, and was blown away by the variety of papers returned. On a whim I searched for a very obscure topic that I'd researched before (just for personal interest) in the past using worldcat / google scholar, and to my surprise was presented with several highly relevant papers I'd never come across before, that were exactly what I was looking for.

nl5y ago

This seems pretty good.

In computer science we are pretty lucky because open access is the norm.

I checked a few well known exceptions, and this seems to find them ok.

"Mastering the game of Go without human knowledge" (Deepmind in Nature): https://scholar.archive.org/search?q=key:work_yqdj7vjbefg7hh...

"Typing candidate answers using type coercion" (IBM Watson special edition, IEEE IBM Systems Journals): https://scholar.archive.org/search?q=key:work_dym4lqay5fcdxo...

nathias5y ago

archive.org is really one of the few things still good on the internet, while studying it has been invaluable for my studies, I can't imagine what the previous generations that could only access 5% of sources were even doing.

8bitsrule5y ago

Oh yeah! Tried this on several specific topics I've looked at recently (2 years ago, 7ya, and 150ya) and the results were fast and on the mark. I'll certainly favor using Scholar over IA searches. Congratulations!

sundarurfriend5y ago

(OffTopic) All this talk about the logo here made me check the page out, instead of moving on after reading just the comments as I might otherwise have done. Perhaps that's a HN strategy to use, to get people to actually click through - add a bikesheddy thing to the page that's likely to be divisive, but doesn't require thought. Gives us a cheap way to have an opinion, and thus an incentive to click!

carbocation5y ago

Interesting. For my field (cardiovascular genetics), the results weren't really what I was expecting. I think that my expectations probably fit pretty well with a PageRank graph of citations. So my guess is that the "relevancy" is semantic only?

j / k navigate · click thread line to collapse

47 comments

30 comments · 14 top-level

marcodiego5y ago· 5 in thread

Black1015y ago

Way too centralized (Centranet?), but it is very nice for now. It's a bit like the library of Alexandria, so it could change/disappear at any time.

cookiengineer5y ago

> It's a bit like the library of Alexandria, so it could change/disappear at any time.

The irony here is that the only second full copy of the Internet Archive is actually hosted at the library of Alexandria.

Source: Digital Amnesia Documentary [1]

[1] https://www.youtube.com/watch?v=NdZxI3nFVJs

dbrereton5y ago

I'm sure they'd be willing to decentralize it if there was a good way to do that. Maybe this can be done with something like IPFS [0].

[0] https://ipfs.io/

4 more replies

phendrenad25y ago

puddingnomeat5y ago

Is it easy to have a local copy?

bnewboldOP5y ago· 3 in thread

This service was hinted at back in September, but is now formally announced and live at https://scholar.archive.org

Related previous post: https://news.ycombinator.com/item?id=24485444

Folks here might also be interested in this CLI for interfacing with the catalog and making edits: https://gitlab.com/bnewbold/fatcat-cli

breck5y ago

I absolutely love everything about it (the logo <3).

Super fast. All my test searches returned what I was looking for.

What is your relationship with semantic scholar like?

Any plans to integrate ranking signals like references, etc?

I'm going to double my monthly donation. This is great.

bnewboldOP5y ago

Thank you for the kind words!

2 more replies

fenrissan5y ago

The bad news is only half of them use DOI or embed licenses in the metadata. Are they indexed or archived somewhere?

To my surprise, there are more than 350,000 papers published in OA Diamond journals every year, and most journals publish fewer than 25 articles a year.

pasttense015y ago· 2 in thread

How does this compare to BASE and why isn't BASE used as a source?

https://www.base-search.net/

bnewboldOP5y ago

Great question!

text72635y ago

Base indexes the metadata only I believe.

masswerk5y ago· 2 in thread

This is nice! I just managed to find an article, I couldn’t find with Google.

Thus, I was able to solve the PDP-1 "Amherst Mystery" [1]: https://www.masswerk.at/nowgobang/2021/pdp1-spotting#update

[1] https://news.ycombinator.com/item?id=26313124

Black1015y ago

Google fails to index so many good sources nowadays... I think that it has been gotten worst over the last 10 years.

masswerk5y ago

See also https://fatcat.wiki (which is, I think, incorporated into Internet Archive Scholar.)

BugsJustFindMe5y ago· 2 in thread

I couldn't find a list of what sources (like which journals) they're archiving from. Does anyone know where to find that? It would be nice to see what subject categories the archive covers.

bnewboldOP5y ago

To answer your question directly, you can search journal names here: https://fatcat.wiki/container/search

tkgally5y ago

[1] https://repository.dl.itc.u-tokyo.ac.jp/?lang=english

[2] https://researchmap.jp/?lang=en

1 more reply

tasogare5y ago· 1 in thread

What are the differences and advantages over Sci-Hub?

kmeisthax5y ago

Sci-Hub exists specifically to exfiltrate paywalled research papers; IA Scholar is for open-access papers that have disappeared off the Internet. They do different things.

endisneigh5y ago· 1 in thread

I'm curious, how does the Internet Archive handle copyright with all of its services?

breck5y ago

#endCopyright

capableweb5y ago

Don't forget to donate if you also like Internet Archive, they need every penny: https://archive.org/donate/?origin=hn

betamaxthetape5y ago

nl5y ago

This seems pretty good.

In computer science we are pretty lucky because open access is the norm.

I checked a few well known exceptions, and this seems to find them ok.

"Mastering the game of Go without human knowledge" (Deepmind in Nature): https://scholar.archive.org/search?q=key:work_yqdj7vjbefg7hh...

"Typing candidate answers using type coercion" (IBM Watson special edition, IEEE IBM Systems Journals): https://scholar.archive.org/search?q=key:work_dym4lqay5fcdxo...

nathias5y ago

8bitsrule5y ago

sundarurfriend5y ago

carbocation5y ago

j / k navigate · click thread line to collapse