undefined | Better HN

0 pointsbnewbold5y ago0 comments

Thank you for the kind words!

We are friendly with Semantic Scholar, and have used their "open corpus" dumps as one of several URL seed lists for crawling in the past. Their search and discovery tech is more sophisticated than ours is likely to be any time soon (https://medium.com/ai2-blog/building-a-better-search-engine-...). We would love to get to the place where groups like AI2, which are primarily research-oriented, could build on an existing open catalog and corpus, and not need to duplicate time crawling, merging catalogs, cleaning metadata, etc. As of today Microsoft Academic (used by Semantic Scholar) might be a better option.

Want to be thoughtful about ranking signals, and are deeply skeptical of journal impact factor, h-index, and most bibliometrics. "Has this been cited more than a handful of times" seems like a reasonable coarse boost. Hope to include more curated signals, like "won a paper prize", "journal in DOAJ and other reviewed indices", etc.

Have been working on a citation graph, keep an eye out for something about that in coming months. One cool thing we hope to do with the citation graph is find "missing works" not yet in the catalog (eg, don't have a DOI, especially for pre-1990 era).

0 comments

4 comments · 2 top-level

curiouscats5y ago· 1 in thread

I really like what you have done.

One easy improvement "Showing results 16 — 30 out of 26 results" :-) showing below search results...

> Hope to include more curated signals, like "won a paper > prize", "journal in DOAJ and other reviewed indices", etc. This would be a great addition.

bnewboldOP5y ago

Fixed, thanks!

tomthe5y ago· 1 in thread

Thank you for this great datasource!

Do you think this is suitable for bibliometric research? (We don't need citation-graphs). We use Scopus and Web of Science, but I really don't like that we are not able to publish helpful datasets that we extract from these databases.

bnewboldOP5y ago

I think it is in a good place for simple bibliometric queries. The fatcat elasticsearch API is open at https://api.fatcat.wiki/fatcat_release/ (behind a proxy to filter "unsafe" requests). That works pretty well for jupyter notebook style experimentation if you are willing to learn the elasticsearch query DSL for aggregations and things.

I don't think the catalog has high enough metadata quality today for use in published research. There are some glaring errors and omissions when you actually starting digging in. On the other hand, almost all bibliographic catalogs seem to have such problems. Fatcat, by being open and having an API, does have the potential to aggregate corrections, fixes, and contributions directly from researchers over time.

A particular missing piece today is that there is no categorization or "discipline" metadata of almost any type. This sort of metadata is more subjective, and the catalog currently carefully only includes factual information. We will likely start collecting metadata at the journal ("container") level and can trickle that down to papers. Aggregating, editing, and curating that metadata in Wikidata first, then importing to Fatcat, might be the best and most sustainable path forward.

j / k navigate · click thread line to collapse