Definitely understand the motivations from a user standpoint for not needing to login to download.
There's some non-obvious benefits we get as a small team by requiring login, in addition to new user growth. Bandwidth for hosting data can be large, and it's easier to reason about and prevent abuse in the context of authenticated users.
We do enable previewing the dataset while logged out, and the preview functionality will become more full-featured.
The biggest issue is that I cannot share my work with others (easily) if I use a dataset from Kaggle. For example, ideally I just want to have a notebook online somewhere which can be instantly downloaded and run by anyone. Having to (automatically) download a dataset is a hinderance, and having to create a kaggle account first is an outright blocker.
On the other hand using, for example, IPFS or a torrent would be better, because you can reference the dataset using a global identifier and anyone can easily get access to it.
There are certainly benefits to you as a host if you restrict access to as a host. But if you have bandwidth concerns, perhaps just link to the source.
As it is now, I may be confused that you are using open data as a way to drive user growth.
That sounds like you are inflating your user counts with garbage accounts.
As a data publisher, you have an easy way to publish data online, see how it's used, and interact with the users of the data. You can create the dataset via a simple web interface, and update it through the interface or an API. We automatically version these updates under the hood.
As a data consumer, you can browse the data online and download it (through the web or an API). You can see the code and insights others have generated on the data through Kaggle Kernels (hosted, versioned IPython notebooks that run in Docker containers). You can fork their code to get started on the data, or start coding from scratch on your own analysis. If you find improvements that could be made to the metadata (dataset/file/column-level descriptions), you can make those directly.
We're rapidly iterating on this product and expanding it's functionality, and would love any feedback and suggestions.
Do you have plans for adding file hashes to the datasets, e.g. sha256? This would make it a lot easier to integrate with other systems.
Not to assume bad faith on Kaggle's part, but we got burned one too many times with private companies pushing their proprietary ("open") platforms for gobbling up data. The "it's free! just create an account — data lock-in — gap after project death/monetization" pattern leaves me a little cynical.
It's awesome that resources like these exist, but I'd be more comfortable paying attention if this was hosted as raw data somewhere (Github?), with a clear licensing and access model.
- https://www.kaggle.com/tags/linguistics
As an aside – I'm really curious to explore the datasets with "fake" in the title :)
https://www.kaggle.com/datasets?sortBy=relevance&group=publi...