Kaggle Datasets – Discover and analyze open data (opens in new tab)

(kaggle.com)

222 pointsbenhamner8y ago36 comments

36 comments

23 comments · 9 top-level

QasimK8y ago· 8 in thread

How about you let me download them without creating an account before calling them “public”?

benhamnerOP8y ago

Thanks for the feedback. This is likely a "not quite yet" vs. "never".

Definitely understand the motivations from a user standpoint for not needing to login to download.

There's some non-obvious benefits we get as a small team by requiring login, in addition to new user growth. Bandwidth for hosting data can be large, and it's easier to reason about and prevent abuse in the context of authenticated users.

We do enable previewing the dataset while logged out, and the preview functionality will become more full-featured.

QasimK8y ago

Thanks for your response. I spoke slightly evocatively because this has caused issues for me and hence I didn’t like to see it advertised as “public”.

The biggest issue is that I cannot share my work with others (easily) if I use a dataset from Kaggle. For example, ideally I just want to have a notebook online somewhere which can be instantly downloaded and run by anyone. Having to (automatically) download a dataset is a hinderance, and having to create a kaggle account first is an outright blocker.

On the other hand using, for example, IPFS or a torrent would be better, because you can reference the dataset using a global identifier and anyone can easily get access to it.

hyperocular8y ago

A torrent option would go a long way to offsetting hosting costs on more popular items. Also I appreciate the preview option, it's good to see what is in a dataset before committing to downloading and extracting hundreds of gigabytes.

prepend8y ago

Part of the best practices of public data involve unrestricted downlod - https://project-open-data.cio.gov/principles/

There are certainly benefits to you as a host if you restrict access to as a host. But if you have bandwidth concerns, perhaps just link to the source.

As it is now, I may be confused that you are using open data as a way to drive user growth.

shepardrtc8y ago

I disagree with the parent. You've taken the time to organize and host these datasets. The least we can do is create an account to download them.

1 more reply

gertye8y ago

AFAIK, Kaggle is a part of Google, therefore is not operated independently by a "small team". You just actively try not to be associated with the behemoth.

1 more reply

mynewtb8y ago

> new user growth

That sounds like you are inflating your user counts with garbage accounts.

1 more reply

alxvio8y ago

This is highly unreasonable. If I'm hosting large files, you bet I'm going to require the public to create an account so I can protect my quality of service for others. They're not restricting who can create accounts... they're not requiring a payment...

benhamnerOP8y ago· 3 in thread

Our goal with Kaggle Datasets is to provide the best place to publish, collaborate on, and consume public data.

As a data publisher, you have an easy way to publish data online, see how it's used, and interact with the users of the data. You can create the dataset via a simple web interface, and update it through the interface or an API. We automatically version these updates under the hood.

As a data consumer, you can browse the data online and download it (through the web or an API). You can see the code and insights others have generated on the data through Kaggle Kernels (hosted, versioned IPython notebooks that run in Docker containers). You can fork their code to get started on the data, or start coding from scratch on your own analysis. If you find improvements that could be made to the metadata (dataset/file/column-level descriptions), you can make those directly.

We're rapidly iterating on this product and expanding it's functionality, and would love any feedback and suggestions.

lgierth8y ago

First of all, this looks like a great tool for datasets, thank you.

Do you have plans for adding file hashes to the datasets, e.g. sha256? This would make it a lot easier to integrate with other systems.

amrrs8y ago

Sorry for a noob, could you please explain how adding hashes would help in better integration?

1 more reply

benhamnerOP8y ago

Thanks and great point! Added this to our list

Radim8y ago· 1 in thread

What happens when the company changes direction? If there's a shift of priorities, an internal restructuring, a "strategic startup pivot", an acquisition?

Not to assume bad faith on Kaggle's part, but we got burned one too many times with private companies pushing their proprietary ("open") platforms for gobbling up data. The "it's free! just create an account — data lock-in — gap after project death/monetization" pattern leaves me a little cynical.

It's awesome that resources like these exist, but I'd be more comfortable paying attention if this was hosted as raw data somewhere (Github?), with a clear licensing and access model.

benhamnerOP8y ago

We joined Google via acquisition one year ago, and Kaggle Datasets has grown from 450 datasets to over 13,000 in that timeframe. We are firmly committed to supporting and growing this platform.

1 more reply

cosmic_ape8y ago· 1 in thread

It would help if the datasets were categorized by data type. Timeseries, multilabel, etc...

benhamnerOP8y ago

Not all the datasets are ML specific, but hopefully this helps:

- https://www.kaggle.com/tags

- https://www.kaggle.com/tags/linguistics

- https://www.kaggle.com/tags/multiclass-classification

- https://www.kaggle.com/tags/text-data

socksy8y ago· 1 in thread

Is there an announcement of some kind of change? Are they still owned by Google? Or is this the thing where sometimes existing solutions will hit the front page of HN? :)

benhamnerOP8y ago

I shared this because another public data portal that I don't think has changed in years ended up at the top of HN. Kaggle Datasets has grown by over an order of magnitude in the past year, and jumps in scale fundamentally changes the utility of community products like this

antirez8y ago

This is gold. When I wrote the NeuralRedis module I had so much fun downloading a few random datasets from Kaggle and wrap it in a few lines of Ruby script to check what the results were in terms of predictions. Normally the data is very high quality, the format well documented, and so forth. However make sure to check the license for the details depending on what use you plan to do.

neuromantik80868y ago

The Awesome Public Datasets Github repo [1] also constitutes a good effort at organizing all of the open data out there that people can play around with.

[1] https://github.com/awesomedata/awesome-public-datasets

metakermit8y ago

Wonderful, thanks for sharing this! It's useful that the kernels people have submitted are there as well and that there is a HN-style upvoting mechanism.

As an aside – I'm really curious to explore the datasets with "fake" in the title :)

https://www.kaggle.com/datasets?sortBy=relevance&group=publi...

naushit8y ago

Any plan to share same data/files using IPFS?

j / k navigate · click thread line to collapse

36 comments

23 comments · 9 top-level

QasimK8y ago· 8 in thread

How about you let me download them without creating an account before calling them “public”?

benhamnerOP8y ago

Thanks for the feedback. This is likely a "not quite yet" vs. "never".

Definitely understand the motivations from a user standpoint for not needing to login to download.

We do enable previewing the dataset while logged out, and the preview functionality will become more full-featured.

QasimK8y ago

Thanks for your response. I spoke slightly evocatively because this has caused issues for me and hence I didn’t like to see it advertised as “public”.

On the other hand using, for example, IPFS or a torrent would be better, because you can reference the dataset using a global identifier and anyone can easily get access to it.

hyperocular8y ago

prepend8y ago

Part of the best practices of public data involve unrestricted downlod - https://project-open-data.cio.gov/principles/

There are certainly benefits to you as a host if you restrict access to as a host. But if you have bandwidth concerns, perhaps just link to the source.

As it is now, I may be confused that you are using open data as a way to drive user growth.

shepardrtc8y ago

I disagree with the parent. You've taken the time to organize and host these datasets. The least we can do is create an account to download them.

1 more reply

gertye8y ago

AFAIK, Kaggle is a part of Google, therefore is not operated independently by a "small team". You just actively try not to be associated with the behemoth.

1 more reply

mynewtb8y ago

> new user growth

That sounds like you are inflating your user counts with garbage accounts.

1 more reply

alxvio8y ago

benhamnerOP8y ago· 3 in thread

Our goal with Kaggle Datasets is to provide the best place to publish, collaborate on, and consume public data.

We're rapidly iterating on this product and expanding it's functionality, and would love any feedback and suggestions.

lgierth8y ago

First of all, this looks like a great tool for datasets, thank you.

Do you have plans for adding file hashes to the datasets, e.g. sha256? This would make it a lot easier to integrate with other systems.

amrrs8y ago

Sorry for a noob, could you please explain how adding hashes would help in better integration?

1 more reply

benhamnerOP8y ago

Thanks and great point! Added this to our list

Radim8y ago· 1 in thread

What happens when the company changes direction? If there's a shift of priorities, an internal restructuring, a "strategic startup pivot", an acquisition?

It's awesome that resources like these exist, but I'd be more comfortable paying attention if this was hosted as raw data somewhere (Github?), with a clear licensing and access model.

benhamnerOP8y ago

We joined Google via acquisition one year ago, and Kaggle Datasets has grown from 450 datasets to over 13,000 in that timeframe. We are firmly committed to supporting and growing this platform.

1 more reply

cosmic_ape8y ago· 1 in thread

It would help if the datasets were categorized by data type. Timeseries, multilabel, etc...

benhamnerOP8y ago

Not all the datasets are ML specific, but hopefully this helps:

- https://www.kaggle.com/tags

- https://www.kaggle.com/tags/linguistics

- https://www.kaggle.com/tags/multiclass-classification

- https://www.kaggle.com/tags/text-data

socksy8y ago· 1 in thread

Is there an announcement of some kind of change? Are they still owned by Google? Or is this the thing where sometimes existing solutions will hit the front page of HN? :)

benhamnerOP8y ago

antirez8y ago

neuromantik80868y ago

The Awesome Public Datasets Github repo [1] also constitutes a good effort at organizing all of the open data out there that people can play around with.

[1] https://github.com/awesomedata/awesome-public-datasets

metakermit8y ago

Wonderful, thanks for sharing this! It's useful that the kernels people have submitted are there as well and that there is a HN-style upvoting mechanism.

As an aside – I'm really curious to explore the datasets with "fake" in the title :)

https://www.kaggle.com/datasets?sortBy=relevance&group=publi...

naushit8y ago

Any plan to share same data/files using IPFS?

j / k navigate · click thread line to collapse