undefined | Better HN

0 pointsstreetcat13y ago0 comments

So 90% of the data is unstructured, but 99% of the ML use cases are tabular (structured), where tree based approach can win against DL.

Also, 90% of the structured data is unlabeled. Hence, your calculation should be for "labeled" unstructured data, is this 90%?. I would argue that outside big tech, this is 0.1%.

You competition is not Databricks. Databricks main use case is tabular data (both in the delta lake and in ML). I.e. Databricks compete with snowflake. It tries to be a database. I.e. it tries to get out of the data lake.

I think that your competition is with S3 and R2 from the storage side, and with transformer based models (Hugging face). Correct me if I am wrong, but the whole idea with transformers models is that the training was already done , and you can use small amount of domain specific data? I.e. you do not need a lot of storage?

0 comments

1 comments · 1 top-level

davidbuniat3y ago

"...I would argue that outside big tech, this is 0.1%."

Fair point regarding the unlabeled/unstructured data. One could also argue that labeled data isn't going to be a prerequisite forever (see https://ai.facebook.com/blog/the-first-high-performance-self...). We see a very sharp rise in unstructured data use for ML (especially a large spike caused by large language models like Dall-E 2 and Stable Diffusion). In my opinion, the majority of the novel use cases are outside of big tech, and we also see a trend in "legacy" companies like media, manufacturing, etc. start building dedicated ML teams. The industry is still nascent, but it is growing fast. Frankly, we see the pain points we're solving resonate with so many more companies than just a year ago.

Agree re Snowflake/Databricks, they are partners rather than competitors. We sit on top of S3/GCS or other blob storages and currently are competing with various in-house solutions that ML scientists built themselves. I do see your point regarding large foundational models that would be only fine-tuned on the tail end for various use cases. I believe there still would be still companies building foundational models from scratch (currently at 5 billion images) so they can serve more application-specific products and unstructured data generators that partner with those companies creating a good enough market for the tool.

j / k navigate · click thread line to collapse

0 comments

1 comments · 1 top-level

davidbuniat3y ago

"...I would argue that outside big tech, this is 0.1%."

j / k navigate · click thread line to collapse