Also, 90% of the structured data is unlabeled. Hence, your calculation should be for "labeled" unstructured data, is this 90%?. I would argue that outside big tech, this is 0.1%.
You competition is not Databricks. Databricks main use case is tabular data (both in the delta lake and in ML). I.e. Databricks compete with snowflake. It tries to be a database. I.e. it tries to get out of the data lake.
I think that your competition is with S3 and R2 from the storage side, and with transformer based models (Hugging face). Correct me if I am wrong, but the whole idea with transformers models is that the training was already done , and you can use small amount of domain specific data? I.e. you do not need a lot of storage?