Show HN: Basilica – word2vec for anything (opens in new tab)

(basilica.ai)

153 pointshiphipjorge7y ago77 comments

77 comments

56 comments · 15 top-level

mlucy7y ago· 14 in thread

Hey all,

I did a lot of the ML work for this. Let me know if you have any questions.

The title might be a little ambitious since we only have two embeddings right now, but it really is our goal to have embeddings for everything. You can see some of our upcoming embeddings at https://www.basilica.ai/available-embeddings/.

We basically want to do for these other datatypes what word2vec did for NLP. We want to turn getting good results with images, audio, etc. from a hard research problem into something you can do on your laptop with scikit.

ru999gol7y ago

In the case of images I can just take an off-the-shelf pre-trained model like ResNet as a feature extractor, why should I use a cloud service for that? I don't quite get what the benefit of it is. Are your embeddings better? Well prove it then? In terms of transfer learning, fine-tuning convolutional layers would perform way better anyways?

PeterisP7y ago

Do you plan to make these (many in the future) embeddings to refer to a single 'semantic vector space' or have each of them be separate?

I.e. do you plan to do the work to align the embeddings of different types of media so that the contents are somewhat similar and e.g. an audio recording of a snippet gets a similar embedding to the equivalent text and a somewhat similar embedding to a picture that's being described?

mlucy7y ago

We aren't currently doing this.

In the future I think we'll try to embed into a single space on a best-effort basis, assuming we can find the engineering resources. It will be really hard for some data types, but for the big ones like text/image/audio it isn't that hard, and will probably be valuable to people.

xapata7y ago

What's different between an "embedding" and a projection, which I believe is the more standard term for this kind of transformation?

mlucy7y ago

"Embedding" is the term I've heard used for this most often. It's definitely the term that seems to dominate in the literature. (Just to pick a random paper off my reading list: https://arxiv.org/abs/1709.03856 .)

In my mind "embedding" carries the connotation that you're moving into a smaller space that's easier to work with, and where things which are similar in some way are near each other.

1 more reply

thanatropism7y ago

Projection is a type of embedding. But you can't really describe what LTSA, UMAP, etc. do as projection. LTSA "unrolls" data rather than projecting it.

1 more reply

sjg0077y ago

Embedding is the ML term for a non-linear projection.

1 more reply

farza7y ago

Hi there Lucy!

So, I don't know a ton about Word2Vec which probably doesn't help, but I do understand that it makes tasks, like NLP, much easier since you're going from this massive space (the english language) into a smaller embedding that you learn.

That being said, how are you embedding images? Is it based on how similar they are, if so what does "similarity" mean? Also, what dataset was leveraged?

Any more info on how you do this task would be awesome :).

mlucy7y ago

Hey!

We're embedding images by feeding them through a deep neural net and using the activations of an intermediate layer as an embedding.

You can read https://arxiv.org/abs/1403.6382 to learn more about this technique if you're interested.

Our launch model is trained on ImageNet, which has enough variety that it usually generalizes well even when your dataset is very dissimilar to the input distribution. We're planning to train on a wider variety of image data in the future, but we wanted to get something into people's hands quickly.

thanatropism7y ago

Isn't the point of word2vec that embeddings are semantically meaningful vectors?

mlucy7y ago

Definitely!

In particular, semantically similar words are close to each other after embedding, so the space ends up with semantically meaningful clusters.

Our embeddings have the same property. If you embed two similar images, they'll end up closer to each other than two dissimilar images. (Where "similar" depends on the training details, but that's true for word2vec as well.)

1 more reply

Fireflite7y ago

Most other word embeddings have hundreds of dimensions, not thousands. Are you able to hint at what causes this difference? Do you see better downstream task performance?

mlucy7y ago

It depends on the task.

If you're doing clustering or instance retrieval, you probably want to PCA the number of dimensions down to 200 or so. (In fact, we do this in the tutorial at https://www.basilica.ai/tutorials/how-to-train-an-image-mode... .)

If you're training a big regression, you'll probably get better results with the larger embedding.

We decided to err on the side of making the embeddings too big, because it's very easy to reduce the number of dimensions on the user's end, and impossible to increase it.

itronitron7y ago

can you give a brief history on the use of the word 'embedding' ?

aaaaaaaaaab7y ago· 9 in thread

>Job Candidate Clustering

>Basilica lets you easily cluster job candidates by the text of their resumes. A number of additional features for this category are on our roadmap, including a source code embedding that will let you cluster candidates by what kind of code they write.

Wonderful! We were in dire need for yet another black-box criteria based on which employers can reject candidates.

“We’re sorry to inform you that we choose not to go on with your application. You see, for this position we’re looking for someone with a different embedding.”

mlucy7y ago

For what it's worth, people are doing job candidate clustering anyway right now. It's just that most people are doing it with keyword search or something.

Doing it with embeddings instead would probably increase the quality of the clustering at the cost of some interpretability. (i.e. you wouldn't be able to say "We didn't show your resume to this employer because it didn't contain both the word "Java" and the word "Agile", but maybe that's a good thing).

It's sort of a hard philosophical question how much you care about transparency/interpretability vs. quality, especially for socially important tasks like hiring.

wutbrodo7y ago

Right, it's a little absurd to complain about flaws in a potential hiring filter without realizing how incredibly flawed current hiring is, relative to some unrealized ideal.

1 more reply

panarky7y ago

word2vec:

    king
  - man
  + woman
  --------
  = queen

Basilica:

    resumes of candidates
  - resumes of employees you fired
  + resumes of employees you promoted
  ---------------------------------------
  = resumes of candidates you should hire

yongjik7y ago

Unless you have thousands of fired and promoted employees, it may easily end up "Sorry, most of our promoted employees are Indians, because they were our founder's close friends and joined earlier, and that one guy fired for flirting with a customer was the only one from UIUC. Your name doesn't sound sufficiently Indian and you graduated from UIUC. Bye."

Worse, the person who looks at the rejection decision may have no idea that it boils down to this.

kvb7y ago

word2vec[0]:

      computer programmer
    - man
    + woman
    ---------------------
    = homemaker

Basilica?

[0] - https://arxiv.org/pdf/1607.06520.pdf

3 more replies

lifeisstillgood7y ago

The second one being (I assume this is your point) merely a way to copy of all your existing biases, but not be able to see it.

eg If you fire all the black people and don't promote women, guess what resumes Artificial Intelligence will send you

mandeepj7y ago

I don’t think this approach will give you a good signal.

For the most part - People don’t get fired due to their skills. They get fired for lacking in execution or behavior. Someone screws up production deployment or makes lewd comments on another coworker. This is hard to come across in a resume.

dang7y ago

This breaks the Show HN guidelines (https://news.ycombinator.com/showhn.html) as well as the HN guidelines (https://news.ycombinator.com/newsguidelines.html), which ask you not to post shallow dismissals, especially of others' work. That's particularly important in Show HN threads. We don't want a culture where the reflex is to be a jerk and kick things.

That doesn't mean you can't raise concerns. Someone else raised the same concern that you did in a fine way: https://news.ycombinator.com/item?id=18348005. When in doubt, emulate them and ask a simple question.

varjag7y ago

"Your CV must have fell between the cracks in multidimensional vector space"

projectramo7y ago· 3 in thread

What is the use case for this? (And this is a general point for AI cloud APIs)

Specifically, I am trying to think of an example where the user cares about a vector representation of something, but doesn't care about how that vector representation was obtained.

I can think of why it would be useful: the ML examples given, or perhaps a compression application.

However, in each of these cases, it would seem that the user has the skill to spin up their own, and a lot of motivation to do so and understand it.

mlucy7y ago

Apologies for the long answer, but this touches on a lot of interesting points:

1. Transfer learning / data volume. If you have a small image dataset, embedding it using an embedding trained on a much larger image dataset is really really helpful. In our tutorial (https://www.basilica.ai/tutorials/how-to-train-an-image-mode...), we get good results with only 2-3k animal pictures, which is only possible because of the transfer learning aspect of embeddings.

You could do transfer learning yourself, if you have the time and expertise. And for a domain like images, it's really easy to find big public datasets. But long-term we're hoping to have embeddings for a lot of areas where there aren't good public datasets, and pool data from all our customers to produce a better embedding than any of them could alone.

2. Ease of Use. You can take a Basilica image embedding, feed it into a linear regression running on your laptop CPU, and get really good results. To get equally good results on your own, you'd need to run tensorflow on GPUs. This is harder than it sounds for a lot of people.

3. Exploration. Because of the other two points, if you have a thought like "huh, I wonder if including these images would improve our pricing model", you can whip up some code and train a model in a few minutes to check. Maybe if it's a big model you go grab lunch while it trains.

If you're doing everything from scratch in tensorflow, it can take days to try the same thing. This activation energy reduces the amount of experimentation people do. It's bad for the same reasons having a multi-day compile/test loop would be bad.

projectramo7y ago

Hi mlucy,

I agree with what you're saying here. I just wonder how it would work in practice.

So imagine I have this monster text or image, and I want to know if it looks like another text or image.

I send each to Basilica, it gives me back two vectors and I compare the vectors.

I use the cosine of the vectors as a similarity score, and lets say it comes out to be 0.6.

However, I think this is too low, and I want to tweak my algorithm.

At this point, doesn't the question of how the vector was generated come to the front. Did you get rid of common words, how did you treat stems, and so on? Or did what biases did you introduce into training?

Furthermore, these questions come up right away, and they seem fundamental to whatever the main practice is.

In other words, can I even experiment or start without knowing how the word2vec works?

1 more reply

hiphipjorgeOP7y ago

Hi, Jorge from Basilica here.

Our bet is that the simplicity of using Basilica will provide a much easier experience that doesn't require complex infrastructure and training and will provide very good results thanks to transfer learning. The amount of data needed for this is also much smaller than if you were training a model from scratch.

piccolbo7y ago· 2 in thread

You quote a target of 200ms per embedding, not sure if it's one type of embedding in particular. I am using Infersent (a sentence embedding from FAIR https://github.com/facebookresearch/InferSent) for filtering and they quote a number of 1000/sentences per second on generic GPU. That's 200 times faster than your number, but it is a local API so I am comparing apples to oranges. Yet it's hard to imagine you are spending 1ms embedding and 199 on API overhead. I am sure I have missed a 0 here or there, but I don't see where, other than theirs is a batch number (batch size 128) and maybe yours is a single embedding number. Can you please clarify? Thanks

piccolbo7y ago

So I am going to answer it myself. On batched data, it's a lot faster than 200ms per embedding and I'd say on a par with Infersent. On the other hand, I couldn't get statistical performance in the same ballpark as Infersent and I had to backtrack. This was training a logistic regression on the embeddings to filter some text streams according to my preferences. If I had, I would have preferred Basilica as Infersent is py2 only, hard to install and distribute and a battery killer on my laptop. Its vectors are also 4x bigger. I experienced some server errors and the team at Basilica was very responsive and fixed it, very pleased with the interaction. It would be important IMHO to publish some benchmark results for these embeddings, as it's usually done in the universal embedding literature, or serve published embeddings with known performance when licensing terms are favorable.

piccolbo7y ago

Another update, on their v2 sentence embedding basilica is ahead of Infersent for my task. Well done basilica!

captn3m07y ago· 2 in thread

Do you think board game states might be a good target later?

mlucy7y ago

Sort of depends on how late "later" is.

In the very-long term, I want us to literally have embeddings for everything people want to embed, which will probably include the states of popular boardgames.

I'm not sure how we'll get there. Maybe we'll have community embeddings, or an embedding marketplace, or we'll abstract away the process of creating an embedding so well that we can create simple embeddings just by pointing our code at a dataset. But I'd like to get there eventually.

In the less-very-long term, we're still focusing on embeddings that are either very general and useful across a lot of domains (e.g. images, text), or embeddings that have clear and immediate business value (e.g. resumes), since running GPUs is expensive.

EmilStenstrom7y ago

Probably not, board game states are different for different games, so I doubt this will be a big enough niche.

asdfghjl7y ago· 2 in thread

How are you embedding images?

mlucy7y ago

We're feeding them through a deep neural network and using the activations of an intermediate layer as an embedding.

You can read more about this technique in https://arxiv.org/abs/1403.6382 if you're interested.

PeterisP7y ago

For most purposes, taking a decent ImageNet model and ripping off a couple final layers works reasonably well.

mathena7y ago· 2 in thread

Am I really missing something here or this thing is a complete nonsense with no actual use cases what's so ever in practice?

There are a number of off-the-shelf models that would give you image/sentence embedding easily. Anyone with sufficient understanding of embedding/word2vec would have no trouble train an embedding that is catered to the specific application, with much better quality.

For NLP applications, the corpus quality dictates the quality of embedding if you use simple W2V. Word2Vec trained on Google News corpus is not gonna be useful for chatbot, for instance. Different models also give different quality of embedding. As an example, if you use Google BERT (bi-directional LSTM) then you would get world-class performance in many NLP applications.

The embedding is so model/application specific that I don't see how could a generic embedding would be useful in serious applications. Training a model these days is so easy to do. Calling TensorFlow API is probably easier then calling Basilica API 99% of the case.

I'd be curious if the embedding is "aligned", in the sense that an embedding of the word "cat" is close to the embedding of a picture of cat. I think that would be interesting and useful. I don't see how Basilica solve that problem by taking the top layers off ResNet though.

I appreciate the developer API etc, but as an ML practitioner this feels like a troll.

vladf7y ago

> Calling TensorFlow API is probably easier then calling Basilica API 99% of the case.

Maybe, but training/curating data appropriate for your application isn't. It's not in that state right now but I think this service could save you some time if they had a domain-relevant embedding ready to roll for your application and it performed decently well -- that would save you a lot of time gathering training data and help you focus on the "business logic" ML needs that accept the embeddings as input.

That said, they'd need to be more performant than, say, GloVe 2B, which I can get for free off of torchtext, meaning they have to do the domain-specific heavy-lifting.

mlucy7y ago

Hi there :)

Apologies for the super long response, but you had a lot of points.

> Am I really missing something here or this thing is a complete nonsense with no actual use cases what's so ever in practice?

Hopefully you're missing something, or we've been wasting a great deal of our time ;)

> There are a number of off-the-shelf models that would give you image/sentence embedding easily. Anyone with sufficient understanding of embedding/word2vec would have no trouble train an embedding that is catered to the specific application, with much better quality.

For images and text, it's definitely true that you can train your own embeddings with an off-the-shelf model. But I think it's more likely that we end up in a place where a small number of people train a bunch of really good models and everyone else uses them.

I think this for three reasons:

1. It's what we've seen with word2vec. The vast majority of people that use word2vec aren't training it themselves, they're downloading pretrained weights.

2. Most people don't have enough data to train a good embedding themselves. There are good public datasets for images and text, but we're planning to produce embeddings for more niche verticals too.

Keep in mind that modern deep neural nets are very data hungry, and the problem gets worse every year. In a few years I think we're going to be in a spot where getting state of the art performance requires a lot of compute, and more data than most people have access to.

3. Prebuilt embeddings drastically speed up development. If you have a traditional model, and you think feeding some images into it might improve it, you can test that hypothesis in twenty minutes with Basilica. We've talked to a lot of teams that have high-dimensional data lying around which they think might improve their models, but they aren't sure, and they can't really justify a week or two of someone's time to explore it.

> For NLP applications, the corpus quality dictates the quality of embedding if you use simple W2V. Word2Vec trained on Google News corpus is not gonna be useful for chatbot, for instance. Different models also give different quality of embedding. As an example, if you use Google BERT (bi-directional LSTM) then you would get world-class performance in many NLP applications. > > The embedding is so model/application specific that I don't see how could a generic embedding would be useful in serious applications. Training a model these days is so easy to do. Calling TensorFlow API is probably easier then calling Basilica API 99% of the case.

It's definitely true that you usually want your input distribution to be reasonably close to the distribution the embedding was trained on. (Although it's worth noting that having a different distribution for your embedding acts as a form of regularization, and sometimes that matters more than the problems you get from the distributional shift.)

I think you're overstating the case though. An embedding trained on a wide variety of sources will perform really well on a lot of tasks, and often other things like amount of data you trained on matters more than distributional similarity.

You may find https://research.fb.com/wp-content/uploads/2018/05/exploring... interesting, especially the end of section 3.1.2. The paper trains a giant network on billions of Instagram images, and then explores both fine-tuning it on Imagenet and using the features of the last layer as inputs to a logistic regression (which they call "feature transfer" rather than "embedding").

The logistic regression trained on the Instagram features gets 83.6% top-1 accuracy, compared to 85.4% for full network fine-tuning and 80.9% for a ResNeXt model trained directly on ImageNet.

In other words, the effect of the larger training set dominated the distributional shift.

ASpring7y ago· 1 in thread

How do you plan to counter the harmful societal biases that embeddings embody?

See Bolukbasi (https://arxiv.org/pdf/1607.06520.pdf) and Caliskan (http://science.sciencemag.org/content/356/6334/183)

While these examples are solely language based, it is easy to imagine the transfer to other domains.

hiphipjorgeOP7y ago

Hi. Jorge from Basilica here.

We don't have any concrete plans to tackle this right now but it is something we're definitely mindful of. Thanks for the links! We'll be sure to go through them.

1 more reply

gugagore7y ago· 1 in thread

Aren't these embeddings task-specific? For example a word2vec embedding is found by letting the embedder participate in a task to predict a word given words around it, on a particular corpus of text.

The embedding of sentences are trained on translation tasks. A embedding that works both for images and sentences is found by training for a picture captioning task.

The point I'm asking about is that there may be many ways to embed a "data type", depending on what you might want to use the embedding for. Someone brought up board game states. You could imagine embedding images of board games directly. That embedding would only contain information about the game state if it was trained for the appropriate task.

mlucy7y ago

You can definitely improve performance by choosing an embedding closely related to your task. In the future we're hoping to have more embeddings for specialized tasks.

Kind of surprisingly, though, if you get your embedding by training a deep neural network to do a fairly general task -- like denoising autoencoding, or classification with many classes -- it ends up being useful for a wide variety of other tasks. (You get the embedding out of the neural network by taking the activations of an intermediate layer.)

In some sense you'd expect this, since you'd hope that the intermediate layers of the neural network are learning general features -- if they were learning totally nongeneral features, it would be overfitting -- but I found it surprising when I first learned about it.

jdoliner7y ago· 1 in thread

How much does this depend on the data type? I.e. do you need people to specify: this is an image, this is a resume, this is an English resume, etc. Could you ever get to a point where you can just feed it general data, not knowing more than that it's 1s and 0s?

mlucy7y ago

That's a really interesting idea.

I can't really think of a barrier to this. Detecting the file format is straightforward, and generic image/text/etc. embeddings work surprisingly well. (In fact, you can actually get some generalization gains by training subword text embeddings on corpora in multiple languages.)

If we wanted to able to use specific embeddings (e.g. photos vs. line art, English vs. German), we could probably do it by running the data through a generic embedding, and then seeing which cluster of training data it's closest to and running it through that specific embedding.

It would be really important in this case to make sure that all the specific embeddings are embedding into the same space, in case people have a mixed dataset, but that's very doable.

pkaye7y ago· 1 in thread

Slightly different topic but what are some approaches to categorize webpages. Like I have 1000s of web links I want to organize with tags. Is there software technique to group them by related topics?

yorwba7y ago

The task is known as document clustering https://en.wikipedia.org/wiki/Document_clustering or topic modeling https://en.wikipedia.org/wiki/Topic_model

Generally, you'll want to extract features (e.g. word counts) and then apply a clustering algorithm to group related documents together. The precise details are the subject of thousands of papers, each one doing things slightly differently.

Lerc7y ago· 1 in thread

Is this actually 'for anything'? I see references to sentences and images. If I, for example, wanted to compare audio samples, how would it work?

mlucy7y ago

"Word2vec for anything" is where we want to get to. Right now we only support images and text, but you can see the other data types on our roadmap at https://www.basilica.ai/available-embeddings/ .

kolleykibber7y ago· 1 in thread

Hi Lucy. Looks great. Do you have any production use cases you can tell us about? Are you a YC company?

mlucy7y ago

Thanks!

No production use cases yet. This is the first usable release, and it's the bare minimum we felt we could build before showing it to people.

> Are you a YC company?

We have a YC interview on Friday, so hopefully in a few days I'll be able to say yes.

msla7y ago· 1 in thread

So the actual code is closed-source?

hiphipjorgeOP7y ago

Hi, Jorge from basilica here.

Yes. We intend to run this as a cloud service API for now.

e_ameisen7y ago

Interesting idea, but seems to very much fall within the category of something you would often want to build in-house. I always imagined the right level of abstraction was closer to spacy's, a framework that lets you easily embed all the things.

If you are interested in how to build and use embeddings for search and classification yourself, I wrote a completely open source tutorial here: https://blog.insightdatascience.com/the-unreasonable-effecti...

j / k navigate · click thread line to collapse

77 comments

56 comments · 15 top-level

mlucy7y ago· 14 in thread

Hey all,

I did a lot of the ML work for this. Let me know if you have any questions.

ru999gol7y ago

PeterisP7y ago

Do you plan to make these (many in the future) embeddings to refer to a single 'semantic vector space' or have each of them be separate?

mlucy7y ago

We aren't currently doing this.

xapata7y ago

What's different between an "embedding" and a projection, which I believe is the more standard term for this kind of transformation?

mlucy7y ago

In my mind "embedding" carries the connotation that you're moving into a smaller space that's easier to work with, and where things which are similar in some way are near each other.

1 more reply

thanatropism7y ago

Projection is a type of embedding. But you can't really describe what LTSA, UMAP, etc. do as projection. LTSA "unrolls" data rather than projecting it.

1 more reply

sjg0077y ago

Embedding is the ML term for a non-linear projection.

1 more reply

farza7y ago

Hi there Lucy!

That being said, how are you embedding images? Is it based on how similar they are, if so what does "similarity" mean? Also, what dataset was leveraged?

Any more info on how you do this task would be awesome :).

mlucy7y ago

Hey!

We're embedding images by feeding them through a deep neural net and using the activations of an intermediate layer as an embedding.

You can read https://arxiv.org/abs/1403.6382 to learn more about this technique if you're interested.

thanatropism7y ago

Isn't the point of word2vec that embeddings are semantically meaningful vectors?

mlucy7y ago

Definitely!

In particular, semantically similar words are close to each other after embedding, so the space ends up with semantically meaningful clusters.

1 more reply

Fireflite7y ago

Most other word embeddings have hundreds of dimensions, not thousands. Are you able to hint at what causes this difference? Do you see better downstream task performance?

mlucy7y ago

It depends on the task.

If you're training a big regression, you'll probably get better results with the larger embedding.

We decided to err on the side of making the embeddings too big, because it's very easy to reduce the number of dimensions on the user's end, and impossible to increase it.

itronitron7y ago

can you give a brief history on the use of the word 'embedding' ?

aaaaaaaaaab7y ago· 9 in thread

>Job Candidate Clustering

Wonderful! We were in dire need for yet another black-box criteria based on which employers can reject candidates.

“We’re sorry to inform you that we choose not to go on with your application. You see, for this position we’re looking for someone with a different embedding.”

mlucy7y ago

For what it's worth, people are doing job candidate clustering anyway right now. It's just that most people are doing it with keyword search or something.

It's sort of a hard philosophical question how much you care about transparency/interpretability vs. quality, especially for socially important tasks like hiring.

wutbrodo7y ago

Right, it's a little absurd to complain about flaws in a potential hiring filter without realizing how incredibly flawed current hiring is, relative to some unrealized ideal.

1 more reply

panarky7y ago

word2vec:

    king
  - man
  + woman
  --------
  = queen

Basilica:

    resumes of candidates
  - resumes of employees you fired
  + resumes of employees you promoted
  ---------------------------------------
  = resumes of candidates you should hire

yongjik7y ago

Worse, the person who looks at the rejection decision may have no idea that it boils down to this.

kvb7y ago

word2vec[0]:

      computer programmer
    - man
    + woman
    ---------------------
    = homemaker

Basilica?

[0] - https://arxiv.org/pdf/1607.06520.pdf

3 more replies

lifeisstillgood7y ago

The second one being (I assume this is your point) merely a way to copy of all your existing biases, but not be able to see it.

eg If you fire all the black people and don't promote women, guess what resumes Artificial Intelligence will send you

mandeepj7y ago

I don’t think this approach will give you a good signal.

dang7y ago

varjag7y ago

"Your CV must have fell between the cracks in multidimensional vector space"

projectramo7y ago· 3 in thread

What is the use case for this? (And this is a general point for AI cloud APIs)

Specifically, I am trying to think of an example where the user cares about a vector representation of something, but doesn't care about how that vector representation was obtained.

I can think of why it would be useful: the ML examples given, or perhaps a compression application.

However, in each of these cases, it would seem that the user has the skill to spin up their own, and a lot of motivation to do so and understand it.

mlucy7y ago

Apologies for the long answer, but this touches on a lot of interesting points:

projectramo7y ago

Hi mlucy,

I agree with what you're saying here. I just wonder how it would work in practice.

So imagine I have this monster text or image, and I want to know if it looks like another text or image.

I send each to Basilica, it gives me back two vectors and I compare the vectors.

I use the cosine of the vectors as a similarity score, and lets say it comes out to be 0.6.

However, I think this is too low, and I want to tweak my algorithm.

Furthermore, these questions come up right away, and they seem fundamental to whatever the main practice is.

In other words, can I even experiment or start without knowing how the word2vec works?

1 more reply

hiphipjorgeOP7y ago

Hi, Jorge from Basilica here.

piccolbo7y ago· 2 in thread

piccolbo7y ago

Another update, on their v2 sentence embedding basilica is ahead of Infersent for my task. Well done basilica!

captn3m07y ago· 2 in thread

Do you think board game states might be a good target later?

mlucy7y ago

Sort of depends on how late "later" is.

In the very-long term, I want us to literally have embeddings for everything people want to embed, which will probably include the states of popular boardgames.

EmilStenstrom7y ago

Probably not, board game states are different for different games, so I doubt this will be a big enough niche.

asdfghjl7y ago· 2 in thread

How are you embedding images?

mlucy7y ago

We're feeding them through a deep neural network and using the activations of an intermediate layer as an embedding.

You can read more about this technique in https://arxiv.org/abs/1403.6382 if you're interested.

PeterisP7y ago

For most purposes, taking a decent ImageNet model and ripping off a couple final layers works reasonably well.

mathena7y ago· 2 in thread

Am I really missing something here or this thing is a complete nonsense with no actual use cases what's so ever in practice?

I appreciate the developer API etc, but as an ML practitioner this feels like a troll.

vladf7y ago

> Calling TensorFlow API is probably easier then calling Basilica API 99% of the case.

That said, they'd need to be more performant than, say, GloVe 2B, which I can get for free off of torchtext, meaning they have to do the domain-specific heavy-lifting.

mlucy7y ago

Hi there :)

Apologies for the super long response, but you had a lot of points.

> Am I really missing something here or this thing is a complete nonsense with no actual use cases what's so ever in practice?

Hopefully you're missing something, or we've been wasting a great deal of our time ;)

I think this for three reasons:

1. It's what we've seen with word2vec. The vast majority of people that use word2vec aren't training it themselves, they're downloading pretrained weights.

2. Most people don't have enough data to train a good embedding themselves. There are good public datasets for images and text, but we're planning to produce embeddings for more niche verticals too.

The logistic regression trained on the Instagram features gets 83.6% top-1 accuracy, compared to 85.4% for full network fine-tuning and 80.9% for a ResNeXt model trained directly on ImageNet.

In other words, the effect of the larger training set dominated the distributional shift.

ASpring7y ago· 1 in thread

How do you plan to counter the harmful societal biases that embeddings embody?

See Bolukbasi (https://arxiv.org/pdf/1607.06520.pdf) and Caliskan (http://science.sciencemag.org/content/356/6334/183)

While these examples are solely language based, it is easy to imagine the transfer to other domains.

hiphipjorgeOP7y ago

Hi. Jorge from Basilica here.

We don't have any concrete plans to tackle this right now but it is something we're definitely mindful of. Thanks for the links! We'll be sure to go through them.

1 more reply

gugagore7y ago· 1 in thread

Aren't these embeddings task-specific? For example a word2vec embedding is found by letting the embedder participate in a task to predict a word given words around it, on a particular corpus of text.

The embedding of sentences are trained on translation tasks. A embedding that works both for images and sentences is found by training for a picture captioning task.

mlucy7y ago

You can definitely improve performance by choosing an embedding closely related to your task. In the future we're hoping to have more embeddings for specialized tasks.

jdoliner7y ago· 1 in thread

mlucy7y ago

That's a really interesting idea.

It would be really important in this case to make sure that all the specific embeddings are embedding into the same space, in case people have a mixed dataset, but that's very doable.

pkaye7y ago· 1 in thread

Slightly different topic but what are some approaches to categorize webpages. Like I have 1000s of web links I want to organize with tags. Is there software technique to group them by related topics?

yorwba7y ago

The task is known as document clustering https://en.wikipedia.org/wiki/Document_clustering or topic modeling https://en.wikipedia.org/wiki/Topic_model

Lerc7y ago· 1 in thread

Is this actually 'for anything'? I see references to sentences and images. If I, for example, wanted to compare audio samples, how would it work?

mlucy7y ago

"Word2vec for anything" is where we want to get to. Right now we only support images and text, but you can see the other data types on our roadmap at https://www.basilica.ai/available-embeddings/ .

kolleykibber7y ago· 1 in thread

Hi Lucy. Looks great. Do you have any production use cases you can tell us about? Are you a YC company?

mlucy7y ago

Thanks!

No production use cases yet. This is the first usable release, and it's the bare minimum we felt we could build before showing it to people.

> Are you a YC company?

We have a YC interview on Friday, so hopefully in a few days I'll be able to say yes.

msla7y ago· 1 in thread

So the actual code is closed-source?

hiphipjorgeOP7y ago

Hi, Jorge from basilica here.

Yes. We intend to run this as a cloud service API for now.

e_ameisen7y ago

j / k navigate · click thread line to collapse