undefined | Better HN

0 pointspleshkov5d ago0 comments

Author here — questions and pushback both welcome.

0 comments

Looking at your experiment code, it seems like the retrieval experiments are done with the reconstructed vectors of dimension D rather than the compressed vectors of dimension d, which doesn't have any direct performance improvements. Later on in the post you indicate that the real advantage is that the residuals are more isotropic and therefore you can quantize the pair (p, V_resid) with less quality degradation, but I don't see any experiments actually verifying that retrieval quality holds up in this setting. Also, it's not quite clear to me how you efficiently compute cosine similarity for vectors encoded in this form. Doesn't the V_resid part of the computation require something significantly more complex than a dot product?

pleshkovOP13h ago

I agree that we don't want to reconstruct the whole vector while retrieval and it makes poly-AE toy-like at the current state non production ready. My main interest here in the just taking more recall pp in closed form. And then think about how to make it fast. In all threads I got a good intermediate thoughts about the topic which may help me to bring to closer to production form

brunosan1d ago

Cool work! Something that worries me with PCA is that it's designed to retain variance, but variance might nor be the right metric for the semantics we want to retrieve. Ditto UMAP/tSNE that retains distances in lower dimensionality... If semantics are mostly encoded as directions on subsets of dimensions, PCA and friends would be too blunt of a tool... I wonder if a better approach would be linear probes or other decoders for a wide range of the concepts one wants to retrieve, and then optimize compression while keeping those retrievals as high as possible... i.e. tune the compressor to the usecase, like MP3 or MPEG do.

dbfclark1d ago

I think your per-axis std normalization is likely doing a big pile of the work —- it’s fairly well-known that “wrong” PCA, setting sigma=Id or just taking a square root, gives better embeddings than the un-normalized version. It would be worth showing a comparison to similarly-normalized PCA I think, if it’s not too hard?

pleshkovOP14h ago

Just checked the normalization point. You were partially right, sqrt-normalization makes the difference x2 less. I'm updating the numbers in the post. Interesting moment. I did a smoke test of poly-AE without whitening, and the result didn't change. I won't mention it in the post cause right now I'm not sure if it's a random effect or really a polynomial lift compensates normalization

pleshkovOP1d ago

Good catch, this is the obvious ablation I should have included. I'll re-run with per-axis normalized PCA as a separate baseline and post numbers in this thread tomorrow. Prior: I expect some of the gap to come from normalization, but not all — the no-improvement results on isotropic datasets (§4) suggest there's structural signal the polynomial cross-terms catch that linear projection structurally can't. But that's a prediction; let me actually run it.

Devilstro2d ago

In the article, you mention this approach requires no search over hyper-parameter, because the method comprises a closed-form solution with "simple" linear algebra. I agree with this, but do you not in think need to tune the L2-regularization strength? That would for me be a hyper-parameter you would need to do a CV over (or similarly).

pleshkovOP1d ago

Fair point — lam is technically a hyperparameter. In practice I used lam=1e-3 (the default in the code) across all four models without tuning, and the gap to PCA is robust enough that small variations don't change the conclusion. So more accurately: "one hyperparameter with a benign default" rather than "no hyperparameters" — you're right I overstated.

yorwba2d ago

You should benchmark the retrieval speed of each method in terms of queries per second. I suspect that the gain in bandwidth you get from slightly better compression will be defeated by decompression being much more expensive.

afxuh2d ago

Cool idea. But it only works when the data never changes. could you make a streaming/incremental version? One that updates the math cheaply when new data arrives, instead of recomputing everything, or does the math fundamentally prevent it?

1 more reply

stephantul2d ago

Really cool! I was investigating PCA on retrieval, thanks for the references.

j / k navigate · click thread line to collapse