undefined | Better HN

0 pointsminimaxir2y ago0 comments

You can use DBSCAN instead of k-means, but DBSCAN has a worst-case memory complexity of O(n^2) so things can get spicy with large datasets, which is why I opt it to only use it for subclusters. k-means also fixes the number of clusters, which is good for visualization sanity.

https://scikit-learn.org/stable/modules/generated/sklearn.cl...

0 comments

2 comments · 1 top-level

Xenoamorphous2y ago· 1 in thread

Isn’t the embedding step much slower than clustering? How many documents are you dealing with?

For I news aggregator I worked on I disregarded k-means because you have to know the number of clusters in advance, and I think it will cluster every document, which is bad for the actual outliers in a dataset.

Agglomerative clustering yielded the best results for us. HDBSCAN was promising but doing weird things with some docs.

whakim2y ago

The embedding step is certainly slower than clustering, but the memory requirements blow up pretty fast when you're doing density-based clustering on a dataset of even, say, 100k embeddings.

j / k navigate · click thread line to collapse