undefined | Better HN

0 pointslacker15y ago0 comments

The main problem is that you would need a totally different indexing system.

Roughly, search engines work in two phases: retrieval, and scoring. Retrieval is when you figure out of the billions of documents in the index, which are the top few thousand that could be worthy of being search results. Scoring is when you look at each of those documents in more detail to figure out the actual top ten.

Scoring based on regular expressions wouldn't be too tough. Retrieval is the killer. Typically retrieval works based on "posting lists", which are basically indices for each word of which documents contain that word. To retrieve based on regular expressions, you would need posting lists for individual characters or short sequences of characters. That would take a lot more space.

You might be able to hack together some hybrid that would use existing posting lists. For example, if you required that the regular expression contain a word within it. But pure regular expressions would require a different index. That sort of added complexity is not worth it for the feature.

0 comments

No comments yet.