undefined | Better HN

0 pointsfulmicoton2y ago0 comments

What we do is CPU bound and we are not just parsing JSON here.

The largest work we do is building an inverted index. Oversimplified, it is equivalent to this:

  inverted_index = defaultdict(list)
  for (doc_id, doc_json) in enumerate(doc_jsons):
    c = json.loads(payload)
    for (field, field_text) in c.items():
      for (position, token) in enumerate():
        inverted_index[token].push((doc, position))

serialize_in_compressed_way_that_allows_lookup(inverted_index)

You can implement it in a couple of hours in the language of your choice to get a proper baseline.

I am sure we can still improve our indexing throughput... but I have never seen any search engine indexing as fast as tantivy.

If someone knows a project I should know of, I'd be genuinely keen on learning from it.

0 comments

dist1ll2y ago

I'm curious, what is your frame of reference with regards to maximum speed of building inverted indices? Like, what is the maximum throughput you'd expect for this type of task, and what is your reasoning for it?

j / k navigate · click thread line to collapse