Building a full-text search engine in 150 lines of Python code (2021) (opens in new tab)

(bart.degoe.de)

84 pointsmatt_daemon1y ago21 comments

21 comments

14 comments · 2 top-level

cocodill1y ago· 9 in thread

> return [token for token in tokens if token]

I love that kind of bullshit poetry.

ks20481y ago

If you're used to this, it's nice and readable.

Or you can do,

    return filter(None,tokens)

Not obvious, but giving None is like giving "lambda x : x" to filter().

evnp1y ago

Is this any different from `filter(bool, tokens)`? (similar to the JS equivalent mentioned in a sibling)

If not, I'm now curious why the `None` special case was added to `filter`..

1 more reply

sroussey1y ago

You can do the same thing in JavaScript with Boolean

Eg, arr.filter(Boolean)

chaos_emergent1y ago

Holy shit, I’ve been writing python for 15 years and it’s the first time I’ve seen None used as an identity function. That’s nuts! How does it work under the hood? Does filter have a special case for evaluating None?

1 more reply

dkga1y ago

It grew on me, too. Originally I frowned upon this kind of python shenanigans but now I must confess it makes me a bit happier inside whenever I have to type a similar thing.

graemep1y ago

I think its lovely if its reasonably readable (which this is) but they can be convoluted. I have written Python list comprehensions that I could not read myself the next day and so I am more careful now.

pastage1y ago

Yeah, but what is worst? I tried to understand the Haskell way in an article last week[0]

  pure (n, guard (factor /= n) $> factor)

Which I think is more or less the same as this python line.

  return [factor for factor in factors if factor and not factor == n]

The article does fancy stuff with memory caches which I believe is easy to do in python but I need to understand the Haskell code better.

[0] Haskell: A Great Procedural Language https://entropicthoughts.com/haskell-procedural-programming#... https://news.ycombinator.com/item?id=42754098

mrkeen1y ago

> pure (n, guard (factor /= n) $> factor)

Returns a tuple: Left side is n, right side is (Just factor) if factor is not n, or Nothing if it is.

itishappy1y ago

You're being way to clever! The machinery from the article is only needed to deal with IO.

A more direct translation:

    # py
    [token for token in tokens if token]

    -- hs
    filter (not . null) tokens

The full functions:

    # py
    def analyze(text):
      tokens = tokenize(text)
      tokens = lowercase_filter(tokens)
      tokens = stopword_filter(tokens)
      tokens = stem_filter(tokens)
      return [token for token in tokens if token]

    -- hs
    analyze :: String -> [String]
    analyze = filter (not . null) . stem_filter . stopword_filter . lowercase_filter . tokenize

1 more reply

jankovicsandras1y ago· 3 in thread

This is a good intro to text search. Shameless plug: If you throw in a bit more, ca. 250 SLOC, you can have BM25 search: https://github.com/jankovicsandras/bm25opt

marginalia_nu1y ago

You can probably have phrase matching in a hundred lines more, maybe less.

Most of the difficulty in search is dealing with the sheer volume of data. The algorithms themselves are pretty trivial for the most part.

benob1y ago

Mine: bm25 I use for teaching (sorry for the French example)

https://gist.github.com/benob/69d48421f88f5dcc2b26a204d3251d...

heresie-dabord1y ago

Merci! Mais faut-il s'en excuser? L'exemple est en français, voilà tout!

1 more reply

j / k navigate · click thread line to collapse

21 comments

14 comments · 2 top-level

cocodill1y ago· 9 in thread

> return [token for token in tokens if token]

I love that kind of bullshit poetry.

ks20481y ago

If you're used to this, it's nice and readable.

Or you can do,

    return filter(None,tokens)

Not obvious, but giving None is like giving "lambda x : x" to filter().

evnp1y ago

Is this any different from `filter(bool, tokens)`? (similar to the JS equivalent mentioned in a sibling)

If not, I'm now curious why the `None` special case was added to `filter`..

1 more reply

sroussey1y ago

You can do the same thing in JavaScript with Boolean

Eg, arr.filter(Boolean)

chaos_emergent1y ago

1 more reply

dkga1y ago

It grew on me, too. Originally I frowned upon this kind of python shenanigans but now I must confess it makes me a bit happier inside whenever I have to type a similar thing.

graemep1y ago

pastage1y ago

Yeah, but what is worst? I tried to understand the Haskell way in an article last week[0]

  pure (n, guard (factor /= n) $> factor)

Which I think is more or less the same as this python line.

  return [factor for factor in factors if factor and not factor == n]

The article does fancy stuff with memory caches which I believe is easy to do in python but I need to understand the Haskell code better.

[0] Haskell: A Great Procedural Language https://entropicthoughts.com/haskell-procedural-programming#... https://news.ycombinator.com/item?id=42754098

mrkeen1y ago

> pure (n, guard (factor /= n) $> factor)

Returns a tuple: Left side is n, right side is (Just factor) if factor is not n, or Nothing if it is.

itishappy1y ago

You're being way to clever! The machinery from the article is only needed to deal with IO.

A more direct translation:

    # py
    [token for token in tokens if token]

    -- hs
    filter (not . null) tokens

The full functions:

    # py
    def analyze(text):
      tokens = tokenize(text)
      tokens = lowercase_filter(tokens)
      tokens = stopword_filter(tokens)
      tokens = stem_filter(tokens)
      return [token for token in tokens if token]

    -- hs
    analyze :: String -> [String]
    analyze = filter (not . null) . stem_filter . stopword_filter . lowercase_filter . tokenize

1 more reply

jankovicsandras1y ago· 3 in thread

This is a good intro to text search. Shameless plug: If you throw in a bit more, ca. 250 SLOC, you can have BM25 search: https://github.com/jankovicsandras/bm25opt

marginalia_nu1y ago

You can probably have phrase matching in a hundred lines more, maybe less.

Most of the difficulty in search is dealing with the sheer volume of data. The algorithms themselves are pretty trivial for the most part.

benob1y ago

Mine: bm25 I use for teaching (sorry for the French example)

https://gist.github.com/benob/69d48421f88f5dcc2b26a204d3251d...

heresie-dabord1y ago

Merci! Mais faut-il s'en excuser? L'exemple est en français, voilà tout!

1 more reply

j / k navigate · click thread line to collapse