Inaugural release of Apache Lucy, Version 0.1.0 (opens in new tab)

(mail-archives.apache.org)

61 pointsmthomas15y ago16 comments

16 comments

16 comments · 7 top-level

rbrown4615y ago· 5 in thread

I tried to get into Lucene (using SOLR) recently but was put off by it's complexity for what was, in my case, a simple use case (searching through a large document set of html, txt, and doc files quickly using proximity search).

After futzing for hours with XSLT and writing scripts to submit content via the REST API, I found out about FTS4 in SQLite, and was impressed by it's relative simplicity. I had something working in under an hour in Python.

fizx15y ago

Wait, what!? You have at least two good options for Solr libaries in Python, neither of which brings you anywhere close to xslt.

- http://haystacksearch.org/ - http://code.google.com/p/pysolr/

nzadrozny15y ago

Also, Sunburnt: https://github.com/tow/sunburnt/

ericmoritz15y ago

This may be beyond your the scope of your needs but Riak Search <http://wiki.basho.com/Riak-Search.html>; is pretty easy to use, is distributed and elastic. As a bonus you have Riak laying around if you need an awesome K/V store.

lepht15y ago

Seconded. I heartily recommend anyone with even a slight interest in Riak to check out the Riak Fast Track:

http://wiki.basho.com/The-Riak-Fast-Track.html

It's a quick, interesting, and to the point into to Riak, including theory, installation, and usage. The Basho guys have seriously great documentation, I've enjoyed browsing through the Riak wiki despite not really having any live Riak deployments.

notJim15y ago

I used SOLR for a site about a year ago, and I was absolutely shocked at how little documentation there was, and how low quality the existing documentation was. Everything was spread out over some awfully-maintained wiki that seemed to have no logical organization. Ugh, what a nightmare.

chuhnk15y ago· 2 in thread

The announcement says: Lucy is a "loose C" port of Apache Lucene, a search engine library for Java -- it is similar in purpose to Lucene, but designed to take advantage of C's unique capabilities.

I'm wondering what these unique capabilities are. Speed? Smaller memory footprint? And I wonder what the reason is behind doing this. I'm all for C project but very curious as to why when Lucene was very well done.

nkurz15y ago

I'm one of the developers, although not currently very active. The main "unique capability" is closer integration to the machine. Our approach has been "The OS is our VM".

We use mmap() heavily, and when running on 64-bit systems take liberal advantage of the giant address space. Using the system to do more of the buffering also allows us to have lightweight processes that can start quickly. We think there will be both speed and memory advantages in the long run. The other main difference is that C is much easier to integrate with other languages than Java. We're starting out with Perl bindings, but have plans for Ruby, Python, Lua, Tcl, and others. The goal is to offer a truly native interface from the language of your choice.

The degree of host language integration is wild. You'll be able to seamlessly subclass just about any part of the C library in any supported language. Nothing is ready beyond C and Perl, but eventually you'll be able to have your indexer in one language, and your customized searchers in a couple more, while all sharing the same shared system cache.

As for why? Marvin, the main developer started the project as KinoSearch at a time when Lucene wasn't really ready for prime time. He's been very interested in real time indexing, and at the time Lucene didn't handle this well. I got interested because I was looking for something lighter weight than Lucene, where I could try to blend the boundaries between search and database retrieval. Lucene had too many layers of abstraction for my purposes. A parallel might be SQLite and Postgres. Both have their place, but Lucy is more on the SQLite side of things.

hotdox15y ago

>I'm wondering what these unique capabilities are.

easy bindings for every dynamic language. they start with perl

z9215y ago· 1 in thread

How much better is it compared to CLucene? CLucene got stuck at 1.9 and now shows very little activity, while Java Lucene is rolling towards version 4. But still CLucene was as less memory hog and faster than Java Lucene at it's active time. They claim it was 2.5 times faster.

If Lucy can deliver the latest progresses in Java Lucene as a usable C library, that should be a very good news for me. Lucene still is the best choice for large data indexing and searching solutions.

nkurz15y ago

CLucene aimed for binary compatibility with Lucene indexes, and as a result had very little room to innovate. Lucy started out with the same approach, but decided early on that it was better to take the best parts of Lucene's internals while not being bound to all of them.

Some parts will be leading Lucy, and some will be catching up. There's already increasing cross-pollination between the two. It's a very loose port at this point.

toisanji15y ago· 1 in thread

Why would apache incubate a competing product like this? And what exactly are the unique capabilities that this project can take advantage of? Lucene is already extremely easy to interface to since its just a rest interface.

pjscott15y ago

Solr provides a REST interface. Lucene is a Java library.

endgame15y ago

When announcing things with cute names, can people please put a short description in the link?

For everyone else: Lucy's apparently a full-text search library written in C targeting dynamic languages, with Perl bindings to start with.

ojosilva15y ago

Here's the Perl binding library in CPAN, in its simplest form: http://search.cpan.org/perldoc?Lucy::Simple

The synopsis is quite elucidative. Just cpanm installed it and in 10 minutes had a program that indexes and searches a collection of files with highlighting. Looks promising!

powertower15y ago

clicky http://incubator.apache.org/lucy/

j / k navigate · click thread line to collapse

16 comments

16 comments · 7 top-level

rbrown4615y ago· 5 in thread

fizx15y ago

Wait, what!? You have at least two good options for Solr libaries in Python, neither of which brings you anywhere close to xslt.

- http://haystacksearch.org/ - http://code.google.com/p/pysolr/

nzadrozny15y ago

Also, Sunburnt: https://github.com/tow/sunburnt/

ericmoritz15y ago

lepht15y ago

Seconded. I heartily recommend anyone with even a slight interest in Riak to check out the Riak Fast Track:

http://wiki.basho.com/The-Riak-Fast-Track.html

notJim15y ago

chuhnk15y ago· 2 in thread

The announcement says: Lucy is a "loose C" port of Apache Lucene, a search engine library for Java -- it is similar in purpose to Lucene, but designed to take advantage of C's unique capabilities.

nkurz15y ago

I'm one of the developers, although not currently very active. The main "unique capability" is closer integration to the machine. Our approach has been "The OS is our VM".

hotdox15y ago

>I'm wondering what these unique capabilities are.

easy bindings for every dynamic language. they start with perl

z9215y ago· 1 in thread

nkurz15y ago

Some parts will be leading Lucy, and some will be catching up. There's already increasing cross-pollination between the two. It's a very loose port at this point.

toisanji15y ago· 1 in thread

pjscott15y ago

Solr provides a REST interface. Lucene is a Java library.

endgame15y ago

When announcing things with cute names, can people please put a short description in the link?

For everyone else: Lucy's apparently a full-text search library written in C targeting dynamic languages, with Perl bindings to start with.

ojosilva15y ago

Here's the Perl binding library in CPAN, in its simplest form: http://search.cpan.org/perldoc?Lucy::Simple

The synopsis is quite elucidative. Just cpanm installed it and in 10 minutes had a program that indexes and searches a collection of files with highlighting. Looks promising!

powertower15y ago

clicky http://incubator.apache.org/lucy/

j / k navigate · click thread line to collapse