Maybe everyone already should already know this, but I was working on a very smart team, and we totally missed this initially. Setting "stored" to false for most fields resulted in a 90% reduction of the index size, which means less to fit into RAM.
Websolr's indexes return in under 50ms for queries of average complexity.
The more expensive queries usually involve "faceting" or sorting a large number of results. For an example, say you search Github for "while." Github used to do language facets, where it would tell you that out of a million results, 200103 files were in javascript, 500358 files were in C, etc.
The problem with this is that you have to count over a million records, on every search! Unlike most search operations which are IO bound, the counting can be CPU-bound, so sharding on one box will let you take advantage of multiple cores.
Racoonone is "sorting on two dimensions, a geo bounding box, four numeric range filters, one datetime range filter, and a categorical range filter." This should put him in a cpu-bound range (in particular because of the sort).
Websolr has customers on sharded plans, but they are usually used in custom sales cases where we're serving many, many millions of documents. We'll look at adding sharding as an option to our default plans, so that they'll be more accessible for people like raccoonone. In the meantime, if you send an email to info@onemorecloud.com, we'll try to accomodate use cases like this.
Edit: Also, other possible optimizations include (1) indexing in the same order you will sort on, if you know ahead of time, and (2) using the TimeLimitedCollector.
Here are some that come to mind right now that are very useful:
- Be smart about your commit strategy if you're indexing a lot of documents (commitWithin is great). Use batches too.
- Many times, i've seen Solr index documents faster than the database could create them (considering joins, denormalizing, etc). Cache these somewhere so you don't have to recreate the ones that haven't changed.
- Set up and use the Solr caches properly. Think about what you want to warm and when. Take advantage of the Filter Queries and their cache! It will improve performance quite a bit.
- Don't store what you don't need for search. I personally only use Solr to return IDs of the data. I can usually pull that up easily in batch from the DB / KV store. Beats having to reindex data that was just for show anyway...
- Solr (Lucene really) is memory greedy and picky about the GC type. Make sure that you're sorted out in that respect and you'll enjoy good stability and consistent speed.
- Shards are useful for large datasets, but test first. Some query features aren't available in a sharded environment (YMMV).
- Solr is improving quickly and v4 should include some nice cloud functionality (zookeeper ftw).
If this sounds interesting, check us out at http://www.searchify.com - We offer true real-time, fast hosted search, without requiring you to learn the innards of Solr or Lucene.
Now if someone could put SenseiDB on the cloud, i'd pay for it...
I'm currently running an index that is 96 million documents(393GB) using a single shard with a response time of 18ms.
If you're comfortable with it, I'd suggest profiling Solr. We found that we were spending more time garbage collecting than expected, and spent some time to speed up an minimize the impact of it. Most of this was related to our IO though.
Second, don't use the default settings. Adjust the cache sizes, rambuffer, and other settings so they are appropriate for you application.
I'd also start instrumenting you web application such that you can start testing removal of query options that may be creating your CPU usage issue. You get a lot of bang for your buck this way, and you may find the options you were using provide no meaningful improvement in search. A metric like mean reciprocal rank can go a long way to improve your performance.
The end result works very well, though it's a real memory hog when you get into the "hundreds" of shards on an individual server.
(rather: why is it useful to explicitly shard vs running one big instance with all of the memory and the same total number of threads? queuing theory would lead me to believe the latter would be better)
Tuning Solr in Near Real Time Search environments: https://vimeo.com/17402451
java -jar start.jar
(I'd expect performance between ES and Solr to be about the same and highly dependent on the underlying lucene engine)
It seems like an interesting product. Advertising here on HN is always tricky --- it's a balancing act between self-promotion and restraint. I voted your comment up in the hope that you'll stick around and tell us more about it. But perhaps more on the tech side, and less on the marketing.
Have you published any papers describing your approach? Or white papers with more meaty details? I work on the Apache Lucy project, and am very interested in things that work better than Lucene.
Thanks for voting my comment up. Based on your request, I provide the following information. Let me know if you want to move this discussion to a private email discussion.
We have two published white papers to compare our engine with Lucene/Sor:
(1) A comparison white paper in the case of traditional keyword search: http://www.bimaple.com/files/Bimaple-Keyword-Search-Comparis... . We have a demonstration site to support instant, fuzzy search on Stack Overflow data (600K question titles as of January 2011): http://demo.bimaple.com/stackoverflow
(2) A comparison white paper in the case of geo-location keyword search: http://www.bimaple.com/files/Bimaple-Map-Search-White-Paper-... . We have a demonstration site to support instant, fuzzy search on 17 million US business listings: http://www.omniplaces.com . It has an iphone app at: http://itunes.apple.com/us/app/omniplaces/id466162583?mt=8
If you have questions, please feel free to contact me at chenli AT bimaple DOT com. Thank you.