Sophia – An embeddable key-value database (opens in new tab)

(sphia.org)

134 pointsisaacb12y ago56 comments

56 comments

I know it seems superficial but beautiful docs are one of my most trusted heuristics when I'm considering using a library. If the author cares about the aesthetics of the docs, it often means they care about the aesthetics of the code, which really does matter a lot. We can make ugly things or we can make beautiful things. I really respect people who take the time to make beautiful tools.

VikingCoder12y ago

Was that intended to be a compliment of this project? I had the opposite reaction. I accept that English is often a second language, but this page was off-putting.

Sophia is a modern [should add comma] embeddable key-value database designed for a high [should hyphenate] load environment.

It has a unique architecture that was born as a result of research and rethinking primary alghorithmical [sic, should say "algorithm"] constraints associated with a [sic] getting popular Log-file based data structures, such as LSM-tree [should say "trees"], it's [sic] variations based on Fractional Cascading ideas and a B-Tree. (see architecture) [run-on, meaning unclear]

It is very fast. (see benchmarks)

it [sic] is easy to use. (see documentation)

Implemented as a small C [should probably hyphenate, or just rewrite] written, BSD [should probably hyphenate] licensed library.

jasonwatkinspdx12y ago

It's quite clear English is not the primary language of the author, and I think it's in poor taste to criticize grammatical errors that are clearly sourced in this.

English is not an easy second language to learn as an adult, and technical English doubly so.

2 more replies

dilap12y ago

I don't think you're weighing the "English-as-a-second" language factor nearly strongly enough. While grammar can be a great proxy measure of quality for works by native speakers, it's completely inappropriate for non-native speakers -- the amount of effort required to reach native-level fluency and polish (i.e., at the level of your critique, above) is herculean.

To cite just one example, redis is widely considered excellent software, but much of its documentation is written in a very imperfect English.

rgbrgb12y ago

They should definitely do some editing. The grammatical errors are kind of off putting, I totally agree. That said there's a very clear value prop/use case and clear, concise documentation. You may be getting downvoted because your language was rather harsh.

tolitius12y ago

Раз ты так хорошо шаришь в английской грамматике, и тебя так напрягает нехватка запятых и дефисов, подскажи как лучше сформулировать и "пруфридать" вот это самый комментарий, или... у тебя эта самая пустота в одной из частей твоего, грамматически заточенного мозга (там, кстати, дефис можно поставить) мешает тебе справиться с задачей?

в следующий раз (кстати с заглавной буквы можно начать) перед тем как насилывать свою клаву, лучше пойди на подкурсы русского или китайского или хотя бы того же английского.

очень не сложно оставлять мусорные комментарии, не относящиеся кстати к теме разговора, и называть себя викинг кодером. гораздо сложнее соответствовать своему имени, и "викинговать" архитектуру, код, мысли, философию создания.

если бы в интернете и/или на хакер ньюз была бы общая корзина для мусора, и можно было бы голосовать за вещи которые туда отправляются, можешь расчитывать на мой голос.

hey VikingCoder, I heard you are a fan of grammar? see above? grammar that.

2 more replies

Sami_Lehtinen12y ago

Check out this post: https://news.ycombinator.com/item?id=6314628

I know my spoken and written english is far from perfect, but I can live with it.

StavrosK12y ago

> rethinking primary alghorithmical [sic, should say "algorithm"] constraints

"Algorithm" isn't an adjective. I accept "algorithmical", although it should probably be "algorithmic".

1 more reply

jasonwatkinspdx12y ago

Unfortunately the chosen font renders very poorly (at least for me in chrome).

thezilch12y ago

  document.getElementById("main").style.fontFamily = "sans-serif";

saurik12y ago

Anyone know anything about how this compares in practice to Lightning MDB (which uses a memory mapped B-tree, I think, and is apparently insanely faster than most of the other enbedded key-value stores people normally examine)?

hosay12312y ago

This one copies, and it has no concept of transactions from the looks of it (not even LevelDB-style snapshots)

apendleton12y ago

Having just gone through the exercise of picking an embedded key-value store for a project, some things that would be nice: how does it compare to other things besides leveldb (which, to be frank, isn't a stellar performer)? In particular, how does it compare to Tokyo/Kyoto Cabinet, Lightning MDB, or Sqlite4's LSM? Does it support data compression (either with a single pre-selected algorithm like LevelDB and Snappy, or in pluggable fashion like LSM)? How does it deal with concurrent access by multiple processes?

shepik12y ago

I've gone through picking embedded key-value store, too.

What really concerns me is why never in benchmarks they perform on already filled database (like, 14G, 28G, 60G)? Because "add 100k random keys into an empty database" is very different from "add 100k random keys into a large database". And that is where more novel algorythms start to shine.

Yes, read speed of leveldb (and, i assume, sophia) with its fancy sst's is lower than of plain old b-trees or hashtables (kctree/kchash), but it is still high enough for most tasks. Write performance of kc* (and btree-based libraries in general) is, however, unacceptable, at least on hard drives, and even with a reasonable-sized database (~90% of RAM) it degrades to a couple of random write per IOPs (so, 200-300 writes per second on a consumer-grade HDD, or up to 1000 on a 2x10k sas hdd in raid-0, if i remember correctly)

It may be reasonable to use kc* on SSD, but i did not test that.

clumsysmurf12y ago

For Java / Android, I've been using H2's MVStore, which is log structured and uses counted B+-trees. It's nice not having to go through JNI for good performance in Java.

http://www.h2database.com/html/mvstore.html

eropple12y ago

Funny - H2's slowness (either with a standard storage system or with MVStore and a standard key format) is the main reason we're moving back to a hand-rolled data storage system that's specific for our data on Android.

clumsysmurf12y ago

Thomas Mueller, the author of H2 / MVStore, gives some thoughts on H2's performance issues under Android here if you are interested:

https://groups.google.com/forum/#!topic/h2-database/Q8K-nbCh...

1 more reply

scanr12y ago

That looks awesome. Thanks for pointing it out.

dfischer12y ago

Typography is hard to read.

XorNot12y ago

Seconded: that font at that size strains the eyes a fair bit.

Amadou12y ago

It looks great with javascript disabled, maybe the font in the examples was a mite small.

I turned on javascritp and it looked a lot like a man-page.

i_have_to_speak12y ago

Cute website. Some random thoughts:

Concurrency:

- No mention of it. There appear to be spin locks in the source. No multi-threaded tests.

Stability and data safety:

- Github has 2 days of history, and 4kLoC of test code. Why should I trust my data to you?

"high load environment":

- So what exactly does it do in a "high load environment"? How do you define "high load" in the first place? CPU load? I/O load from other processes? What shortcomings of the competition under a "high load environment" are you trying overcome?

Backup:

- How do I do hot backup?

Benchmark:

- LevelDB is not a fair comparison as it offers additional non-trivial functionality (snapshots) that cannot be built up on top of Sophia. LevelDB APIs are also safe for concurrent use, which adds overhead. Kyoto Cabinet would have been more suitable as a peer to benchmark with.

- 3 million records with 16-byte keys and 100-byte values is not really an interesting benchmark dataset.

- Iteration over a static database is not interesting, either. Is there any alternative other than locking an entire mutating database for the duration of iteration?

pwpwp12y ago

I, for one, wouldn't trust my data to a library by somebody who uses the same text decoration for hyperlinks as for plain text.

msvan12y ago

Computer scientists aren't known for their design chops. I'd take that as a sign of authenticity.

FraaJad12y ago

REAL computer scientists do not use CSS. Bonus points if they use FONT tags (in caps of course!).

oscargrouch12y ago

Really guys, can you give more constructive or at least more (not based in bullshit assumptions) comments? if not, just shut up..

This is a non-trivial effort, and all people do is to complain about the font face or if the punctuation was right?

First, in the benchmarks it just crush leveldb, this is already by itself a great achievement. can you confront the benchmarks? you do it one yourself with a different configuration? no?

Second, if you are not a database expert and can create proper critics (constructive or not), just keep it to yourself.. i wonder how so many people get up with all of this conclusions so fast, without a proper look at the source code and to have a reasonable amount of time to know what are they talking about.

its very hard to create things like this, but very easy to critisize without any background.. dont forget about it

if you have something to say about a small thing, that do not have a direct relation to the product or thing itself, if theres already one comment about it, that enough! do not spam, answering it, or creating new comments about it, this is just so rude and unrespectful..

really, things are getting creepy on HN.. and its not only in this thread

VikingCoder12y ago

You: Making the documentation readable and easy to parse adds no value to projects! Everyone who disagrees should shut up.

If I'm being kind to you, HN commenters (myself included) should do a better job of commenting politely, and spend more effort making sure their criticism comes off as constructive rather than just whining and aggressive... ...but I think you make it sound like criticism of anything outside of the source code itself is creepy, rude, and disrespectful.

Negitivefrags12y ago

It says that the benchmark source is on github, but I can't find it.

It doesn't appear to be in their primary repo.

I would like to try and do my own test against another embedded data store like Berkeley DB but I want to know more about the conditions on the test. How many threads were used, that kind of thing.

the112y ago

https://github.com/pmwkaa/sophia_benchmark

Goopplesoft12y ago

Very cool. As a suggestion, increase the link size under the main title, wasn't clear to me what the next step was at first after reading the introduction text.

laichzeit012y ago

In case any of the devs read this:

1. Can multiple processes use the same database concurrently? (Separate address space processes, not fork()'d)

2. Have you tested this with uClib/cross compiler? (I would like to use it on a MIPs embedded router)

The reason I ask this is because I recently had the displeasure of having to hack a non-volatile RAM library to work with shared memory / thread safe and something small like this would be a perfect replacement with a lot less pain.

dkhenry12y ago

Was this man's computer use being charged by the key stroke? I mean I understand using a few abbreviations here and there, but i.c ? At least name your files descriptively.

I would avoid using this for realzies if only for the fact that if something broke trying to fix it in that code base would be prohibitive

hosay12312y ago

Looks nice, but note this doesn't appear to support consistent reads (unlike LevelDB snapshots)

conductor12y ago

I love the simplicity of the site and the C code. I will definitely use it, thank you.

MichaelGG12y ago

I've got a need for something like this, but would like to have the keys and values delta encoded to achieve simple, yet effective, compression.

leif12y ago

How about http://github.com/Tokutek/ft-index? It's embeddable (BDB-like API), has compression built in, and is a similar data structure to this but with more mature features like transactions.

acron012y ago

Had a quick, 30 min bash at a win32 port using msinttypes and pthread-win32 but no luck yet :( Would love to see one though...

ksec12y ago

Something for Mozilla to consider using inside Firefox inplace of LevelDB ( If that was ever landed )

buster12y ago

I'd be far more interested in benchmarks versus BDB (and maybe even sqlite).

maaku12y ago

Snapshots? I could find it in the documentation..

jgalt21212y ago

any word on support for unicode keys?

dlundqvist12y ago

Keys are arbitrary data (you pass in pointer to data and length in bytes), so you can use anything that makes sense for you as keys.

luisbebop12y ago

Awesome work, congratulations!

j / k navigate · click thread line to collapse

56 comments

rgbrgb12y ago

VikingCoder12y ago

Was that intended to be a compliment of this project? I had the opposite reaction. I accept that English is often a second language, but this page was off-putting.

Sophia is a modern [should add comma] embeddable key-value database designed for a high [should hyphenate] load environment.

It is very fast. (see benchmarks)

it [sic] is easy to use. (see documentation)

Implemented as a small C [should probably hyphenate, or just rewrite] written, BSD [should probably hyphenate] licensed library.

jasonwatkinspdx12y ago

It's quite clear English is not the primary language of the author, and I think it's in poor taste to criticize grammatical errors that are clearly sourced in this.

English is not an easy second language to learn as an adult, and technical English doubly so.

2 more replies

dilap12y ago

To cite just one example, redis is widely considered excellent software, but much of its documentation is written in a very imperfect English.

rgbrgb12y ago

tolitius12y ago

hey VikingCoder, I heard you are a fan of grammar? see above? grammar that.

2 more replies

Sami_Lehtinen12y ago

Check out this post: https://news.ycombinator.com/item?id=6314628

I know my spoken and written english is far from perfect, but I can live with it.

StavrosK12y ago

> rethinking primary alghorithmical [sic, should say "algorithm"] constraints

"Algorithm" isn't an adjective. I accept "algorithmical", although it should probably be "algorithmic".

1 more reply

jasonwatkinspdx12y ago

Unfortunately the chosen font renders very poorly (at least for me in chrome).

thezilch12y ago

  document.getElementById("main").style.fontFamily = "sans-serif";

saurik12y ago

hosay12312y ago

This one copies, and it has no concept of transactions from the looks of it (not even LevelDB-style snapshots)

apendleton12y ago

shepik12y ago

I've gone through picking embedded key-value store, too.

It may be reasonable to use kc* on SSD, but i did not test that.

clumsysmurf12y ago

For Java / Android, I've been using H2's MVStore, which is log structured and uses counted B+-trees. It's nice not having to go through JNI for good performance in Java.

http://www.h2database.com/html/mvstore.html

eropple12y ago

clumsysmurf12y ago

Thomas Mueller, the author of H2 / MVStore, gives some thoughts on H2's performance issues under Android here if you are interested:

https://groups.google.com/forum/#!topic/h2-database/Q8K-nbCh...

1 more reply

scanr12y ago

That looks awesome. Thanks for pointing it out.

dfischer12y ago

Typography is hard to read.

XorNot12y ago

Seconded: that font at that size strains the eyes a fair bit.

Amadou12y ago

It looks great with javascript disabled, maybe the font in the examples was a mite small.

I turned on javascritp and it looked a lot like a man-page.

i_have_to_speak12y ago

Cute website. Some random thoughts:

Concurrency:

- No mention of it. There appear to be spin locks in the source. No multi-threaded tests.

Stability and data safety:

- Github has 2 days of history, and 4kLoC of test code. Why should I trust my data to you?

"high load environment":

Backup:

- How do I do hot backup?

Benchmark:

- 3 million records with 16-byte keys and 100-byte values is not really an interesting benchmark dataset.

- Iteration over a static database is not interesting, either. Is there any alternative other than locking an entire mutating database for the duration of iteration?

pwpwp12y ago

I, for one, wouldn't trust my data to a library by somebody who uses the same text decoration for hyperlinks as for plain text.

msvan12y ago

Computer scientists aren't known for their design chops. I'd take that as a sign of authenticity.

FraaJad12y ago

REAL computer scientists do not use CSS. Bonus points if they use FONT tags (in caps of course!).

oscargrouch12y ago

Really guys, can you give more constructive or at least more (not based in bullshit assumptions) comments? if not, just shut up..

This is a non-trivial effort, and all people do is to complain about the font face or if the punctuation was right?

First, in the benchmarks it just crush leveldb, this is already by itself a great achievement. can you confront the benchmarks? you do it one yourself with a different configuration? no?

its very hard to create things like this, but very easy to critisize without any background.. dont forget about it

really, things are getting creepy on HN.. and its not only in this thread

VikingCoder12y ago

You: Making the documentation readable and easy to parse adds no value to projects! Everyone who disagrees should shut up.

Negitivefrags12y ago

It says that the benchmark source is on github, but I can't find it.

It doesn't appear to be in their primary repo.

I would like to try and do my own test against another embedded data store like Berkeley DB but I want to know more about the conditions on the test. How many threads were used, that kind of thing.

the112y ago

https://github.com/pmwkaa/sophia_benchmark

Goopplesoft12y ago

Very cool. As a suggestion, increase the link size under the main title, wasn't clear to me what the next step was at first after reading the introduction text.

laichzeit012y ago

In case any of the devs read this:

1. Can multiple processes use the same database concurrently? (Separate address space processes, not fork()'d)

2. Have you tested this with uClib/cross compiler? (I would like to use it on a MIPs embedded router)

dkhenry12y ago

Was this man's computer use being charged by the key stroke? I mean I understand using a few abbreviations here and there, but i.c ? At least name your files descriptively.

I would avoid using this for realzies if only for the fact that if something broke trying to fix it in that code base would be prohibitive

hosay12312y ago

Looks nice, but note this doesn't appear to support consistent reads (unlike LevelDB snapshots)

conductor12y ago

I love the simplicity of the site and the C code. I will definitely use it, thank you.

MichaelGG12y ago

I've got a need for something like this, but would like to have the keys and values delta encoded to achieve simple, yet effective, compression.

leif12y ago

How about http://github.com/Tokutek/ft-index? It's embeddable (BDB-like API), has compression built in, and is a similar data structure to this but with more mature features like transactions.

acron012y ago

Had a quick, 30 min bash at a win32 port using msinttypes and pthread-win32 but no luck yet :( Would love to see one though...

ksec12y ago

Something for Mozilla to consider using inside Firefox inplace of LevelDB ( If that was ever landed )

buster12y ago

I'd be far more interested in benchmarks versus BDB (and maybe even sqlite).

maaku12y ago

Snapshots? I could find it in the documentation..

jgalt21212y ago

any word on support for unicode keys?

dlundqvist12y ago

Keys are arbitrary data (you pass in pointer to data and length in bytes), so you can use anything that makes sense for you as keys.

luisbebop12y ago

Awesome work, congratulations!

j / k navigate · click thread line to collapse