The magic of small databases (opens in new tab)

(tomcritchlow.com)

192 pointstopcat313y ago63 comments

63 comments

49 comments · 25 top-level

dmje3y ago· 10 in thread

I run a little agency in the UK who works with museums to help them with digital. A large part of this is getting collections online.

Some years ago we commissioned a developer to make CultureObject[0], a free and open source WordPress plugin to make it easier to ingest collections data for display on the web. At the heart it's a glorified data importer, and many people just use the CSV mode to sync and import collections data.

It requires some dev effort - we've built an add-on which makes this easier but there's no denying that search, faceting and display needs knowledge of wordpress development.

Three years ago we then launched The Museum Platform[1] which is a more SaaS based model - we take away the need for dev skills and ask clients to just send us a CSV and any related media and we do the hard work. It's WordPress again but a modified version where we also facilitate storytelling and narrative around the ingested collections.

The interesting thing about this journey is that the requirement to "get a collection online" is apparently and theoretically easy. But the reality is it gets hard quite quickly as the need for search / filtering appears, and it gets harder still as scale comes into it. 1000 records is fine. 100,000 gets quite a bit harder.

There are also many subtleties - particularly with museum collections. "Location" of a record could be where it was collected, or where it is now, or where it's on display. Relational stuff is hard, as are taxonomies and authority terms. It's hard to generalise and it's hard to scale.

[0] https://cultureobject.co.uk/ [1] https://themuseumplatform.com/

mmsimanga3y ago

I see you decided on Wordpress, if you were going to use a CMS I think Drupal 7 would have been a good choice. Drupal has concept of entities and views. An entity as the name suggests is essentially a table and you can add all sorts of different fields to it. From simple text and number fields to images and fields that lookup other entities thus creating relationships between entities. Views is another construct that lets you choose how to display the entities. As a list of as a table a two possible views. Most of this can be done in Drupal 7 without writing code. I say Drupal 7 because you mentioned Wordpress. Drupal 8 and above is more of a developer framework and requires knowledge of Composer. Backdrop [0] is fork of Drupal 7.

[0]https://backdropcms.org/

dmje3y ago

WordPress has custom post types, taxonomies and metafields so is very capable of dealing with complex relationships if you need it to. What's challenging is going from simple columnar data such as CSV to something complex and relational.

We chose WordPress because of its ubiquity and power - plus it's insanely easy to host and use as a non technical editor, which (last time I looked) can't be said of Drupal.

justusthane3y ago

This is how ExpressionEngine is structured as well, except they're called channels and templates. I really enjoyed working with EE, although coding is definitely required - you basically have to build your site from the ground up. No themes included.

That being said, I found it much, much easier to develop than WordPress.

noduerme3y ago

This is a really cool niche... and I love the idea of it being more generally applicable or extensible to the kinds of private collections of objects that the writer is describing. (I really like what the article seems to be arguing for).

It seems like the data storage / search / filtering aspects of your software would be really fun and interesting to develop flexible solutions to. The Wordpress aspects probably wouldn't be so fun to maintain, but it's always pick-your-poison when it comes to CMSs unless you develop your own in-house.

That being said, a collection CMS doesn't necessarily need to have all the plugins and doodads that a Wordpress site does. It could be something bare-bones and extensible that was written to be more tightly coupled to a layer that interpreted the underlying data structure. Just toying with the idea, maybe even something that flattened the data views of the collection into static webpages for deployment so that at least some of the indexing could be handled by naming conventions and directory structure without recourse to database searches.

The world could definitely use an open source kit along these lines, with a GUI backend that would let non-developers build their own table structure and search parameters, draw up some page layouts, and just generate a searchable site that collated CSV records with images.

Some of this actually reminds me of what HyperCard could do... it allowed some really interesting experiments with user-classified data. Like this, from 1989: https://core.ac.uk/download/pdf/225955134.pdf

Relational stuff is hard, as you say, but in a structure built around a collection it seems like you could come up with a DSL that defined which columns needed to relate to other tables (any column with repeating data, for instance), suggest making that column "normalized", and automatically generate a linked table.

dmje3y ago

That's a nice idea - point a script at a CSV file and generate a bunch of flat files for each item using some kind of simple templating language. I might take this back to the guys at the Platform and see if we can do a POC for the clients who have zero budget but want to get going with something straightforward... Thanks for the thinking :-)

1 more reply

tomrod3y ago

Is this not Libre Office, more or less?

1 more reply

mootpoints3y ago

I'm curious about what you make of Omeka, and whether you think it relates to OP's point. It's quite common in the digital humanities, but I've never seen it used outside that context.

dmje3y ago

I really like Omeka. It's a very cool project and we did look into it early on. Really though we chose WP because it's ubiquitous - no lock in, very powerful, easy for editors to use. With the right nudging it does all the things Omeka does.

ZephyrBlu3y ago

This is really interesting. The problem you're describing sound similar to what I wrote about why this kind of thing is hard to generalize in another comment: https://news.ycombinator.com/item?id=34564394.

8n4vidtmkvmk3y ago

maybe I'm being naive, but 100K records doesn't sound hard to search either. maybe at 5 or 10M it starts getting ugly/expensive

btown3y ago· 4 in thread

Buried in here is a fascinating musing on "Market-making Small Databases" - "Imagine a Substack for databases - an easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions..." It's worth a read in full in the original article.

One of my favorite small databases is https://hiregoats.com/ - it's a simple site showing goat herds for rent (for clearing brush in a sustainable way, etc.), monetized with at $35 listing fee and nothing else. There's no e-commerce, no attempt to insert the site into the transaction or funds flow, no bells and whistles. Certainly this doesn't scale to other niches where suppliers are less incentivized to pay a listing fee, but I'd love to see this kind of thing be more common, and incentivize people to curate.

fbdab1033y ago

I was quite amused when I went to the goats page to see they are expanding into other markets. They now have a sister site of https://hiresheep.com/

tomcam3y ago

Damnit I need to register hirequokkas.com and hireibexes.com immediately before one of you sharks beats me to it

1 more reply

btown3y ago

Much less inventory, though! But it's cool that they're starting somewhere - they have no need to feel sheepish just because their other site is so much more goated.

2h3y ago

uBlock Origin blocks that site for some reason

zokier3y ago· 4 in thread

Personally I find the whole dBase etc non-SQL kinda-graphical database systems interesting historical software branch that feels mostly died out these days. Access probably did quite a lot of damage here, killing out competitors before mostly succumbing itself.

gcanyon3y ago

FileMaker is still a thing. I don't know their internal financials, but they've steadily improved the product over the years. https://www.claris.com

Or if you want to go super-niche, Panorama is still around, and (they say) the longest-running Mac software developer apart from Microsoft. https://www.provue.com

Either one makes it easy to build a database+interface.

digitalsankhara3y ago

I had a distant memory about this Mac based spreadsheet/database thing but could not remember its name (Panorama). Couldn't surface it in searches either. Thought about it the other week and here we are!

Odd pricing though = pay in advance credits. Ummm, not something I'd like to use for work when I'm in the middle of an important analysis with a deadline and I (inevitably) run out of credits and have to start faffing about with in-app purchases. Maybe its not that bad and I'm being unfair.

1 more reply

gavinmckenzie3y ago

Takes me back to the days of dBase, Clipper, and my favourite FoxPro which was acquired by Microsoft and continued to exist in the 90s. Access definitely destroyed the market for these other products by combining aspects of Visual Basic and database tech.

pstuart3y ago

FoxPro on the Mac was wonderful. I learned SQL wrangling with analytics on it -- there weren't all the options we have today.

LunarAurora3y ago· 2 in thread

There are categories of “Nocode” online services that could work, more or less, as small databases. Some are already cited in the article:

- DBs platforms (Best for more advanced DB) : Airtable, getgrist.com

- wikis+DB platforms (Best for building a site around the DB) : notion.so, coda.io

- Airtable/GSheet publishing (Best for simple/custom UI) : glideapps.com, siteoly.com

- Bookmarks/Collections (Best for links/References) : Zotero (online groups), are.na

- List sharing (Best for open collaboration?) : listium.com, (ranker.com ?)

- BI platforms (Best for advanced filters/charts) : polymersearch.com, Google Data Studio

- Data Set Hosting (Best for downloading?) : data.world, kaggle.com

All these allow publishing, and some collaboration

nerdponx3y ago

What about Datasette and/or Dolt.

LunarAurora3y ago

My list included nocode services only.

1 more reply

simongray3y ago· 2 in thread

This post is an exercise in describing the motivation and features of the Semantic Web seemingly without realising the tech stack already exists.

simonw3y ago

I honestly think that reflects more poorly on the semantic web tech stack than it does on the author of that piece.

I spend almost all of my time thinking about this class of problems and hanging out with other people who do, and sadly it's vanishingly rare to run into anyone outside of academia who's trying to use the classic semantic web stack (RDF an suchlike) to build this kind of thing.

osi3y ago

i worked at a then-web3.0 startup in the 00’s that had built something that could have been pivoted into this, but instead the CEO wanted to be like Digg instead.

the commercial community of practice is small for sure.

dgudkov3y ago· 2 in thread

Small databases aren't popular because Excel spreadsheets already occupy that niche. A small database doesn't have to be normalized. Because it's small, it can be denormalized into a flat table that can be conveniently handled in Excel.

FridgeSeal3y ago

That’s not really analogous and kind of misses some of the other aspects the author talks about.

Excel doesn’t cover the publishing and discovery aspect. It is absolutely atrocious from a machine usability and schema perspective, nevermind performance, etc.

Even if you think excel does address those, I think the shortcomings of the format should rule it out. It is better to have a more powerful tool, and fix the usability aspects, rather than trying to proverbially rub glitter on what amounts to a turd of a format.

omnipath3y ago

Yeah, but I think you're underestimating what the grandparent post is saying. The people who would PAY for some of these functions are already making do with Excel. And microsoft has responded in kind by increasing Excel's ability to do, well, everything. I'm pretty sure if someone wanted to include those features the article is talking about on top of Excel, they would. Just last week, we saw someone add onto Excel a C# IDE with debugger, and posted about it here. It seems one's limit with Excel is only one's imagination.

1 more reply

breck3y ago

I'm going to plug our related project: TreeBase. It's the public domain software that powers PLDB.com (a Programming Language DataBase).

It's very simple. If your small database was about cars, your structure might look something like this:

    database/
     grammar/
      engine.grammar
      interior.grammar
     things/
      model3.car
      camry.car

The `grammar` files are written in a Tree Language called Grammar. Those are your schema files. You basically create a new syntax-free plain text "language" for storing your data, in this case 1 "car" file per model of car.

It was a pipedream of mine until the M1's came out. Those changed everything, because then it became fast enough to actually do it.

We have a new release coming out soon with a new query language that will change everything. Here is the source code: https://github.com/breck7/jtree/tree/main/treeBase

xnx3y ago

"Publishing documents to the web is a well-served use case but publishing small indexes, databases and collections to the web is still an incredibly frustrating and under-served use case. Here I outline why I think it matters and a variety of approaches to solving it."

Amen. I'm surprised the post doesn't mention sqlite3 WASM/JS (https://sqlite.org/wasm/doc/trunk/about.md). That, paired with an easy-to-use faceting library, would go a long way.

ZephyrBlu3y ago

I love this. I've been thinking about something similar lately. There are so few good indexes and search engines for niche collections of data.

Imagine if there was a niche search engine for everything, and the search engine was customized for that niche.

I think the main problems here are:

- Data format and ingestion - Domain-specific indexing/relevance

Most data is super messy and it not accessible through nice APIs, which presents a problem. You might need custom ingestion for each niche and it's pretty likely you'll need some rules to standardize data from multiple sources, neither of which seems easy to generalize and automate because they're very domain-specific.

The other part to this is indexing/relevance so the search feels good to use. Some fields are obviously going to be more important than others and people are going to want to utilize search for things that are to predict ahead of time.

To use the authors example of artists in Brooklyn, people might want to search for artists near them. Now you have to gather location data, format it, ingest it, index it and add it to the search UI.

The fact that adding another field to index on is a vertical integration adds a lot of overhead.

All of this stuff in isolation is not difficult, but when you put it together it becomes quite a lot of work that generally isn't easily scalable.

itsmemattchung3y ago

Reminds of Amazon EBS and a white paper describing the philosophy of deploying millions of tiny databases:

https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c...

moehm3y ago

For what it's worth, here is my "small database" attempt, a structured list of worthwhile Wikipedia articles to read.

https://www.mostdiscussed.com

overgard3y ago

People would love this for sports. There's so much interesting data locked up in proprietary databases

roncesvalles3y ago

I'm aware that this may sound dismissive but the solution that the author of the OP is looking for is the World Wide Web itself.

The "small database" in question is, well, an HTML page. It can be shared and passed around by selecting the portions of it that you need and pressing Ctrl+C/Ctrl+V. Search is accomplished by the browser using Ctrl+F. Collaboration can take many forms - wikis, comments, forums, live editing. Links between databases are what URL links are. The database that OP is looking for is a page of text (for unstructured data) or somewhat structured solutions like CSV, JSON, or YAML.

Now, yes, there are certain participants on the WWW who make poor web design choices that cause agreed-upon functionality to break. E.g. unnecessary pagination or accordions breaking Ctrl+F, not offering data for download, not having useful URL paths etc.

Zababa3y ago

I like the idea, but I think one issue is that the database is the easy part. If I look again at the list of requirements, most are not about the database but about how to put data from external source in the database, how to edit the database, and how to publish it. To me this sounds like an interface problem. But since the whole point is small, specialized collections, interfaces have to be specialized too. That means no single tool that can offer a solution. Maybe it's an issue of definition, I call a database something like MySQL or SQLite or even a CSV file, while for the author it's the finished product, the database about <stuff> and the tools that are adapted to <stuff>.

Substack is an interesting example. It's great for written content with a few images, which mostly looks the same everywhere. But it lacks great customisation features that I think a database would need, because that stuff is hard to do.

If I had to propose a solution, it would be this: if you want to do a small database, do it. Experimentation in the cyberspace is very cheap. These days you have lots of resources for everything online. It can be intimidating, and can lead to analysis paralysis. I'm supposed to be a professional developer and still struggle with that. But one thing that has helped me a lot recently is to try stuff, see if it works, if it fails, ask questions (to either real people or ChatGPT/Copilot, Copilot is especially valuable to get in a "just keep writing, editing comes later" mood). It's not always fun, in fact it can be quite frustrating, but that's how things are.

In the end, this is about decentralisation and you can't have proper decentralisation if you don't also decentralise the skills, the know-how. For example, there has been a lot of talk about Mastodon as a decentralised alternative to Twitter. And it is one. But if you simply go from being a user on Twitter to being a user on Mastodon, well you don't regain much control. On the other hand if you try running a small instance, even just a local instance to see how it works, or maybe add a few feature to your preferred client (it can be code, but it could also be helping translation, or maybe a color scheme (you wouldn't believe how many color scheme are barely usable when you're colorblind)), well then you start being in control.

topcat31OP3y ago

Hey OP here, just wanted to say thanks for all the comments (goats and all). There's lots I still need to learn about (actual) databases as a hobby developer...

In the meantime I've made a big update to the Airtable with links to tools, examples and further reading:

https://airtable.com/shrYY94GrqVB4HUsi/tblHPrdomiPbLpod6/viw...

marniewebb3y ago

H2O — https://h2o.law.harvard.edu/ — is a now-defunct collaborative syllabus project from Harvard that gets at a lot of this I think. It’s basically a list maker with a lot of additional capabilities. While it’s made for small list of things it’s easy to imagine this is a piece of the solution.

0823498723498723y ago

A search for "filemaker" reveals that Claris is still in business; I'd hope they'd have something that might address this need?

hardwaresofton3y ago

Weirdly enough I haven’t seen too much mention of CMSes — them plus/minus spreadsheet like tools are almost surely the way to handle this kind of use case.

What’s missing is the added search + UI capabilities.

I think about saas ideas a lot and this is actually quite a common one (though I’m generally thinking of a specific niche) —- enabling people to craft and expose datasets would surely be a great startup.

jerryu3y ago

Having a small database is especially useful when collaborating on data strategy. I have seen some database diagrams with 1000s of tables and it is hard to make sense of it using ERD tools.

Even with advanced views offered by tools like ERDLab.io it is a pain in the ass to collaborate on large schemas at various stages of development.

cavisne3y ago

I feel like this is getting really close. GPT is create at writing sql queries from text and turning a blob of semi structured data into an sql schema.

We just need to somehow tie it together so anyone can explain their use case, and show an example of the data in plain english, then lock in a schema and feed everything in.

aabbcc12413y ago

For collection of links with short description for projects/services, there are many awesome list on github.

For more complex data to be shared, maybe it can be csv/md/mdx shared over git as well?

It can have stable url and be searchable from github, search engines, and 3rd indicies

vaporup3y ago

https://zed.brimdata.io

maphew3y ago

Makes me think of what something like Datasette fused with Fossil SCM could accomplish.

Trayja-Peter3y ago

"I want to empower more individuals to publish, maintain and collaborate on small indexes. To build a million tiny libraries, community databases, weird collections and indie indexes."

Funnily enough, a friend and I have been building https://Trayja.com, a tool which does this exact thing, with a focus on the "community" aspect. There's a huge amount of wisdom in communities, whose value could be multiplied if it would be aggregated in a structured, indexable, searchable way. This article articulated so much of what I've been trying to explain about my project.

LAC-Tech3y ago

With how fast computers are now, they can work well for small businesses too.

j / k navigate · click thread line to collapse