One day I will do it.
Speaking more broadly, whether we talk about HTML or PDF it's the same problem: documents should have two representations - human-friendly and machine-friendly until AI gets so good that only having the human-friendly representation is enough.
when I download papers from arxiv I sometimes choose the LaTex version because it often comes with commented out ideas that didn't make it into the paper. The author thought process becomes clear. The Metadata helps me understand the whole thing quicker the same way the semantic web helps the machine.
Perhaps there is a clever philosophical analogy in there somewhere about "us becoming the machine" or "the map becoming the territory", but I can't put my finger on it.
I personally believe the key is in "information architecture". We have been conveying information as a linear sequence of words for so long that we don't know yet how to best exploit non-linear formats.
Programming languages harness the relation between specific instructions and the structure on which they are embedded; but this structure is oriented towards building a single executable block that controls a machine step by step. We have yet to build tools equivalent to IDEs to better exploit the overall structure of knowledge, but for the goal of understanding a topic at all levels of detail, bot at the local flow of ideas and the overall relation between its subtopics.
The first widespread step in that direction was the 1.0 static World Wide Web, and we have learnt a lot thanks to it so we can now improve upon it. I have great hopes in online notebooks and no-code spreadsheet-like tools as the basis of such information-processing environments.
I think we can use CTPH algorithms to fingerprint the data independently of whatever names are used, and then we can use that to find representations of the "same data" submitted by other users. Probably there would be some reputation stuff involved, a web of trust, etc. The flow would go something like this:
COGNIZE:
0. Encounter messy data in the wild (has pagination, timestamps of access, etc), need other representation (human/computer/whatever)
1. Calculate CTPH fingerprints, use them to search for link: miss
3. Clean data the hard way and publish canonical representation (ipfs?)
4. Generate missing representation the hard way, publish that too
5. Calculate the fingerprints common to both the missing representation and the canonical representation, and publish it as a "link" between the two. Unlike traditional web links, this one is bidirectional.
RECOGNIZE
1. Different user encounters "same" data in the wild
2. Calculate fingerprints, use them to search for canonical representation: hit
3. Find further links to see what other representations of the "same data" are available, download them if desired.
The fingerprint stuff works, but there's a lot of work left to be done re: mapping fuzzy hashes of "in the wild" data to cryptographic ones of "canonical representations" and finding ways to incentivize users to go through the hassle of the "cognize" step so that other users can benefit from the "recognize" step.
Sorry for talking your ear off, it just feels good to know I'm not the only one working on something like this, even if our approaches are quite different (mine works on PDF's only because it works on arbitrary bytes). Good luck with yours.
But I don't think it changes the main problem to be solved, which is that it's data consumers who want the semantic web, but so far it's been up to data providers to implement it. We need to be able to create links between data with without the participation of whoever hosts it.
I figure if you configure the client to keep track of whose annotations you have a history of using, you've got a real granular view of which content providers to pay. I'm imagining a game where we all put $5 in at the start of the month and we all pay each other based on whose content we use the most. Some users will pay more than they make, others will make more than they pay.
This is exactly what we are trying to achieve at Anvil. 1. Provide the no-code tools to make it easy to convert existing PDF forms into web forms. 2. Share the web forms with perspective customers instead of PDF forms as email attachments 3. PDFs are generated as part of the workflow once the data is captured and represented in structured JSON. 4. (optional) request certification of the PDF via e-signatures
The end result is a JSON payload that can be shared via API as well as a static PDF that is stored for human consumption. In most cases, we find that our customers actually just use the PDF as an interface with legacy systems (IRS, Banks, Insurance Companies) that haven't yet figured out how to modernize to a data-first business model.
Of course this really only addresses PDFs that are used for information capture and transfer between two parties. But most PDFs that are not "standardized-forms" are made for consumption by humans not by machines (think ebooks, journal articles, graphics etc), and therefore having a JSON payload of the data attached doesn't really matter.
https://de.wikipedia.org/wiki/ZUGFeRD
My CodeDraw app also puts the source code in PDFs or PNGs that it generates.
Is this compatible with User Defined Language?
But: don't we need some way to prove that the data matches what is visually rendered in the PDF reader?
And if we can prove that the embedded data matches the rendered document, couldn't that same logic just be used in reverse to generate the structured data from the renderable PDF?
I do really like the idea of having a checksum of some sort if we end up embedding metadata like this.
But in the real world people are going to want to annotate the PDFs, there will be open-and-save cycles that add metadata and break the checksum, etc and so on, even without considering malicious actors. Restricting all that is maybe easy -- just reject anything that doesn't match the checksum, done -- but communicating that restriction to the users without making a mess of it is probably hard.
Long ago I wrote some PDF generating programs and it was a lot of fun, but the spec has evolved in ways I imagine would make it less fun today. Still, could be a cool thing, and I'd be surprised if someone hasn't already done a version of it somewhere.
[Edit: plus, whoever creates the one-way function is deciding what all the PDFs are going to look like, which means you will end up with many such functions to accommodate the different rendering goals, and then each validator needs to know each one and someone decides which ones to trust, and so on...]
Wait until you realize this applies to business presentations (Excel and PPT-authored-PDFs)
But it doesn't matter. The API doesn't matter. Web 3.0 was never about APIs, it was about data. A standardized API is only useful if it outputs standardized data. Having a bunch of bespoke SQLite tables scattered across the web gets us no closer to the ideal of Web 3.0.
Features you may have missed:
- CTEs - the WITH statement works great. And recursive CTEs, which can do mandelbrot fractals! https://www.sqlite.org/lang_with.html
- Surprisingly good built-in full text search: https://www.sqlite.org/fts5.html
- Functions for directly querying JSON data in columns: https://www.sqlite.org/json1.html
- Extremely comprehensive geospatial capabilities thanks to the SpatiaLite extension - this has a huge amount of functionality, which I think is better than MySQL though not yet as good as PostGIS: https://www.gaia-gis.it/fossil/libspatialite/index
I'm not saying it's as "good" as PostgreSQL, but I don't thank your argument that PostgreSQL and MySQL implement a substantially larger portion of SQL holds up particularly well.
Schema.org exists, but all websites adopting it seems unlikely.
That being said, I can maybe see a world in which one company adopts schema.org schemas and the rest have to follow suit to be competitive in that particular domain.
Schema.org has the backing of major search engines and other reusers of Web-served content. It's way more likely to be adopted compared to anything else in its general domain.
The only difference is that it is usually run locally (compared to Postgres and your other examples), but something doesn't have to run remotely to be considered running in production :)
You could probably take it one step further and define an OpenAPI spec which is populated via those queries. Tho that would require an intermediary / post-processing, likely with a cache.
Regardless, the capability to determine how and what to consume sits with the consumer (developer) from the outset. Rather than having to scrape the data, normalise it into some form of schema, and then build an api / interface around it. And then worry about keeping it up to date.
This article is talking about that earlier definition, and a way it might once again be the definition, perhaps relegating Web3 to Web 4.0 (or web we-worked-out-it-was-a-ponzi-scheme-so-just-stopped with something else being Web 4.0, if you take the more cynical view).
The idea is good, that a web-page should be generated from some data somewhere. But "web" is much about not a single document but the links between the documents, which allow you to to represent a "semantic net". The data should be about the links between them. Now where is such a database? And how can it "sharded" into multiple databases running in thousands of locations on the internet?
Though to be serious: what do you expect a graph database to provide that sqlite cannot / does not do efficiently?
(Note: I've actually written a graph database from scratch, for exactly these reasons.)
Triple stores are essentially relational databases in 6th normal form. But relational databases like SQLite don't have good join algorithms to deal with this (they do pairwise joins instead of the worst case optimal ones like Leapfrog-Triejoin or Tetris). They also lack good interfaces for so many joins, you want something more declarative like Datalog/SparQL/GraphQL, than to explicitly write out every join.
That has been my pet peeve for a while, and can make it hard to define navigation.
Totally not a deal-breaker though. I'd still use sqlite because I ### love it :)
relation_id │ relatee_id
────────────┼───────────
1 │ x
1 │ y
voila, directionless relationships. you can even do N-way, just insert more rows with the same relation-id.SQLite is not a graph database. Even if you used SQLite to implement a graph database, it would not solve any significant problems of the semantic web, such as access to data, taxonomies, ontologies, lexicons, tagging, user interfaces to semantic data management, etc.
It's a really odd suggestion that you would just copy around a database or leave it on the internet for people to copy from. For the BBS mentioned here, that might actually be illegal, as it might contain PII, and on other sites possibly PHI. Many countries now have laws that require user data to remain in-country. Besides the challenges of just organizing data semantically, there still needs to be work done on data security controls to prevent leaking sensitive information.
The funny thing is, that isn't even hard to do with the semantic web. You classify the data that needs protecting and build functions and queries to match. You can tie that data to a unique ID so that people can "own" their data wherever it goes, and sign it with a user's digital certificate which can also expire.
But all of that (afaik) doesn't exist yet. Everyone is more concerned with blockchains and SQL, either because the fancy new tech is sexier, or the old boring tech doesn't require any work to implement. The Semantic Web never caught on because it's really fucking hard to get right. No companies are investing in making it easier. Maybe in 20 years somebody will get bored enough over a holiday to make a simple website creation tool that implicitly creates semantic web sites that are easy to reason about. It'll probably be a WordPress plugin.
[1] https://en.wikipedia.org/wiki/Semantic_Web [2] https://graphdb.ontotext.com/documentation/enterprise/introd...
This just isn't true, on multiple levels. RDF is an interoperability standard that does not per se depend on a 'graph-like' data model - you can very much expose plain old relational data via RDF, and this is quite intended. Additionally, modern general-purpose RDBMS's support graph-focused data models quite well, despite being built on 'relational' principles - there's no need for special tech when working with general-purpose graph models, unless you're doing some sort of heavy-duty network analytics.
RDBMS were a niche research project for a decade before they started to catch on in business apps. They've stayed around forever because they're just functional enough to be dangerous. But we've already hit the upper limits of both reliability and performance years ago (remember NoSQL?) and we just keep bolting on features because nobody wants to leave them. The old designs and implementations are holding us back.
The online docs (and TBL himself) rarely mention of graph databases, but obviously the idea is tied tightly to RDF. Separating it from that implementation detail is part of the point, though. Getting people to represent their data via an additional format was never going to work.
> For the BBS mentioned here, that might actually be illegal, as it might contain PII
Can't imagine the purpose you had in even making this point. In theory, any arbitrary database exposed publicly could be illegal to replicate due to copyright, PII laws, etc. But that has nothing at all to do with a technical discussion of a technique for exposing data. What a bizarre point to make.
As an aside, I'm glad you removed the "Uh........." from the beginning of your post. We're all making an effort to reduce the typical HN snark in the comments, and there's always room for improvement :D
There is a WP plugin to expose some basic data: https://wordpress.org/plugins/wp-linked-data/ and a Swedish startup Metasolutions has Wordpress plugins for embedding any kind of RDF information: https://docs.entryscape.com/en/blocks/
I took advantage of that for my datasette-graphql plugin - it's not a graph database, but it does allow deeply nested graph-like queries that take advantage of SQLite's fast small query performance: https://datasette.io/plugins/datasette-graphql
1. Graph databases on top of triple stores are a lot less scalable than relational databases or key-value stores, and this is how semantic data is meant to be stored/queried.
2. Data is valuable. Handing out data for free in a machine-consumable way is both expensive (machines can request data much more quickly than a human) and a recipe for copycats. The incentives just aren't there.
TBL's Solid project is about trying to separate semantic data providers from the presentation layer and opening up the possibility of payment from these data providers to try to improve the incentives around semantic data sharing.
I really appreciate this point. I had the opportunity to work on an exploratory project with an experienced ontologist (yes, you really need one of those, I think). The tools were fascinating (reasoners quickly became necessary) but I had the feeling that many of these tools were at a comparatively early stage of maturity.
Trying to explain to people how the system would work was a challenge as it required a primer on theory and application -- we glazed many eyes. The CTO wanted to know if we could use blockchain somehow. Another group addressed a slice of the problem with technologies already in use and that decided the matter.
Ouch. Most uses of reasoners/inference are quite computationally-intensive, to the point of making "reasoning" quickly infeasible. But if you really want, you can do all this stuff in traditional databases by defining appropriate 'views' and having your application query them. You could even use custom database triggers to enable inserts/updates on views.
E.g. Stardog does most of their reasoning via query rewriting (and also lean on some restrictions). That way you can leverage DBs to do what they are good at. If you can then on top of that build some clever caching or incremental computation, you should be fine for even pretty huge dataset sizes.
The positive thing I feel when reading about this, is that it dramatically lowers the barrier for the producer of the data to expose it in a meaningful way. While previously it was necessary to think about the format and write code to expose the data while now its possible to just throw the data over a wall.
You could use a framework to automate the first thing, but this would be specific to one programming language, while the second approach works with all languages. So it lowers the total effort to get to the goal, effectively side-stepping the "have to implement framework or serialization code" issue.
Warning: heavy speculation below
So if more people would build sites using this technique, the pressure for better tools (at a higher level than right now) for consumers would increase, so these would be built by someone. As you have a proper standard (there is only one SQLite) you would have a new "ecosystem" growing. This would lower the pain for the consumers of said data. You'd still have to implement it in every programming language that wants to access the data, but this is another problem.
It isn't quite so bad. You can have wiki-esque volunteer-driven cooperative authoring, linking to known good versions, etc. to keep it from becoming a complete free-for-all.
Ontologies were centrally published (and had URLs when not - "URIs/URNs are cool"), so it was easy to understand data models. The entity name was the location was the definition. Ridiculously clever.
Furthermore, HTML was headed back to its "markup" / "document" roots. It focused around meaning and information conveyance, where applications could be layered on top. Almost more like JSON, but universally accessible and non-proprietary, and with a built in UI for structured traversal.
Remember CSS Zen Garden? That was from a time where documents were treated as information, not thick web applications, and the CSS and Javascript were an ethereal cloak. The Semantic Web folks concurrently worked on making it so that HTML wasn't just "a soup of tags for layout", so that it wasn't just browsers that would understand and present it. RSS was one such first step. People were starting to mark up a lot of other things. Authorship and consumption tools were starting to arise.
The reason this grand utopia didn't happen was that this wave of innovation coincided with the rise of VC-fueled tech startups. Google, Facebook. The walled gardens. As more people got on the internet (it was previously just us nerds running Linux, IRC, and Bittorrent), focus shifted and concentrated into the platforms. Due to the ease of Facebook and the fact that your non-tech friends were there, people not only stopped publishing, but they stopped innovating in this space entirely. There are a few holdouts, but it's nothing like it once was. (No claims of "you can still do this" will bring back the palpable energy of that day.)
Google later delivered HTML5, which "saved us" from XHTML's strictness. Unfortunately this also strongly deemphasized the semantic layer and made people think of HTML as more of a GUI / Application design language. If we'd exchanged schemas and semantic data instead, we could have written desktop apps and sharable browser extensions to parse the documents. Natively save, bookmark, index, and share. But now we have SPAs and React.
It's also worth mentioning that semantic data would have made the search problem easier and more accessible. If you could trust the author (through signing), then you could quickly build a searchable database of facts and articles. There was benefit for Google in having this problem remain hard. Only they had the infrastructure and wherewithal to deal with the unstructured mess and web of spammers. And there's a lot of money in that moat.
In abandoning the Semantic Web, we found a local optima. It worked out great for a handful of billionaires and many, many shareholders and early engineers. It was indeed faster and easier to build for the more constrained sandboxiness of platforms, and it probably got more people online faster. But it's a far less robust system that falls well short of the vision we once had.
At one point twitter seemed to want to be a relatively general protocol, where users could build their own UI, use 3rd party apps and maybe even interoperate or extend with other social networks & such.
Pg/yc even wrote about it, inviting startups to start writing apps for this exciting new protocol. The early app ecosystem was pretty slimy, with a lot of spam-ish clients for promoting snake oil. More importantly, it became clear that controlling the UI means control over users: the data, rights, often and the ability to decide what goes into people's feed. That's where the (financial) value is, and they're not going to give that up.
TBL's ideas were naive perhaps, but he did have his thumb in the right place. Something like semantic web was necessary, in order to avoid the centralisation that did end up happening.
RSS, via podcasting did catch on. Today it's one of the only "free" media forms. There's no company moderating podcasts like twitter, FB, youtube, etc.
"Marking up forum posts" is something that's getting quite a bit of traction nowadays via specifications like ActivityStreams (with its "push" extension ActivityPub now powering the 'Fediverse') and WebMention.
While that concept sounds cool in theory, in practice it was and is a disaster. In combination with the big degree of centralization and little versioning mechanisms you have to trust the publisher to not alter the semantics, and also hope that they stay online forever or your semantics vanish.
When I first learned about the semantic web, I was very hyped on it, but that quickly subsided once I tried actually querying the ontologies and having to see that most of them yield a 404.
I'm still very hopeful for semantic data (and happy to be able to work on a product leveraging it), but I think for an open semantic web there is a lot of work that needs to go into tooling to make it succeed.
In the end though I'm not sure it ever would have been any different. People want it "now" and they want it "convenient".
RSS can be considered a primitive separation of data and UI, yet was killed everywhere. When you hand over your data to the world, you lose all control of it. Monetization becomes impossible and you leave the door wide open for any competitor to destroy you.
That pretty much limits the idea to the "common goods" like Wikipedia and perhaps the academic world.
Even something silly as a semantic recipe for cooking is controversial. Somebody built a recipe scraping app and got a massive backlash from food bloggers. Their ad-infested 7000 word lectures intermixed with a recipe is their business model.
Unfortunately, we have very little common good data, that is free from personal or commercial interests. You can think of a million formats and databases but it won't take off without the right incentives.
True. Demonstrable in the health-care IT world. Think of electronic health records. My personal portable electronic health record would either be a bunch of images of scrawled notes and maybe some nice medical images ( = nonsemantic web) Or it would be in a highly wrought format, i dunno, XML or something, with carefully worked out schemata for everything from flu shot records to heart transplants (= semantic web).
Back in 2007-2010, "electronic health records" EHR were spottily and sloppily implemented by some providers. But, in the US, a federal law pushed more widespread implemetation. Now my online EHR, and yours, is decidedly app-mediated and non-semantic, on a web site portal. Export to JSON? Hah. No.
The hospitals and health care systems only did it because of incentives.
I happened to work at a B2B SaaS company focused on making connections between hospitals and rehab/skilled nursing providers. A rehab outfit can't decide to accept a patient without seeing her medical records and doctors' orders. So our customers had a real incentive to be able to share records. It worked. But the data we had access to (go read about HL7) was not even close to semantic. And our SQL database schemas were, umm, quinquiremes of Nineveh, really intricate, somewhat brittle. Let's leave privacy issues out of the conversation for a moment. Publishing the schema and accepting random queries would help NOBODY except some partner outfit willing to develop and test useful stuff.
Hey, I got an idea! Let's give them an API! Oh, wait, nobody wants to bother with an API? OK, how about a nice web site! And we're back where we started.
With a universal semantic web, the same problems would crop up everywhere.
Taking someone else's content and republishing it without permission isn't cool, even if you wrap it in a nice machine readable format.
The question isn't even, what can one do, because obviously nobody can change how incentives works in a given society.
The question is: is there a timeline in which the right incentives (to share data) start being enforced? How would that play out?
Why would it be semantic?
Point in case: YES to at least remembering that web3 is (also) the Semantic Web. But no, this solution is not semantic data.
Any announcement I missed? Solid project exists for a long time and seems that many specs are still very early days.
But as I wanted to point out, which data models are used, is not the major obstacle to the semantic web. It is these other problems that are not addressed.
Similar to the OP, one of the things I've realized is that while the dream of getting everyone to use the exact same standards for their data has proved almost impossible to achieve, having a SQL-powered API actually provides a really useful alternative.
The great thing about SQL APIs is that you can use them to alter the shape of the data you are querying.
Let's say there's a database with power plants in it. You need them as "name, lat, lng" - but the database you are querying has "latitude" and "longitude" columns.
If you can query it with a SQL query, you can do this:
select name, latitude as lat, longitude as lng from [global-power-plants]
Here's a demo using exactly that query: https://global-power-plants.datasettes.com/global-power-plan...That URL gives you back an HTML page, but if you change the extension to .json you get back JSON data:
https://global-power-plants.datasettes.com/global-power-plan...
Or use .csv to get back the data as CSV:
https://global-power-plants.datasettes.com/global-power-plan...
But what if you need some other format, like Atom or ICS or RDF?
Datasette supports plugins which let you do that. I'm running the https://datasette.io/plugins/datasette-atom datasette-atom plugin on this other site. That plugin lets you define atom feeds using a SQL query like this one:
select
issues.updated_at as atom_updated,
issues.id as atom_id,
issues.title as atom_title,
issues.body as atom_content,
repos.html_url || '/issues/' || number as atom_link
from
issues join repos on issues.repo = repos.id
order by
issues.updated_at desc
limit
30
Try that query here: https://github-to-sqlite.dogsheep.net/github?sql=select%0D%0...The plugin notices that columns with those names are returned, and adds a link to the .atom feed. Here's that URL - you can subscribe to that in your feed reader to get a feed of new GitHub issues across all of the projects I'm tracking in that Datasette instance: https://github-to-sqlite.dogsheep.net/github.atom?sql=select...
As you can see, there's a LOT of power in being able to use SQL as an API language to reshape data into the format that you need to consume.
https://github.com/mumba-org/mumba
It's badly documented as i have just published to github, but i hope it gives a clue of how is supposed to work.
I'm on the final touches over this project, but the main concept is already working as is 90% of it, but i think exposing SQL is too raw, and maybe dont offer the whole picture, as for instance, what is important is not data, but sometimes pure computation.. Eg. suppose you offer a deep leaning inference where you receive and give back tensors..In the middle of it is a different sort of computation, where it doesnt have anything to do with databases.
Or yet, suppose you need to access something in a third-party before giving an answer, or if you want to do it in a distributed fashion without you api consumer even noticing it?
API's are a good answer to that, and in my opinion are superior interfaces, whatever the semantic web of the future will be, it will need this network of API peers to work as a floor to it.
For instance, you can design a Graph API on top of it. Exposing your data layer directly is bad engineering as there's a lot of problems you wont be able to solve, and where leaving clients to talk to "you" over a well-defined API will.
To put it simply, in my point of view the direction the semantic-web is pointing to is cool, but the answer is not the right one, and this idea of exposing SQLite directly while is cooler, yet have the same flaws, or else something as GraphQL would have taken the world as its not much a different answer than the one presented here.
With Datasette, my solution is to specifically publish the subset of your data in the schema that you think is suitable for exposing to the outside world. You might have an internal PostgreSQL database, then use my db-to-sqlite tool - https://datasette.io/tools/db-to-sqlite - to extract just a small portion of that into a SQLite database which you periodically publish using Datasette.
The other idea I have is to use views. Imagine having a PostgreSQL database with a couple of documented SQL views that you expose to the outside world. Now you can change your schema any time you like, provided you then update the definition of those views to expose the same shape of data that your external, documented API requires.
As with all APIs of this sort, adding new columns is fine - it's only removing columns or changing the behaviour of existing problems that will cause breakages for clients.
- https://datasette.io/plugins/datasette-ics can be used to generate ICS calendar feeds, which you can subscribe to using desktop calendars or Google Calendar
- https://datasette.io/plugins/datasette-geojson can generate GeoJSON files for any SpatiaLite database table with a geometry column.
https://intercoolerjs.org/2016/05/08/hatoeas-is-for-humans.h...
The idea that metadata can be provided and utilized in a similar manner doesn't strike me as realistic. If it is code consuming the metadata, the flexibility of the uniform interface is wasted. If it is a human consuming the metadata, they want something nice like HTML.
For code, why not just a structured and standardized JSON API?
This appears to be what we have settled on, and I don't see any big advantage extending REST-ful web concepts on top of it. The machines just ignore all that meta-data crap.
So in this version of the idea... because structuring data requires work. Unstandardized data exists already. Some of it is already SQLITE. A lot of the rest is in other SQLs, and that might be a smaller bridge.
Author claims (if I'm understanding correctly) that a static website could easily query sqlites over HTTP, and bam, web 3.0.
Honestly, it's hard for me to think/discuss these ideas without examples, even if contrived. What kind of websites would be built this way? What data will they be querying?
A web app that uses photos and address books on the users phone? An alternative UI for news.yc?
The word "just" is doing a lot of work there! Getting everyone to use the same standardized JSON API format turns out to be incredibly difficult.
This is why I'm a big fan of the idea of using SQL as an API language to redefine the data into the output format that you need, see my comment here: https://news.ycombinator.com/item?id=29900403
Okay, offline first, what does that mean? Should I download the entire 600mb SQLite database? Should I do it every time it changes? Who will pay for the bandwidth? We can not employ standard HTTP proxy and caching mechanism here, it is not for 600mb files.
A typical request using indexes will be less than 10 separate 1KB GET requests, not 50. But yeah, more work needs to be done on performance.
Whether it makes sense to fully download the dataset depends on the project; maybe it does not. But it doesn't have to be a monolithic file. You can use SQLite's multiplex VFS to split the SQLite file into many smaller pieces (and still update the db later!).
Schema-enforced document databases, on the other hand, are neat and mostly people- and machine-readable at the same time.
Another idea: downloading only indexes may greatly reduce number of requests needed to query the data.
Despite the number of requests seeming excessive to us, the performance of this setup is already in the ballpark of your usual underpowered MySQL going through an app server.
Someone have to pay for the infrastructure. Right now the one hosting (not the consumer) pays for the infrastructure. So there are not really any incentive to host data for free - like a kiosk offering free goods and services.
The problem is if you would get paid by sending stuff, everyone would be spamming data everywhere. Imagine if you would have to pay 10c every time someone sent you an e-mail.
Something that I think would help are micro transactions. And a built into browsers so that you could easily make a micro-transaction. We already have Bitcoin and other crypto currencies, but they are too big to run inside the browser of a mobile phone, if it wasn't for the high transaction costs - the ledger/blockchain would be even bigger...
Today publishers - those that publish stuff on the web earn money by showing ads. And ads initially worked very well for a few years around 2000 before people started cheating with bots. But you can still make individual deals with webmasters and choose to trust them.
Also a lot of "the web" has moved to videos and Youtube. The average web user choose to watch a video rather then reading a text article covering the topic of interest.
This is the standard ISP arguement that I never understood. Google pays their ISP and any other ISPs they are using for peering, but they don't HAVE to pay your ISP. You pay for your usage.
The thing it's kinda missing for me is the ability to compose multiple SQLite databases, possibly provided by different domains.
It'd be nice to join together different public datasets. In a weird personal example, if Strava exposed SQLite, I'd love to do a join to weather.com and see when the last time I biked in the rain was.
It'd be cool if one half of some table was at foo.com and I could add a few rows to it on my bar.com domain, and then the combined dataset was queryable as a single unit.
Datasette grew a `--crossdb` option a while back, which means that if you attach multiple SQLite files to the same Datasette instance you can run joins across them: https://docs.datasette.io/en/stable/sql_queries.html#cross-d...
So one option is to download the database you want to join against and run the joins locally. Datasette encourages making the raw SQLite database file available, so if it's less than about 100MB this may be a good way to do it.
If you're willing to do the joins in your client-side code, Datasette's default JSON API can help. You can write an application (including a client-side JavaScript application) which fetches and combines data from multiple different Datasette JSON instances by hitting their APIs.
My last idea is the most out-of-left-field: since Datasette lets you define custom SQL functions using Python code, it would be feasible to create a Python function which itself makes a query via the JSON API against another Datasette instance! You could then use that to simulate joins in SQL queries that you run against a single Datasette instance.
I've not built a prototype of this yet, and to be honest I think combining data fetched from multiple JSON APIs (which is possible today) will provide just-as-good results, but it's an interesting potential option.
1. You design your data model as a set of PHP classes, or you generate the class from any RDF vocabulary such as Schema.org
2. API Platform uses the classes to expose a JSON-LD API with with all the typical features (sorting, filtering, pagination…)
3. You use the provided "smart clients" to build a dynamic admin interface or to scaffold Next, Nuxt or React Native apps (these tools rely on the Hydra API description vocabulary, and work with any Hydra-enabled API)
In addition to RDF/JSON-LD/Hydra, API Platform also supports ActivityPub.
Great idea by the way! had to the pleasure of working with api platform in some Symfony applications in a past gig. I can vouch its easy enough to use, but the GraphQL integration (at least at that time) was really slow. I have not found PHP to be the ideal runtime for GraphQL
we love web3 labeled articles here
fortunately the more popular variant of web 3.0 doesn't even need the developer to make a database or anything on the backend. just frontend development, and deploying code once to the nearest node. frontend is optional depending on your userbase.
There's been a technology around for so long, that it is forgotten meanwhile (like the semweb itself): xslt.
A lot can be done by just publishing raw xml data plus a visual representation generated in the browser right before display.
I'm doing so with RDF https://demo.mro.name/geohash.cgi/about, GPX https://demo.mro.name/geohash.cgi/u154c, homegrown xml http://rec.mro.name/stations/b2/2022/01/12/1005 or atom feeds https://demo.mro.name/shaarligo and on and on.
The server is a source of data, its filesystem the database, and the client has to make sense of it. There is no API but GET requests. Works wonders for all but big data queries, naturally.
So you publish raw data (TimBL, you want it that way) plus a recipe for a visual representation and the browser shows a sensible view to begin with.
In a world where my blogpost objet has the same information as your blogpost object, this works without a problem.
In a world where I actually want to up my database to you, we could agree on a format.
Both of these cases, from where i stand, seem very unlikely and we have not even talked about the pople that would clone your data 1 to 1 just to host an ad filled alternative of your site in real time.
But I don't think it should be actually used for anything serious.
And I don't really get the connection with "semantic web", which was essentially idealistic vaporware of the 2000s.
Not going to work unless imposed by some external force. The semantics of the web can more practically be extracted with neural nets, but it's a long tail and there are errors. Lots of good work recently in parsing tables, document layouts and key-value extraction. LayoutLM and its kin comes to mind.[1]
[1] https://scholar.google.com/scholar?cites=9435785928704193879...
I poked around the ANSIWAVE BBS and it looks fun!
Are there more than 2 blogposts? Cannot find a posts page.
Is manual labor the reason things turned out the way they did, with google spending whatever it took to index and monetise the whole web the way it did?
Or might money have something to do with it?
Re exposing an entire database of static content: again, reality gets in the way. Websites want to keep control over how they present their data. Not to mention that many news sites segregate their content as public and paywalled. Making raw content available as a structured and query able database may work for the likes of Wikipedia or arxiv.org. But it’ll not likely going to be adopted by commercial sites.
Would be great if providers could offer data in raw form without the overhead of all the gunk that gets them paid.
The lower case variety kind of survives as a smart thing to do to "help" search engines a little but otherwise has very little real world relevance. All talk of doing anything with on page information in browsers evaporated a long time ago. E.g. MS had some plans with this with early versions of Edge and there were some nice extensions for Chrome and Firefox as well. Not a thing any more. Most of that got unceremoniously ripped out of browsers a long time ago. At this point it's basically just good SEO practice to use microformats as search engines can use all the help they need to figure out what is what on a page. Other than that, whether you render your data to a canvas, a table, or nice semantic HTML has very little relevance for anyone. It's all just pixels that hit your eyeballs in the end. There's nothing else that looks at that information. With the exception of search engines. And they were part of web 1.0 already.
The capital S Semantic Web with ontologies, triple databases, etc. never really got out of the gates and is perpetually stuck in people doing very academic stuff or specialist niche stuff that largely does not matter to anyone else. The exception is graph databases, which are still used in some data/backend teams for some stuff. And of course a few of those also pay lip service to some of the Semantic Web W3C standards from the early 2000s even though that is not the main thing they do anymore. Either way, too much of a specialist thing to call it a semantic web (capital or lower case). Most of the web uses exactly none of this stuff. But nice tools to have if you need them. You could argue a lot of the people involved moved their focus to AI and machine learning, which certainly looks like it is having a very large impact on e.g. search engines.
I guess web3 has that in common with web 3.0 (other than the number 3). There are a few people who desperately (and loudly) want the web to go their way and insist it must be the future. But most people couldn't care less. In the end people just vote with their feet and gravitate to technologies that work for them or solve a problem they have and ignore things that don't do anything useful for them. In the case of Semantic Web, there was nothing there that you could coherently explain (i.e. without using all sorts of abstractions, complex stuff, and simplistic hyperbole). There were a few startups and lots of hype. They did a bunch of stuff. Most of those startups no longer exist or have faded into irrelevance. And the few that survived carved out a few interesting niches but did not end up producing any mainstream, must have technology. Certainly no unicorns there. Wolfram Alpha probably is one of the more well-known ones that actually shipped something useful. But it's a destination and not the web.
Web3 has the same issues. Most threads on HN on web3 devolve into people talking about what it is, ought to be, or isn't and why that is or isn't important. That seems to be impossible to do without using a lot of hyperbole and BS. Very little substance in terms of widely adopted technology or even in terms of what that technology looks like or should look like. It's Web 1.0 all over again. Step 1 Blockchain, Step 2: ????, Step 3: Profit (or not).
Most of the web is just a slightly slicker version of what we had 15 years ago (web 2.0). AJAX definitely became common place. We now have mature versions of HTML, SVG, CSS, etc. that actually work. And with WASM we can finally engineer some proper software without having to worry about polyfills and other crazy hacks to make javascript do stuff it clearly is not very good at. I'm looking forward to the next 15 years. It's going to be interesting and possibly a wild ride.