This is an important project, with the potential to eclipse wikipedia, maybe even growing to be the saviour of free software? My reasoning follows.
Currently we program computers by giving them a set of instructions on how to achieve a goal. As computers grow more powerful, we will stop giving detailed instructions. Instead, we will write a general purpose deduction/inference engine, feed in a volume of raw data and let the computer derive the instructions it must follow to achieve the given goal.
There are two parts to such a system: the engine and the data. The engine is something that free software is capable of producing. The missing component is the data. The wikidata project is this missing component.
I'm convinced that Wolfram Alpha is a glimpse of this future: an engine coupled to a growing body of structured data. Wolfram's end game isn't taking over search, but taking over computer programming and ultimately reasoning. It's just that search is currently a tractable problem for Alpha, one that can pay the bills until it becomes more capable. There will come a day when Alpha is powerful enough to automatically translate natural language into structured data, at which point it will spider the Internet and its database and capabilities will grow explosively.
Free software needs Wikidata, to arrive at this endpoint first and avoid being made largely irrelevant by Alpha (or Google?)
I do think that by virtue of breadth Wikipedia's version may become the best data resource in niches that have no specialized structured-data project for them, and it may give other informal-schema, broad-coverage projects like ConceptNet a competitor.
Wolfram Alpha is already completely worthless because it doesn't cite the sources for any of its results. It's basically just a fancy search engine built on top of a garbage dump.
Click "Source information."
The field is very exciting but not without grave risks. I am of the opinion that The final key breakthrough(s) in Artificial Intelligence will be raced and not collaborated towards. The advantages the possessor of such a system would have would be enough to test the purest of saints. Also, Computational Ethics lags far behind even current primitive attempts at AGI. Furthermore, there are insentives to leave off the moral breaks since the consequences seem ephemeral, burdening your system with ethics would further increase the search space - doing the right thing is computationally harder than doing what is best for just yourself. The future: step carefully.
For what it is worth, you can merge large data sources with automatic program construction today. I recently started a project in this area. ConceptNet has an excellent API. Then look at Genetic Programming, Markov Logic Networks, inductive logic programming, each with its own strength and weaknesses. Program Transformation is a related area where it deduces programs from formal specifications that are unoptimized or non polynonmial in time or space. The most interesting take on this I have seen: http://www.cas.mcmaster.ca/~kahl/HOPS/ANIM/index.html.
There are meta-format standards: XML, RDF, HTML and lately JSON. With these four you are probably covering 80% of the world published open data, the rest is PDF, MS DOC and MS XLS.
That is missing, and good like filling this void, is a single format that you can use to describe everything. Personally, I think that such a single format will never exist and looking for one is pointless. Geographical data requires attention to certain details, music data to others; this means two different formats must be used (serialized through XML, RDF, HTML, whatever). If you are thinking about "bridging" different formats and data models, then, welcome to the world of RDF/S, OWL, TopicMaps ontologies (or ontologY), I'm not sure you want to live there :)
This new Wikidata, just like Freebase, is trying to collect structured or semi-structured data instead of unstructured data such as that present in Wikipedia. I am happy about the aim (completely unstructured data is basically useless for any serious data reuse and data extraction) but my fear is that they will not succeed as well as they did with Wikipedia. Wikipedia funded its success on the fact that anybody could edit it. In order to edit a wikipedia page you only need very low technical skills and basic writing skills (plus knowledge of the topic, obviously). Adding and manipulating structured data requires people to obey to a certain mental grid, to a formalized model, to a schema developed by someone and put in place to be respected strictly. The vast majority of people is easily demotivated when they are required to learn something substantial beforehand and most of the edits of unskilled users end up removed by watchdog (something seen often in high quality Wikipedia articles: edits made by new users are quickly reverted on the grounds that they did not follow some of the many guidelines that must be followed).
My idea is that many problems found in structured-data projects (FreeBase, MusicBrainz...) could be alleviated by better interfaces and a wide use of automation, both things that Wikipedia projects do not seem to excel in.
The data will be exported in different formats, especially RDF, SKOS, and JSON.
http://meta.wikimedia.org/wiki/Wikidata/Technical_proposalThe question is which schema, ontology or vocabulary will they use to express their data? Who will develop it? Or will they reuse other vocabularies? How do they intend to extend them? If they are RDF based, how will they project to JSON given that there are a dozen different conversion methods?
How can that document not cite DBpedia, a project that is extracting structured data from Wikipedia infoboxes and has years of experience in doing that?
The fact that their technical proposal document is quite confused about these ground technologies makes me fear that there is more wishful thinking than past experiences.
The important thing is to jump in and make a start. The right way of doing things will become evident as the project evolves.
In theory, it's a good idea. It takes an existing useful data source and puts in a form that encourages reuse, and since it solves the bootstrapping problem then it's not obviously doomed to failure like the Semantic Web.
I see two potential downsides.
My first concern is that, in practice, it will make editing Wikipedia more complex. There's no inherent reason why this should be the case, but there's no inherent reason why Wikimedia Commons should make editing Wikipedia more complex either, yet it undeniably does.
Secondly, it will prevent a similar source of data from appearing with broader terms of use. For example, OpenLibrary is public domain.
Somewhat ironically , since part of the reason is that you can't copyright facts, they didn't just take the existing data under the same theory, but asked everyone to accept the new licence. I wonder what Wikipedia plan to do?
I would think the more-pressing problem would be the 'viral' nature of the 'share alike' restriction when it came to API use.
Attribution would also seem to be thorny and difficult to police, but not intractable.
[1] e.g. I can make a phone directory and copyright it. You could take all the data out of my phone directory to make your own directory and that would be fine. But you could not simply make copies of my directory and sell those as your own.
The part that's actually interesting is:
- The methodology behind the numbers
- What we think is most likely the case based on the evidence available
- How each fact connects with other facts
- What we think we should do based on the evidence available
Being able to embed facts is definitely a cool use case, but unless you have all the other stuff backing it up when you click the link back to the database then it's pretty much worthless. And curating these sorts of epistemological discussions and third party analyses isn't something that really fits within the Wikimedia mission, so I doubt they will even try.
Because of this I doubt their implementation of the project will be successful, although I do think it's a space that ultimately has potential.
If there were no operations, math would just be numbers on their own -- and what fun is that?
The problem is that the relations turn it into the Semantic Web, and after trying and failing to crack that nut for so long, everyone is turned off of it. Which is too bad, because what was failing was the approach. Trying several shipping routes to the New World and failing each time doesn't mean that the New World doesn't exist.
Not really. Assuming there are only four or five simple relationships like "Knowing fact X is necessary to understand fact Y", then the whole system isn't much more complicated than trackbacks for blog posts.
Edit: As an afterthought, it would make a lot of sense to manage it like a git repository, where someone could submit a pull request for data changes, and then some subgroup or a trusted percentage of the population approves the request and it gets merged into the master dataset.
As someone who had suffered from an unknown illness (no doctor could figure it out), I can rationalize how such a system would have been helpful. You see a bit of this with WebMD's Symptom Checker, but I feel tools like that aren't comprehensive enough and we end up with a lot of cyberchondria. You can't rely on co-relation to find absolute answers, but helping map out symptoms, lifestyle choices may be a tool to finding solutions faster.
It took about a year to resolve my illness. Going to the doctor 2-4 times a week for 10-20 minutes isn't enough to work with when you have no clear-cut diagnosis.
Now, to be clear, I am not talking about replacing doctors or devaluing doctors by allowing everyone to be an expert.
Creating something general purpose like the Cyc or Semantic Web is very challenging, especially because different people have different notion of "meaning". Just look at the back and forth arguments over some controversial Wikipedia page. This is 100 times more conceptual and challenging.
1. http://dxplain.org/dxp/dxp.pl
From their site: The DBpedia knowledge base currently describes more than 3.64 million things, out of which 1.83 million are classified in a consistent Ontology, including 416,000 persons, 526,000 places, 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organisations, 183,000 species and 5,400 diseases.
"O3.1. Develop and prepare a SPARQL endpoint to the data. Even though a full-fledged SPARQL endpoint to the data will likely be impossible, we can provide a SPARQL endpoints that allows certain patterns of queries depending on the expressivity supported by the back end."
I see the semantic web slowly realizing its actual purpose (which is not related to semantic natural language processing but rather linking data).
"Wikidata is a secondary database. Wikidata will not simply record statements, but it will also record their sources, thus also allowing to reflect the diversity of knowledge available in reality."
That sounds pretty cool to me, because you could potentially upload probabalistic data from statistical analysis. If they make this so that you can tell how reliable the source is, you could upload information that's accurate to a given degree of probability.
It would be very interesting if you could version data by reliability, so that less-reliable data could eventually be replaced by definitive data. This is an achilles heel of current data modeling systems.