Wikidata: The first new project from Wikimedia Foundation since 2006 (opens in new tab)

(meta.wikimedia.org)

142 pointsmattrichardson14y ago49 comments

49 comments

37 comments · 13 top-level

femto14y ago· 6 in thread

Like others here, it's something I've been thinking about for a number of years.

This is an important project, with the potential to eclipse wikipedia, maybe even growing to be the saviour of free software? My reasoning follows.

Currently we program computers by giving them a set of instructions on how to achieve a goal. As computers grow more powerful, we will stop giving detailed instructions. Instead, we will write a general purpose deduction/inference engine, feed in a volume of raw data and let the computer derive the instructions it must follow to achieve the given goal.

There are two parts to such a system: the engine and the data. The engine is something that free software is capable of producing. The missing component is the data. The wikidata project is this missing component.

I'm convinced that Wolfram Alpha is a glimpse of this future: an engine coupled to a growing body of structured data. Wolfram's end game isn't taking over search, but taking over computer programming and ultimately reasoning. It's just that search is currently a tractable problem for Alpha, one that can pay the bills until it becomes more capable. There will come a day when Alpha is powerful enough to automatically translate natural language into structured data, at which point it will spider the Internet and its database and capabilities will grow explosively.

Free software needs Wikidata, to arrive at this endpoint first and avoid being made largely irrelevant by Alpha (or Google?)

_delirium14y ago

I think one problem is that it's really hard to do structured data in general. Projects that pick a specific domain tend to do it much better, because they have a more tractable problem, can build a community with domain expertise, etc., in ways that Wikipedia will have trouble matching unless they plan to collaborate with those projects and/or pull data from them. For example, I think a structured-data version of Wikipedia artist/album infoboxes is going to have a long way to go to catch up to http://musicbrainz.org/, which has a carefully thought out ontology and years of iteration on that specific problem. Alternatively you can try to do a carefully thought out, consistent schema for all metadata, but the Cyc project shows how hard that is.

I do think that by virtue of breadth Wikipedia's version may become the best data resource in niches that have no specialized structured-data project for them, and it may give other informal-schema, broad-coverage projects like ConceptNet a competitor.

Alex391714y ago

"Free software needs Wikidata, to [] avoid being made largely irrelevant by Alpha"

Wolfram Alpha is already completely worthless because it doesn't cite the sources for any of its results. It's basically just a fancy search engine built on top of a garbage dump.

mkr-hn14y ago

http://www.wolframalpha.com/input/?i=how+heavy+is+earth

Click "Source information."

1 more reply

Dn_Ab14y ago

The way I read this it looks like you are placing the data part as more difficult? Although the system you speak about sounds like it is AI Complete. Figuring out how the human mind manages to maneuver combinatorial explosions in interesting search spaces is a very hard problem.

The field is very exciting but not without grave risks. I am of the opinion that The final key breakthrough(s) in Artificial Intelligence will be raced and not collaborated towards. The advantages the possessor of such a system would have would be enough to test the purest of saints. Also, Computational Ethics lags far behind even current primitive attempts at AGI. Furthermore, there are insentives to leave off the moral breaks since the consequences seem ephemeral, burdening your system with ethics would further increase the search space - doing the right thing is computationally harder than doing what is best for just yourself. The future: step carefully.

For what it is worth, you can merge large data sources with automatic program construction today. I recently started a project in this area. ConceptNet has an excellent API. Then look at Genetic Programming, Markov Logic Networks, inductive logic programming, each with its own strength and weaknesses. Program Transformation is a related area where it deduces programs from formal specifications that are unoptimized or non polynonmial in time or space. The most interesting take on this I have seen: http://www.cas.mcmaster.ca/~kahl/HOPS/ANIM/index.html.

jsmcgd14y ago

I'm with you to an extent, but why do you think Wikidata in particular will be the missing component and not some other service like Freebase or DBPedia?

femto14y ago

I don't really. Substitute any free body of structured data for Wikidata, or even view them as one body of data, which happens to be spread across multiple servers (and maybe requiring some translation for unification).

judofyr14y ago· 5 in thread

Missing from the FAQ: What's the difference between Freebase and Wikidata?

_delirium14y ago

It looks like the main difference is two-way integration: instead of just scraping data from Wikipedia dumps to produce a structured database (like Freebase and dbpedia do), it's going to store the canonical version of some of the information there, and pull from it to populate the infoboxes. One of the motivations seems to be to keep the data in sync across Wikipedia languages, so an addition or fix propagates to them all, which is currently done somewhat awkwardly by a mix of manual and bot measures.

sjaakkkkk14y ago

For the interested reader, here a cool paper on Information Arbitrage Across Multi-lingual Wikipedia: http://www.cond.org/paper_202.pdf

huherto14y ago

So they are adding an extra layer?...Who said that CS is the science where everything is solved with an extra level of indirection?

1 more reply

wslh14y ago

And dbpedia?

gojomo14y ago

Also related: Factual.com.

manuletroll14y ago· 5 in thread

This might be very interesting if it's implemented in a sane way. Unfortunately there doesn't seem to be a very widely-adopted standard in the world of open data for now..

gioele14y ago

What does it mean "a very widely-adopted standard in the world of open data"? "standard" of what?

There are meta-format standards: XML, RDF, HTML and lately JSON. With these four you are probably covering 80% of the world published open data, the rest is PDF, MS DOC and MS XLS.

That is missing, and good like filling this void, is a single format that you can use to describe everything. Personally, I think that such a single format will never exist and looking for one is pointless. Geographical data requires attention to certain details, music data to others; this means two different formats must be used (serialized through XML, RDF, HTML, whatever). If you are thinking about "bridging" different formats and data models, then, welcome to the world of RDF/S, OWL, TopicMaps ontologies (or ontologY), I'm not sure you want to live there :)

This new Wikidata, just like Freebase, is trying to collect structured or semi-structured data instead of unstructured data such as that present in Wikipedia. I am happy about the aim (completely unstructured data is basically useless for any serious data reuse and data extraction) but my fear is that they will not succeed as well as they did with Wikipedia. Wikipedia funded its success on the fact that anybody could edit it. In order to edit a wikipedia page you only need very low technical skills and basic writing skills (plus knowledge of the topic, obviously). Adding and manipulating structured data requires people to obey to a certain mental grid, to a formalized model, to a schema developed by someone and put in place to be respected strictly. The vast majority of people is easily demotivated when they are required to learn something substantial beforehand and most of the edits of unskilled users end up removed by watchdog (something seen often in high quality Wikipedia articles: edits made by new users are quickly reverted on the grounds that they did not follow some of the many guidelines that must be followed).

My idea is that many problems found in structured-data projects (FreeBase, MusicBrainz...) could be alleviated by better interfaces and a wide use of automation, both things that Wikipedia projects do not seem to excel in.

icebraining14y ago

RDF has been adopted by some pretty big data websites, and apparently that's one of the formats they plan to support:

    The data will be exported in different formats, especially RDF, SKOS, and JSON.

http://meta.wikimedia.org/wiki/Wikidata/Technical_proposal

gioele14y ago

Technically unsound: RDF is a relationship model and a meta-model (think XML Infoset), SKOS is a vocabulary (think XHTML) and JSON is a serialization format (think XML or RDF/N3).

The question is which schema, ontology or vocabulary will they use to express their data? Who will develop it? Or will they reuse other vocabularies? How do they intend to extend them? If they are RDF based, how will they project to JSON given that there are a dozen different conversion methods?

How can that document not cite DBpedia, a project that is extracting structured data from Wikipedia infoboxes and has years of experience in doing that?

The fact that their technical proposal document is quite confused about these ground technologies makes me fear that there is more wishful thinking than past experiences.

1 more reply

mapleoin14y ago

I think whatever they choose to implement it in has a good chance at becoming the next de facto standard.

femto14y ago

Does the standard really matter? If it's machine understandable, it should be able to be automatically translated into any other format in the future.

The important thing is to jump in and make a start. The right way of doing things will become evident as the project evolves.

halo14y ago· 3 in thread

tl;dr: spin-off Wikipedia infoboxes into a seperate project with an API, and then use that data to bootstrap an open data project with broader goals.

In theory, it's a good idea. It takes an existing useful data source and puts in a form that encourages reuse, and since it solves the bootstrapping problem then it's not obviously doomed to failure like the Semantic Web.

I see two potential downsides.

My first concern is that, in practice, it will make editing Wikipedia more complex. There's no inherent reason why this should be the case, but there's no inherent reason why Wikimedia Commons should make editing Wikipedia more complex either, yet it undeniably does.

Secondly, it will prevent a similar source of data from appearing with broader terms of use. For example, OpenLibrary is public domain.

ZeroGravitas14y ago

Is it even possible to have a database of factual content under CC-BY-SA? This is part of the reason OpenStreetMap is moving to ODbL.

Somewhat ironically , since part of the reason is that you can't copyright facts, they didn't just take the existing data under the same theory, but asked everyone to accept the new licence. I wonder what Wikipedia plan to do?

roc14y ago

I don't see why you couldn't have a database of facts under CC-BY-SA. You can't copyright individual facts, but you absolutely can copyright a collection of facts as a collection. [1]

I would think the more-pressing problem would be the 'viral' nature of the 'share alike' restriction when it came to API use.

Attribution would also seem to be thorny and difficult to police, but not intractable.

[1] e.g. I can make a phone directory and copyright it. You could take all the data out of my phone directory to make your own directory and that would be fine. But you could not simply make copies of my directory and sell those as your own.

1 more reply

indubitably14y ago

What editing interface could possibly be more complex than the current system of Infobox "markup"? If Wikidata does nothing besides make it easier to edit those infoboxen, it will be a success.

Alex391714y ago· 2 in thread

This is actually a startup idea I've had for a while now. It's a great idea in theory, but it's very tricky in practice. Facts have a mysterious way of vanishing if you look closely enough at them, and the raw numbers themselves don't actually tell you anything.

The part that's actually interesting is:

- The methodology behind the numbers

- What we think is most likely the case based on the evidence available

- How each fact connects with other facts

- What we think we should do based on the evidence available

Being able to embed facts is definitely a cool use case, but unless you have all the other stuff backing it up when you click the link back to the database then it's pretty much worthless. And curating these sorts of epistemological discussions and third party analyses isn't something that really fits within the Wikimedia mission, so I doubt they will even try.

Because of this I doubt their implementation of the project will be successful, although I do think it's a space that ultimately has potential.

david92714y ago

You couldn't be more right, and I think the key here is: How each fact connects with other facts

If there were no operations, math would just be numbers on their own -- and what fun is that?

The problem is that the relations turn it into the Semantic Web, and after trying and failing to crack that nut for so long, everyone is turned off of it. Which is too bad, because what was failing was the approach. Trying several shipping routes to the New World and failing each time doesn't mean that the New World doesn't exist.

Alex391714y ago

"The problem is that the relations turn it into the Semantic Web"

Not really. Assuming there are only four or five simple relationships like "Knowing fact X is necessary to understand fact Y", then the whole system isn't much more complicated than trackbacks for blog posts.

1 more reply

debacle14y ago· 2 in thread

My concern for the potential for abuse in this project is much greater than that of wikipedia. How is wikimedia going to ensure that there are no malicious edits to this data? Any changes will almost certainly need stringent peer review.

Edit: As an afterthought, it would make a lot of sense to manage it like a git repository, where someone could submit a pull request for data changes, and then some subgroup or a trusted percentage of the population approves the request and it gets merged into the master dataset.

femto14y ago

Given that the data is structured, to some extent it should be possible to automatically check its consistency.

john6114y ago

Openstreetmap has the same problem and it handles it well.

tomkin14y ago· 1 in thread

One area I really want to see this take off in is Medicine.

As someone who had suffered from an unknown illness (no doctor could figure it out), I can rationalize how such a system would have been helpful. You see a bit of this with WebMD's Symptom Checker, but I feel tools like that aren't comprehensive enough and we end up with a lot of cyberchondria. You can't rely on co-relation to find absolute answers, but helping map out symptoms, lifestyle choices may be a tool to finding solutions faster.

It took about a year to resolve my illness. Going to the doctor 2-4 times a week for 10-20 minutes isn't enough to work with when you have no clear-cut diagnosis.

Now, to be clear, I am not talking about replacing doctors or devaluing doctors by allowing everyone to be an expert.

chintan14y ago

In the medical domain, there do exist large structured knowledge bases and "expert systems" for diagnosis. Read up on DXplain[1], MYCIN[2] and the UMLS[3]. Even in biology, there seems to be significant activity in formalizing knowledge. It literally took decades to develop and refine these knowledge bases.

Creating something general purpose like the Cyc or Semantic Web is very challenging, especially because different people have different notion of "meaning". Just look at the back and forth arguments over some controversial Wikipedia page. This is 100 times more conceptual and challenging.

1. http://dxplain.org/dxp/dxp.pl

2. http://en.wikipedia.org/wiki/Mycin

3. http://www.nlm.nih.gov/research/umls/

sjaakkkkk14y ago

For people interested in this subject, you might want to check out the DBPedia project: http://dbpedia.org/About. They have been extracting structured data from Wikipedia for quite some time already and allow you to query their database with SPARQL.

From their site: The DBpedia knowledge base currently describes more than 3.64 million things, out of which 1.83 million are classified in a consistent Ontology, including 416,000 persons, 526,000 places, 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organisations, 183,000 species and 5,400 diseases.

jasonkolb14y ago

Nice to see they're going to support SPARQL:

"O3.1. Develop and prepare a SPARQL endpoint to the data. Even though a full-fledged SPARQL endpoint to the data will likely be impossible, we can provide a SPARQL endpoints that allows certain patterns of queries depending on the expressivity supported by the back end."

I see the semantic web slowly realizing its actual purpose (which is not related to semantic natural language processing but rather linking data).

nsns14y ago

Hats off to Wikimedia, a beacon of the true ideals of the free Internet; they've never tried to monetize their substantial achievements, really made a difference, and actually realized what for other companies have been merely lip service (i.e. freeing up information).

jasonkolb14y ago

Now this is interesting (from the page):

"Wikidata is a secondary database. Wikidata will not simply record statements, but it will also record their sources, thus also allowing to reflect the diversity of knowledge available in reality."

That sounds pretty cool to me, because you could potentially upload probabalistic data from statistical analysis. If they make this so that you can tell how reliable the source is, you could upload information that's accurate to a given degree of probability.

It would be very interesting if you could version data by reliability, so that less-reliable data could eventually be replaced by definitive data. This is an achilles heel of current data modeling systems.

Monotoko14y ago

It's hardly new... it's been a non starter for about 5 years: http://lists.wikimedia.org/pipermail/wikidata-l/

nathell14y ago

This kind of reminds me of http://dabanese.blogspot.com/2009/09/introduction.html

j / k navigate · click thread line to collapse

49 comments

37 comments · 13 top-level

femto14y ago· 6 in thread

Like others here, it's something I've been thinking about for a number of years.

This is an important project, with the potential to eclipse wikipedia, maybe even growing to be the saviour of free software? My reasoning follows.

Free software needs Wikidata, to arrive at this endpoint first and avoid being made largely irrelevant by Alpha (or Google?)

_delirium14y ago

Alex391714y ago

"Free software needs Wikidata, to [] avoid being made largely irrelevant by Alpha"

Wolfram Alpha is already completely worthless because it doesn't cite the sources for any of its results. It's basically just a fancy search engine built on top of a garbage dump.

mkr-hn14y ago

http://www.wolframalpha.com/input/?i=how+heavy+is+earth

Click "Source information."

1 more reply

Dn_Ab14y ago

jsmcgd14y ago

I'm with you to an extent, but why do you think Wikidata in particular will be the missing component and not some other service like Freebase or DBPedia?

femto14y ago

judofyr14y ago· 5 in thread

Missing from the FAQ: What's the difference between Freebase and Wikidata?

_delirium14y ago

sjaakkkkk14y ago

For the interested reader, here a cool paper on Information Arbitrage Across Multi-lingual Wikipedia: http://www.cond.org/paper_202.pdf

huherto14y ago

So they are adding an extra layer?...Who said that CS is the science where everything is solved with an extra level of indirection?

1 more reply

wslh14y ago

And dbpedia?

gojomo14y ago

Also related: Factual.com.

manuletroll14y ago· 5 in thread

This might be very interesting if it's implemented in a sane way. Unfortunately there doesn't seem to be a very widely-adopted standard in the world of open data for now..

gioele14y ago

What does it mean "a very widely-adopted standard in the world of open data"? "standard" of what?

There are meta-format standards: XML, RDF, HTML and lately JSON. With these four you are probably covering 80% of the world published open data, the rest is PDF, MS DOC and MS XLS.

icebraining14y ago

RDF has been adopted by some pretty big data websites, and apparently that's one of the formats they plan to support:

    The data will be exported in different formats, especially RDF, SKOS, and JSON.

http://meta.wikimedia.org/wiki/Wikidata/Technical_proposal

gioele14y ago

Technically unsound: RDF is a relationship model and a meta-model (think XML Infoset), SKOS is a vocabulary (think XHTML) and JSON is a serialization format (think XML or RDF/N3).

How can that document not cite DBpedia, a project that is extracting structured data from Wikipedia infoboxes and has years of experience in doing that?

The fact that their technical proposal document is quite confused about these ground technologies makes me fear that there is more wishful thinking than past experiences.

1 more reply

mapleoin14y ago

I think whatever they choose to implement it in has a good chance at becoming the next de facto standard.

femto14y ago

Does the standard really matter? If it's machine understandable, it should be able to be automatically translated into any other format in the future.

The important thing is to jump in and make a start. The right way of doing things will become evident as the project evolves.

halo14y ago· 3 in thread

tl;dr: spin-off Wikipedia infoboxes into a seperate project with an API, and then use that data to bootstrap an open data project with broader goals.

I see two potential downsides.

Secondly, it will prevent a similar source of data from appearing with broader terms of use. For example, OpenLibrary is public domain.

ZeroGravitas14y ago

Is it even possible to have a database of factual content under CC-BY-SA? This is part of the reason OpenStreetMap is moving to ODbL.

roc14y ago

I don't see why you couldn't have a database of facts under CC-BY-SA. You can't copyright individual facts, but you absolutely can copyright a collection of facts as a collection. [1]

I would think the more-pressing problem would be the 'viral' nature of the 'share alike' restriction when it came to API use.

Attribution would also seem to be thorny and difficult to police, but not intractable.

1 more reply

indubitably14y ago

What editing interface could possibly be more complex than the current system of Infobox "markup"? If Wikidata does nothing besides make it easier to edit those infoboxen, it will be a success.

Alex391714y ago· 2 in thread

The part that's actually interesting is:

- The methodology behind the numbers

- What we think is most likely the case based on the evidence available

- How each fact connects with other facts

- What we think we should do based on the evidence available

Because of this I doubt their implementation of the project will be successful, although I do think it's a space that ultimately has potential.

david92714y ago

You couldn't be more right, and I think the key here is: How each fact connects with other facts

If there were no operations, math would just be numbers on their own -- and what fun is that?

Alex391714y ago

"The problem is that the relations turn it into the Semantic Web"

1 more reply

debacle14y ago· 2 in thread

femto14y ago

Given that the data is structured, to some extent it should be possible to automatically check its consistency.

john6114y ago

Openstreetmap has the same problem and it handles it well.

tomkin14y ago· 1 in thread

One area I really want to see this take off in is Medicine.

It took about a year to resolve my illness. Going to the doctor 2-4 times a week for 10-20 minutes isn't enough to work with when you have no clear-cut diagnosis.

Now, to be clear, I am not talking about replacing doctors or devaluing doctors by allowing everyone to be an expert.

chintan14y ago

1. http://dxplain.org/dxp/dxp.pl

2. http://en.wikipedia.org/wiki/Mycin

3. http://www.nlm.nih.gov/research/umls/

sjaakkkkk14y ago

jasonkolb14y ago

Nice to see they're going to support SPARQL:

I see the semantic web slowly realizing its actual purpose (which is not related to semantic natural language processing but rather linking data).

nsns14y ago

jasonkolb14y ago

Now this is interesting (from the page):

"Wikidata is a secondary database. Wikidata will not simply record statements, but it will also record their sources, thus also allowing to reflect the diversity of knowledge available in reality."

Monotoko14y ago

It's hardly new... it's been a non starter for about 5 years: http://lists.wikimedia.org/pipermail/wikidata-l/

nathell14y ago

This kind of reminds me of http://dabanese.blogspot.com/2009/09/introduction.html

j / k navigate · click thread line to collapse