This is the first in a short series of blog entries in which I explain some of the design decisions for Wikidata. They are my personal opinion, but they have a strong impact on some features or non-features of Wikidata. This is to explain them.
One of the features – others call it a bug – of Wikidata is that you can choose any item as the value for a property. Many of them do not make sense: so, if you have the article on Paris, saying that its country is goat cheese does not really make sense. Wouldn’t it be great if Wikidata knew which values for a country would make sense, and only allow you to choose those, instead of allowing any possible value here? Wouldn’t it be great if the community decided that a property like the widely used P107 could actually be restricted to the six possible values they decided on?
I strongly disagree.
Another feature – others call it a bug – of Wikidata is that you can use any property on any item. If you want to add the capital city of Julius Caesar, you’re welcome to do so. Wouldn’t it be great if Wikidata knew which properties make sense for a given item, and would not only restrict you to use those but even list the ones that still have missing values? Wouldn’t it be great if the community could create templates of properties that should all be filled out for a person, or for a city, or a country – and not allowing anything else?
I strongly disagree.
I completely agree that smarter suggestions would be great. Some of these could be pretty trivial to implement: count the frequency for the values of a property and make a suggestion based on that. What about suggesting properties? There’s lots of research going on in that area, basically something like “items with these properties also have these properties” – you might have seen that on certain shopping sites.
I am all for better suggestions. What I am strongly disagreeing with are strong restrictions. It provides far too much space for drama and edit-warring. Does every country have a capital? What is a country anyway? What should the possible values for the property “gender” be? What are the right properties for presidents?
Anything that the system uses for building its user interface and core functionality – labels and descriptions, for example, or the links to Wikipedia pages – can not have references. This is something the system simply “believes.” On the other hand, if you add a statement saying that Kosovo is a country, you can add a reference to it. Others might say that Kosovo is a part of Serbia. You can add a reference for that too. But if you want to make the user interface use this kind of information – for example when a property is restricted to countries – the system needs to make a call whether Kosovo is an independent country or not. There is no room for the kind of knowledge diversity that Wikidata is build for.
I perceive the danger that some parts of Wikidata might get stuck in an ontology engineering exercise. I think these exercises can be fundamentally unresolvable, and thus that Wikidata’s mandate should not be to solve them. Wikidata should, in my opinion, work on a less abstract level: Let us enter the authors of Aerosmith’s “I Don’t Want to Miss a Thing”, and not discuss whether authorship can apply to a song or not. Let us trace the genealogy of the British monarch, and not whether officials can only be persons. Are you sure that no donkey has ever become a Roman senator? Can you tell whether drinks should have inventors?
Wikidata allows for a unique collaborative space for humans and bots. Much more than Wikipedia, which already sports a pretty amazing example of such an environment. In Wikipedia, we have bots checking for outdated references to websites, for correct usage of punctuation, etc. In Wikidata we can create bots that check whether a teacher has indeed lived before the death of its student. Whether all Roman senators have lived before the 6th century. Whether the population of the cities of a country add up to be less than the population of the country as a whole. And the bots doing these checks will need to find a way to report their results to humans, who can then check whether the bots discovered genuine inconsistencies – either in the real world or in Wikidata – or not.
The world is complex. Wikidata aims to collect structured knowledge about this complex world. The root of Wikidata, as the name hints, are wikis – and wikis mean freedom. Based on this legacy, Wikidata as a software does not aim to implement restricted types for properties, nor restricting sets of properties for types of item anytime soon.
(I skipped the boring technical details about why it would be hard to implement and what kind of problems could arise from implementations of the suggested features. There are some serious problems with that, but I wanted to stick with the conceptual reasons.)
Denny, I didn’t elaborate because Emw mostly did so for me. But since you asked, I will do so as well. Let’s start with your examples.
The allowed value for the “country” property should not be “goat cheese”. The allowed values should be those items that contain the statement “instance of: country”. And you’ll notice that this allows ambiguity, because even though Kosovo is not a country according to Serbia, it is a country according to some other countries. Since Wikidata is not about the truth, but about statements and their sources, we can record that a certain country stated that Kosovo is a country, just as we can also record that Serbia stated that Kosovo is a region within Serbia. And yes, Wikidata would also need built-in “instance of” and “subclass of” properties.
Does every country have a capital city? Emw mentioned Nauru, which doesn’t. Well, Wikidata already covers that – the “no value” special property value.
Why isn’t it sensible to restrict the domains of properties as well as their ranges? The domain of “capital city” should be restricted to those items that are instances of countries (or countries union administrative divisions, not sure). That is, after all, part of the semantics of the “capital city” concept–why shouldn’t we be able to capture those semantics?
Can authorship apply to a song? I haven’t yet heard the argument that it can’t. But if someone claims it can’t, and others claim it can, obviously they have different definitions of the concept of authorship. And that is fine, because those different definitions can be captured by different properties. What’s wrong with that?
What’s wrong with ontology engineering? You said you think that it can get stuck in a fundamentally unresolvable situation, how and why?
What I’m mainly disappointed about is that your opinion in the article, and thus the design of Wikidata, is presented matter-of-factly, and not as an invitation to a discussion. Don’t you think anyone beyond the Wikidata team should have some input regarding these matters?
John H., I am sorry to do so. I really should be more conscious about the environment and recycle more.
Or am I disappointing you in some other way? It would be helpful if you actually elaborate a little bit.
Denny, you disappoint me.
“The world is complex. Wikidata aims to collect structured knowledge about this complex world.”
The world is complex, but it has structure. Classes or types are a useful way to express that structure. All knowledge representation technologies that I’m aware of — RDF, RDFS, OWL, UML, etc. — support statements about a subject’s class. As an important development in knowledge representation, shouldn’t Wikidata also support class relations?
How many Roman senators were donkeys? How many countries don’t have capitals? I think the proportion of such class exceptions to conforming subjects is very low. (There was a horse, Incitatus, who was said to be treated as a Roman consul, and Nauru is the only country without an official capital; in both cases, the proportion of such exceptions in those classes is well under 1%.) For subjects with disputed classes, why not allow users to specify different classes for a subject, and in those cases relax certain property restrictions? Even then, while there is dispute as to whether, for example, Kosovo is an independent country, everyone agrees that its area cannot be 60 minutes.
I think the overarching promise of Wikidata is as a centralized repository of structured data for Wikipedia. Without support for classes, Wikidata might be slightly more flexible, but it would be much less structured.
Nonthless there is value in ontological checking of one’s belief system as Clive says above. This is only part of error control though.
I strongly agree. SAP class business systems would need considerably less customization without a myrid of restrictive data features, and conversly, bennefit considerally from bots checking for inconsistencies.
On ACID, Amazon’s Dynamo paper, has proven availability trumps consistency, its good to see the same thinking propogated all the way to freedom of the users application.
A fascinating post – thanks for sharing, Denny!