This is the third in a short series of blog entries in which I explain some of the design decisions for Wikidata. The first one was about restricting property values or properties, the second about veracity and verifiability. The essays represent my personal opinion, and are not to be understood as the official opinion of the Wikidata project.
At first a name people doing knowledge representation care very, very strongly about: Barbara. Introduced about 2500 years ago by Aristotle (Teacher to Alexander the Great, who had conquered the known world and beyond by the age of 33. School and awesome teachers do matter!) and named a millennium later by my favorite philosopher, Boethius. (Seriously, this guy is awesome. He had everything you could have hoped for back in that time, and he lost it all. Read his bio. He had both his sons made consuls of the mightiest empire of the world, and then suddenly he got his riches taken, family members executed, and was awaiting his own execution in exile in a prison. Instead of lamenting, he sat down and wrote a book about what really is important in life. Read his Consolation of Philosophy. It remained on the bestselling list for a few centuries, not without a reason. Kings copied it by hand!) Barbara is part of the logical foundation of anything that has to do with classes. You might know classes as types, categories, genera, or anything else that is somehow taxonomical. Barbara is a type of syllogisms, thus a rule for correct reasoning. Modus Barbara states that if all A are B and all B are C, well then also all A are C. As an example: If we know that all billionaires are human, and we know that all humans are mortal, bang, all billionaires are mortal, too.
Aristoteles’ idea of categories and logic has profoundly shaped western thinking and civilization, and still continues to have an iron grip on the whole business of knowledge representation. Any book on it will make sure that you understand how really, really important this idea is. Seriously. Any book I read. Brachman? The whole idea of description logics is about nothing else than about describing classes efficiently. Sowa? Barbara on page two. Russell and Norvig’s AIMA? The chapter on knowledge representation starts with a Figure of “the upper ontology of the world”, a classification of basically everything. If you find a book on knowledge representation that is not focusing on classification, please let me know.
I have studied philosophy, computer science, and then I did a Ph.D. on the topic of ontologies: one of the connecting things that ran through all of that was the utter importance of categories and their taxonomies, be it in logic, object-oriented design, or in the OWL-based ontologies I worked with. Even in Semantic MediaWiki we didn’t start with much reasoning (This changed over time, but remained within the framework of description logics.) but from the very first paper, when Semantic MediaWiki was still just an idea and neither Markus nor I had touched a line of PHP yet, it was clear that categories and subcategories would play a paramount role.
Considering all of this background it is thus very hard for me to make the following confession: I don’t like classification. I dislike Barbara even more. (Nothing personal. I know a few Barbaras in real life, and all of them are great persons. This is not about them.) Maybe “dislike” is too strong a word. I am wary about them. I distrust them. They fill me with unease. I get an queasy hard-to-explain feeling about them. And therefore I would rather drop them. For now. Let’s see how it works.
I suggested, imagine Wikidata without classification. (OK, my words might have been “Let’s just kill all this classification nonsense”, but you get my idea.) And, to put it short, in no way I could have imagined beforehand the reactions I got. Long discussions with and surprised looks from basically everyone involved. Everyone said it is a stupid idea. (Or something like “You might want to reconsider”. Most of them are much more polite than me.) People I worked with. People I learned from. People I admire. To be able to classify and categorize seems to be imperative for Wikidata.
So what is the fuzz about? Basically, it is about two properties, the instanceOf-property and the subClassOf-property. (You might want to take a look at the primer on the Wikidata data model in case you feel lost here.) Whereas it is obviously impossible to not have them in Wikidata (Since the community can create properties as they like, as I expected, these two properties appeared within the first few days of having properties at all.) and since I do not suggest to ban their existence, the question is if they should enjoy any special treatment of if they should just be properties like all the others?
Think of categories in Wikipedia. They started as normal links and they slowly and slightly chaotically turned into today’s category system, where the category links are treated very differently from normal links, and where there are plenty of category pages and special category functionality. Hidden categories, category trees, subcategories, functionality for dealing with very large categories, etc.
My own position is the following: Treat them like any other property and implement no special meaning for them. Let’s call this “weak classification”. Wikidata has that.
But can we leave out “strong classification”? Or does Wikidata have to know about Barbara and implement her rules? What would we be missing without strong classification? If you add a statement “instance of: Billionaire” to the item on Bruce Wayne, and the item on Billionaire has the statement “Superclass: Person”, you might expect that now automatically Bruce Wayne is also a person, i.e. if you make a query for all persons, you would like Bruce Wayne to be among the correct answers. The current draft of the Wikidata data model specification introduces special ways to model these two properties. (That draft assumes that qualifiers and references would not be applicable to them as they might suggest changes to the semantics of such a statement so that the resulting statement would not be automatically interpretable.)
Also, it seems that many people take classification for granted and it might confuse the heck out of them if the system would not offer it. This shows that the costs of not having classification are quite high, and I admit that. So why do I still dislike strong classification?
- People attach more meaning to classification than to property assignments. Saying that “Nikola is a Serbian” has a much stronger meaning than “Nikola is a citizen of Serbia” or “Nikola is born in Serbia”. There is much more attached to it for human readers. One might say “being a Serbian is whatever being a Serbian is defined to be, and if being a Serbian is defined as having the Serbian citizenship than that’s all to it there is” – but in Wikidata we do not yet plan for ways to define what a class means beyond natural language. In Wikidata the meaning of a class is at the intersection of a social construct and how the system acts. If you don’t believe that check the history of the article on Nikola Tesla and see how much people fight whether he is a Serbian, a Croat, an American, or a combination thereof. They do not fight that much about where he was actually born or what citizenship he had. And if examples do not convince you, there’s also research about that effect. Just one starter: Yamauchi, T. (2007). The Semantic Web and human inference: A lesson from cognitive science in Lecture Notes in Computer Science vol. 4825, pp. 609-622)
- A statement in Wikidata can have a reference, which is usually displayed when the statement is displayed. This way, you can still assess the referenced source and then decide whether you believe it or not. But a classification statement might be hidden in the background and not displayed. If I ask for all persons, and I get back Bruce Wayne, maybe because fictional billionaires are billionaires and billionaires are persons, how and where do I display the reference information for these intermediary steps? Yes, this is a problem for most kind of inferences. Which is why I would like to keep inferences mostly out of Wikidata until we understand better how the Wikidata and Wikipedia communities interact with the Wikidata system. Inferences might also have effects that are hard to localize, and thus in a wiki also hard to correct.
- Having qualifiers be restricted from the type- and subclass-property would restrict them from being used with certain patterns. The statement “Kosovo is a country” is obviously more problematic than “Kosovo is a country recognized by Germany”, i.e. having a qualifier “recognized by” on the “is a” statement. The statement “Turks are Europeans” might be more controversial than if we were able to add a qualifier subtracting Anatolians. As the latter example of statements shows, such a hierarchy of classes would still be highly controversial. I am wary of Wikidata becoming the place to fight for the one hierarchy of classes for the world, the one world ontology. Wikidata can provide a lot of useful information without taking on the burden of creating the one classification scheme for everything. Especially if they have a special meaning inside the system.
- Finally: classification appeals to a certain demographics more than to others, a demographic that seems to be overrepresented in Wikipedia and Wikidata already anyway. Just to name one example, the importance of classification might be overrepresented in Western thought in general. (See the very worth reading book The Geography of Thought for a discussion of this.) It is unclear if catering to classification might not further reinforce the demographic composition of Wikimedia projects, whereas one of the strategic goals of Wikimedia is to significantly increase the number of editors ”and” their diversity.
In short, my decision is to not prioritize strong classification. Instead I will prioritize more data types, ranks, querying, and result formats. What we will deliver in Wikidata is a trade-off anyway, and it means prioritizing. Strong classification can be added later, together with other inferences (or syllogisms), once we start to understand how Wikidata works as a socio-technical system. (See also the paper by Mathias Schindler and me on how Wikipedia as a socio-technical system has developed in the past in the face of such features.) It can even be added by external services. Also, the effects of a strong classification can be mostly recreated by the community, making type-statements explicitly – something I very strongly encourage and welcome. Already, an increasing number of properties have additional descriptions on their talk pages, which are used by bots in creating reports that support the community in maintaining Wikidata. We will keep an eye on these activities, and see what needs to be done in order to better support them. It also shows that Wikidata manages to be a sufficiently flexible system to actually support this kind of activities.
I have written this essay in order to give a rationale for my decision and to invite wider participation in its discussion. Despite the mostly negative reactions I got so far, I would like to remain stubborn. But recognizing that these negative reactions come from very smart people makes me naturally wary. By taking this discussion to the wider public I would like to invite a broader audience to think and talk about this subject and to participate in creating Wikidata. Since in the end, just like Wikipedia, Wikidata is not only for everybody, but also by everybody.
[…] Denny Vrandečić (ever keen to show just how much he deserved his PhD in the subject!) argued recently, strong classification of the kind that categories lead to seems to hit a nerve in the human […]
Last night I came across [[:fr:Statue du Christ-Roi]] with a list of giant statues of Christ the King and the corresponding wikidata item ‘monumental statue of Christ the King’. I went and tagged all those statues on Wikidata with ‘instance of:monumental statue of Christ the King’.
Sleeping on it I realised this morning that ‘monumental statue of Christ the King’, when used as a class, is trying to do two things at the same time. Today I am going to change all of the items for those statues to ‘instance of:colossal statue’ and ‘depicts:Christ the King’.
Similarly the important property for classifying humans is probably going to be ‘occupation’ even if they are all tagged with ‘instance of:human’ as well. ~~~~
‘instance of’ will be important but it is not the only important property.
Classes, defined using the ‘subclass of’ property to link specific classes to more general items, seem to have a place defining what values are acceptable with various properties. The ‘occupation’ property will mostly link to items which are a subclass of the ‘occupation’ item. ‘instance of’ should, in general, not link to items which are a subclass of the ‘occupation’ item. Bots can be used to highlight exceptions to these guidelines for review by humans. The human can then change the property, mark the value item as a ‘subclass of:occupation’ or accept it as an exception.
This is, I now believe, the appropriate compromise between rules and exceptions and is not something I would ever have come up with without the nudging built into the software by Denny.
Hi. I think the biggest problem with no hardwired classification scheme is not a philosophical or a theorical one, it’s a practical one : It tends to make quite difficult to make properties suggestion to the user, as the system does not have knowledge of what the item is.
I puts a lot into the hands of the community to make a lot of specialised tool with a more or less hard-wired properties for domain specific tools, and into periodic reports and building tools for expressing constraints or patterns (with Wikisyntax like http://mappings.dbpedia.org/index.php/Main_Page dbpedia ? it can be made useful but I can’t help myself thinking it’s a bit of a suboptimal hack :) ).
On the other hand a type or class system in Wikidata could be implemented with the same principle that exists in your posts : annotated (qualified) classification, soft constraints which serve more as patterns than as limits, with additionate benefits as immediate reports to the user. It’s a matter of choice but I think it tend to be kind of hard for community to understand all these problems and make (another level of) choices. Maybe this would help community at no expressive costs and will make things go faster, make users understand a little better the project ?
Thank you, thank you, thank you, Denny! All of the textbooks on semantic modelling seem to strongly believe in the ability to naturally classify everything into a neat hierarchy. Computer scientists love to have things fit nicely into disjunct categories. Except that as human beings, we are messy and don’t fit. There are always exceptions, and especially exceptions over time. With the man recently giving birth in Neukölln we now have a jillion or so standard father-mother-children examples shot to hell ;)
Please keep strong classification OUT of WikiData. Otherwise we will end up having to fudge our way around the problems that occur, and since we don’t have good means of representing inference, that will make it immensely difficult to figure out what went wrong. Tagging will permit multiple and overlapping “classification” and more closely fit the Real World ™, imho.
I think you might like the approach presented recently by Stefan Decker that prefers prototypes to classes when modelling data. See Stefan’s slides (http://www.slideshare.net/stefandecker1/stefan-decker-keynote-at-cshals/28) or somewhat quiet Google+ group on this topic (https://plus.google.com/communities/102405508518643959546).
I think you should be more explicit about what you mean by “strong classification”. Does it just mean knowing automatically the set of all the superclasses of an item ? It seems that we can do that trough other means anyway (by recursively querying the value of the “instance of” property or, if we want to make it more efficient, by having bots compile lists of subclasses/superclasses for major items).