A categorical imperative?

(Die deutsche Version dieses Artikels ist hier .) This is the third in a short series of blog entries in which …

  • Denny Vrandecic
  • 12. September 2013

(Die deutsche Version dieses Artikels ist hier.)

This is the third in a short series of blog entries in which I explain some of the design decisions for Wikidata. The first one was about restricting property values or properties, the second about veracity and verifiability. The essays represent my personal opinion, and are not to be understood as the official opinion of the Wikidata project.

At first a name people doing knowledge representation care very, very strongly about: Barbara. Introduced about 2500 years ago by Aristotle (Teacher to Alexander the Great, who had conquered the known world and beyond by the age of 33. School and awesome teachers do matter!) and named a millennium later by my favorite philosopher, Boethius. (Seriously, this guy is awesome. He had everything you could have hoped for back in that time, and he lost it all. Read his bio. He had both his sons made consuls of the mightiest empire of the world, and then suddenly he got his riches taken, family members executed, and was awaiting his own execution in exile in a prison. Instead of lamenting, he sat down and wrote a book about what really is important in life. Read his Consolation of Philosophy. It remained on the bestselling list for a few centuries, not without a reason. Kings copied it by hand!) Barbara is part of the logical foundation of anything that has to do with classes. You might know classes as types, categories, genera, or anything else that is somehow taxonomical. Barbara is a type of syllogisms, thus a rule for correct reasoning. Modus Barbara states that if all A are B and all B are C, well then also all A are C. As an example: If we know that all billionaires are human, and we know that all humans are mortal, bang, all billionaires are mortal, too.

Aristoteles’ idea of categories and logic has profoundly shaped western thinking and civilization, and still continues to have an iron grip on the whole business of knowledge representation. Any book on it will make sure that you understand how really, really important this idea is. Seriously. Any book I read. Brachman? The whole idea of description logics is about nothing else than about describing classes efficiently. Sowa? Barbara on page two. Russell and Norvig’s AIMA? The chapter on knowledge representation starts with a Figure of “the upper ontology of the world”, a classification of basically everything. If you find a book on knowledge representation that is not focusing on classification, please let me know.

I have studied philosophy, computer science, and then I did a Ph.D. on the topic of ontologies: one of the connecting things that ran through all of that was the utter importance of categories and their taxonomies, be it in logic, object-oriented design, or in the OWL-based ontologies I worked with. Even in Semantic MediaWiki we didn’t start with much reasoning (This changed over time, but remained within the framework of description logics.) but from the very first paper, when Semantic MediaWiki was still just an idea and neither Markus nor I had touched a line of PHP yet, it was clear that categories and subcategories would play a paramount role.

Considering all of this background it is thus very hard for me to make the following confession: I don’t like classification. I dislike Barbara even more. (Nothing personal. I know a few Barbaras in real life, and all of them are great persons. This is not about them.) Maybe “dislike” is too strong a word. I am wary about them. I distrust them. They fill me with unease. I get an queasy hard-to-explain feeling about them. And therefore I would rather drop them. For now. Let’s see how it works.

I suggested, imagine Wikidata without classification. (OK, my words might have been “Let’s just kill all this classification nonsense”, but you get my idea.) And, to put it short, in no way I could have imagined beforehand the reactions I got. Long discussions with and surprised looks from basically everyone involved. Everyone said it is a stupid idea. (Or something like “You might want to reconsider”. Most of them are much more polite than me.) People I worked with. People I learned from. People I admire. To be able to classify and categorize seems to be imperative for Wikidata.

So what is the fuzz about? Basically, it is about two properties, the instanceOf-property and the subClassOf-property. (You might want to take a look at the primer on the Wikidata data model in case you feel lost here.) Whereas it is obviously impossible to not have them in Wikidata (Since the community can create properties as they like, as I expected, these two properties appeared within the first few days of having properties at all.) and since I do not suggest to ban their existence, the question is if they should enjoy any special treatment of if they should just be properties like all the others?

Think of categories in Wikipedia. They started as normal links and they slowly and slightly chaotically turned into today’s category system, where the category links are treated very differently from normal links, and where there are plenty of category pages and special category functionality. Hidden categories, category trees, subcategories, functionality for dealing with very large categories, etc.

My own position is the following: Treat them like any other property and implement no special meaning for them. Let’s call this “weak classification”. Wikidata has that.

But can we leave out “strong classification”? Or does Wikidata have to know about Barbara and implement her rules? What would we be missing without strong classification? If you add a statement “instance of: Billionaire” to the item on Bruce Wayne, and the item on Billionaire has the statement “Superclass: Person”, you might expect that now automatically Bruce Wayne is also a person, i.e. if you make a query for all persons, you would like Bruce Wayne to be among the correct answers. The current draft of the Wikidata data model specification introduces special ways to model these two properties. (That draft assumes that qualifiers and references would not be applicable to them as they might suggest changes to the semantics of such a statement so that the resulting statement would not be automatically interpretable.)

Also, it seems that many people take classification for granted and it might confuse the heck out of them if the system would not offer it. This shows that the costs of not having classification are quite high, and I admit that. So why do I still dislike strong classification?

In short, my decision is to not prioritize strong classification. Instead I will prioritize more data types, ranks, querying, and result formats. What we will deliver in Wikidata is a trade-off anyway, and it means prioritizing. Strong classification can be added later, together with other inferences (or syllogisms), once we start to understand how Wikidata works as a socio-technical system. (See also the paper by Mathias Schindler and me on how Wikipedia as a socio-technical system has developed in the past in the face of such features.) It can even be added by external services. Also, the effects of a strong classification can be mostly recreated by the community, making type-statements explicitly – something I very strongly encourage and welcome. Already, an increasing number of properties have additional descriptions on their talk pages, which are used by bots in creating reports that support the community in maintaining Wikidata. We will keep an eye on these activities, and see what needs to be done in order to better support them. It also shows that Wikidata manages to be a sufficiently flexible system to actually support this kind of activities.

I have written this essay in order to give a rationale for my decision and to invite wider participation in its discussion. Despite the mostly negative reactions I got so far, I would like to remain stubborn. But recognizing that these negative reactions come from very smart people makes me naturally wary. By taking this discussion to the wider public I would like to invite a broader audience to think and talk about this subject and to participate in creating Wikidata. Since in the end, just like Wikipedia, Wikidata is not only for everybody, but also by everybody.

  1. I think you should be more explicit about what you mean by “strong classification”. Does it just mean knowing automatically the set of all the superclasses of an item ? It seems that we can do that trough other means anyway (by recursively querying the value of the “instance of” property or, if we want to make it more efficient, by having bots compile lists of subclasses/superclasses for major items).

    Comment by zolo on 12. September 2013 at 19:24

  2. I think you might like the approach presented recently by Stefan Decker that prefers prototypes to classes when modelling data. See Stefan’s slides (http://www.slideshare.net/stefandecker1/stefan-decker-keynote-at-cshals/28) or somewhat quiet Google+ group on this topic (https://plus.google.com/communities/102405508518643959546).

    Comment by Jindřich Mynarz on 12. September 2013 at 19:54

  3. Thank you, thank you, thank you, Denny! All of the textbooks on semantic modelling seem to strongly believe in the ability to naturally classify everything into a neat hierarchy. Computer scientists love to have things fit nicely into disjunct categories. Except that as human beings, we are messy and don’t fit. There are always exceptions, and especially exceptions over time. With the man recently giving birth in Neukölln we now have a jillion or so standard father-mother-children examples shot to hell ;)

    Please keep strong classification OUT of WikiData. Otherwise we will end up having to fudge our way around the problems that occur, and since we don’t have good means of representing inference, that will make it immensely difficult to figure out what went wrong. Tagging will permit multiple and overlapping “classification” and more closely fit the Real World ™, imho.

    Comment by WiseWoman on 12. September 2013 at 20:54

  4. Hi. I think the biggest problem with no hardwired classification scheme is not a philosophical or a theorical one, it’s a practical one : It tends to make quite difficult to make properties suggestion to the user, as the system does not have knowledge of what the item is.

    I puts a lot into the hands of the community to make a lot of specialised tool with a more or less hard-wired properties for domain specific tools, and into periodic reports and building tools for expressing constraints or patterns (with Wikisyntax like http://mappings.dbpedia.org/index.php/Main_Page dbpedia ? it can be made useful but I can’t help myself thinking it’s a bit of a suboptimal hack :) ).

    On the other hand a type or class system in Wikidata could be implemented with the same principle that exists in your posts : annotated (qualified) classification, soft constraints which serve more as patterns than as limits, with additionate benefits as immediate reports to the user. It’s a matter of choice but I think it tend to be kind of hard for community to understand all these problems and make (another level of) choices. Maybe this would help community at no expressive costs and will make things go faster, make users understand a little better the project ?

    Comment by TomT0m on 13. September 2013 at 18:56

  5. Last night I came across [[:fr:Statue du Christ-Roi]] with a list of giant statues of Christ the King and the corresponding wikidata item ‘monumental statue of Christ the King’. I went and tagged all those statues on Wikidata with ‘instance of:monumental statue of Christ the King’.

    Sleeping on it I realised this morning that ‘monumental statue of Christ the King’, when used as a class, is trying to do two things at the same time. Today I am going to change all of the items for those statues to ‘instance of:colossal statue’ and ‘depicts:Christ the King’.

    Similarly the important property for classifying humans is probably going to be ‘occupation’ even if they are all tagged with ‘instance of:human’ as well. ~~~~

    ‘instance of’ will be important but it is not the only important property.

    Classes, defined using the ‘subclass of’ property to link specific classes to more general items, seem to have a place defining what values are acceptable with various properties. The ‘occupation’ property will mostly link to items which are a subclass of the ‘occupation’ item. ‘instance of’ should, in general, not link to items which are a subclass of the ‘occupation’ item. Bots can be used to highlight exceptions to these guidelines for review by humans. The human can then change the property, mark the value item as a ‘subclass of:occupation’ or accept it as an exception.

    This is, I now believe, the appropriate compromise between rules and exceptions and is not something I would ever have come up with without the nudging built into the software by Denny.

    Comment by filceolaire on 13. September 2013 at 21:40

  6. […] Denny Vrandečić (ever keen to show just how much he deserved his PhD in the subject!) argued recently, strong classification of the kind that categories lead to seems to hit a nerve in the human […]

    Pingback by Railing against the categorical imperative :: Pathological Neologisms on 13. September 2013 at 22:32

Leave a comment