On truths and lies

(Die deutsche Version dieses Artikels ist hier.)

This is the second in a short series of blog entries in which I explain some of the design decisions behind Wikidata. The first one was about restricting property values or properties. The essays represent my personal opinion, and are not to be understood as the official opinion of the Wikidata project.

Databases have an aura of correctness. When we query a database, we expect the result that comes back to basically be The Answer and The Truth. Ask Amazon’s database about the author of the Bible. Ask IMDB about the director of Adaptation. You are not expecting to get a possible answer, or different points of view – you expect one definitive answer.

Wikidata is collecting structured data about the world. It is basically a crowdsourced database. Unlike text, structured data necessarily and unfortunately lacks in nuance. Whereas it is possible to talk about the statehood of Kosovo in an NPOV way in natural language, a naive approach to represent that in structured data would fail: either we say Kosovo is a state, or we do not. There are no shades of grey.

Fortunately some of the roots of Wikidata lie in an EU research project called RENDER. The goal of this project is to explore and support the diversity of knowledge on the Web. RENDER discards the assumption of a simple, single truth – and this was inherited by the Wikidata data model. Instead of collecting facts, we collect statements. We define statements as claims that can have references. A reference supports the claim. A beautiful example is for example Ethanol, where the CAS number – a standard identifier for chemical compounds – is given with a reference to the actual standard, pointing out the page in the source.

Unlike many other databases, Wikidata can contain contradicting statements, supported by different references. Unlike the natural text in Wikipedia, Wikidata does not offer the possibility to reconcile and explain the differences in prose, providing due weight to the different points of view. The responsibility lies with the Wikidata reader and reuser to deal with deciding which sources to trust. I expect quite a bit of research and exploration to deal with this question in the following years. The first reusers to deal with these issues will be the Wikipedia communities that opt to choose data from Wikidata.

In the next few weeks and months we will add a few more features to support the diversity of statements in Wikidata.

Currently, the most obvious omission is a lack of datatypes to specify numbers, text and URLs. Only with these datatypes it will be possible to actually write down references in their full glory. Another opportunity – once URLs are available – would be to provide content locators for text in HTML pages through XPath, oxPath, CSS selectors, or something similar, thus enabling bots to check if the given references are still valid. I am very curious to see how the usage of references and sources will develop in and around Wikidata.

Another major feature that will be introduced in the course of this year is the possibility to rank statements: not all statements are to be regarded equally. We will introduce three ranks, and every statement will be in one of them: preferred, normal, and deprecated.

“Preferred” statements should be the most current and most widely accepted statements. There can be several preferred statements for the same item and property.

“Deprecated” statements are those that are considered to be not reliable for some reason. They are mentioned though because they might have a strong source supporting it, or they are widely spread for some reason, but actually not accepted anymore. Examples can include typos from influential textbooks – for example regarding the iron content of spinach, or the length of the Rhine – or numbers spread by some form of propaganda that are considered not correct today anymore.

“Normal” statements are thus the ones left, which are neither “preferred” nor “deprecated”. This will often apply for historic statements (the population of Rome in the time of Julius Caesar, former capitals of Russia, etc.).

Technically, we will start with using only preferred statements for answering queries (i.e. when you ask for all capitals with a population of less than 500,000, then you won’t get answers where the city had a population of 120,000 in the 16th century). Also only they will be returned by the property-parserfunction. The Lua interface will have access to all statements and thus provide full flexibility. It is planned to extend query answering later to support more complex queries, at which point we will have to think about integrating other ranks.

The ranks should allow for a more inclusive policy in Wikidata, allowing to reflect a wider diversity of knowledge.

To give an idea of the time scale: we will first implement the datatypes that are still missing, and then, as a prerequisite for ranks, the possibility to reorder statements. After that, ranks will be the next feature to land in Wikidata.

Ranks introduce a vector for debate, which has not been there in Wikidata yet. The question moves from “should this statement be included?” to “what should be the rank of this statement?” This seems like a necessary step: unlike natural text, Wikidata otherwise could not include statements that are agreed on to be bogus but that have historical or other value. This makes it even more important to remember that Wikidata is not about truth, but about collecting referenced statements in a secondary database. The criterion for inclusion should not be veracity, but verifiability – a policy that has served Wikipedia very well.

Wikidata will always – and that is both a necessity as well as acknowledged by design – run short of Wikipedia in many aspects. Wikipedia articles can explore causal and informal connections, they can inspire curiosity, and they can support one of the major modes of knowledge transfer between humans: storytelling. Wikidata has other, unique advantages: it can provide some ground data about a topic of interest in many languages more easily, and it provides the data in a way that is much more accessible for bots and apps. It could be a step towards relieving some Wikipedias from a lot of bot-created articles, never touched by a human editor, cluttering recent changes, and skewed statistics.

Without the ability to express a plurality of statements about an item – even if they are considered truths only by some and lies by others – Wikidata would fall short of one of the major pillars of Wikipedia, the Neutral Point of View and the possibility of integrating conflicting points of view.

I hope that the technical platform that we as developers are building, and the rules and processes of the communities in Wikidata, the Wikipedias, and other Wikimedia projects, are establishing a useful ecosystem, understanding the limitations of each project, and discovering how we can most effectively help each other. And this means understanding the peculiar relationship between Wikidata and the Truth.

#Technical

24 Hours of Open Data – The Wikidata Birthday Meetup

Wikidata turns eight – An interview with Lydia Pintscher

OPEN!NEXT creates new standard for open hardware

ProWD: Detecting Knowledge Imbalances on Wikidata

Kommentare

Wikimedia engineering June 2013 report — Wikimedia blog
12. July 2013 at 13:18

[...] a blog entry, Denny Vrandečić explained his understanding of the relation of Wikidata and the [...]

Reply

Verwandte Artikel

24 Hours of Open Data – The Wikidata Birthday Meetup

Wikidata turns eight – An interview with Lydia Pintscher

OPEN!NEXT creates new standard for open hardware

ProWD: Detecting Knowledge Imbalances on Wikidata

Kommentare

Leave a Reply Cancel reply

DSGVO Hinweis