Transatlantic work on structured data in Berlin

Last week Wikimedia Deutschland was happy to welcome guests for a special technical discussion that spawned an entire week at the headquarters in Berlin. Members from the multimedia team of the Wikimedia Foundation in San Francisco, members from the team developing software for Wikidata at Wikimedia Deutschland and technical experts and developers from the volunteer community came together to discuss Wikimedia Commons and structured data.

Structured data was an important topic in many talks on technology at this year’s Wikimania in London. It is the principle behind Wikidata — a free knowledge base with data that can be filtered, sorted, queried, and of course edited by machines and human beings alike, all in a way that goes beyond storing wikitext in a specific human language. The technology in the engine room of Wikidata is a software project called Wikibase which stores data in a structured way. Ideas that Wikimedia Commons, the free repository of media files, could benefit from structured data and Wikibase have been floating around for a long time, as have thoughts about making Commons more user-friendly and make license-conforming re-use of pictures easier. The weeklong meeting in Berlin marked the starting point of a planning and discussion process that brought together Wikimedians from both sides of the pond.

Outreach Program for Women at Wikidata

This May, Wikidata was part of the Outreach Program for Women. Helen Halbert and Anjali Sharma took care of documenting Wikidata for the general public and the community, with tasks ranging from guided tours for those new to Wikidata to handling the various social media channels. The following guest post is a summary by Helen (written together with  Anjali) about her time with Wikidata.

The journey to contributor

This past May, Anjali and myself were thrilled to learn we both would be working for Wikidata for the summer as part of GNOME Foundation’s Outreach Program for Women (OPW), which provides paid internships with participating organizations to encourage more women to get involved with free and open source software. Both of us were assigned the task of working on outreach efforts.

Wikidata at Wikimania 2014 in London

Wikidata was one of the dominating themes at Wikimania 2014. Many talks mentioned it in passing, even those that didn’t focus on technical topics. Structured data with Wikibase were a topic that was often talked about, be it in discussions on the future of Wikimedia Commons or in projects that do something with GLAM.

When it comes to Wikidata, more and more people are beginning to see the light, so to say. It was fitting that Lydia Pintscher’s talk on Wikidata used this metaphor for the projects: creating more dots of light on the map of free knowledge.

Another excellent talk on Wikidata was dedicated to the research around it. Markus Krötzsch took us on a journey through the data behind the free knowledge base that anyone can edit.

Of course, there were meetups by the Wikidata community and hacks were developed during the hackathon. One enthusiastically celebrated project came from the Russian Wikipedia. Russian Wikipedia had infoboxes that come from Wikidata for quite some while now. What they added at the hackathon was the ability to edit data in the columns of these infoboxes in place — and change it on Wikidata at the same time, pretty much like a visual editor for Wikidata. Read about their hack on Wikidata, or have a look at the source code (which is still a long way from being easy to adopt to other Wikipedias, but it’s a start).

Guided tours and Wikidata: How to explain a complex project and encourage new editors

The following is a contribution by Bene*, admin and bureaucrat on Wikidata and author of the guided tours on Wikidata. He explains the motovation behind guided tours and how they can attract new editors to the Wikidata community:

Wikidata is no longer a brand new project but still a lot of people do not really know what it actually does. This makes it hard for new editors to get involved with the project and become active contributors. We realized that something had to change; that we had to make things easier to understand and take our newbies by the hand.

Wikidata guided tour intro

Wikidata guided tour labels


When it comes to planning how to help new editors, a first approach is typically to create help pages for individual topics. However, these pages are often very long and do not do a good job of explaining concepts beyond their theoretical context. Another way to explain things is to create illustrative presentations including slideshows. Unfortunately, the users still only get the theory and have to make the leap from reading to actually editing on their own. Keeping all this in mind, we decided that we needed a format that is integrated with the editing interface of Wikidata and gives users the opportunity to edit content through a series of practical exercises.
In fact, this is exactly what the GuidedTour extension does. It provides a way to create presentations, or rather interactive tutorials, in which the user can actually complete a set of actions. One great use case of Guided Tours is the Wikipedia Adventure. However, for Wikidata we needed something different because the item editing interface shares very little in common with a standard wiki page. The pages contain more buttons and small text fields because an item does not simply consist of text but stores structured data instead. Therefore, we adjusted the guided tours to our needs and added an overlay feature to highlight single design elements. We also made the tours translatable as Wikidata is a multilingual project. If you are interested in the result just try it out for yourself: there are currently two Wikidata tours available—one on items, and one on statements.

Wikidata items tour stats

Wikidata statements tour stats

As you can see from the usage statistics, the work was well worth the effort. Since the release on 11th July more than 150 users have taken the first tour and more than 100 went on to complete the second one. This shows the impact our tours have had and the great need for them. It was lots of fun to create and implement the interactive tutorials but there is still a lot of work to do. New tours are being worked on and the existing ones are also in need of translations. If you have any ideas for new tours or improvements to the existing ones, just add your comments to the coordination page. You might also want to help translate the released tours (which is just like translating any wiki page). You can translate the existing tutorials about items and about statements.

A note from Lydia (Wikidata’s product manager): Thank you so much to Bene* (Wikidata community developer) and Helen (Free Software Outreach Program for Women intern with Wikimedia) who have worked together over the past weeks to make these first guided tours a reality. It’s great to see us making progress towards making Wikidata easier to use every single day.


Pushing Wikidata to the next level

In early 2010 I met Denny and Markus for the first time in a small room at the Karlsruhe Institute of Technology to talk about Semantic MediaWiki, its development and its community. I was intrigued by the idea they’d been pushing for since 2005 – bringing structured data to Wikipedia. So when the time came to assemble the team for the development of Wikidata and Denny approached me to do community communications for it there was no way I could have said no. The project sounded amazing and the timing was perfect since I was about to finish my studies of computer science. In the one and a half years since then we have achieved something amazing. We’ve built a great technical base for Wikidata and much more importantly we’ve built an amazing community around it. We’ve built the foundation for something extraordinary. On a personal level I could never have dreamed where this one meeting in a small room in Karlsruhe has taken me now.

From now on I will be taking over product ownership of Wikidata as its product manager.

Up until today we’ve built the foundation for something extraordinary. But at the same time there are still a lot of things that need to be worked on by all of us together. The areas that we need to focus on now are:

  • Building trust in our data. The project is still young and the Wikipedia editors and others are still wary of using data from Wikidata on a large scale. We need to build tools and processes to make our data more trustworthy.
  • Improving the user experience around Wikidata. Building Wikidata to the point where it is today was a tremendous technical task that we achieved in a rather short time. This though meant that in places the user experience has not gotten as much attention. We need to make the experience of using Wikidata smoother.
  • Making Wikidata easier to understand. Wikidata is a very geeky and technical project. However to be truly successful it will need to be easy to get the ideas behind it.

These are crucial for Wikidata to have the impact we all want it to have. And we will all need to work on those – both in the development team and in the rest of the Wikidata community.

Let’s make Wikidata a joy to use and get it used in places and ways we can’t even imagine yet.

Data for the people!

In the first session of the first Wikimania, I presented the idea of enriching Wikipedia with structured data. Asked how long it would take to implement this, I answered: “Two weeks, if you know the MediaWiki software well.”

That was in 2005. It turned out, I would be slightly off.

Now, in 2013, we finally started using structured data from Wikidata in the Wikipedias. The project is still in its infancy, but I am already extremely proud of the Wikidata team and what they have achieved. I am very thankful to the many, many people that helped us get to where we are today (I started listing them explicitly, but this post became too long). There are still many things that need to be done, but the rough sketch of what Wikidata is and is not has been drawn, and I think we have created a very interesting new project. I am confident enough about Wikidata and its future, or else I would not be leaving.

A categorical imperative?

This is the third in a short series of blog entries in which I explain some of the design decisions for Wikidata. The first one was about restricting property values or properties, the second about veracity and verifiability. The essays represent my personal opinion, and are not to be understood as the official opinion of the Wikidata project.

At first a name people doing knowledge representation care very, very strongly about: Barbara. Introduced about 2500 years ago by Aristotle (Teacher to Alexander the Great, who had conquered the known world and beyond by the age of 33. School and awesome teachers do matter!) and named a millennium later by my favorite philosopher, Boethius. (Seriously, this guy is awesome. He had everything you could have hoped for back in that time, and he lost it all. Read his bio. He had both his sons made consuls of the mightiest empire of the world, and then suddenly he got his riches taken, family members executed, and was awaiting his own execution in exile in a prison. Instead of lamenting, he sat down and wrote a book about what really is important in life. Read his Consolation of Philosophy. It remained on the bestselling list for a few centuries, not without a reason. Kings copied it by hand!) Barbara is part of the logical foundation of anything that has to do with classes. You might know classes as types, categories, genera, or anything else that is somehow taxonomical. Barbara is a type of syllogisms, thus a rule for correct reasoning. Modus Barbara states that if all A are B and all B are C, well then also all A are C. As an example: If we know that all billionaires are human, and we know that all humans are mortal, bang, all billionaires are mortal, too.
Wikidata quality and quantity

One of the goals of the Wikidata development project is a community that is strong enough to maintain the content in Wikidata. The community is – as with all other Wikimedia projects – the only guarantee of quality and sustainability.

None of the objectives of the Wikidata development project is to be the largest collection of data on the net. The sheer number of statements in Wikidata is not a metric that is indicative of healthy growth or quality. Since it is an easy to get and understandable number it is nontheless used a lot, but we should not attach too much importance to it.

This leads to the question, which metrics are meaningful for quality in Wikidata? And I have to admit: we do not know. This may seem particularly ironic since my dissertation was on the topic of quality measurement of knowledge structures. But it is not surprising: the opportunity to make statements in Wikidata exists since about half a year. The site is in continuous development, and some important pieces for quality assurance that are planned for Wikidata are not yet developed – including, for example, ranks for statements, web links as a data type, the protection of individual statements and aggregated views of the data. How to make quality measurable in Wikidata, which metrics correlate with quality – it has simply not yet been investigated sufficiently. I expect that science will provide some answers in the coming months and years.

To get an overview of the development of Wikidata, we must temporarily take assumptions about what numbers likely indicate quality. I do hereby call the community to make suggestions and discuss. A few first thoughts below.

The number of data elements (items) seems to not be a useful measure. This number is so far almost exclusively characterized in that items are required for the storage of language links. Accordingly, there was initially strong growth, while the links were transferred, and in recent months, the number is relatively stable.

The number of edits per page seems to be more meaningful. Last week it went above 5.0 and is rising quickly. The number of edits alone in Wikidata is less meaningful than in many other Wikimedia projects as an extraordinarily high proportion of the edits are done by bots. Bots are programs written by users to automatically or semi-automatically make changes. The bots are controlled by a group of about 80 users. This leads many to the idea that Wikidata is only written by bots. But that’s not true: every month 600000-1 million edits are performed by human user. These are numbers that can be reached only by the most active Wikipedias – including their own bot edits. Worries about Wkidata’s growth being too fast and that the quality of the data would suffer, have so far, except for anecdotes, not proven true.

Perhaps the simplest metric is the number of active users. Active users in Wikimedia projects are defined as the users who contributed at least five edits in a given month. Wikidata has nearly 4,000 active users, making it rank 6th among the most active of the Wikimedia projects together with the Japanese and Russian Wikipedia behind only the English Wikipedia, Commons, the German, French and Spanish Wikipedia. In other words, Wikidata has more active users than 100 smaller Wikipedias combined! Whenever the smaller Wikipedias access Wikidata, they rely on a knowledge base that is maintained by a much larger community than their own Wikipedia. But the advantages don’t end there: by using the content of Wikidata in the Wikipedias it becomes more visible, gets more attention, and errors are more likely to be found (although we still lack the technical means to then correct the error easily from Wikipedia – but that is on the development plan). This mainly benefits the smaller Wikipedias.

But it also already has useful advantages for the larger Wikipedias: An exciting – and for me completely unexpected – opportunity for quality assurance came when the English Wikipedia decided not to simply take IMDB IDs from Wikidata but instead load them from Wikidata to compare them with the existing numbers in Wikipedia, and in the case of inconsistency to add a hidden category to the article. This way difficult to detect errors and easily vandalisable data got an additional safety net: it may well be that you have a typo in the number on the English Wikipedia, or some especially funny person switched the ID for Hannah Montana’s latest film with that ofNatural Born Killers in the French Wikipedia – but now these situations are detected quickly and automatically. This data that is validated in several ways can then be used by the smaller Wikipedias with little concern.

As mentioned earlier, a lot is still missing and Wikidata is a very young project. Many of the statements in Wikidata are without a source. Even in the German Wikipedia the statement, Paris is the capital of France, does not have a source. We impose much stricter rules on a much smaller project after such a short time? But, then one may interject, if a statement has no source, I can not use it in my Wikipedia. And that is perfectly okay: it is already possible now, to just use data from Wikidata if they have a source of a certain type.

There are two ways to ensure the long term quality of Wikipedia: Allow user to be more effective or attract more users. We should continue to pursue both ways and Wikidata uses both ways very effectively: the mechanisms described above aim to give users the means to make more powerful tools and processes to build quality assurance, simultaneously Wikidata has already brought more than 1300 new users to the Wikimedia projects who had not edited in the other Wikimedia projects before.

Wikidatas main goal is to support the Wikimedia projects: it should enable higher quality of the content and reduce the effort required for the same. We need more metrics that capture this goal, and show how we evolve. The simple metrics all indicate that the initial growth in width has come to an end after months, and that the project is gaining in depth and quality. There are useful applications both for small as well as for large projects. But it is also clear that I am an avid supporter of Wikidata and so have a bias, and therefore start a call for ideas to track Wikidata’s effect critically and accurately.

Wikidata and other technical bits at Wikimania

Denny, Lydia and Daniel (by Fabrice Florin, CC-by-sa 2.0)

I’m back from an amazing Wikimania. First of all thank you to everyone who helped make the event happen. It was very well organized and an overall useful and productive event. I was there to discuss everything Wikidata as well as new technology like the Visual Editor and Flow and how they affect the German language Wikipedia.

It felt like Wikidata and the Visual Editor were on everyone’s mind during this Wikimania. No matter which talk or panel or dinner I went to – every single one of them mentioned Wikidata and the Visual Editor in some way. It’s great to see the Wikimedia community embrace Wikidata as its sister project. And the VisualEditor – while still rough – it seems is getting to that point very quickly too.
On truths and lies

This is the second in a short series of blog entries in which I explain some of the design decisions behind Wikidata. The first one was about restricting property values or properties. The essays represent my personal opinion, and are not to be understood as the official opinion of the Wikidata project.

Databases have an aura of correctness. When we query a database, we expect the result that comes back to basically be The Answer and The Truth. Ask Amazon’s database about the author of the Bible. Ask IMDB about the director of Adaptation. You are not expecting to get a possible answer, or different points of view – you expect one definitive answer.

Wikidata is collecting structured data about the world. It is basically a crowdsourced database. Unlike text, structured data necessarily and unfortunately lacks in nuance. Whereas it is possible to talk about the statehood of Kosovo in an NPOV way in natural language, a naive approach to represent that in structured data would fail: either we say Kosovo is a state, or we do not. There are no shades of grey.

Fortunately some of the roots of Wikidata lie in an EU research project called RENDER. The goal of this project is to explore and support the diversity of knowledge on the Web. RENDER discards the assumption of a simple, single truth – and this was inherited by the Wikidata data model. Instead of collecting facts, we collect statements. We define statements as claims that can have references. A reference supports the claim. A beautiful example is for example Ethanol, where the CAS number – a standard identifier for chemical compounds – is given with a reference to the actual standard, pointing out the page in the source.

Unlike many other databases, Wikidata can contain contradicting statements, supported by different references. Unlike the natural text in Wikipedia, Wikidata does not offer the possibility to reconcile and explain the differences in prose, providing due weight to the different points of view. The responsibility lies with the Wikidata reader and reuser to deal with deciding which sources to trust. I expect quite a bit of research and exploration to deal with this question in the following years. The first reusers to deal with these issues will be the Wikipedia communities that opt to choose data from Wikidata.

In the next few weeks and months we will add a few more features to support the diversity of statements in Wikidata.

Currently, the most obvious omission is a lack of datatypes to specify numbers, text and URLs. Only with these datatypes it will be possible to actually write down references in their full glory. Another opportunity – once URLs are available – would be to provide content locators for text in HTML pages through XPath, oxPath, CSS selectors, or something similar, thus enabling bots to check if the given references are still valid. I am very curious to see how the usage of references and sources will develop in and around Wikidata.

Another major feature that will be introduced in the course of this year is the possibility to rank statements: not all statements are to be regarded equally. We will introduce three ranks, and every statement will be in one of them: preferred, normal, and deprecated.

“Preferred” statements should be the most current and most widely accepted statements. There can be several preferred statements for the same item and property.

“Deprecated” statements are those that are considered to be not reliable for some reason. They are mentioned though because they might have a strong source supporting it, or they are widely spread for some reason, but actually not accepted anymore. Examples can include typos from influential textbooks – for example regarding the iron content of spinach, or the length of the Rhine – or numbers spread by some form of propaganda that are considered not correct today anymore.

“Normal” statements are thus the ones left, which are neither “preferred” nor “deprecated”. This will often apply for historic statements (the population of Rome in the time of Julius Caesar, former capitals of Russia, etc.).

Technically, we will start with using only preferred statements for answering queries (i.e. when you ask for all capitals with a population of less than 500,000, then you won’t get answers where the city had a population of 120,000 in the 16th century). Also only they will be returned by the property-parserfunction. The Lua interface will have access to all statements and thus provide full flexibility. It is planned to extend query answering later to support more complex queries, at which point we will have to think about integrating other ranks.

The ranks should allow for a more inclusive policy in Wikidata, allowing to reflect a wider diversity of knowledge.

To give an idea of the time scale: we will first implement the datatypes that are still missing, and then, as a prerequisite for ranks, the possibility to reorder statements. After that, ranks will be the next feature to land in Wikidata.

Ranks introduce a vector for debate, which has not been there in Wikidata yet. The question moves from “should this statement be included?” to “what should be the rank of this statement?” This seems like a necessary step: unlike natural text, Wikidata otherwise could not include statements that are agreed on to be bogus but that have historical or other value. This makes it even more important to remember that Wikidata is not about truth, but about collecting referenced statements in a secondary database. The criterion for inclusion should not be veracity, but verifiability – a policy that has served Wikipedia very well.

Wikidata will always – and that is both a necessity as well as acknowledged by design – run short of Wikipedia in many aspects. Wikipedia articles can explore causal and informal connections, they can inspire curiosity, and they can support one of the major modes of knowledge transfer between humans: storytelling. Wikidata has other, unique advantages: it can provide some ground data about a topic of interest in many languages more easily, and it provides the data in a way that is much more accessible for bots and apps. It could be a step towards relieving some Wikipedias from a lot of bot-created articles, never touched by a human editor, cluttering recent changes, and skewed statistics.

Without the ability to express a plurality of statements about an item – even if they are considered truths only by some and lies by others – Wikidata would fall short of one of the major pillars of Wikipedia, the Neutral Point of View and the possibility of integrating conflicting points of view.

I hope that the technical platform that we as developers are building, and the rules and processes of the communities in Wikidata, the Wikipedias, and other Wikimedia projects, are establishing a useful ecosystem, understanding the limitations of each project, and discovering how we can most effectively help each other. And this means understanding the peculiar relationship between Wikidata and the Truth.

