Wikidata birthday: “We started programming with 12 people”

10 years ago, a group of twelve people in Berlin laid the foundation for what is now the world's largest free knowledge database: Wikidata was launched on October 29, 2012. Lydia Pintscher was there at the time. In this interview, she tells us how it all began and where the data journey for Wikidata could still go.

Patrick Wildermann (free Editor)

27. October 2022

Wikidata celebrates its 10th birthday – how did the project get started in the first place?

LYDIA PINTSCHER: The idea for Wikidata has been around for a while, at least since the first Wikimania conference we held in 2005. At that time, Denny Vrandečić and Markus Krötzsch – the two originators of the idea for Wikidata – submitted a paper with the proposal to make Wikipedia machine-readable. People should be able to continue working with the data that can be pulled from Wikipedia. This resulted in the Semantic MediaWiki project, a software that can be used to query and visualize data. Semantic MediaWiki was and is very successful, but never made it to the application in Wikipedia.

Denny nevertheless stuck to the idea of building a knowledge graph from Wikipedia. With the experience gained from Semantic MediaWiki in the background – and with funding from Google, AI² and the Gordon and Betty Moore Foundation – Wikimedia Germany (WMDE) set up a team to start working on Wikidata. That was in 2012. We started programming with 12 people, and just six months later Wikidata went live.

What Semantic MediaWiki experiences have been incorporated into the development of Wikidata?

Among other things, the Semantic MediaWiki is not multilingual, but this is crucial for Wikidata. An open database, to which people from all over the world are supposed to contribute, cannot function only in English. In addition, Wikidata has references for the data, which means that it is possible to trace where the data comes from. Also a crucial point. A third difference compared to Semantic MediaWiki, is Wikidata’s idea of centrality. The principle is that all language versions of Wikipedia get their data from Wikidata – which then only needs to be maintained once, instead of several hundred times for all versions individually.

Lydia Pintscher | Foto: VGrigas (WMF), Lydia Pintscher – 1, CC BY-SA 3.0

In 2012, Lydia Pintscher was part of the team that developed Wikidata. Recently, the free knowledge database passed the milestone of 100 million items. Lydia talks about why this is only partly a reason to celebrate in this interview.

What are particular landmarks in the history of the project?

One milestone was the release of Wikidata after six months – the point at which editors could create their first items. Another important point, not much later, was the possibility to insert links to Wikipedia articles. Before Wikidata existed, for example, in the English Wikipedia at the end of an article there was a reference to the French version, the German, the Italian and so on – very long lists for all articles, redundantly kept in each Wikipedia, which meant chaos. Finally, these links had to be kept consistent in each case. With the help of bots, editors imported them into Wikidata and removed them from Wikipedia. From then on, Wikidata got a lot of new items…

Please explain this thrust in more detail….

There had to be an item in Wikidata now for every relevant concept that is described somewhere in Wikipedia. A concept – that is for example ‘Berlin’. There were articles about Berlin in almost 300 Wikipedias. For the item ‘Berlin’ in Wikidata, people could now collect data in the next step. This helped us enormously to build up a base of data in a relatively short time, which could then be improved and expanded.

Was there an original idea of who or what this data would serve?

For us as a team, the priorities were clear from the start. First: collect data and make it available centrally for Wikipedia. Then, in a next step: do the same for the other Wikimedia projects, for Wiki Commons, Wiktionary, Wikivoyage and so on. Whereas our data is a treasure that is not only relevant for Wikimedia projects, but also for anyone else out there who needs a basic set of data about the world. So we focused on making the data available to everyone outside of Wikimedia projects. Then the final step was realizing that not only is our data relevant, but so is the software we developed for Wikidata, which others can use to build their own open knowledge base – Wikibase. This is the evolution of the project.

Today, Wikidata’s data is also used in voice assistants such as Siri or Alexa, i.e. in commercial projects. An unavoidable side effect of the open access philosophy?

We have explicitly decided to publish our data under cc0 – which means that everyone can do what they want with it. This includes any kind of commercial use, no matter if we welcome it or not. Not to mention that there are also non-commercial uses that we don’t approve of in case of doubt. I see this ambivalently. Voice assistants are precisely the tools through which people obtain their knowledge these days. Accordingly, I prefer it when it comes from a source that everyone can contribute to and not from a closed system that no one can influence.

Das Wikidata-Team 2012 | Lizenz: Phillip Wilke (WMDE), Wikidata Fotos 183, CC BY-SA 3.0

These dedicated people launched Wikidata in Berlin in 2012: John Erling Blad, Abraham Taherivand, Tobias Gritschacher, Jeroen De Dauw, Henning Snater, Lydia Pintscher, Daniel Kinzler, Markus Krötzsch, Silke Meyer, Denny Vrandečić, Katie Filbert, Daniel Werner, Jens Ohlig.

What technical innovations have meant a difference in Wikidata’s ten-year history?

Our decision to make the Wikibase software available to everyone and to simplify access via Wikibase Cloud was certainly important. Our wish is that other people also set up Wikibase instances where they publish and maintain their own data – so that we can link to it in Wikidata, or vice versa. But that’s not technically easy. Wikibase Cloud is a service offering: Wikimedia Deutschland takes care of the hosting and any technical problems, and the operator has to take care of the content only.

Another innovation was the Query Service, which enables queries in Wikidata – and based on this, the Query Builder…

… which is being hailed as the new “superpower” in the world of Open Data – what exactly is behind it?

In Wikidata, as discussed, huge amounts of data are available. The population of Berlin just as the name of the capital of Paraguay, or the winner of the “Oscar” for the best sound editing. The only point is that this data is not very meaningful in itself. More relevant is the knowledge that can be gleaned from them. One question might be: how many people from Asia have won “Oscars” compared to people from Europe or the US? To do that, you need to know: Who has won an “Oscar,” where was this person born, on which continent is the location? The point is to establish links. To do this, you have to start queries on the data in Wikidata. This is made possible by the Query Builder.

In which projects has Wikidata already been involved?

One example is the QURATOR research project, which we carried out in collaboration with ten partner organizations, including the German Research Center for Artificial Intelligence GmbH (DFKI). It was under the broad heading of “curation technologies”. The goal was to develop technologies that would make the work of various knowledge workers easier, for example journalists researching for an article. As Wikidata, we mainly worked on making our data easier to use and enabling editors to increase its quality.

What responsibility does it mean to use Wikidata’s data to train algorithms?

I see the responsibilities in different places. One is certainly that we have to provide a data basis with Wikidata that is of high quality, representative, up-to-date and verifiable. I see that as the task of my team and the Wikidata community. The next stage that determines whether something is done well or badly is the question: How does the algorithm use the data? However, we have no influence on this; this responsibility lies with the developers of these algorithms.

How can data be kept non-discriminatory?

I’m afraid they will never be completely free of discrimination, unfortunately. But there are various points that can be addressed. In my eyes, the most important is that we are an open project. People who have recognized discrimination can do something about it with us. The other point is that we analyze our data very carefully in terms of where we have gaps, biases, biases. For example, we have a gender gap dashboard that shows the ratio of men vs. women in Wikipedias, also broken down by profession. Such a list was not possible before Wikidata, because the data basis was missing.

Social negotiation processes are constantly taking place about what discrimination is. What does that mean for your work?

The nice thing about Wikidata and Wikimedia projects in general is that you don’t have to stay at the point of complaining about the situation, but you can do something about it, concretely improve a situation – like a group like Women in Red, for example, which writes articles about women in Wikipedia and makes entries in Wikidata. That is also my attitude: Let’s do something, not just complain!

Where is Wikidata’s untapped potential?

There are so many! There are certainly quite a few apps, services or websites that don’t exist today only because someone hasn’t yet come up with Wikidata to make the idea a reality. There is also still potential in terms of expanding our data. We’ve had a new part in Wikidata for a few years now that deals with lexicographic data, data like you would find in a dictionary. That’s still an untapped treasure. In the Linked Open Data ecosystem we have in mind, we can also build many more Wikibase instances that make new, better data accessible and linked to Wikidata.

Can you personally think of a field of application that does not yet exist, but could?

A service that keeps you informed about the publications or activities of musicians or authors you appreciate. Whenever the artist releases something new, you get a notification: the new book is out, the tour is coming up. Wikidata does not have all the data it would need for that. But that can still be done.

More things to know about Wikidata

On October 29, Wikidata celebrates its 10th birthday! To mark the occasion, we’ve published a series of blog articles with lots of interesting facts about the history of the world’s largest free knowledge database and its unique community.