10 Years of Wikidata – Part 1

Wikidata is approaching its tenth birthday and to celebrate this incredible milestone we wanted to know more about the editors who made Wikidata the collaborative project it is today. There are countless stories to tell as well as countless projects to describe – from the occasional clean-up made by one contributor on a very limited set of Items, to bigger Wikiprojects who created new data models for all to follow, to outreach collaborations with GLAMs and cultural institutions.

Luca Martinelli [Sannita]

30. September 2022

This is part one of a two-part series of blog posts which will be a collection of some of those stories and which will all share our community members’ ingenuity and willingness to contribute to the sum of all knowledge. The reasons behind contribution may vary with the person, but there are some common themes; we may do it because we value our home community and we want it to “appear on a map”, or because we feel close to a particular topic and we like to share our passion with the world. Ultimately, we do what we do just because we can, and because we like to do it, in our own fashion, and with our own capabilities. We hope these stories will resonate with you and maybe inspire you for the next ten years.

Go to:

There Since the Very Beginning

Disruptive Media Learning Lab, Andy Mabbett, newly-appointed Wikimedian in Residence at Coventry University, June 2019, CC BY 4.0

Andy Mabbett, aka Pigsonthewing, has been a well-known Wikipedian since 2003. He has contributed more than 940,000 edits across 162 projects in his 19 year tenure as a Wikimedia editor and was one of the very first contributors to edit Wikidata since its early days.

“Earlier than that, in about 2007 – he says – I did some work on ‘microformats’, which as a method of marking metadata within a page using HTML classes in order to specify that a piece of text was a date, a name, an address, geographical coordinates, or whatever. I did that mostly on external websites because I needed it for a project I was running for an organization. In 2007 I started to apply those markup techniques to English Wikipedia infoboxes so that computers could read that metadata.” In other words, Andy was doing some of the things Wikidata would do a few years later: “I didn’t know that at the same time people like Denny were planning a project to make metadata available through a database. But I guess we were both trying to do the same thing in different ways.”

Wikidata arrived in 2012, a clean slate of a knowledge graph where only Wikipedia sitelinks could be added for the first few months. “I got involved with Wikidata probably even before its official launch. At that time it was very basic and then, when we had the ability to add Properties, I contributed to some of the specifications of what the new Properties might be.” Working on Properties was, in fact, one of the various domains in which Andy contributed heavily: at least 175 Properties were directly proposed by him and many more proposals had comments and suggestions to make them better. “I’m not the only one that did that, of course, I worked with other people – he interjects – but I did my part to try to make sure that the project was robust, that it was going to survive because of its quality and it wasn’t going to get sidetracked. I think what we did in the early days was give it a good steer. We made some sensible ground rules and some of them are written down as policies now. Some others were just more a case of custom and practice, but we can now say to people ‘this is how you do it’. I suppose ‘developing best practices’ is the best way to put it.”

Meanwhile, as many other contributors did in their history of contribution, Andy started to populate Wikidata’s main namespace by adding Items about Wikipedia articles he had written and topics that were of interest to him. From Birmingham’s local historic buildings, to people he had written of, to some of the GLAMs (Galleries, libraries, archives and museums) he worked with. “I did this to get as much practice as I can on Wikidata, to see what it was all about. And then – he continues – I became involved with a number of bulk imports of data into Wikidata through my GLAM work. One of the first that springs to mind is the TED Talks. Jane Darnell and I were both recruited by TED as Wikimedians in Residence and we encouraged volunteers to write Wikipedia articles about speakers at the main TED conference since many of them didn’t have one. A lot of our work was also adding the speakers’ biographical data and details about their TED talks into Wikidata. This wasn’t any ‘scraping from Wikipedia’ initiative, this was one of the first in which gathering data was done by exporting the content from a database, reorganizing it and then uploading it in bulk through QuickStatements.”

Many other imports followed, from ORCID to BBC, from the British historical cultivars of apples, to Songkick, to Quora – which is probably one of the projects that Andy is more proud of, «because it was one of the biggest projects and it also had an impact on their website». Quora, in fact, started to draw data from Wikidata in order to improve its own ontology of data, more or less same as what Soundkick did as well. “The other thing I am particularly proud of was the dblp computer science bibliography import: I worked with them to import their identifiers and match them with Wikidata, which they imported into their own database, so that you have a two-way linking. By doing this, they were able to import other data from Wikidata and improve our project as well, by exchanging identifiers and data. I think this is our strength: we have a symbiotic relationship with our partner organizations that strengthens both in the long run.”

All in all, when asked about the next ten years of Wikidata, Andy feels there is still room for improvement: “there are some features I would like to see, maybe a gadget or a tool or a minor change to the core software, that would be nice to have. These do not represent a failure in Wikidata, the website is good and useful as it is, but these things would make it a little better.” Of course, there’s also the wish that more people understand and use Wikidata: “I am certainly a very strong advocate for the project. I think it’s a great pity that it is so badly underutilized on English Wikipedia; it could be so much better if it made more use of Wikidata, and if members of the community fed back their issues into Wikidata.”

Of Names, WikiProjects and Workshops

French Wikipedian, Harmonia Amanda, fell in love with Wikidata when the project started. “I first edited Wikidata in its first week of life, but I really started contributing only in early 2013 – she remembers – It immediately changed the way I was working with sitelinks. I remember it was so empty at the beginning, but then another early Wikidata adopter, I think it was Otourly, told me that we could finally start doing data correction by batches so we started adding descriptions and basic data, going through the Wikipedia categories.”

One of Harmonia’s first big projects was a thorough clean-up of the Items related to the Lord of the Rings fictional universe, but the most durable project she set up is Wikiproject Names: “names are very, very, very complicated to deal with. Back in 2013 we started to import some external databases which were primarily about Western and English names. The rest of the world was just not represented and that was the first problem that we grappled with. Then we had the problem of modeling data, because there are people who can have two family names or none at all. There’s also the variants of names, their spelling and their original scripting language: is the name ‘Nadia’ coming from its Russian form or from its Arabic form? We decided to go with the ‘one variant, one item’ solution, which is just less problematic than having one Item for all variants!”.

But it’s not all books and names. Sport is also another topic that saw Harmonia work hard: “Someone in 2016 asked me who won the Grand Prix of Figure Skating Final twice in a row. I thought ‘this is an easy SPARQL query’, but despite many attempts – she remembers – the queries always gave no results. I went to the Item of a skater I know won the competition just to find it basically empty, not even ‘sport: figure skating’ was present as data! So I started adding to it, combing through every category on various Wikipedias, and then I started modeling data for the results of competitions. Because you know, to answer that query in the beginning, you need the Items of the competitions to have the necessary data.” Then a new project was born which also provides a number of queries in case you want to help with maintenance.

Another big part of Harmonia’s involvement with Wikidata were her monthly in-person workshops in France: “unfortunately I had to slow down significantly my activities in the last three years, because of COVID and other problems… but I still did some online workshops with African communities to on-board them onto Wikidata. They’re doing a lot of good work! Since around 2014 to the beginning of the pandemic, I hosted Wikidata workshops every month. Initially, I thought it was a good idea to have a fixed theme for each of them, so that we could work on a specific set of Items or problems. Then I discovered that people came when they could and not because they were interested in the monthly topic, so I changed my plan. I let people just come and ask whatever question they had and that question would become the topic of the workshop. I think I helped a lot of people with their problems with Wikidata!”

When asked what the future may have in store for Wikidata, Harmonia gets more serious: “The problem I see for Wikidata is the same of all Wikimedia projects: there’s just too much content to cover and quality can differ very much from topic to topic depending on whether we have editors who are specialized in those topics or not. I should say that Wikidata helped a lot in cleaning up many mistakes on Wikipedia: I was doing that kind of work before Wikidata and it was definitely harder than it is now. I’m also very worried about the ‚weaponization‘ of our data – she adds – even though it’s not a Wikimedian problem per se, there are people on the Internet who are determined to do harm and Wikidata, in my opinion, is still very vulnerable to this kind of vandalism. On the other hand, I’m seeing more stable contributions in what I call ‚minority languages on the Internet‘ – because they’re not minority languages for those who speak it! I’m seeing established community members who run workshops, who edit, who use Wikidata in their own projects; like Hausa for example. This makes me very hopeful that the more we grow, the more we become multi-lingually diverse. One fear I had – she concludes – was that we would have so much data in English that all other languages would have been abandoned, but now I can say that people are adding information in their languages and getting them represented on the internet. That’s great, that’s just what we wanted.”

Evolving Throughout the Years

Alexmar983, 2019-05-01 Camillo Pellizzari 01, CC BY-SA 4.0

Camillo Pellizzari, known on the Wikimedia projects as Epìdosis, is about to start his Ph.D. in Antiquity Sciences at the Scuola Normale Superiore di Pisa. He started contributing to the Italian Wikipedia in December 2012 and started to contribute to Wikidata in the following months.

“I immediately specialized in merging duplicate Items – he remembers – especially those about categories. I realized that the old interwiki system left out a lot of potential interconnections between projects, especially between languages that are not close to each other. At the time, there was no possibility to keep the second Item as a redirect to the first too, so I quickly amassed quite a lot of requests for deletion.” There were so many that Camillo decided just to propose himself as an administrator on Wikidata, “mostly to speed up the process of deletion”.

To date, according to the statistics, around 55,000 merges bear his signature, making Epìdosis the eighth contributor in that field. Camillo still continues to clean-up duplicate Items, but his focus shifted with the years from the duplicates that come from the projects to those that come from external database imports. “I started to grasp this ‘new’ aspect of Wikidata, in other words that Wikidata no longer was a project serving only Wikipedia, especially since 2019. Until then I mostly did imports of Wikipedia data into Wikidata or merging Items. I must say that sometimes – he concedes – I unfortunately still have some doubts about the quality of certain data imports, even though it really depends much on the import and on the topic.”

Camillo still holds true to the ‘old Wikidata values,’ “maybe – he quips – because of my background as a history student. I still explain, when I do a presentation about Wikidata, how the project started as a way to centralize data coming from various Wikipedias, in order to be reused on even more Wikipedias.” This, however, doesn’t prevent him from finding that the ten-year-long evolution of Wikidata has been extremely positive for the project: “Wikidata certainly evolved with time and I did too. I have expanded my work towards the interaction with external databases, especially GLAM databases, that want to transpose data from Wikidata in their own database and the other way round. It seems to me that the basic function of the project didn’t change, but it merely expanded its potential area of intervention to other entities which do the same thing that Wikipedia does: choosing which relevant data to show to their readers.”

Speaking of GLAMs, right before the start of the pandemic Camillo met in person with another contributor he casually got to know on Wikidata – a meeting that resulted in the birth of a brand new GLAM project in Italy: “I met Stefano Bargioni – User:Bargioni on Wikidata – because of his edits on Wikidata. In real life, he’s Deputy Director of the Library of the Pontifical University of the Holy Cross and he started editing Wikidata by adding the identifiers from his own library. I casually stumbled on his edits and we started messaging. In January 2020, we met in person in Rome, and I showed him Mix‘n’Match and other Magnus tools. Stefano immediately understood the potential and from there our collaboration took off. He then contacted other colleagues, who in turn realized the value of collaborating with us – Camillo continues – and so the ‘Gruppo MAB’ was created” (“MAB” is the Italian acronym for museums, archives and libraries). Almost all of the activities of the group were held online in recent years, “not just because of the various lockdowns, but also because we’re all in different cities. Nonetheless, we managed to have some in-person meetings and we plan to have others in the next months.”

Regarding the future Camillo wishes mostly two things. First: a way to make tool development and technical changes more easy to request. “It’s disappointing that there is no established process to request a gadget or tool – he notes – and have it realized by someone. Then we have the problem of community-maintained tools that sometimes accumulate a number of requests and bugs that are never addressed or very slowly fulfilled.” Secondly, an improvement on the so-called ‘data round-tripping’: “To me it’s paramount to forge close partnerships with all institutions that give us, or can give us, data, who can then act as data quality controllers. At the moment there is almost a complete lack of a workflow that would allow corrections in both directions, especially from Wikidata back to the original database, which is an almost non-existent feature. In my opinion, correcting someone’s data is as important as giving the correction back to its original source, since the quality of the other databases is key to maintaining the quality of Wikidata itself.”

License information: 1. Disruptive Media Learning Lab, Andy Mabbett, newly-appointed Wikimedian in Residence at Coventry University, June 2019, CC BY 4.0 | 2. Alexmar983, 2019-05-01 Camillo Pellizzari 01, CC BY-SA 4.0

More things to know about Wikidata

On October 29, Wikidata celebrates its 10th birthday! To mark the occasion, we’ve published a series of blog articles with lots of interesting facts about the history of the world’s largest free knowledge database and its unique community.