Data Quality Management in Wikidata – Workshop write-up

Data growth is more visible than ever before and with this growth, trustworthiness of data is becoming a challenge. Wikidata is the central source of structured knowledge for many projects and is developed to be used by anyone anywhere in the world. Therefore, ensuring data quality is of utmost importance. A workshop brought together scholars and Wikidata community members to take a deeper look at the challenges ahead.

  • Lisa Dittmer
  • 7. März 2019

A guest article by Cristina Sarasua, Claudia Müller-Birn, and Mariam Farda-Sarbas

Despite the development of many methods and tools to improve different dimensions of quality in Wikidata, further efforts are needed to achieve high quality data. One effort in this regard was the workshop “Data Quality Management on Wikidata”, which was held on January 18th 2019 at Wikimedia Deutschland (Germany). It brought together scholars interested in and working on monitoring and improving data quality on Wikidata as well as members of the Wikidata community. The workshop began with a welcome note by two of the organizers of the workshop (Claudia Müller-Birn from Freie Universität Berlin and Cristina Sarasua from University of Zurich).

After that, the first keynote speaker, Amrapali Zaveri, gave a talk about “Open Data Quality: dimensions, metrics, assessment and improvement” (Slides).

Afterwards, the workshop participants introduced themselves in a round robin introduction session and formed three discussion groups. The participants in this workshop had come from different institutions and countries. Besides Germany, where the workshop was held, participants had come from Switzerland, the UK, the Netherlands, Belgium and the United States.

The workshop was designed as a discussion forum organized in three sprints: one to collectively identify the key data quality challenges in Wikidata, a second sprint to brainstorm solutions to address the identified challenges and a third sprint to discuss ways to prioritize the next activities.

The main challenges discussed within the groups were, amongst other things the velocity of Wikidata’s schema, the diverging meaning of items in various languages, subtitle vandalism, measuring completeness without introducing bias, extending references and sources.

Participants at work, CC BY-SA 4.0

The groups had identified and discussed further challenges of data quality in Wikidata, such as measurement of quality, need for more tools to identify and resolve quality issues, consistency of property usage, trustworthiness and accuracy of claims, and the lack of a global perspective in terms of making decisions locally.

During their second sprint, the groups started to focus on more specific topics and to discuss possible solutions. Some of the suggested solutions, for instance, are making data more trustworthy and accurate by checking sources/references of data through voting or a ranking system, getting more references by enforcing the use of references or through information extraction, overcoming the language challenge of Wikidata by motivating communities through a ranking system or introducing a record button to allow spoken languages, identifying items without statements based on categories in linked articles.

After each sprint, groups presented their ideas and shared them with all participants and answered their questions.

The groups then discussed their results and suggested to keep working on this topic after the workshop. Based on the discussions in the room, the participants founded the WikidataProject: DataQuality.

After the sprint sessions, the authors of accepted abstracts briefly presented their projects and ideas regarding their abstracts. The abstracts were focusing on data quality from different angles, such as an overview of data quality tools for Wikidata, OpenRefine and ProWD as new data quality tools, completeness, external identifiers, Wikidata schema language, bulk edits and bot edits.

The workshop was summarized by the second keynote speaker, Daniel Mietchen:

…And wrapped up with a group photo:

Please note: Since the workshop footage and all relevant materials were published in English, this blog post will also – as an exception – only be available in English.