Data Quality Management in Wikidata – Workshop write-up
Data growth is more visible than ever before and with this growth, trustworthiness of data is becoming a challenge. Wikidata is the central source of structured knowledge for many projects and is developed to be used by anyone anywhere in the world. Therefore, ensuring data quality is of utmost importance. A workshop brought together scholars and Wikidata community members to take a deeper look at the challenges ahead.
A guest article by Cristina Sarasua, Claudia Müller-Birn, and Mariam Farda-Sarbas
Despite the development of many methods and tools to improve different dimensions of quality in Wikidata, further efforts are needed to achieve high quality data. One effort in this regard was the workshop “Data Quality Management on Wikidata”, which was held on January 18th 2019 at Wikimedia Deutschland (Germany). It brought together scholars interested in and working on monitoring and improving data quality on Wikidata as well as members of the Wikidata community. The workshop began with a welcome note by two of the organizers of the workshop (Claudia Müller-Birn from Freie Universität Berlin and Cristina Sarasua from University of Zurich).
After that, the first keynote speaker, Amrapali Zaveri, gave a talk about “Open Data Quality: dimensions, metrics, assessment and improvement” (Slides).
Afterwards, the workshop participants introduced themselves in a round robin introduction session and formed three discussion groups. The participants in this workshop had come from different institutions and countries. Besides Germany, where the workshop was held, participants had come from Switzerland, the UK, the Netherlands, Belgium and the United States.
The workshop was designed as a discussion forum organized in three sprints: one to collectively identify the key data quality challenges in Wikidata, a second sprint to brainstorm solutions to address the identified challenges and a third sprint to discuss ways to prioritize the next activities.
The main challenges discussed within the groups were, amongst other things the velocity of Wikidata’s schema, the diverging meaning of items in various languages, subtitle vandalism, measuring completeness without introducing bias, extending references and sources.
The groups had identified and discussed further challenges of data quality in Wikidata, such as measurement of quality, need for more tools to identify and resolve quality issues, consistency of property usage, trustworthiness and accuracy of claims, and the lack of a global perspective in terms of making decisions locally.
During their second sprint, the groups started to focus on more specific topics and to discuss possible solutions. Some of the suggested solutions, for instance, are making data more trustworthy and accurate by checking sources/references of data through voting or a ranking system, getting more references by enforcing the use of references or through information extraction, overcoming the language challenge of Wikidata by motivating communities through a ranking system or introducing a record button to allow spoken languages, identifying items without statements based on categories in linked articles.
After each sprint, groups presented their ideas and shared them with all participants and answered their questions.
The groups then discussed their results and suggested to keep working on this topic after the workshop. Based on the discussions in the room, the participants founded the WikidataProject: DataQuality.
After the sprint sessions, the authors of accepted abstracts briefly presented their projects and ideas regarding their abstracts. The abstracts were focusing on data quality from different angles, such as an overview of data quality tools for Wikidata, OpenRefine and ProWD as new data quality tools, completeness, external identifiers, Wikidata schema language, bulk edits and bot edits.
The workshop was summarized by the second keynote speaker, Daniel Mietchen:
Please note: Since the workshop footage and all relevant materials were published in English, this blog post will also – as an exception – only be available in English.
Wir verwenden Cookies auf unserer Website, um Ihnen die beste Erfahrung zu bieten, indem wir Ihre Präferenzen speichern auch bei wiederholten Besuchen. Durch Klicken auf "Akzeptieren" stimmen Sie der Verwendung aller Cookies zu. Sie können jedoch die Cookie-Einstellungen aufrufen, um eine kontrollierte Einwilligung zu erteilen. Mehr Informationen finden Sie auch in unserer Datenschutzerklärung
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies. However you may visit Cookie Settings to provide a controlled consent.
This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
This cookie is set by GDPR Cookie Consent plugin. The purpose of this cookie is to check whether or not the user has given the consent to the usage of cookies under the category ‚Scripts'.
Um besser zu verstehen, was die Besucher/innen auf unseren Websites interessiert und ob diese sich dort zurechtfinden, setzen wir das Open-Source Analyse-Tool Matomo (vormals Piwik) ein. Dieses Tool setzt einen Cookie, um einzelne Nutzer/innen voneinander zu unterscheiden.
_pk_ses.2.b225
0
30 minutes
Um besser zu verstehen, was die Besucher/innen auf unseren Websites interessiert und ob diese sich dort zurechtfinden, setzen wir das Open-Source Analyse-Tool Matomo (vormals Piwik) ein. Dieses Tool setzt einen Cookie, um einzelne Nutzer/innen voneinander zu unterscheiden.
Dieses Cookie wird von Youtube gesetzt und registriert eine eindeutige ID zum Verfolgen von Benutzern basierend auf ihrem geografischen Standort
IDE
1
2 years
Wird von Google DoubleClick verwendet und speichert Informationen darüber, wie der Nutzer die Website und andere Werbung verwendet, bevor er die Website besucht. Dies wird verwendet, um Nutzern Anzeigen zu präsentieren, die für sie entsprechend dem Nutzerprofil relevant sind.
VISITOR_INFO1_LIVE
1
5 months
Dieser Cookie wird von Youtube gesetzt. Wird verwendet, um die Informationen der eingebetteten YouTube-Videos auf einer Website zu verfolgen.
vuid
0
2 years
Diese Cookies werden vom Vimeo-Videoplayer auf Websites verwendet.
YSC
1
Diese Cookies werden von Youtube gesetzt und dienen zum Verfolgen der Ansichten eingebetteter Videos.
Dieser Cookie-Name ist mit Funktionen zum Konvertieren einer Benutzer-IP-Adresse in einen geografischen Standortdatensatz verknüpft. Es wird am häufigsten verwendet, um einem Benutzer Inhalte basierend auf seinem Standort bereitzustellen.