German summary: In diesem Blogpost berichtet Alessandro über seinen Forschungsaufenthalt in der Softwareentwicklung bei Wikimedia Deutschland e.V. Fokus seiner Doktorarbeit ist es, herauszufinden wie gemeinschaftliche Prozesse die Datenqualität in Wikidata beeinflussen. Zentrale Fragen sind dabei, wie Ehrenamtliche in Wikidata zusammenarbeiten, wie sich die Zusammenarbeit nach längerer Erfahrung in der Community verändert, welche Auswirkungen dies auf die Datenqualität hat und was überhaupt Datenqualität in so einer Wissensdatenbank bedeutet.
This is a blog post by Alessandro Piscopo.
My secondment at Wikimedia Germany was part of the research work for my Computer Science PhD, which I carry out at the University of Southampton and as an Early-Stage researcher within the Marie Curie ITN WDAqua project. The aim of WDAqua is to perform research to advance the state of the art of Question-Answering systems based on web data. The outcome of my PhD will be integrated with the research carried out by 14 other participants in this project.
In this blog post I would like to recap this interesting and proficuous experience, and to communicate the advancements that I was able to make in my research.
What does data quality mean in Wikidata?
My research focuses on investigating how community processes influence data quality in collaborative knowledge engineering systems, in particular on Wikidata. Progressing in my research will then mean to find answers to questions such as: How do users collaborate in Wikidata? How does their behaviour change as they gain experience in the community? How do these aspects affect data quality? What does data quality mean in this knowledge base?
We – my PhD supervisor, other fellow researcher at the Web and Internet Science group of the University of Southampton, and I – have already tried to answer some of these questions. Now we would like to concentrate our efforts on what data quality means in Wikidata.
Data quality is a complex concept. It covers a number of different aspects, also called dimensions, and is usually defined as fitness for use. But which use? Wikidata has already quite a broad coverage, has been already employed for several different purposes, and will probably be used for tasks we cannot devise yet. Furthermore, Wikidata is entirely maintained by its community. It is the community which determines what is in Wikidata and ultimately what Wikidata is and will be for.
Therefore, the best method, i.e. more reliable and appropriate for its features, to determine what data quality means to Wikidata seemed to be to directly involve the community and the development team that is behind Wikidata. At Wikimedia Germany, I was able to obtain precious information about how Wikidata was conceived, to become aware about several technical details that would have been otherwise difficult to find, and to understand – and learn. For example, I could observe how developers of a collaborative project such as Wikidata relate to its vast user community (and believe, it can be a hard task sometimes).
In more specific terms, the main goal of my secondment at Wikimedia Germany has been to set a data quality framework for Wikidata. The first step has been to review the relevant literature, in order to select data quality dimensions, i.e. different aspects of data quality, that could be relevant for Wikidata. This was at the same time a top-down and a bottom-up approach: high-level, abstract dimensions were selected from the literature and included on the basis of quality issues already observed in Wikidata.
After several revisions, the resulting dimensions’ list was published as a Request for Comments, and advertised on the Wikidata Project Chat page (in the English, French, and Italian versions) and on social media (Thanks Lydia and Léa!). Since its publication date (11 August), the page has been viewed on average about 92 times per day and has been edited more than 80 times. This testifies a certain degree of interest from the community, considering that other RfC pages I checked had much lower numbers of daily views in similar time spans. Nevertheless, the hope is to get as many participants as possible in the discussion, which would be beneficial both for this research and for Wikidata itself.
The date chosen to close the RfC is September 4. After that date, user comment will be analysed, so it can be used for further refinement of the quality framework. This could be again submitted to the Wikidata community to enquire whether it should be officially adopted by the project or not. A following step will be to find appropriate metrics for the dimensions identified in order to eventually perform a large-scale evaluation of Wikidata. Finally, for the prosecution of my PhD project, these results should be analysed and related to the community processes in Wikidata. But this is another part of the story, which I hope to write about here sooner or later.
This was the description of the research I have carried out in the last four weeks at Wikimedia Germany. It is obviously only a partial account of my time there. It has been an interesting, enriching experience, which allowed me to learn a lot on the topic of my research, but not only. Furthermore, this research would have been much harder without the help and advice of everyone in the Wikidata team and, in particular, of Lydia Pintscher, who allowed me to be here, revised my work, and patiently answered a very large number of questions.
Finally, I would like to again thank everybody at Wikimedia Germany. You have been great hosts and made me immediately feel at ease. It has been a pleasure to spend this time here and hope to be back soon.