zurück

Wikidata quality and quantity

WMDE allgemein

4. September 2013

One of the goals of the Wikidata development project is a community that is strong enough to maintain the content in Wikidata. The community is – as with all other Wikimedia projects – the only guarantee of quality and sustainability.

None of the objectives of the Wikidata development project is to be the largest collection of data on the net. The sheer number of statements in Wikidata is not a metric that is indicative of healthy growth or quality. Since it is an easy to get and understandable number it is nontheless used a lot, but we should not attach too much importance to it.

This leads to the question, which metrics are meaningful for quality in Wikidata? And I have to admit: we do not know. This may seem particularly ironic since my dissertation was on the topic of quality measurement of knowledge structures. But it is not surprising: the opportunity to make statements in Wikidata exists since about half a year. The site is in continuous development, and some important pieces for quality assurance that are planned for Wikidata are not yet developed – including, for example, ranks for statements, web links as a data type, the protection of individual statements and aggregated views of the data. How to make quality measurable in Wikidata, which metrics correlate with quality – it has simply not yet been investigated sufficiently. I expect that science will provide some answers in the coming months and years.

To get an overview of the development of Wikidata, we must temporarily take assumptions about what numbers likely indicate quality. I do hereby call the community to make suggestions and discuss. A few first thoughts below.

The number of data elements (items) seems to not be a useful measure. This number is so far almost exclusively characterized in that items are required for the storage of language links. Accordingly, there was initially strong growth, while the links were transferred, and in recent months, the number is relatively stable.

The number of edits per page seems to be more meaningful. Last week it went above 5.0 and is rising quickly. The number of edits alone in Wikidata is less meaningful than in many other Wikimedia projects as an extraordinarily high proportion of the edits are done by bots. Bots are programs written by users to automatically or semi-automatically make changes. The bots are controlled by a group of about 80 users. This leads many to the idea that Wikidata is only written by bots. But that’s not true: every month 600000-1 million edits are performed by human user. These are numbers that can be reached only by the most active Wikipedias – including their own bot edits. Worries about Wkidata’s growth being too fast and that the quality of the data would suffer, have so far, except for anecdotes, not proven true.

Perhaps the simplest metric is the number of active users. Active users in Wikimedia projects are defined as the users who contributed at least five edits in a given month. Wikidata has nearly 4,000 active users, making it rank 6th among the most active of the Wikimedia projects together with the Japanese and Russian Wikipedia behind only the English Wikipedia, Commons, the German, French and Spanish Wikipedia. In other words, Wikidata has more active users than 100 smaller Wikipedias combined! Whenever the smaller Wikipedias access Wikidata, they rely on a knowledge base that is maintained by a much larger community than their own Wikipedia. But the advantages don’t end there: by using the content of Wikidata in the Wikipedias it becomes more visible, gets more attention, and errors are more likely to be found (although we still lack the technical means to then correct the error easily from Wikipedia – but that is on the development plan). This mainly benefits the smaller Wikipedias.

But it also already has useful advantages for the larger Wikipedias: An exciting – and for me completely unexpected – opportunity for quality assurance came when the English Wikipedia decided not to simply take IMDB IDs from Wikidata but instead load them from Wikidata to compare them with the existing numbers in Wikipedia, and in the case of inconsistency to add a hidden category to the article. This way difficult to detect errors and easily vandalisable data got an additional safety net: it may well be that you have a typo in the number on the English Wikipedia, or some especially funny person switched the ID for Hannah Montana’s latest film with that ofNatural Born Killers in the French Wikipedia – but now these situations are detected quickly and automatically. This data that is validated in several ways can then be used by the smaller Wikipedias with little concern.

As mentioned earlier, a lot is still missing and Wikidata is a very young project. Many of the statements in Wikidata are without a source. Even in the German Wikipedia the statement, Paris is the capital of France, does not have a source. We impose much stricter rules on a much smaller project after such a short time? But, then one may interject, if a statement has no source, I can not use it in my Wikipedia. And that is perfectly okay: it is already possible now, to just use data from Wikidata if they have a source of a certain type.

There are two ways to ensure the long term quality of Wikipedia: Allow user to be more effective or attract more users. We should continue to pursue both ways and Wikidata uses both ways very effectively: the mechanisms described above aim to give users the means to make more powerful tools and processes to build quality assurance, simultaneously Wikidata has already brought more than 1300 new users to the Wikimedia projects who had not edited in the other Wikimedia projects before.

Wikidatas main goal is to support the Wikimedia projects: it should enable higher quality of the content and reduce the effort required for the same. We need more metrics that capture this goal, and show how we evolve. The simple metrics all indicate that the initial growth in width has come to an end after months, and that the project is gaining in depth and quality. There are useful applications both for small as well as for large projects. But it is also clear that I am an avid supporter of Wikidata and so have a bias, and therefore start a call for ideas to track Wikidata’s effect critically and accurately.

Kommentare

  1. […] of a statement. Denny Vrandečić published 2 blog posts about the ideas behind Wikidata: “Wikidata Quality and Quantity” and “A categorical imperative?“. In addition, he shared a few thoughts on the […]

  2. Amrapali
    19. September 2013 at 11:49

    Im currently working on a survey paper focused towards data quality assessment methodologies, dimensions, metrics and tools particularly for LOD: http://www.semantic-web-journal.net/content/quality-assessment-methodologies-linked-open-data (although currently under review). I think this would be a potential answer to your statement; “How to make quality measurable in Wikidata, which metrics correlate with quality – it has simply not yet been investigated sufficiently. I expect that science will provide some answers in the coming months and years.” Of course, in this case not particularly for Wikidata but definitely can be applied to it !

  3. Thieol
    8. September 2013 at 02:12

    in my opinion, wikidata quality will increase naturally as the number of statements. Many statements are linked together will obvious rules. “A term cannot have a date of birth” and so forth. bots will then be able to detect errors by logical rules.

  4. Torsten
    5. September 2013 at 10:59

    Press “Random article” 100 times and count the incidences of Wikidata use. (All inter wiki language links together count only as one incident.)

  5. Boris Schneider
    4. September 2013 at 19:20

    I hope the project will have many contributors and that it will be successful. :) Thanks for the great article about Wikidata.

Leave a Reply

Your email address will not be published. Required fields are marked *