zur Artikelübersicht

The Wikidata tool ecosystem

Lydia Pintscher

6. May 2013

(Die deutsche Version dieses Artikels ist hier.)

The following is a guest post by Magnus Manske, active tool developer around Wikidata and author of the software that later evolved into MediaWiki.

Wikidata is the youngest child of the Wikimedia family. Its main purpose is to serve as a “Commons for factoids”, a central repository for key data about the topics on, and links between, the hundreds of language editions of Wikipedia. At time of writing, Wikidata already contains about 10 million items, more than any edition of Wikipedia (English Wikipedia currently has 4.2 million entries). But while, as with Commons, its central purpose is to serve Wikipedia and its sister projects, Wikidata has significant value beyond that; namely, it offers machine-readable, interlinked data about millions of topics in many languages via a standardized interface (API).

Such a structured data repository has long been a “holy grail” in computer science, since the humble beginnings of research into artificial intelligence, to current applications like Google’s Knowledge Graph and Wolfram Alpha, and towards future systems like “intelligent” user agents or (who knows?) the Singularity.

The scale of any such data collection is a daunting one, and while some companies can afford to pour money into it, other groups, such as DBpedia, have tried to harvest the free-form data stored in Wikipedia. However, Wikidata’s mixture of human and bot editing, the knowledge of Wikipedia as a resource, and evolving features such as multiple property types, source annotation, and qualifiers add a new quality to the web of knowledge, and several tools have already sprung up to take advantage of these, and to demonstrate its potential. A fairly complete list is available.

Views on Wikidata

Family tree of Johann Seabastian Bach

For a straight-forward example of such a tool, have a look at Mozart. This tool does not merely pull and display data about an item; it “understands” that this item is a person, and queries additional, person-specific items, such as relatives. It also shows person-specific information that does not refer to other items, such as Authority Control data. Mozart’s compositions are listed, and can be played right on the page, if a file exists on Commons. To a degree, it can also use the language information in Wikidata, so you can request the same page in German (mostly).

Instead of looking only for direct relatives, a tool can also follow a “chain” of certain properties between items, and retrieve an “item cluster”, such as a genealogical tree (pretty and heavy-duty tree for Mozart). The Wikidata family tree around John F. Kennedy contains over 10.000 people at time of writing. In similar fashion, a tool can follow taxonomic connections between species up to their taxonomic roots, and generate an entire tree of life (warning: huge page!).

These tools demonstrate that even in its early stages, Wikidata allows to generate complex results with a fairly moderate amount of programming involved. For a more futuristic demo, talk to Wiri (Google Chrome recommended).

Edit this item

Unsurprisingly to anyone who has volunteered on Wikimedia projects before, tools to help with editing are also emerging. Some have the dual function of interrogating Wikidata and displaying results, while at the same time informing about “things to do”. If you look at the genre of television series on Wikidata, you will notice that over half of them have no genre assigned. (Hint: Click on the “piece of pie” in the pie chart to see the items. Can you assign a genre to Lost?).

When editing Wikidata, one usually links to an item by looking for its name. Bad luck if you look for “John Taylor”, for there are currently 52 items with that name but no discerning description. If you want to find all items that use the same term, try the Terminator; it also has (daily updated) lists with items that have the same title but no description.

Similarly, you can look for items by Wikipedia category. If you want some more complex filter, or want to write your own tool and look for something to ease your workload, there is a tool that can find, say, Operas without a librettist (you will need to edit the URL to change the query, though).

There are also many JavaScript-based tools that work directly on Wikidata. A single click to import all language links or species taxonomy from Wikipedia, find authority control data, declare the current item to be a female football player from Bosnia, or apply the properties of the current item to all items in the same Wikipedia category — tools for all of these exist.

This is only the beginning

While most of these tools are little more than demos, or primarily serve Wikidata and its editors, they nicely showcase the potential of the project. There might not be much you can learn about Archduke Ernest of Austria from Wikidata, but it is more than you would get on English Wikipedia (no article). It might be enough information to write a stub article. And with more statements being added, more property types (dates, locations) emerging, and more powerful ways to query Wikidata, I am certain we will see many, and even more amazing tools being written in the near future. Unless the Singularity writes them for us.

Kommentare

  1. […] which for example now makes it possible to enter the date of birth of a person. Magnus Manske blogged about the tool ecosystem that is building around Wikidata. During the next 3 months, the team will be working with 3 Google […]

  2. […] Wikidata has now begun to serve all language versions of Wikipedia as a common source of structured data that can be used any Wikipedia article, e.g. in infoboxes. Wikidata’s machine-readable knowledge database already contains over 11 million items. They can also be queried, evaluated and edited with the help of a growing collection of tools. […]

  3. […] Wikidata has now begun to serve all language versions of Wikipedia as a common source of structured data that can be used any Wikipedia article, e.g. in infoboxes. Wikidata’s machine-readable knowledge database already contains over 11 million items. They can also be queried, evaluated and edited with the help of a growing collection of tools. […]

  4. Magnus Manske
    7. May 2013 at 19:40

    @Chris: These ships should be notable on Wikidata according to [1]. We could create items even if there is no WIkipedia article (yet). We should check that all major data can be represented by Wikidata properties first, though.

    [1] http://www.wikidata.org/wiki/Wikidata:Notability

  5. Chris Keating
    7. May 2013 at 18:57

    Very interesting. I have a thought: How does Wikidata work for ships? A British museum has released a dataset based on its own collections and research, in part so it can be used on Wikipedia. However, it might be more useful to use on Wikidata. You can see the information in the dataset in some PDFs here, though it also is available in CSV files;

    http://www.rmg.co.uk/researchers/research-areas-and-projects/warship-histories/

  6. Andy Mabbett
    7. May 2013 at 14:37

    The English-language article http://en.wikipedia.org/wiki/Lost_(TV_series) gives five genres for Lost, provided by volunteer editors and moderated by the wider community. Why are we asking humans to re-do that work?

  7. Andy Mabbett
    7. May 2013 at 14:24

    “Archduke Ernest of Austria from Wikidata” links to German and other language articles which have sources; anyone fluent in one or more of those languages and English may make a translation.

  8. Lydia Pintscher
    7. May 2013 at 12:54

    Please be aware that Wikidata is a very young project. Sources for example only started working a few weeks ago and the community is still deciding how to make use of them exactly. So while this surely isn’t perfect I think you would also not have expected anything else so shortly after Wikipedia has started. This needs a bit more time.

  9. Hannes
    6. May 2013 at 19:11

    “Archduke Ernest of Austria from Wikidata” is no good starting point for a stub. The information has no reliable sources as required by every Wikipedia project. It provides neither context nor meaning. For techies this doesn’t matter quite ofter, after all a bot could create an article with this data and a lot of Wikidata activists would call it an “article”. But there is no editorial content. When does Wikidata finally start to provide reliable data from external sources?

Leave a Reply

Your email address will not be published. Required fields are marked *