Posts Tagged ‘english’



Wikidata and other technical bits at Wikimania

Denny, Lydia and Daniel (by Fabrice Florin, CC-by-sa 2.0)

I’m back from an amazing Wikimania. First of all thank you to everyone who helped make the event happen. It was very well organized and an overall useful and productive event. I was there to discuss everything Wikidata as well as new technology like the Visual Editor and Flow and how they affect the German language Wikipedia.

It felt like Wikidata and the Visual Editor were on everyone’s mind during this Wikimania. No matter which talk or panel or dinner I went to – every single one of them mentioned Wikidata and the Visual Editor in some way. It’s great to see the Wikimedia community embrace Wikidata as its sister project. And the VisualEditor – while still rough – it seems is getting to that point very quickly too.
Weiterlesen »

1 Stern2 Sterne3 Sterne4 Sterne5 Sterne (5 Bewertungen, Durchschnitt: 4,80 von 5)
Loading...

On truths and lies

(Die deutsche Version dieses Artikels ist hier.)

This is the second in a short series of blog entries in which I explain some of the design decisions behind Wikidata. The first one was about restricting property values or properties. The essays represent my personal opinion, and are not to be understood as the official opinion of the Wikidata project.

Databases have an aura of correctness. When we query a database, we expect the result that comes back to basically be The Answer and The Truth. Ask Amazon’s database about the author of the Bible. Ask IMDB about the director of Adaptation. You are not expecting to get a possible answer, or different points of view – you expect one definitive answer.

Wikidata is collecting structured data about the world. It is basically a crowdsourced database. Unlike text, structured data necessarily and unfortunately lacks in nuance. Whereas it is possible to talk about the statehood of Kosovo in an NPOV way in natural language, a naive approach to represent that in structured data would fail: either we say Kosovo is a state, or we do not. There are no shades of grey.

Fortunately some of the roots of Wikidata lie in an EU research project called RENDER. The goal of this project is to explore and support the diversity of knowledge on the Web. RENDER discards the assumption of a simple, single truth – and this was inherited by the Wikidata data model. Instead of collecting facts, we collect statements. We define statements as claims that can have references. A reference supports the claim. A beautiful example is for example Ethanol, where the CAS number – a standard identifier for chemical compounds – is given with a reference to the actual standard, pointing out the page in the source.

Unlike many other databases, Wikidata can contain contradicting statements, supported by different references. Unlike the natural text in Wikipedia, Wikidata does not offer the possibility to reconcile and explain the differences in prose, providing due weight to the different points of view. The responsibility lies with the Wikidata reader and reuser to deal with deciding which sources to trust. I expect quite a bit of research and exploration to deal with this question in the following years. The first reusers to deal with these issues will be the Wikipedia communities that opt to choose data from Wikidata.

In the next few weeks and months we will add a few more features to support the diversity of statements in Wikidata.

Currently, the most obvious omission is a lack of datatypes to specify numbers, text and URLs. Only with these datatypes it will be possible to actually write down references in their full glory. Another opportunity – once URLs are available – would be to provide content locators for text in HTML pages through XPath, oxPath, CSS selectors, or something similar, thus enabling bots to check if the given references are still valid. I am very curious to see how the usage of references and sources will develop in and around Wikidata.

Another major feature that will be introduced in the course of this year is the possibility to rank statements: not all statements are to be regarded equally. We will introduce three ranks, and every statement will be in one of them: preferred, normal, and deprecated.

“Preferred” statements should be the most current and most widely accepted statements. There can be several preferred statements for the same item and property.

“Deprecated” statements are those that are considered to be not reliable for some reason. They are mentioned though because they might have a strong source supporting it, or they are widely spread for some reason, but actually not accepted anymore. Examples can include typos from influential textbooks – for example regarding the iron content of spinach, or the length of the Rhine – or numbers spread by some form of propaganda that are considered not correct today anymore.

“Normal” statements are thus the ones left, which are neither “preferred” nor “deprecated”. This will often apply for historic statements (the population of Rome in the time of Julius Caesar, former capitals of Russia, etc.).

Technically, we will start with using only preferred statements for answering queries (i.e. when you ask for all capitals with a population of less than 500,000, then you won’t get answers where the city had a population of 120,000 in the 16th century). Also only they will be returned by the property-parserfunction. The Lua interface will have access to all statements and thus provide full flexibility. It is planned to extend query answering later to support more complex queries, at which point we will have to think about integrating other ranks.

The ranks should allow for a more inclusive policy in Wikidata, allowing to reflect a wider diversity of knowledge.

To give an idea of the time scale: we will first implement the datatypes that are still missing, and then, as a prerequisite for ranks, the possibility to reorder statements. After that, ranks will be the next feature to land in Wikidata.

Ranks introduce a vector for debate, which has not been there in Wikidata yet. The question moves from “should this statement be included?” to “what should be the rank of this statement?” This seems like a necessary step: unlike natural text, Wikidata otherwise could not include statements that are agreed on to be bogus but that have historical or other value. This makes it even more important to remember that Wikidata is not about truth, but about collecting referenced statements in a secondary database. The criterion for inclusion should not be veracity, but verifiability – a policy that has served Wikipedia very well.

Wikidata will always – and that is both a necessity as well as acknowledged by design – run short of Wikipedia in many aspects. Wikipedia articles can explore causal and informal connections, they can inspire curiosity, and they can support one of the major modes of knowledge transfer between humans: storytelling. Wikidata has other, unique advantages: it can provide some ground data about a topic of interest in many languages more easily, and it provides the data in a way that is much more accessible for bots and apps. It could be a step towards relieving some Wikipedias from a lot of bot-created articles, never touched by a human editor, cluttering recent changes, and skewed statistics.

Without the ability to express a plurality of statements about an item – even if they are considered truths only by some and lies by others – Wikidata would fall short of one of the major pillars of Wikipedia, the Neutral Point of View and the possibility of integrating conflicting points of view.

I hope that the technical platform that we as developers are building, and the rules and processes of the communities in Wikidata, the Wikipedias, and other Wikimedia projects, are establishing a useful ecosystem, understanding the limitations of each project, and discovering how we can most effectively help each other. And this means understanding the peculiar relationship between Wikidata and the Truth.

1 Stern2 Sterne3 Sterne4 Sterne5 Sterne (2 Bewertungen, Durchschnitt: 5,00 von 5)
Loading...

The Wikidata tool ecosystem

(Die deutsche Version dieses Artikels ist hier.)

The following is a guest post by Magnus Manske, active tool developer around Wikidata and author of the software that later evolved into MediaWiki.

Wikidata is the youngest child of the Wikimedia family. Its main purpose is to serve as a „Commons for factoids“, a central repository for key data about the topics on, and links between, the hundreds of language editions of Wikipedia. At time of writing, Wikidata already contains about 10 million items, more than any edition of Wikipedia (English Wikipedia currently has 4.2 million entries). But while, as with Commons, its central purpose is to serve Wikipedia and its sister projects, Wikidata has significant value beyond that; namely, it offers machine-readable, interlinked data about millions of topics in many languages via a standardized interface (API).

Such a structured data repository has long been a „holy grail“ in computer science, since the humble beginnings of research into artificial intelligence, to current applications like Google’s Knowledge Graph and Wolfram Alpha, and towards future systems like „intelligent“ user agents or (who knows?) the Singularity.

The scale of any such data collection is a daunting one, and while some companies can afford to pour money into it, other groups, such as DBpedia, have tried to harvest the free-form data stored in Wikipedia. However, Wikidata’s mixture of human and bot editing, the knowledge of Wikipedia as a resource, and evolving features such as multiple property types, source annotation, and qualifiers add a new quality to the web of knowledge, and several tools have already sprung up to take advantage of these, and to demonstrate its potential. A fairly complete list is available.

Views on Wikidata


Family tree of Johann Seabastian Bach

For a straight-forward example of such a tool, have a look at Mozart. This tool does not merely pull and display data about an item; it „understands“ that this item is a person, and queries additional, person-specific items, such as relatives. It also shows person-specific information that does not refer to other items, such as Authority Control data. Mozart’s compositions are listed, and can be played right on the page, if a file exists on Commons. To a degree, it can also use the language information in Wikidata, so you can request the same page in German (mostly).

Instead of looking only for direct relatives, a tool can also follow a „chain“ of certain properties between items, and retrieve an „item cluster“, such as a genealogical tree (pretty and heavy-duty tree for Mozart). The Wikidata family tree around John F. Kennedy contains over 10.000 people at time of writing. In similar fashion, a tool can follow taxonomic connections between species up to their taxonomic roots, and generate an entire tree of life (warning: huge page!).

These tools demonstrate that even in its early stages, Wikidata allows to generate complex results with a fairly moderate amount of programming involved. For a more futuristic demo, talk to Wiri (Google Chrome recommended).

Edit this item

Unsurprisingly to anyone who has volunteered on Wikimedia projects before, tools to help with editing are also emerging. Some have the dual function of interrogating Wikidata and displaying results, while at the same time informing about „things to do“. If you look at the genre of television series on Wikidata, you will notice that over half of them have no genre assigned. (Hint: Click on the „piece of pie“ in the pie chart to see the items. Can you assign a genre to Lost?).

When editing Wikidata, one usually links to an item by looking for its name. Bad luck if you look for „John Taylor“, for there are currently 52 items with that name but no discerning description. If you want to find all items that use the same term, try the Terminator; it also has (daily updated) lists with items that have the same title but no description.

Similarly, you can look for items by Wikipedia category. If you want some more complex filter, or want to write your own tool and look for something to ease your workload, there is a tool that can find, say, Operas without a librettist (you will need to edit the URL to change the query, though).

There are also many JavaScript-based tools that work directly on Wikidata. A single click to import all language links or species taxonomy from Wikipedia, find authority control data, declare the current item to be a female football player from Bosnia, or apply the properties of the current item to all items in the same Wikipedia category — tools for all of these exist.

This is only the beginning

While most of these tools are little more than demos, or primarily serve Wikidata and its editors, they nicely showcase the potential of the project. There might not be much you can learn about Archduke Ernest of Austria from Wikidata, but it is more than you would get on English Wikipedia (no article). It might be enough information to write a stub article. And with more statements being added, more property types (dates, locations) emerging, and more powerful ways to query Wikidata, I am certain we will see many, and even more amazing tools being written in the near future. Unless the Singularity writes them for us.

1 Stern2 Sterne3 Sterne4 Sterne5 Sterne (3 Bewertungen, Durchschnitt: 5,00 von 5)
Loading...

Wikidata all around the world

(Die deutsche Version dieses Artikels ist hier)

Since one month 11 Wikipedias have the ability to include data from Wikidata in their articles. Two days ago English Wikipedia was added to that group. Today the remaining 274 are joining. Usage examples are in the last blog entry. There is also an FAQ for this deployment.

This is a huge step for Wikidata and at the same time also another beginning. It’s a huge step because from now on all Wikipedias are able to collect, curate and use data together. For example every Wikipedia can query the ID of a movie on the Internet Movie Database and use it in their article as soon as someone added it to Wikidata. At the same time it is a beginning because there is still a lot to do. Accessing the data has to be made easier. More data has to be added to Wikidata (and translated where necessary). More sources have to be added to existing claims. More data types need to be made available – for example geocoordinates and time. Your help and your Feedback is very welcome and important there.

We’re looking forward to the next steps!

1 Stern2 Sterne3 Sterne4 Sterne5 Sterne (1 Bewertungen, Durchschnitt: 5,00 von 5)
Loading...

And that makes 12

(Die deutsche Version dieses Beitrags gibt es hier.)

Today the English Wikipedia got the ability to include data from Wikidata. Four weeks ago the first 11 Wikipedias started testing this second phase of the project. This means by now 12 Wikipedias can make use of the shared data in their infoboxes for example. The available data includes things like conservation status for a species, ISBN for a book or the top level domain of a country.

A Request for Comments about how to use data from Wikidata is currently ongoing. Until the Request for Comments is closed you can continue to try it out on test2.wikipedia.org.

There are two ways to access the data:

  • Use a parser function like {{#property:p159}} in the wiki text of the article on the Wikimedia Foundation. This will return “San Francisco” as that is the current headquarters location of the non-profit.
  • For more complicated things you can use Lua. The documentation for this is here.

We are working on expanding the parser function so you can for example use {{#property:headquarters location}} instead of {{#property:p159}}. The complete plan for this is here.

The next step is the deployment on all remaining 274 Wikipedias. If there are no issues they will follow on Wednesday.

There is an FAQ for this deployment. Please help us with testing and feedback. The best place to leave feedback is this discussion page.

1 Stern2 Sterne3 Sterne4 Sterne5 Sterne (3 Bewertungen, Durchschnitt: 5,00 von 5)
Loading...

You can have all the data!

(Die deutsche Version dieses Beitrags gibt es hier.)

Today the first 11 Wikipedias got the ability to include data from Wikidata in their articles. These are the Italian, Hebrew, Hungarian, Russian, Turkish, Ukrainian, Uzbek, Croatian, Bosnian, Serbian and Serbo-Croatian Wikipedias. If you are curious you can also try it out on test2.wikipedia.org. This means the editors on these Wikipedias are now able to make use of the growing amount of structured data that is available in Wikidata as a common dataset. It includes things like conservation status for a species, ISBN for a book or the top level domain of a country.

There are two ways to access the data:

  • Use a parser function like {{#property:p169}} in the wiki text of the article on Yahoo!. This will return “Marissa Mayer” as she is the chief executive officer of the company.
  • For more complicated things you can use Lua. The documentation for this is here.

We are working on expanding the parser function so you can for example use {{#property:chief executive officer}} instead of {{#property:p169}}. The complete plan for this is here.

The next step is the deployment on the other Wikipedias. We will carefully monitor performance and if there are no issues they will follow within a week or two.

We have prepared an FAQ for this deployment and are looking forward to your testing and feedback. The best place to leave feedback is this discussion page.

1 Stern2 Sterne3 Sterne4 Sterne5 Sterne (4 Bewertungen, Durchschnitt: 4,50 von 5)
Loading...

Some data on Wikidata

(Die deutsche Version dieses Artikels ist hier.)

This weekend we saw the creation of the 7th Million item on Wikidata. We had already collected some data based on the Wikidata database dumps, but now we extended the scripts so that they can provide us with daily updates. We want to use this chance to publish a few statistics.

Within the last month – since statements got enabled – more than 660,000 of these items also got statements about them, resulting in more than 1.4 Million statements. The item with the most statements is the United Nations (Q1065), listing all member states. The growth of the number of statements is amazing and well beyond what we expected.

So far we have more than 22 Million links to Wikipedia articles. There are about 24-25 Million Wikipedia articles, which means that we have more than 90% of all links already in Wikidata. Assuming the bots continue working as efficiently as they did so far, all links could be transferred in about a month or two, and then the rapid growth in the number of items is expected to slow down considerably.

At the same time, the edits on Wikidata are increasing a lot. With more than 12.5 Million edits as of now, Wikidata is one of the most dynamic Wikimedia projects. One might say that this is all due to bot activity — but that would be very wrong. About 2 Million edits have been done by human editors, and actually the percentage of edits performed by human editors is increasing. More than 4,500 human editors have been active on Wikidata in the last thirty days.

Regarding labels and descriptions, Wikidata has collected more than 23 Million labels and more than 5 Million descriptions so far, in 333 languages. We see a great opportunity for external tools and websites to help us with collecting labels and descriptions, as they basically provide the translation of the content of Wikidata and make its content available in many languages simultaneously.

It is still too early to really understand what these numbers mean, but we can clearly state that the activity of the community exceeds the hopes of the development team. Although many features are still missing, the warm embrace of the Wikidata project in its current state by the Wikimedia communities is simply amazing, and I can only say „thank you“ to those thousands and thousands of editors for their contribution to free knowledge.

1 Stern2 Sterne3 Sterne4 Sterne5 Sterne (1 Bewertungen, Durchschnitt: 5,00 von 5)
Loading...

Wikidata now live on all Wikipedias

(Die deutsche Version dieses Eintrags ist hier.)

Today we have enabled Wikidata’s language link support on all remaining Wikipedias. (So all 282 remaining ones except Hungarian, Hebrew, Italian and English.) This means the links in the sidebar that link to articles on the same topic in other languages are now coming from Wikidata.

What does this mean exactly?

  • Language links in the sidebar are automatically coming from Wikidata, once the article is linked on Wikidata. No special syntax is needed for that.
  • Existing language links in the wikitext will continue to work and overwrite links from Wikidata.
  • For individual articles, language links from Wikidata can be suppressed completely with the noexternallanglinks magic word.
  • Changes on Wikidata that relate to articles on this Wikipedia show up in Recent Changes and Watchlist, if the option is enabled by the user. (There are still some issues with this when you have enhanced recent changes enabled.)
  • At the bottom of the language links list you will see a link to edit the language links that leads you to the linked page on Wikidata.
  • You can see an example of how it looks in the article about the long-eared hedgehog.
  • The second phase (which is about statements/infoboxes) is in use on Wikidata, but can’t yet be used on any Wikipedia. It is scheduled to be enabled on the first few Wikipedias at the end of the month. The rest will follow soon after that.

An FAQ for editors is here and documentation exists here.

Staying up-to-date and contributing

There are several ways to stay up-to-date on everything happening around Wikidata. The weekly status updates are the most important ones. You can add yourself here to have them delivered to your talk page on-wiki. There is also Twitter, identi.ca, Facebook and Google+.
If you’d like to contribute to Wikidata this page is a good start.

1 Stern2 Sterne3 Sterne4 Sterne5 Sterne (7 Bewertungen, Durchschnitt: 4,29 von 5)
Loading...

Restricting the World

(Die deutsche Version dieses Artikels ist hier.)

This is the first in a short series of blog entries in which I explain some of the design decisions for Wikidata. They are my personal opinion, but they have a strong impact on some features or non-features of Wikidata. This is to explain them.

By Tomascastelazo (Own work)
CC-BY-SA-3.0,
via Wikimedia Commons

One of the features – others call it a bug – of Wikidata is that you can choose any item as the value for a property. Many of them do not make sense: so, if you have the article on Paris, saying that its country is goat cheese does not really make sense. Wouldn’t it be great if Wikidata knew which values for a country would make sense, and only allow you to choose those, instead of allowing any possible value here? Wouldn’t it be great if the community decided that a property like the widely used P107 could actually be restricted to the six possible values they decided on?

I strongly disagree.

Another feature – others call it a bug – of Wikidata is that you can use any property on any item. If you want to add the capital city of Julius Caesar, you’re welcome to do so. Wouldn’t it be great if Wikidata knew which properties make sense for a given item, and would not only restrict you to use those but even list the ones that still have missing values? Wouldn’t it be great if the community could create templates of properties that should all be filled out for a person, or for a city, or a country – and not allowing anything else?

I strongly disagree.

I completely agree that smarter suggestions would be great. Some of these could be pretty trivial to implement: count the frequency for the values of a property and make a suggestion based on that. What about suggesting properties? There’s lots of research going on in that area, basically something like “items with these properties also have these properties” – you might have seen that on certain shopping sites.

I am all for better suggestions. What I am strongly disagreeing with are strong restrictions. It provides far too much space for drama and edit-warring. Does every country have a capital? What is a country anyway? What should the possible values for the property „gender“ be? What are the right properties for presidents?

Anything that the system uses for building its user interface and core functionality – labels and descriptions, for example, or the links to Wikipedia pages – can not have references. This is something the system simply “believes.” On the other hand, if you add a statement saying that Kosovo is a country, you can add a reference to it. Others might say that Kosovo is a part of Serbia. You can add a reference for that too. But if you want to make the user interface use this kind of information – for example when a property is restricted to countries – the system needs to make a call whether Kosovo is an independent country or not. There is no room for the kind of knowledge diversity that Wikidata is build for.

I perceive the danger that some parts of Wikidata might get stuck in an ontology engineering exercise. I think these exercises can be fundamentally unresolvable, and thus that Wikidata’s mandate should not be to solve them. Wikidata should, in my opinion, work on a less abstract level: Let us enter the authors of Aerosmith’s “I Don’t Want to Miss a Thing”, and not discuss whether authorship can apply to a song or not. Let us trace the genealogy of the British monarch, and not whether officials can only be persons. Are you sure that no donkey has ever become a Roman senator? Can you tell whether drinks should have inventors?

Wikidata allows for a unique collaborative space for humans and bots. Much more than Wikipedia, which already sports a pretty amazing example of such an environment. In Wikipedia, we have bots checking for outdated references to websites, for correct usage of punctuation, etc. In Wikidata we can create bots that check whether a teacher has indeed lived before the death of its student. Whether all Roman senators have lived before the 6th century. Whether the population of the cities of a country add up to be less than the population of the country as a whole. And the bots doing these checks will need to find a way to report their results to humans, who can then check whether the bots discovered genuine inconsistencies – either in the real world or in Wikidata – or not.

The world is complex. Wikidata aims to collect structured knowledge about this complex world. The root of Wikidata, as the name hints, are wikis – and wikis mean freedom. Based on this legacy, Wikidata as a software does not aim to implement restricted types for properties, nor restricting sets of properties for types of item anytime soon.

(I skipped the boring technical details about why it would be hard to implement and what kind of problems could arise from implementations of the suggested features. There are some serious problems with that, but I wanted to stick with the conceptual reasons.)

1 Stern2 Sterne3 Sterne4 Sterne5 Sterne (7 Bewertungen, Durchschnitt: 3,86 von 5)
Loading...

The Future of Wikidata

(The German version of this article is here.)

Almost precisely one year ago, in March 2012, Wikimedia Deutschland started a completely new Wikimedia project – Wikidata. The goal of Wikidata is to create an open knowledge base about the world that can be read and edited by everyone. Wikidata is the biggest technical project that a chapter of the Wikimedia movement has ever undertaken.

The initial development of the Wikidata project is almost completed. Much has been achieved: the language links from Wikidata are already in use in four Wikipedia language versions (Hungarian, Hebrew, Italian, and English) and the other language versions will follow in the next days. The current state of Wikidata is nicely illustrated by the example of the page about Russia. With the help of staff, volunteers, and generous donations by [ai]², the Gordon and Betty Moore Foundation and Google, the foundation for the first new Wikimedia project since 2006 has been laid: a scalable infrastructure that allows for the central management of data in a wiki in order to make them available on Wikipedia and beyond, for example on blogs or websites.

The board of Wikimedia Deutschland has decided to continue the development of Wikidata with a team of eight in 2013. Wikimedia Deutschland will fund this development by means of donations.

In the coming year, the team will be working on the further development and maintenance of Wikidata. This includes, among other things:

  • the implementation of the third phase of Wikidata: the automatic creation and updating of lists and visualizations of the data in Wikidata
  • extending Wikidata with other data types, e.g. geodata
  • supporting the community in the growth and expansion of Wikidata, also when it is used outside of the different Wikipedia language versions
  • the possibility of deploying Wikidata in further Wikimedia projects, e.g. Wikimedia Commons or Wikivoyage

We expect that Wikidata will become an integral part of the Wikimedia movement. The excellent cooperation with the Wikimedia Foundation was an essential factor for this development: the Wikimedia Foundation not only operates Wikidata but also the many tools that have supported us during its development. We show our trust in the project and its goals by continuing to support Wikidata. In addition, we ensure the further development and maintenance.

Wikidata has the potential for more „great leaps“: Without the generous donations that funded the first year Wikidata would not have been possible. For the further expansion we hope to find additional partners who support us in reaching our goal of making the sum of all human knowledge accessible for every single person.

1 Stern2 Sterne3 Sterne4 Sterne5 Sterne (5 Bewertungen, Durchschnitt: 4,40 von 5)
Loading...