zurück

Lexicographical data on Wikidata: Words, words, words

Language is what makes our world beautiful, diverse, and complicated. Wikidata is a multilingual project, serving the more than 300 languages of the Wikimedia projects. This multilinguality at the core of Wikidata means that right from the start, every Item about a piece of knowledge in the world and every property to describe that Item can have a label in one of the languages we support, making Wikidata a polyglot knowledge base that speaks your language. Expanding Wikidata to deal with languages is an exciting new application.
Photo: https://pixabay.com/vectors/hello-languages-word-cloud-foreign-3791381/

Jens Ohlig

25. March 2019

While structured data about the sum of all human knowledge may help machines and artificial intelligence to understand the world, teaching Wikidata to also represent languages can help them understand how humans express their knowledge in words. And just think about all the things that are possible with all the language combinations we have in Wikimedia projects: translations from Estonian to Maltese or Tamil to Zulu — while a printed dictionary for these combinations probably doesn’t exist, it may be possible with structured data about languages.

Items in Wikidata describe a thing, person, or concept in this world. What Wikidata didn’t have until recently was the linguistic side of things: the words to describe these entities as they appear in a language, their grammatical forms and meanings. Over the last months, we developed features in Wikidata and the software that powers it, Wikibase, to describe data about words. We call it lexicographical data.

Lexicographical data were introduced in May 2018 and have been with us now for almost a year. Time to take a closer look.

Lexicographical data means just that: data that can appear in a lexicon. What we’re dealing with here is the linguistic side of words. As the word “word” is already very loaded, we use the linguistic term Lexeme – a Lexeme is an entry in a dictionary.

Lexemes are a little different from other entities in Wikidata and thus have a namespace of their own. Their entity numbers don’t start with a Q — they start with an L. At https://www.wikidata.org/wiki/Lexeme:L1 you can find the first Lexeme in Wikidata, the Sumerian word for “mother”. As Sumerian is one of the oldest languages we know, and the word for mother is one of the most basic words in any language, this may very well be one of the earliest utterances in human history.

Every Lexeme has Senses, which tell you what a word means in various languages. It also has Forms which describe how the Lexeme can change grammatically — just think of the 15 cases a noun can be used with in the Finnish language.

Every Lexeme is for an entry in just one language. English “apple” and the French “pomme” are different Lexemes (L3257 and L15282). As Wikidata is a linked database, it can even link to an Item with a Q-ID that represents the concept of that Lexeme. You can learn more about the data model for Lexemes on the documentation page.

Some Lexemes in some languages can take many Forms. To help you with entering them, there is help: Wikidata Lexeme Forms is a tool to create a Lexeme with a set of Forms, e. g. the declensions of a noun or the conjugations of a verb.

If you want to add Senses (i.e. explanations what a word really means) to Lexemes, there is also a handy tool: Wikidata Senses shows the list of languages and number of Senses that are missing, then after selecting a language, shows a random Lexeme that needs a Sense so you can create it. Try it while waiting for your bus at a bus stop. It’s a quick way to contribute to Free Knowledge!

Of course, you can also query lexicographical data. An interesting example is this query by Finn Årup Nielsen to query for persons with a surname that matches the past participle form of a Danish verb.

With querying you can also build amazing applications. One of the most common sources of headaches and frustration for learners of German are the articles for nouns: der, die, das. There is very little logic involved and it mostly means that articles have to be memorized. As Mark Twain remarked in his classic essay “The Awful German Language”: “Every noun has a gender, and there is no sense or system in distribution; so the gender of each must be learned separately and by heart. There is no other way. To do this one has to have a memory like a memorandum-book. In German, a young lady has no sex, while a turnip has. Think what overwrought reverence that shows for the turnip, and what callous disrespect for the girl.”

Fortunately, there is a game that uses lexicographical data from Wikidata to help you with the memorizing: DerDieDas. Can you make it through 10 randomly selected German nouns with guessing the correct article? For those who already speak German, there is also a French version and a Danish version.

Wikidata currently has 43440 Lexemes in 315 different languages, dialects or scripts (14762 Lexemes in English, 10334 in French, 3039 in Swedish, 2651 in Nynorsk, 2095 in Polish, and 2027 in German — see the full list). While this is already a good start, it is clearly just the beginning. Start exploring lexicographical data on Wikidata and help build a new repository of Free Knowledge for language!

Kommentare

  1. […] ஆங்கில அறிவிப்பை இங்கே காண்க – wmdeblog.local/2019/03/25/lexicographical-data-on-wikidata-words-words-words/ […]

  2. Malte
    28. March 2019 at 23:28

    Pretty and wise.
    Lexemes, Senses and Forms
    Nice!

  3. Finn Årup Nielsen
    27. March 2019 at 00:59

    @Andrew Krizhanovsky,

    For the surname query you can follow the unreadable URL to the embed.html page. In the lower left corner of that page is the “Wikidata Query Service” link. Press it and you will get to my query.

  4. Andrew Krizhanovsky
    26. March 2019 at 11:24

    “… query by Finn Årup Nielsen to query for persons with a surname that matches the past participle form of a Danish verb…”

    Could you provide the text of this SPARQL script?
    It will be more convenient for readers, than the unreadable URL with embed.html :(

    P.S. Yes, I hope that Wiktionary data will be used in the Wikidata filling…

  5. Infovarius
    26. March 2019 at 09:31

    Not single word about Wiktionary! Which is not only predecessor of lex.data but also contains a lot more these lexicographical data at the time. How it can possibly be??

Leave a Reply

Your email address will not be published. Required fields are marked *