Wikidata and Artificial Intelligence: Simplified Access to Open Data for Open-Source Projects
Corinna Schuster
17. September 2024
More and more people are using AI applications to obtain information, among other things. This makes it all the more important that they are trained with high-quality data. At the same time, large companies dominate the development of ChatGPT and the like because they can afford it. Wikimedia Deutschland has launched a new project to support the development of non-profit AI projects and to contribute to a reliable information ecosystem. It aims to facilitate the use of open data from Wikidata.
As an open knowledge graph with over 112 million machine- and human-readable entries, Wikidata represents a centralized source of high-quality open data. All Wikimedia projects, Wikipedia included, access this data to automatically update information such as population figures or dates of birth. Supported by over 12,000 volunteer contributors, Wikidata offers a comprehensive, validated database. Although the data is accessible to developers of open source projects, they often lack the resources to use it for AI training, an option usually only available to large technology companies.
Vectorized data for machine learning
The goal of the new project is to give smaller open-source projects in particular the opportunity to use data from Wikidata. Wikimedia Deutschland has thus teamed up with DataStax and Jina AI to process Wikidata data in such a way that smaller projects without the financial and human resources of large companies can also use it.
At the center of the new project lies the transformation of Wikidata’s data into semantic vectors – a time-consuming but necessary step that open-source developers typically cannot manage alone. To this end DataStax is providing a powerful vector database, while Jina AI is contributing an open-source model for vectorizing the textual data.
This transformation of the data into vectors allows developers to run semantic search queries more efficiently and to integrate Wikidata’s data into their AI models. That allows not only a faster and more precise search but also simplifies the process of embedding Wikidata into RAG (retrieval-augmented generation) applications. These applications minimize AI errors by supplementing their results with current and verified facts.
A further goal of the project is to more easily detect vandalism on Wikidata. Because generative AI has the ability to create content en masse, it also lends itself to the spread of false information. Vectorizing the data allows potentially damaging changes to Wikidata entries to be identified and corrected.
AI and Wikimedia Deutschland’s values
In so doing, Wikimedia affirms its belief in the values of transparency and of free access to information in the form of open data. Particularly in the context of generative AI, which often generates content that is not always accurate or reliable, making validated data available is an important measure in improving the quality of AI-generated content.
Dr. Jonathan Fraine, head of software development at Wikimedia Deutschland, explains: “Many developers share our values, but accessing Wikidata’s data is challenging for them. We must simplify the process in order to make these enormous volumes of data usable for the latest AI applications.” Lydia Pintscher, Portfolio Lead Wikidata, adds: “By making high-quality open data available, we support the communities in developing innovative ideas that benefit humankind rather than serving commercial ends.”
Wikidata as the basis of a more equitable digital future
The meaning of this project lies in establishing Wikidata’s data as a reliable source for AI applications. In a time when AI-generated content increasingly dominates the internet, the danger exists that unverified and often incorrect information will be propagated. Wikidata offers a stable alternative. The knowledge graph contains an enormous volume of data, and the information it contains is freely accessible with an open license and constantly validated and expanded by an active community.
In collaborating with DataStax and Jina AI, Wikimedia Deutschland is creating a technical infrastructure that makes the open knowledge repository that is Wikidata usable by smaller development teams as well. Over the long term this could allow open-source AI projects to more easily hold their own against the dominating tech giants. At the same time, access to reliable data will become easier for anyone and everyone, supporting democratic access to knowledge in a digitized world.
The future of AI at Wikimedia Deutschland
In December of 2023, Wikimedia Deutschland began implementing this semantic search plan. The initial beta tests of a prototype are planned for 2025. This project is a huge opportunity to improve the information ecosystem with AI and at the same time to protect the fundamental values of openness and transparency.
This enterprise is an important step in Wikimedia Deutschland’s mission of making Free Knowledge accessible to all. With the help of machine learning and semantic search, access to Wikidata’s valuable data will be further simplified, potentially furthering not only the developer community but society as a whole.
Presentation of the project in Paris
Jonathan Fraine (head of software development at Wikimedia Deutschland) and Lydia Pintscher (Portfolio Lead Wikidata) present the project at “AI_dev: Open Source GenAI & ML Summit Europe 2024”. Their presentation is available on YouTube: