Wikibase — the software behind Wikidata — draws a lot of attention from those interested in modelling knowledge for specific domains: scientists, people working on political data, those working with cultural data, and many more. As the second meeting in a series of workshops by the Wikibase community, a meeting in June brought together experts. Wikibase developer Thomas Arrow was at the workshop and shared his insights on the event. 

by Thomas Arrow

Following a first Wikibase workshop in Antwerp a follow up meeting was held in Berlin at the WMDE offices funded by the European Research Council and focussing on the modelling of grant data in Wikibases.

Participants of the Wikibase workshop in Berlin. Photo by Lisa-Marie Köhler, CC BY-SA 4.0

The workshop started at Sunday lunchtime with a workup day of talks setting the scene for a variety of different topic areas:

A talk by Diego from the ERC was first. He talked about the past work he has done on modelling grant data and the challenges he found.

Next came a talk on “Federation first” from Andra Waagmeester (Wikimedia volunteer, Member of the Gene Wiki Project) about how he sees Wikibase federation which he put succinctly as ‘SPARQL’ and  federation between graph database endpoints. This was followed by a description by Lydia Pintscher (WMDE) of all the different types of federation she could envision; including Andra’s interpretation and how close / far we were to seeing these types within the Wikibase ecosystem.

Next up was a talk about Shex from Eric Prud’hommeaux who is on the W3C Shex Community Group. He described to us how Shape Expressions could be used to validate data stored in places like wikibase. He also showed an in browser Shex validator that can indicate if a given graph meets the constraints of a particular shape expression.

We then heard about FAIR data principle and how that could work with wikidata and federated wikibases.

After a short break we heard about the OpenAire project which has an api to provide open data on EU funded research projects and their outputs.

Raz Shuty of Wikimedia Deutschland gave a short talk about continued developments made on the tool called bubber. He explained that it was still being developed but the goal was to provide a click through interface to generate the config file to set up a containerized wikibase.

Daniel Mietchen and Tom Arrow then presented another outcome of the earlier Antwerp workshop: the wikibase registry which is a wikibase that stores information about other wikibases and encouraged the audience to include their setup on there if it was currently missing.

To finish the day we heard about methods of loading data into Wikidata or a Wikibase. First we heard from Gregg Thompson about the WikidataIntegrator tool that has been extensively used by the Genewiki project. He explained that he’d recently adapted it to work with Wikibases other than Wikidata. Antonin Delpeuch told us about a tool called OpenRefine which he works on as a volunteer developer. It provides a graphical interface to load data into Wikidata and he told us he was keen to adapt it to work with arbitrary Wikibases if possible.

Monday started bright and early and we came together into groups to work on different topic areas for the next day and a half.

We had one group working on importing data about grants into Wikidata using OpenRefine. Specifically, they worked on a small dataset of researchers and metadata about them, e.g. Orcid and ScopusID. They used OpenRefine to ‘reconcile’ this dataset against that parts of this dataset that already existed on Wikidata.

Another area that was being worked on was linking a variety of other datasets that already exist on Wikidata to funding sources. For example, they looked at cell lines and linking those to the discovery publication using SPARQL on the Wikidata query service. A similar strategy but from ‘the other direction’ was employed by searching for scientific papers on Wikidata that had a Main Subject of a piece of scientific software. This could then help determine who and how that software was funded.

The second group worked in great detail on a project that was finally titled DIEGO (Data Integration Extension for Grants Ontology), a detailed graph model to describe the funding for projects. They described the outcomes of this in Shex and could then use the online validator shown on Sunday by Mark Thompson to validate examples direct from Wikidata against it. The model could be summed up verbally as: “Funders empower bureaucrats, who provide money, in partial payments, to projects, which have participants, to attain given goals, possibly in collaboration with other projects.”

The third group worked on the WikidataIntegrator tool as previously shown by Gregg Stupp. They were keen to adapt to to work on a wider range of papers than it previously did. It used to only work on those papers that were available on PubMed or PubMedCentral and they succeeded in having a working version getting data on papers from CrossRef.

A fourth group worked on infrastructure and was thinking about the practicalities of having a world with many people running many wikibases. They created a new Wikibase on the WMF CloudVPS infrastructure to store data about grants and funders that may be too fine grainer for the Wikidata community to want to curate. This was named ORIG (Open Research Impact Graph).

Much time was also spent working on containerizing another Wikibase: Rhizome which is one of the longest running running Wikibases other than Wikidata. The group investigated the difficulty in porting a legacy install to the containerized infrastructure and then actually when through the process noting the pain points along the way as well a the “good” parts where the containerization made maintenance easier.

Attendees did not stick rigidly to one group and moved between them frequently as well having ad-hoc meetings on other topics. Some of the notable ones this author recalls were: understanding the requirements that a data import tool has to work with an arbitrary wikibase other than Wikidata. This resulted in a plan to expose that information in the Wikibase API so that tools like OpenRefine could happily work on any Wikibase not just wikidata. Another was going through the mechanics of hosting a wikibase using the WMDE developed container infrastructure on a number of different platforms including (but not limited to) Google Cloud Engine, Wikimedia’s own Cloud VPS, AWS and custom OpenStack setups.

The Wikibase community is extremely active and motivated to bring structured data to fields beyond Wikidata itself. We are happy and proud to have the support of the European Research Council in bringing this diverse and talented group of people working with Wikibase together. Both the exchange of stakeholders and the concrete solution for the data model of grants for the ERC were valuable outcomes of this three-day-workshop. We hope to bring more data and people together in the future.