zur Artikelübersicht

Wikidata for Research – a grant proposal that anyone can edit

WMDE allgemein

5. December 2014

German summary: Vor einigen Wochen wurde an dieser Stelle von einer Initiative berichtet, im Rahmen derer Wikidata-Einträge für alle knapp 40.000 menschlichen Gene angelegt wurden. Hier nun baut Daniel Mietchen – Wissenschaftler am Museum für Naturkunde Berlin und aktiver Wikimedianer – auf dieser Idee auf und stellt einen europäischen Forschungsantrag zur Integration von Wikidata mit wissenschaftlichen Datenbanken vor, den jede und jeder via Wikidata editieren kann, ehe er in knapp sechs Wochen eingereicht wird.

A few weeks ago, this blog was enriched with a post entitled “Establishing Wikidata as the central hub for linked open life science data”. It introduced the Gene Wiki – a wiki-based collection of information related to human genes – and reported upon the creation of Wikidata items for all human genes, along with their annotation with statements imported from a number of scientific databases. The blog post mentioned plans to extend the approach to diseases and drugs, and a few weeks later (in the meantime, Wikidata had won an Open Data award), the underlying proposal for the grant that funds these activities was made public, followed by another proposal that involves Wikidata as a hub for metadata about audiovisual materials on scientific topics.

Now it’s time to take this one step further: we plan to draft a proposal that aims at establishing Wikidata as a central hub for linked open research data more generally, so that it can facilitate fruitful interactions at scale between professional research institutions and citizen science and knowledge initiatives. We plan to draft this proposal in public – you can join us and help develop it via a dedicated page on Wikidata.

The proposal – provisionally titled “Wikidata for research” – will be coordinated by the Museum für Naturkunde Berlin (for which I work), in close collaboration with Wikimedia Germany (which oversees development of Wikidata). A group of ca. 3-4 further partners are invited to join in, and you can help determine who these may be. Maastricht University has already signaled interest in covering data related to small molecules, and we are open to suggestions from any discipline, as long as there are relevant databases suitable for integration with Wikidata.

Two aspects – technical interoperability and community engagement – are the focus points of the proposal. In terms of the former, we are interested in external scientific databases providing information to Wikidata with an intention that both parties will be able to profit from this. Information may have the form of new items, new properties, or added statements to existing ones. One focus here would be on mapping identifiers that different databases use to describe related concepts, and on aligning controlled vocabularies built around that.

In terms of community engagement, the focus would be on the curation of Wikidata-based information, on syncing of curation with other databases (a prototype for that is in the making) and especially on the reuse of Wikidata-based information – ideally in ways not yet possible –  be it in the context of Wikimedia projects or research, or elsewhere.

Besides the Gene Wiki project, a number of other initiatives have been active at the interface between the Wikimedia and scholarly communities. Several of these have focused on curating scholarly databases, e.g. Rfam/Pfam and WikiPathways, which would thus seem like good candidates for extending the Gene Wiki’s Wikidata activities to other areas. There are also a wide range of Wikiprojects on scientific topics (including within the humanities), both on Wikidata and beyond. Some of them team up with scholarly societies (e.g. Biophysical Society or International Society for Computational Biology), journals (e.g. PLOS Computational Biology) or other organizations (e.g. CrossRef). In addition to all that, research about wikis is regularly monitored in the Research Newsletter.

The work on Wikidata – including contributions by the Gene Wiki project – is being performed by volunteers (directly or through semi-automatic tools), and the underlying software is open by default. Complementing such curation work, the Wikidata Toolkit has been developed as a framework to facilitate analysis of the data contained in Wikidata. The funding proposal for that is public too and was indeed written in the open. Outside Wikidata, the proposal for Wikimedia Commons as a central hub of multimedia from open-access sources is public, as is a similar one to establish Wikisource as a central hub for open-access literature (both of these received support from Wikimedia Germany).

While such openness is custom within the Wikimedia community – it contrasts sharply with current practice within the research community. As first calls for more transparency in research funding are emerging, the integration of Wikidata with research workflows seems like a good context to explore the potential of drafting a research proposal in public.

Like several other Wikimedia chapters, Wikimedia Germany has experience with participation in research projects (e.g. RENDER) but it is not in a position to lead such endeavours. The interactions with the research community have intensified over the last few years, e.g. through GLAM-Wiki activities, participation in the Leibniz research network Science 2.0, in a traveling science exhibition, or in events around open science. In parallel, the interest on the part of research institutions to engage with Wikimedia projects has grown, especially so for Wikidata.

One of these institutions is the Museum für Naturkunde Berlin, which has introduced Wikidata-related ideas into a number of research proposals already (no link here – all non-public). One of the largest research museums worldwide, it curates 30 million specimens and is active in digitization, database management, development of persistent identifiers, open-access publishing, semantic integration and public engagement with science. It is involved in a number of activities aimed at bringing biodiversity-related information together from separate sources and making them available in a way compatible with research workflows.

Increasingly, this includes efforts towards more openness. For instance, it participated in the Open Up! project that fed media on natural history into Europeana, in the Europeana Creative project that explores reuse scenarios of Europeana materials, and it leads the EU BON project focused at sharing biodiversity data. Within the framework of the pro-iBiosphere project, it was also one of the major drivers behind the launch of Bouchout Declaration for Open Biodiversity Knowledge Management, which brings the biodiversity research community together around principles of sharing and openness. Last but not least, the museum participated in the Coding da Vinci hackathon that brought together developers with data from heritage institutions.

As a target for submission of the proposal, we have chosen a call for the development of “e-infrastructures for virtual research environments”, issued by the European Commission. According to the call, “[t]hese virtual research environments (VRE) should integrate resources across all layers of the e-infrastructure (networking, computing, data, software, user interfaces), should foster cross-disciplinary data interoperability and should provide functions allowing data citation and promoting data sharing and trust.”

It is not hard to see how Wikidata could fit in there, nor that this still requires work. Considering that Wikidata is a global platform and that initial funding came mainly from the United States, it would be nice to see Europe taking its turn now. The modalities of this kind of EU funding are such that funds can only be provided to certain kinds of legal entities based in Europe, but we appreciate input from anywhere as to how the project should be shaped.

In order to ensure compatibility with both Wikidata and academic customs, all materials produced for this proposal shall be dual-licensed under CC BY-SA 3.0 and CC BY 4.0.

The submission deadline is very soon – on January 14, 2015, 17:00 Brussels time. Let’s find out what we can come up with by then – see you over there!


Written by Daniel Mietchen


  1. Ivan Ferrero
    18. December 2014 at 14:26

    I spread the word in the Psychology field, for I’d like to contribute to an Open Source Psychology.
    How may Psychology benefit from WikiData?

    I asked the Web, and I’ll repost here the most useful comments.

    The first one:

    “I like the idea of Wikidata as the central hub for linked open life science data. Yet, the concept is too general at the moment, so it’s too early to conclude whether it’s going to work for psychology data”

    I’m not a Researcher and, while I really support the Open Source concept, I need more infos and suggestions from people who work in the Research field.

    It’s new field, so it’s up to us to build it! ;-)

    So…How could it benefit our Psychology?

  2. Scott Edmunds
    13. December 2014 at 14:30

    For proof of concept if you are interested in seeding wikidata with a broad variety of pretty well curated biological data types we have ~50TB of CC0 data in GigaDB (http://gigadb.org) you are welcome to try. There are obviously plenty of public domain genomics datasets to work with (and I see you’ve already flagged our 50 bird/reptile genomes), but on top of sequencing and optical mapping data we have lots of interesting things like imaging (CT/micro-CT, DICOM, MRI), neuroscience (EEG, fMRI, multielectrode array) and mass spec (metabolomics and mass spec imaging both in the pipeline) data to play with. To do this systematically would probably take a bit of curation and improvements to the API we are working on, but even without that there could be some scope for exploration. Have you considered trying to tie some of this with “Bring Your Own Data” parties? We and ELIXIR-NL/DTL have been experimenting with some of these, and are currently working on getting some funding for more (see: http://blogs.biomedcentral.com/gigablog/2014/08/22/aint-no-party-like-a-bring-your-own-data-party/). I’m not sure if its problematic we are not in the EU, but a lot of the data producers and people we are working with are. Will try to talk more on this at FORCE2015.

  3. Daniel Mietchen
    9. December 2014 at 16:30

    @Snipre: Sorry if that was not clear, but the _materials produced for this proposal_ shall be put under the above-mentioned licenses. We are aware of Wikidata’s policy that
    “[a]ll structured data from the main and property namespace is available under the Creative Commons CC0 License” and fully plan to abide by that.

  4. Spencer Bliven
    9. December 2014 at 12:51

    I am definitely supportive of this proposal, and I think that after the success of specific projects like Rfam, now is the perfect time for this. The momentum is definitely there now for an initiative like this, what with the funding of the Data Discovery Index Consortium in the US and many opportunities for data integration.

    Looking over the current outline of the proposal, I would suggest discussing specific Wikidata developments a bit more. Wikidata is currently very well suited for providing mappings between databases. However, most of this information will presumably be populated by bots from the various member databases, which makes it similar to the many existing DB mapping services. The strength of Wikidata should be in the ability for users to correct annotation errors and add missing links. I think that this requires a lot of tool improvements. For instance, better wikidata searching, better summary pages, data aggregation and visualization, etc. Bots to propagate changes to related entries (where appropriate) might be useful too. I feel like wikidata is not particularly easy to use currently as a database, so this proposal would be the perfect time to suggest improvements.

  5. Snipre
    9. December 2014 at 11:22

    @Peter Murray-Rust. Due to European law no database can be released in CC0 without the formal consent of the database owner. I just hope that the project team discussed in detail about licence problem because I can’t see how the project can define to work under CC BY-SA 3.0 and CC BY 4.0 with Wikidata working under CC=. Solutions can be found but there is a need of a formal declaratoin that data released in Wikidata are under CC0. Right now Wikidata doesn’t have a formal system to deliver data under a special licence like for Wikipedia or Commons (See https://en.wikipedia.org/wiki/Wikipedia:Donating_copyrighted_materials). So something have to be done in that direction with the Wikidata team.

  6. Peter Murray-Rust
    7. December 2014 at 15:24

    “no european database will release data under this licence”. I don’t understand this.

    I am personally very supportive of this proposal.

  7. Snipre
    5. December 2014 at 17:00

    Nice initiative. But I think we will have a methodology clash between the open world and the scientific world. For scientific world referencing and source citation are two important elements which are not familiar in Wikidata. First we have to solve the problem of the CC0 licence: no european database will release data under this licence. So I wait for a clear explanation about how this proposal under the CC BY-SA 3.0 and CC BY 4.0 can fit the CC0 of the Wikidata database.

Leave a Reply

Your email address will not be published. Required fields are marked *