Data Partnerships in Wikidata: Project Durchblick

English version

WMDE allgemein

21. August 2017

Dieser Beitrag ist auch auf Deutsch verfügbar.

Dr. Georg Schelbert works at the Humboldt-Universität Berlin at the Institut für Kunst- und Bildgeschichte (IKB) as the head of the media library. His project “Durchblick” may be translated as “Through the Looking-Glass” in more than one sense: As it is about glass slides, it literally deals with glass you can see through. But the project is also about gaining insights from a vast collection of cultural assets, almost like exploring the wonderland hidden in it. Project Durchblick makes extensive use of Wikidata. This kind of data partnership does not come in the form of a data donation, but rather as using Wikidata as a hub for other data collections in order to provide objects in collections with common identifiers.

What is Durchblick all about? What kind of collection is that and how did it start?

We decided to call our project “Durchblick!”, as we faced the task to explore a large number of glass slides that had been in use for many decades in the department for the history of art at the Humboldt-Universität and make them accessible again.

Every institute for the history of art has more or less large collections of slides that are used in lectures and seminars. Primarily, two formats are used: First, the larger glass slides where a black and white film is applied directly to a pane of the format of 8,5 x 10cm that were produced since the late 19th until the middle of the 20th century. And then the so-called small picture slides that emerged from the 35mm color film format as used in movie theaters that have been in use from the 1940s until today. We focus only on the glass slides with Durchblick. The Berlin collection is one of the oldest and largest, as some of the Berlin professors of art history like Herman Grimm (son of Wilhelm Grimm) or Heinrich Wölfflin discovered and used the potential of projected photographs quite early. Thus, the collection also reflects the interests of famous representatives of the field in research and teaching. Today, after damages caused by wars, a significant amount consists of replicas and additions made in the 1950s, however over the the time also many interesting new themes were added, such as East German or Soviet art, urban development, or even traffic planning.

How does digitisation and digital opening of a collection actually look like?

For the digitisation we were looking for the most effective way, as we did for all aspects of the project. As we were not only interested in the slides as a depiction of something, but also as an object (with its inscription, traces of usage, and its general state) we developed a photographic process in which we take a picture of the frame at the same time as the transparent picture lightened from behind. We use a classic copypod with lightening from above and below and a high-resolution digicam (36MP). This gave us a chance to digitise the approximately 56,000 slides in less than a year and put them online, preliminarily largely without any metadata. The only metadata at this point are general systematic ones that say something about the picture’s content (e.g. Italy, painting, renaissance). Through the index on the storage of the slides the digital representations can be queried more specifically.

Scientifically opening a collection of photos or slides that depict pieces of art normally means to describe both the art depicted and — if it is historic — the photo or slide itself. This is done using specific standards that provide us with mandatory fields and terminologies. However, this description of the pieces of art on our slides turned out to be an unnecessary burden. As they were all well-known or even famous pieces of art, we can presume that they are already described elsewhere. No matter what choice we make for external descriptions of the pieces of art, it made sense to us to connect them with authority file identifiers. In the case of a person, you would choose an established authority file like the GND (Integrated Authority File by the German National Library) or go straight for the meta-identifiers of VIAF (Virtual Internet Authority File). For pieces of art or architecture, there is no sufficiently comprehensive authority file. The GND is far too incomplete in this field. Thus we saw Wikidata as a new solution. At least in some parts it has reasonably complete collections of pieces of art and architecture.

Wikidata also offers other advantages: New items can be added to Wikidata by every user if something is lacking. Wikidata items contain other data and are at least linked to one Wikipedia article. The structure of data in the form of statements links every item with an ever-growing web of knowledge that provides content which can also be used in the future. We still have to do some immediate work of description, nevertheless: All the properties of the slide, including the inscriptions on it, have to be documented for every single object. We can imagine a future workflow where transcribed inscriptions are matched with Wikidata items and proposed matches can be manually approved.

Your project was awarded with a prize. What kind of prize is it and how come you were chosen? And what are the “digital humanities” actually?

It makes sense to start with the last question. There are different definitions for digital humanities. In a broader sense, it applies to every methodic use of computers in the humanities. Pure digitisations or using office programs would fall out of the scope of the definition, but processing and preparation of digitised objects with metadata may very well be within the definition, especially if the data are used in research. In a more narrow sense we understand digital humanities as the analysis and compilation of text corpora. This is why we were happy to learn that our project which deals with tangible objects and pictures while thinking about the most efficient way for scientifically opening them was awarded with a prize.

The Prize for Digital Humanities is an annual award given out by the interdisciplinary research network for digital humanities for innovative projects in that field since 2015. The jury consists of computer scientists, information scientists, and scientists in other fields. The jury may have been lead to their decision by the assessment that we will see even more with Wikidata and GLAM in the future.

What is your take on Wikidata’s data quality, its documentation and data access? Are there things we could improve?

Of course the quality of data differs a lot. There are cases of incorrect statements as well as inconsistent use of properties — for instance, it does not seem to be possible to search for self-portraits as this is rarely used as a category while at the same time several self-portraits of Rembrandt have an item of their own. In my opinion, Wikidata — and Wikipedia, too — work best when it’s about comparably hard facts or references to other resources. These are things like a person’s birthdate, geo coordinates, membership in institutions etc. But first and foremost it’s about the identifiers contained in an Wikidata item. Of course it would be advantageous if we were not forced to look up the artist, the size of the painting, the style, the current museum location or literature references for every piece of art we linked to Wikidata. But for now it is much more important for us to unambiguously identify objects, for instance Rembrandt’s self-portraits as the Apostle Paul (https://www.wikidata.org/wiki/Q2267759) or as Zeuxis (https://www.wikidata.org/wiki/Q2267594) which would otherwise be hard to describe in an unambiguous way.

So far our work is primarily manual. That means based on the inscriptions of the slides we are searching for matching Wikidata items and then use the identifiers. Simpler ways to search would be helpful (the Query Service with its Query Helper are a good start). For the time being, Google search (which adjusts misspelled words) is still the most effective way. Wikipedia pages typically rank very highly. From there it’s easy to get to the Wikidata ID.

In cases where a piece of art is missing (even in the case of the famous Rembrandt self-portraits Wikidata is probably not complete yet) we rarely add it manually. In order to do this in a more orderly fashion it would be helpful to have templates that are tailored to GLAM needs (e.g. a template for quickly adding an item for a painting, a sculpture, or a monument); the Quick Statements tool is simply not easy enough. It is also conceivable to upload whole catalogues and other sets of work to Wikidata. However, here I see the increasing problem of how to avoid duplicates. With more and more pieces of art included in Wikidata the probability rises that a certain amount of a mass upload from a GLAM database is already included and causes duplicates.

Another aspect would be the re-use of data from Wikidata. For now we don’t systematically re-use Wikidata data. However we are planning to include a part of the general information on the pieces of art based on the Wikidata identifier — either on the fly or through regular data imports from Wikidata. This we would do, as I said above, to not repeat things that are already known and noted elsewhere.

How does the path ahead for Wikidata and GLAM look like? Can Wikidata truly become something like a meta-vocabulary for collections?

I find it hard to see the path ahead in a general way. But I can imagine the classic cultural heritage, i.e. everything that is on the walls of museums or on monument lists, to be completely included in Wikidata at the very least. I’m sure other communities like ethnology or cultural studies can propose several other items as well. With an extensive cultural corpus Wikidata could become a central if not global reference for cultural assets. In the case of authority files that are either organised nationally (national libraries, national registers of cultural assets) or along items (museums), Wikidata could become a meta authority file that contains all other identifiers. It’s possible that it could also become some kind of vocabulary if pieces of art are tagged with Wikidata. That would mean terms and concepts such as art genres (“painting”, “portrait”), techniques (“oil painting”), artistic styles (“Baroque”) or similar things. However, vocabularies are largely dependant on internal classifications (how terms are related to each other, e.g. “painting — landscape painting — mountain landscape”) so I expect classifications like Iconclass or AAT to be dominant for the time being.

I am much less sure about how Wikidata could play a central role regarding the statements about the pieces of art. Its data model is fundamentally suitable for that, but the general desire for certified and original information will probably lead people to search for information on works (and the digitised objects of these works) at the museums. Wikidata’s function would be to link there, however.

Another scenario would be that Wikidata becomes a repository for information that is not documented elsewhere — Wikipedia already plays that role to a certain degree.

All in all I expect some powerful impact of Wikidata in the world of GLAM. Given that one of the largest museums of cultural history, the British Museum in London, is in the process of developing a documentation system (ResearchSpace) that is based on the same software as Wikidata itself (Metaphacts) there are clear signs that GLAM and Wikidata move towards a common direction.

#Technical

24 Hours of Open Data – The Wikidata Birthday Meetup

Wikidata turns eight – An interview with Lydia Pintscher

OPEN!NEXT creates new standard for open hardware

ProWD: Detecting Knowledge Imbalances on Wikidata

Kommentare

Wikimedia Blog » Blog Archive » Datenpartnerschaften mit Wikidata: Projekt Durchblick
21. August 2017 at 10:28

[…] This blog post is also available in English. […]

Reply