#CityLIS Student Perspectives: No Data is an Island

Student Perspectives is our series of guest posts written by current CityLIS students.

This post is by current CityLIS student, Bethany Sherwood. Bethany discusses the implications the semantic web poses for interpreting information when the context it is presented in is not always clear.

***

There are two things that the past few weeks of DITA reading have made me more certain of: resource description is never a neutral activity, and you should always be suspicious when Google gets involved.

The non-neutrality of resource description is pretty much agreed on when we’re talking about subject headings (see Simon Barron’s essay on the hidden politics of collection management for an excellent explanation of this [1]), but I’ve become increasingly aware of a similar tension at work in the formats of bibliographic description, and especially in the case of Linked Data.

Linked Data is the manifestation of the vision for a connected Web that Tim-Berners-Lee, James Hendler & Ora Lassila put forward in their 2001 article ‘The Semantic Web’. The Semantic Web is a vision for increased machine readability on the Web, by structuring the content of Web pages such that the relationships and meaning of their data can be read, and crucially, “understood” by computers.

Practically this looks like the structured markup of Web pages in XML and RDF. RDF triples being the method by which meaning is encoded:

“In RDF, a document makes assertions that particular things (people, Web pages or whatever) have properties (such as “is a sister of”, “is the author of” with certain values (another person, another Web page)”. (p.2) [2]

These assertions run in the background of Web pages, not intended for human reading, but rather telling machines what things mean. This is the semantics of the Semantic Web. This is also where I start to get suspicious. Words are fundamentally ambigious beings. We understand what they mean based on the context they appear in (their semantic relationships to other words).

The Semantic Web is not a contextually neutral space. RDF triples are not simple statements of fact. They’re human made assumptions about human written text encoded in humanly created markup languages. They involve decisions, omissions and consequences that aren’t immediately visible in a user-end interface.

As MARC becomes increasingly outdated and the demand for openness becomes unignorable, libraries will need to open up their closed systems to the promise of the Semantic Web and formats of Linked Data. As Roy Tennant points out in his article ‘MARC Must Die‘ libraries’ choice of software that can read MARC is limited to a small market; ‘meanwhile, the wider information technology industry is moving wholesale to XML as a means to encode and transfer information.’ [3]

If libraries want their data to be included in the Semantic Web they ultimately need to adopt new bibliographic standards and formats. This in turn forces the questions of who gets to set these standards and who gets to contribute to them? Whose assumptions and semantics will be coded into our Linked Data?

The Schema.org vocabulary, created by Google, Bing and Yahoo! in 2011 for structured data markup on web pages, isn’t a bibliographic standard but it is a format of linked data that wants library data. Schema.org describes itself as ‘a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet’. The phrasing Schema.org uses to describe itself is one of “community” but it is ultimately owned and driven by the search engines represented by the Steering Group.

Schema.org contributes to Google’s Knowledge Graph, a development that has been subtly changing the way we google since it’s release in 2012. The Google Knowledge Graph utilises semantic-searches to deliver information scraped from linked data to create results based on inferring what the user wants to know next. It’s the system that infers that if you search for the film title “Mean Girls” you’ll likely also be interested in other films about the politics and anguish of adolescence such as 10 Things I Hate About You. The Knowledge Base uses data from Wikidata, an information source that is fundamentally crowd-based but no less subject to influence from interested parties (Google included, who in 2014 transferred data from their Freebase into Wikidata). The Knowledge Graph is not neutral information, although it is increasingly treated as such.

In their research into the influence of the Semantic Web on urban politics Helen Ford and Mark Graham point out how the Knowledge Graph is distancing us from the sources of the data it presents:

“because of the ease of separating content from containers, the provenance of data is often obscured. Contexts are stripped away, and sources vanish into Google’s black box. For instance, most of the information in Google’s infoboxes on cities doesn’t tell us where the data is sourced from.” [4]

In Lent term this year I was helping an English undergrad with their dissertation bibliography. One of the references she was puzzling over was a screenshot of Google search suggestions. Unsurprisingly the MHRA guidelines offered little help with correctly citing the outputs of an algorithm. The Knowlede Panel produces the same dilemma; as data becomes increasingly linked the lines between data sources are blurred.

I don’t think I’m being overly cynical to be concerned about linked data and its implications for the future of bibliographic description. The search engines invested in Schema.org are ultimately commercially driven and will manipulate the data at their disposal to this end. Carol Jean Godby points this out in her description of the OCLC’s (a cooperative organisation whose member libraries contribute the data that makes up WorldCat) experiments with Schema.org, ‘OCLC researchers were also skeptical of Schema.org because the vocabulary seemed too focused on commercial products, which overlap only partially with the curatorial needs of libraries.’ (p.77) [5].

OCLC came to conclude that ‘a sharable semantics from a group of influential organisations with a commercial incentive could only be interpreted as a positive development for the Semantic Web.’ (p.78) Godby is positive about library engagement with Schema.org, presenting it as part of the natural progression of data sharing between data providers and search engines:

“Schema.org is the latest incentive, and as in earlier solutions such as the HTMLtag or microdata, the promised return is a Rich Snippet, a Knowledge Card, and generally greater visibility in the marketplace where users now begin their quest for information.” (p.78)

This statement is footnoted with a reference to Google blogposts: “Promote Your Content With Structured Data Markup” and “Introducing the Knowledge Graph”. OCLC buys into Google’s conceptualisation of the data landscape as a market, the promised benefits for libraries being the inclusion in the Knowledge Graph, which Godby considers a good exchange for access to their data.

I don’t think library use of Schema.org is a bad thing, and I don’t want this to end up as a Google-is-the-root-of-all-evil kind of blogpost, but I do think its influence on Schema.org means there is a lot at stake in which format libraries choose to replace MARC. The same issue of underlying assumptions is at work in every data format, including the Library of Congress’ BIBFRAME, which will also encode data based on the assumptions of its creators.

As Linked Data begins to change the way we search for and recieve information, as the Knowledge Graph is already proving it will, it’s going to change the context of the data, both in terms of the semantic relationships encoded within it and where a person recieves it. There’s no content without context, and no context is neutral. If the context of library data is likely to become the Knowledge Panel or a commercially driven vocabulary I think it’s worth being a little suspicious. No data is an island; structurally, semantically, or contextually.

[1] Barron, Simon, ‘The Hidden Politics of Collection Management’, SimonXIX, 2017 [accessed 12 November 2017]

[2] Berners-Lee, Tim, Ora Lassila, and James Hendler, ‘The Semantic Web’, Scientific American, 2001 [accessed 19 November 2017]

[3] Tennant, Roy, ‘MARC Must Die’, Library Journal, 2002 [accessed 19 November 2017]

[4] Graham, Mark, ‘Why Does Google Say Jerusalem Is the Capital of Israel?’, Slate, 30 November 2015 [accessed 12 November 2017]

See also: Ford, Heather, and Mark Graham, Semantic Cities: Coded Geopolitics and the Rise of the Semantic Web (Rochester, NY: Social Science Research Network, 28 October 2015) [accessed 19 November 2017]

[5] Godby, Carol Jean, ‘A Division of Labor: The Role of Schema.Org in a Semantic Web Model of Library Resources’, in Linked Data for Cultural Heritage, ed. by Ed Jones and Michele Seikel (London: Facet Publishing, 2016), pp. 73–103

About Joseph Dunne-Howrie

I am artist in residence in the MA/MSc Library and Information Science department at City, University of London and module year coordinator for MA/MFA Performative Writing/Vade Mecum at Rose Bruford College of Theatre and Performance.My research interests include intermediality, live performance in digital culture, participatory and immersive theatre, performance documentation, archives, and performative writing.
This entry was posted in News, Student Perspectives and tagged , , , , , . Bookmark the permalink.