4 Data Linking

The internal data are linked to various external data sources.

Main database table el_matches. Idea is to just dump everything there in a first step.

  • additional_data is a json column for maximal flexibility
CREATE TABLE `el_matches` (
  `created_at` timestamp NULL DEFAULT current_timestamp(),
  `updated_at` timestamp NULL DEFAULT NULL ON UPDATE current_timestamp(),
  `entity_id` varchar(55) NOT NULL,
  `entity_id_type` varchar(255) DEFAULT NULL,
  `entity_id_combination` varchar(255) DEFAULT NULL,
  `entity_id_combination_type` varchar(255) DEFAULT NULL,
  `external_id` varchar(255) DEFAULT NULL,
  `external_id_label` varchar(255) DEFAULT NULL,
  `external_id_type` varchar(55) NOT NULL,
  `external_id_desc` varchar(255) DEFAULT NULL,
  `source` varchar(255) DEFAULT NULL,
  `source_id` longtext DEFAULT NULL,
  `property_type` varchar(255) DEFAULT NULL,
  `additional_data` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL CHECK (json_valid(`additional_data`))

4.1 GND/lobid

I want to link the entities, like a book author or publisher, to the external identifier of the German National Library (GND). The easiest way seems to be calling the lobid API. It returns a json with meta data, e.g. contributors or topics that often, but not always, have a GND id.

I filter the json file by using jq, a json processor, that has a wrapper in R: jqr.

4.1.1 Output

  • DB tbl: lgbtiq_kg.el_matches
  • beware: it’s not cleaned data, so not in “clean” database lgbtiq_kg_clean

4.1.2 Examples for jq commands

  • getting contributors: .member[].contribution[]? | {type: .agent?.type[], label: .agent?.label, role:.role?.id, gnd_id: .agent?.gndIdentifier}
  • getting topics: .member[].subject[].componentList[]? | {gnd_id: select(.type?).gndIdentifier, type: select(.gndIdentifier?).type[], label: select(.type?).label}
  • getting lobid ressource id: .member[] | {source_id: .id}
  • more complex with optional fields: .member[] | {id: ((select(.sameAs) | {sameAs}) // null), gnd_subject_category: .gndSubjectCategory, placeOfBusiness: .placeOfBusiness}

4.1.3 Basic workflow

  • Most sophisticated script: data-linking/lobid/04-search-via-publisher.R
    • serves for templates from now on
  • search for isbn
    • the most precise way, because it’s already a persistent id
    • data-linking/lobid/01-search-via-isbn.R
    • script first version
  • search for author name and title in the relevant objects
    • query like contribution.agent.label:AUTHOR+AND+title: TITLE&format=json")
    • data-linking/lobid/02-search-via-author-title.R
  • search just by name
    • data-linking/lobid/03-search-via-author.R
  • search for publishers as corporate bodies: data-linking/lobid/04-search-via-publisher.R

4.2 VIAF

To VIAF by using the viafr R-package. It suggest a VIAF entity when providing a string.

  • db table: el_viaf_books_authors