The digitised past

NLW Journals Writeup

2021-06-04T12:27:22+00:00

NLW Journals write up from 2017. I often find that I link to this article but the original isn’t always available so I have copied it here for posterity. The original is available on the NLW Dev Site.

Welsh Journals Technical Writeup

Authors:

Glen Robson - Head of Systems,
Dan Field - Head of Development,
Dylan Jones - Senior Web Developer,
Kim Botticelli - Senior Web Developer

This document describes how the National Library of Wales (NLW) has built the new Journals website https://journals.library.wales. The new website brings together two collections of Welsh Journals and also provides access to content that hasn’t previously been available. The amount of content in the new website includes:

Total titles 479
Total new titles 395
Total number of pages 1,226,816
Total number of new pages 669,223

The development was done on top of the NLW IIIF infrastructure including harvesting using Sitemaps and indexing the IIIF Manifests and Annotations lists.

Background

The NLW has worked on two Journal projects. The first project was funded by JISC digitising 50 modern Welsh Journals. The second project was funded by the Welsh Government and the European Regional Development fund to digitise over one million pages of Journal content. The content of the Journals ranges from academic and scientific publications to literary and popular magazines. Further details on the collection can be seen at:

https://www.llgc.org.uk/index.php?id=7594

The JISC-funded modern Welsh Journals project was undertaken around 2008 and included content that was under copyright so Authors and Publications had to be identified and rights status established. As part of the digitisation workflow articles were identified by noting the article start and end page. The project website was launched around 2009/10.

The second project learned a number of lessons from the previous project particularly in what metadata was captured and the workflow involved in making the items available. For this project Journals were chosen which were out of copyright and article metadata wasn’t captured as part of the digitisation process. 36 titles were shared with Europeana and were made available on a temporary website while development was undertaken to merge the two datasets. The project included digitising 425 titles and around 1 million pages.

IIIF interface to the data

The Journal pages were scanned to TIFF files and a number of derivatives were created to support the previous website including a 650px image, 50% image and a PFF zoomable image. The OCR was stored as ALTO and also in AbbyyFineReader xml. For the set of Journals that were shared with Europeana the access file was JPEG2000 and the NLW had undertaken a migration to replace the previous images with a single JPEG 2000. After delivery of the Europeana project the NLW developed a PFF to IIIF image server so converting the images not covered by this project was no longer necessary. Now both JP2 and PFF images appear in the website as IIIF images and the differing source formats is no longer an issue. Using the IIIF image standard has also made the 650px and 50% images redundant.

The metadata for the Journals is stored in the NLW fedora repository as METS documents and these were converted to IIIF Manifests and Collections as can be seen in the diagram below. The IIIF standard promotes using a seeAlso link to point to structured metadata and we converted our MODS metadata to EDM. The list of Journals which are part of the project are listed in a sitemap and this is what we used for indexing.

Sitemap

An example of the sitemap we used is below and includes live links to IIIF collections for Public Domain Titles:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"  xmlns:iiif="http://iiif.io/api/presentation/2/">
    <url>
        <iiif:manifest>http://dams.llgc.org.uk/iiif/2.0/2000001/manifest.json</iiif:manifest>
    </url>
    <url>
        <iiif:manifest">http://dams.llgc.org.uk/iiif/2.0/2035242/manifest.json</iiif:manifest>
    </url>
    ….
    <url>
        <iiif:manifest>http://dams.llgc.org.uk/iiif/2.0/2007313/manifest.json</iiif:manifest>
    </url>
    <url>
        <iiif:manifest>http://dams.llgc.org.uk/iiif/2.0/2007768/manifest.json</iiif:manifest>
    </url>
</urlset> 

There wasn’t time to add last modification dates into the sitemap but this would have been very useful to enable the indexer to pickup when a manifest changed. This process of re-indexing was managed through an issue tracking system but became complicated to manage at times.

Europeana Data Model (EDM) Metadata

An example EDM record for a title can be seen below:

http://dams.llgc.org.uk/behaviour/llgc-id:2000001/fedora-sdef:rdf/toRDF

Issue level manifests also contain links to EDM but it was only the title level that was indexed for this project. The EDM record provided the following information in the SOLR record:

Publication title - dc:title
Publisher details - dc:publisher (Welsh and English)
Language of Journal - dc:language
Description of Journal - dc:description (Welsh and English)

Frequency of publication was added later to the EDM.

Articles

The original Welsh Journals project contained Article information and this has been encoded using IIIF Ranges and an example can be seen at:

http://dams.llgc.org.uk/iiif/2.0/1097087/manifest.json

The ranges contain an article title, author and type although only the title is shown through the Universal Viewer. The Universal Viewer displays these ranges in the index panel and this manifest can be seen in isolation at:

https://viewer.library.wales/1097087

When viewing this manifest in the Journals website the Universal Viewer shows the current issue within its context of running Journals issues:

https://journals.library.wales/view/1093205/1097087

In the index section you can view the issues either by label (from the Journal Title IIIF Collection) or by navDate.

Missing Pages

During the processing of the second Journals project the NLW developed a tool which would allow the re-arrangement of images and also the adding of missing pages where the original pages weren’t available for digitisation. These images are noted in the METS document and in the generated IIIF manifests these appear as canvases without a IIIF Image service. See the following as an example which is a slightly extreme example picked up during testing. We believe that there was only one image (page 16) available for this issue hence the first 16 pages are marked as missing. The more common case is where a single image is missing in a sequence:

https://viewer.library.wales/2886549 (http://dams.llgc.org.uk/iiif/2.0/2886549/manifest.json)

Broken Images

During the validation of images we undertook during development we found 6 images which required attention. Some of these could be regenerated from the original tiff file but some images will require further work after the project has completed. One jp2 in particular passes jhove validation and jpylyzer analysis but still doesn’t work using IIP image and will have to be reported. To create a valid manifest we treated these errors similar to above and marked them as missing. See page 35 of the following:

https://viewer.library.wales/2161205 (http://dams.llgc.org.uk/iiif/2.0/2161205/manifest.json)

Rights protected Images

In the original Welsh Journals project there are a number of pages which are protected due to copyright reasons. Some images have been blanked but others are protected by the Fedora rights management process. These images present a 401 when requested. We looked into how this would work in the Universal Viewer and if it was a use case for the IIIF Authentication API. The following manifest responds to 401s in two different ways:

https://viewer.library.wales/1272050 (http://dams.llgc.org.uk/iiif/2.0/1272050/manifest.json)

Pages 162 and 163 return a 401 if the info.json is requested with a customer error message in Welsh and English. If you request these images using CURL you get the following response:

HTTP/1.1 401 Nid yw'r hawliau gennym i arddangos y deunydd hwn. - We do not have the rights to display this material.

This works in the Universal Viewer as it displays the message in a information box. The possibly more correct way of doing this is in page 166 to 177 where the info.json displays a rights service with a description saying the content is unavailable. Unfortunately currently this isn’t working in the UV and more experimentation is needed to see what needs to be in the info.json to make this use case work. The experimental info.json for page 166 is:

https://damsssl.llgc.org.uk/iiif/2.0/image/1272222/info.json

Validation

One of the things we found most useful during the development of the website was to have a validation script that would go through the sitemap and validate all of the IIIF collections and manifests and identify missing fields that were considered ‘mandatory’ for the project. The validator for this project checked the following:

All collections and manifests were valid JSON and existed in the cache
All Journal titles and Manifests had valid navDates - there were a handful of issues which didn’t have dates and even after investigation dates couldn’t be assigned.
All Journal titles had descriptions and frequency information (not all missing frequencies could be fixed by the end of the project) in the EDM
All Issues had a search api pointer
All images responded to at least an info.json request

This validator was run regularly and on a daily basis towards the end of the project. The navDates were created by converting textual string dates e.g. ‘Nov 1852’ or ‘Spring 1924’ into ISO dates and so the validation picked out problems with this conversion. It also picked out mistakes in the data like 30th Feb 2001.

Indexing

In order to deliver site content in a timely manner, we have indexed the IIIF manifests, EDM and Sitemap to a Solr search index. This allows our front-end web developers to converse with a simple HTTP search API and receive JSON content for inclusion in the website.

We developed an indexing script which firstly iterated over the Sitemap in order to collect all of the top-level manifest documents for the project. Each manifest was then processed and a page-level document was indexed to solr containing the full text contained on the journal page along with associated metadata for faceting e.g Issue Date, Publication Title, page Order within issue (used to land on correct page within the Universal Viewer). We also indexed a single ‘publication’ document which was used for the publication information page.

Page-level documents needed to store the location of each word on the page in pixel co-ordinates. This is stored as ALTO XML but converted on the fly to use IIIF Annotations. Storing the annotations as separate records in SOLR would have meant a lot of redundant information being stored and it wouldn’t be possible to make use of SOLRs ranking algorithms which can show results containing multiple word matches on a page higher. We devised a cut down version of an annotation in JSON for storing the string and coordinates and stored this as a string within solr. This could be extracted in the IIIF search API implementation in order to highlight search terms on the page within the Universal Viewer.

Each word would be marked up as [x,y,width,height, word] e.g. [“671”,”178”,”21”,”14”,”cymdeithas”] and each word was stored in an array of all words on the page, which made up the Json object.

We considered a number of alternatives to compressing the annotations into SOLR including:

Reading the Annotation List at query time We were concerned this might lead to performance problems. We currently have 100 results per page and this could lead to 100 annotation list requests per user per search. It would also take time for search implementation to processes the JSON and find the annotations for that result.
Storing the annotations in a separate annotation store This would also lead to similar amount of requests to the above and we would have to develop a similar method of matching the initial query with annotations. We were also concerned about keeping another system alongside SOLR up to date.
Indexing the annotation as annotations (like the SimpleAnnotationServer does) This would produce a SOLR instance that was large and contained repeated information for example over 1 million copies of the same motivation. We would also have to work on developing something that would match searches for phrases as these would cross annotations assuming we encoded them at word level. If we encoded them at another level e.g. line or paragraph the highlight box in the Universal Viewer would be bigger than it needs to be.

An example Solr Page document is included below with comments after the – symbol.

 "Doctype":"page",
 "page_coords":"[[\"580\",\"299\",\"163\",\"84\",\"YR\"],[\"773\",\"295\",\"31\",\"84\",\"\"],[\"834\",\"295\",\"861\",\"88\",\"HYFFORDDWR.\"],[\"347\",\"524\",\"104\",\"42\",\"Cyp.\"],[\"473\",\"524\",\"111\",\"54\",\"III.]\"],[\"1000\",\"520\",\"129\",\"48\",\"MAI,\"]]", -- encoded page annotations
 "page_text":" YR  HYFFORDDWR. Cyp. III.] MAI", -- the above text without coordinate information
 "Id":"2005214", -- cutdown page identifier 
 "Publicationpid":"llgc-id:2004553", -- cut down identifier of title
 "publicationtitle":"Yr hyfforddwr", -- title from EDM dc:title
 "publicationtitle_facet":"Yr hyfforddwr", -- SOLR processed copy of above for faceting
 "publisherdetails_en":"Hughes and Co.", -- from EDM dc:publisher[@xml:lang=en]
 "publisherdetails_cy":"Hughes and Co.", -- from EDM dc:publisher[@xml:lang=en]
 "Publicationlanguage":["cy"], -- from EDM dc:language
 "publicationdetails_en":"The monthly Welsh language religious periodical of the Campbellite Baptists in Wales. The periodical's main contents were religious articles and poetry. The periodical was edited by the poet and man of letters, John Edwards (Meiriadog, 1813-1906). Associated titles: Yr Hyfforddiadyr (1855); Y Llusern (1858).", -- from EDM dc:description[@xml:lang=en]
 "publicationdetails_cy":"Cylchgrawn crefyddol, Cymraeg ei iaith, y Bedyddwyr Campbellaidd yng Nghymru. Prif gynnwys y cylchgrawn oedd erthyglau crefyddol a barddoniaeth. Golygwyd y cylchgrawn gan y bardd a llenor, John Edwards (Meiriadog, 1813-1906). Teitlau cysylltiol: Yr Hyfforddiadyr (1855); Y Llusern (1858).", -- from EDM dc:description[@xml:lang=cy]
 "Issuepid":"llgc-id:2005213", -- cut down issue identifier
 "issuelabel":"Cyf. III rhif. 5 - Mai 1854", -- from Issue manifest label
 "Issueorder":31, -- Important for linking into the Universal Viewer index of issue within the Title (from IIIF Collection)
 "isodate":"1854-05-01T00:00:00Z", -- From Issue Manifest
 "Issuedate_decade":1850, -- Created from Nav date for facets and front page graphic. 
 "Issuedate_year":1854, -- Created from Nav date for facet
 "Issuedate_month":5, -- Created from Nav date for facet
 "Issuedate_day":1, -- Created from Nav date for facet
 "Issuedate_weekday":1, -- Created from Nav date for facet
 "issuerightsurl":"https://creativecommons.org/publicdomain/mark/1.0/", -- from license field in IIIF manifest
 "Pagepid":"llgc-id:2005214", -- cutdown Page identifier
 "Pagelabel":"[65]", -- from Issue manifest, canvas label
 "Pageorder":2, -- Important for linking into the Universal Viewer, index of canvas in sequence from issue manifest.
 "Project":"scif", -- internal project identifier. If it has rangers this would be ccymod. 
 "timestamp":"2017-04-07T12:15:48.112Z"} -- indexing timestamp to for debugging purposes. 

Due to the iterative development process found in an Agile project like this, we needed to implement caching on the indexing server to reduce stress on the IIIF server. Certain content like the Annotation Lists was unlikely to change between iterations so a gziped cache of the JSON content was stored locally with the indexer. A MD5 sum of the URL was used to name the file on the filesystem so it could be easily matched for future cache requests. Gzipping kept the file sizes down and the entire cache of the project is only 12GB in size.

In order to access the metadata, as IIIF does not promote the storage of structured metadata within the manifest, we instead opted to provide access to the metadata in an EDM file linked from the manifests. This enabled us to use many existing vocabularies such as Dublin Core, RDF, DCTerms to further mark up the data.

Website

The website is written in PHP and is built on top of SOLR. The majority of the searching and the website front page is generated directly from queries into SOLR. The IIIF manifests and collections come in when you start viewing an item using the Universal Viewer. The Universal Viewer reads the IIIF Collection and Manifests to produce an index showing all of the issues of a Journal and Articles if present. The metadata on the right hand panel is generated from the current issue which is shown in the central image panel.

Supporting IIIF search API

We have developed an implementation of the IIIF search API which uses the same SOLR instance as the website and allows searching within the Universal Viewer. It is also used when navigating between the results page into an item. This was a complicated development due to the way we had decided to compress the annotations in SOLR.

One of the issues is it is very difficult to predict the number of results per IIIF search results ‘page’ (page as in results pagination). To give a working example if you search for a common word like ‘the’, chances are you will get multiple results per page. The IIIF search implementation will search SOLR for this issue and get x pages returned which have at least one occurrence of ‘the’. When the annotations are decompressed from SOLR this will lead to x * the number of occurrences of ‘the’ on each page. It is possible to support paging of the returned SOLR records but this won’t tally with the actual amount of results per page. In the current implementation it cuts off at 100 results but we have considered whether it would be allowable in the Search API specification and supported in the Universal Viewer to have varying amounts of results per page. This would mean we couldn’t support the startIndex field in the results. The other alternative is to implement some kind of paging cache outside of SOLR which keeps track of the results to a query.

The second problem we came across was linking between the results in the website e.g:

https://journals.library.wales/search?query=%22Pryse+Family%22&range%5Bmin%5D=1735&range%5Bmax%5D=2007

and if you click on the first result you get to the viewer which doesn’t highlight the words on the page. The reason for this is that you can pass complicated query parameters to SOLR including; OR, AND, NOT and speech marks among others (see https://wiki.apache.org/solr/SolrQuerySyntax for full details) which result in hits in the SOLR search but in the IIIF search implementation this needs to be matched with the compressed annotations. The common features of OR, AND, NOT are supported but currently speech mark phrases are unsupported.

Developments with Universal Viewer

Throughout our project we have used a standard URI structure for our document viewers of /[publicationpid]/[issuepid]/[pagenumber], for example:

To continue following these standards while using the UniversalViewer we had to address the problem of detecting changes in the front-end such as users navigating through issues or pages and update the URI to reflect this. JavaScript was used to process the hash parameters and to detect when either the m (issue) value or cv (page) value, and to replace the URI segments with the updated values. The issuepid was retrieved from the manifest using the publicationpid.

Originally published at 23:59, 11 April 2017 (BST) on the NLW Dev Site.

AI and IIIF Ideas

2020-09-08T12:27:22+00:00

I am following the Fast AI course and one of the suggestions was to blog as you follow the course. This blog will document some ideas for the second session. In the second session they look at creating an Image classifier. In the example they use Bing to get examples of three different types of Bears; Grizzly, Black and Teddy. The build a machine learning tool that when supplied with an image of a bear will tell them the type of bear.

I would like to use IIIF images, but the issue is getting labeled data that I can easily use. Having worked at the National Library of Wales I know their data best. A few options I’ve been thinking about:

LCTGM Classifier

LCTGM stands for Library of Congress Thesaurus for Graphic Materials and is a way to add subjects to graphical materials. It has the advantage of splitting out the subject heading into discreet parts namely Topic, Geographic and Temporal. I wonder if it would be possible to give the predictor an image and it will return suitable topics for that image.

Data

While working at the NLW there were a number of digitised image collections that were catalogued using LCTGM and these include:

the John Thomas collection, which is a photographic collection but contains many pictures of people.
the P B Abery collection which contains a good mix of people and places.
the Geoff Charles collection which is a very large collection. Geoff Charles was a news photographer so there is a good mix of pictures but unfortunately this was catalogued at story level so any metadata applies to numerous images rather than a 1 to 1 relationship.

All of the above have IIIF images associated with them. Unfortunately the metadata isn’t directly downloadable but a number have been shared with Wikidata. For example to find all of the images from P B Abery collection you can run the following query:

SELECT ?item ?itemLabel ?manifest ?subjectLabel
WHERE 
{
  ?item wdt:P195 wd:Q74836239.
  ?item wdt:P6108 ?manifest
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Once you have the manifest e.g:

https://damsssl.llgc.org.uk/iiif/2.0/1293458/manifest.json

you can work out the ID and get to the metadata by going to:

http://dams.llgc.org.uk/behaviour/llgc-id:1294464/fedora-bdef:mets/mets

The LCTGM headings are in the MODS section of the METS documents.

Issues and thoughts

P B Abery collection seems a good match for machine learning as it has a good range of subjects and is individually catalogued.
But it will be very time consuming to download 1,746 manifests plus images.
Some subjects will be more heavily populated than others… For the purposes of the demo I could restrict it to the top 3 most popular subjects
Test images will need to be restricted to black and white photographs probably taken in a similar era (1911 - 1948)
If it works it could be used to assign sensible topics to historical photographs.

I think on balance the effort to download the metadata and parse the METS is putting me off this idea.

Tribunal records form identification

While at the NLW I was involved with a crowd sourcing project to transcribe the Cardiganshire War Tribunal records. This is an amazing archive of records where people applied for exemption from enlisting into the Army during WW1. One of the issues with the crowdsourcing was that we found there were different types of forms in the archive, some were the same form but a different version as new versions were released during the war. Others were appeals after an initial application. Each different form required different fields to be entered. To solve this users were asked to identify which form they were transcribing before they started filling in the fields. Now this data has been entered would it be possible to create a predictor that could tell you which type of form you were looking at?

Data

This data is available as IIIF Annotations on GitHub:

https://github.com/NLW-paulm/Welsh-Tribunal-annotations

There is one JSON file per district and 11 districts in total. A district is made up of all of the forms and letters that the tribunal reviewed. They are ordered so that 1 persons application, appeal and any letters are located together. The JSON files are IIIF Annotation Lists and it looks like there is one annotation per page. The text of the annotation is split into fields e.g:

"chars": "Name of Local Tribunal: Aberaeron Borough/Urban District<p>Number of Case: 1<p>Name: Dewi Trefor Jones <p>Address: Castle House Aberaeron <p>Occupation, profession or 
business: Draper's Assistant <p>Attested or not attested: Attested <p>Grounds: I am 73 years of age, in ill health and can't stand any work. <p>Signature of appellant: John Hugh Jones <p>Addres
s of appellant: Castle House, Aberaeron <p>Occupation, profession or business: Draper <p>Why appellant acts for the man: Employer <p>Date: 1916-03-04T10:41:02.000Z<p>Tag: Pink R43/R44: Page 1"

Could be ordered as:

Name of Local Tribunal: Aberaeron Borough/Urban District
Number of Case: 1
Name: Dewi Trefor Jones
Address: Castle House Aberaeron
Occupation, profession or business: Draper’s Assistant
Attested or not attested: Attested
Grounds: I am 73 years of age, in ill health and can’t stand any work.
Signature of appellant: John Hugh Jones
Address of appellant: Castle House, Aberaeron
Occupation, profession or business: Draper
Why appellant acts for the man: Employer
Date: 1916-03-04T10:41:02.000Z
Tag: Pink R43/R44: Page 1

The bit I am interested in is the Tag which says what type of page/document this is. Possible options are:

Tag: Beige R186/187, page 1
Tag: Beige R186/187, page 2
Tag: Beige R41/42: Page 1
Tag: Beige R41/42: Page 2
Tag: Blue R52/53, page 1
Tag: Blue R52/53, page 2
Tag: Pink R43/R44: Page 1
Tag: Pink R43/R44: Page 2
Tag: Unknown document type

The Unknown document types are usually letters of support. A full annotation looks as follows:

        {
            "@type": "oa:Annotation",
            "motivation": "sc:painting",
            "on": "https://damsssl.llgc.org.uk/iiif/2.0/4001851/canvas/4001854.json#xywh=0,0,3497,4413",
            "resource": {
                "@type": "cnt:ContentAsText",
                "chars": "Description: Letter from Fred Burris & Sons from Horse Shoes and Mule Shoes for war department of British Government.<p>Transcription: We understand from Mr John Rees, 16 Albert St, that exemption has been refused him.   We however, suggest that there must be some error here, particularly as this man is badged No. L 16876, and consequently can only be taken with the permission of the Ministry of Munitions, \nWe think we have previously explained the matter and are at a loss to understand what has happened. The only thing, therefore, is for us to communicate with the War Office and inform them of the details,  \n<p>Name of Tribunal: Aberaeron Borough/Urban District<p>Number of Case: Appeal Form No 4<p>Tag: Unknown document type",
                "format": "text/plain"
            }
        },

The on field gives the required IDs for the IIIF Image and manifest the annotation applies to.

Issues and thoughts

This data has now been captured for all of the digitised tribunal records, so a predictor may not be that useful
Could potentially be applied to other collections. Most tribunal records were destroyed although it seems there are copies in Manchester Archive and some from the National Archives

IIIF Tribunal Image Classifier

2020-09-08T12:27:22+00:00

As mentioned previously, I am following the Fast AI course and as part of the second lesson you are encouraged to develop an Image Classifier. An Image Classifier is a type of program which can look at an image and decide which bucket the image should go in. In the first lesson the example Image Classifier let you know if an image was a picture of a dog or a cat and in this weeks lesson the Image Classifier distinguished between pictures of Brown, Grizzly and Teddy Bears.

In the previous post I discussed a couple of ideas but decided to go with the Tribunal records identification as this seemed the easiest to get started with. As a reminder the Cardiganshire War Tribunal records are a fascinating collection of records covering the communication between the community in Ceredigion and the Military Tribunals who decided if applicants were allowed to avoid military conscription during WW1. The records were part of a National Library of Wales (NLW) crowdsourcing project working with volunteers to both transcribe the records but also to retain a link between the transcription and the field on the form. The completed transcriptions were made available by the NLW and are available on Paul McCann’s Github Page as IIIF Annotations.

The archive contains many different forms and supporting correspondence. My plan is to create a classifier which will identify the form type. This is now mostly an academic exercise as all of the pages have been identified and transcribed but if in theory someone else was to digitise a set of tribunal records they could use this tool to identify which type of documents they have. I can test this theory using the sample tribunal record copies in Manchester Archive and the National Archives.

Preparing the data

To fit the data structure discussed in the course there should be a directory per category (or bucket) containing images in that set. So there should be 9 directories containing images for the following document types:

Tag: Beige R186/187, page 1
Tag: Beige R186/187, page 2
Tag: Beige R41/42: Page 1
Tag: Beige R41/42: Page 2
Tag: Blue R52/53, page 1
Tag: Blue R52/53, page 2
Tag: Pink R43/R44: Page 1
Tag: Pink R43/R44: Page 2
Tag: Unknown document type

The dataset from Paul, contains Annotation Lists for each of the different districts that make up the county of Cardiganshire (now called Ceredigion). The districts are:

Aberaeron Borough District
Aberaeron Rural District
Aberystwyth Borough District
Aberystwyth Rural District
Cardigan Borough District
Cardigan Rural District
Lampeter Rural District
Lampeter Urban District
Llandysyl rural District
Newquay District
Tregaron District

Inside these files there are lists of annotations and there is one annotation per page which looks as follows:

{
    "@type": "oa:Annotation",
    "motivation": "sc:painting",
    "on": "https://damsssl.llgc.org.uk/iiif/2.0/4001851/canvas/4001854.json#xywh=0,0,3497,4413",
    "resource": {
        "@type": "cnt:ContentAsText",
        "chars": "Description: Letter from Fred Burris & Sons from Horse Shoes and Mule Shoes for war department of British Government.
                 <p>Transcription: We understand from Mr John Rees, 16 Albert St, that exemption has been refused him.   We however, 
                 suggest that there must be some error here, particularly as this man is badged No. L 16876, and consequently can only
                 be taken with the permission of the Ministry of Munitions, \nWe think we have previously explained the matter and are 
                 at a loss to understand what has happened. The only thing, therefore, is for us to communicate with the War Office and
                 inform them of the details,  \n<p>Name of Tribunal: Aberaeron Borough/Urban District<p>Number of Case: Appeal Form No 4
                 <p>Tag: Unknown document type",
        "format": "text/plain"
    }
},

The first job is to find the page Tag and link it to the image it references. The Tag is buried in the resource.chars part of the annotation and to pick it out we need to split the string and find <p>Tag: and then this will be followed by the Tag. I used the following python method to do this conversion:

def string2json(line):
    data = {}
    for field in line.split('<p>'):
       # print (field)
        name, value = field.split(':')[0], ':'.join(field.split(':')[1:])
       # print("Name: {}, value: {}".format(name,value))
        data[name] = value.strip()

    return data 

This will return a hash table with all of the fields in the string separated so data['Tag'] will return the page tag.

The second part of the task is to identify the image this annotation points to. To start we need to find the canvas which is in the on field:

"on": "https://damsssl.llgc.org.uk/iiif/2.0/4001851/canvas/4001854.json#xywh=0,0,3497,4413",

If this was a set of images I didn’t know then I would find the manifest, look for this canvas id and then find the reference to the IIIF Image. Luckily I can take a short cut as I know the last number (4001854) before the .json is actually the Image Identifier. I can then use the following to get access to the Image information:

http://dams.llgc.org.uk/iiif/2.0/image/4001854/info.json

One decision that was required was the size of the image to request. The course recommends

“We don’t have a lot of data for our problem (150 pictures of each sort of bear at most), so to train our model, we’ll use RandomResizedCrop with an image size of 224 px, which is fairly standard for image classification, and default aug_transform”

(From ‘Training Your Model, and Using It to Clean your Data’)

so I can request a IIIF Image at this size by running:

http://dams.llgc.org.uk/iiif/2.0/image/4001854/full/224,/0/default.jpg

The full code for the program to extract the Tags and Images from the Annotation Lists is called typeImage.py and can be found on GitHub. Once this program has run you will have a directory for each Tag containing the following number of images:

Beige_R186-187_page_1: 138 images
Beige_R186-187_page_2: 137 images
Beige_R41-42_Page_1: 1,331 images
Beige_R41-42_Page_2: 1,350 images
Blue_R52-53_page_1: 701 images
Blue_R52-53_page_2: 690 images
Pink_R43-R44_Page_1: 1,618 images
Pink_R43-R44_Page_2: 1,609 images
Unknown_document_type: 971 images

To access it in the next steps I needed to make it public so I zipped it up and put it here: TribunalTypeImages.zip.

Training the model

The next part of the task was to take this zip file of images and train a model that can identify the different forms. The course recommends doing this by creating a Jupiter Notebook. I haven’t quite got my heard around how these notebooks fit into or compares with virtual machines, Docker or Vagrance but they allow you to both run code and mix in Markdown descriptions to comment on the process. Using the Google Colab software you can run these Jupiter notepads on their virtual machine infrastructure for free. One of the main advantages of this approach is that Google Colab provides Virtual Machines with performant graphics cards which appear essential for training models. This was mentioned in the first lesson and I had a go at running the Jupiter notebook locally but even though my machine isn’t slow, it took a surprisingly long time to train the basic example models. Even though I was sorely tempted, lesson one advised against spending (wasting) time building your own computer with a powerful graphics card and just use the free or low cost options they link to in the course.

The Jupiter notebook is embedded in this blog as a read only version with the saved output when I ran it.

You should be able to open a version you can run yourself by clicking the Open in Colab button below. In the Colab version, when you see a block of code you should be able to run it by moving your mouse cursor to the top left part of the code block. This should then show a play button. You should run the code blocks in order and wait for them to finish before going onto the next one.

Results and reflection

I was really pleased to see the results of the Image Classification. It worked on most of the images and for the ones that it didn’t there seemed to be clear reasons why not. It would be interesting to look further into the top losses to see if this could be used to identify data that needs correcting. The only real downside to this project is the lack of applicability to real world problems. Its very specific to this collection but I wonder if the techniques could be applied to identify different types of objects in a collection. One thought I’ve had is whether you could train a classifier to identify types of images e.g. Maps, Newspapers, Manuscripts etc. This maybe something I try next although finding the source data will more challenging.

IIIF from Scratch

2018-01-12T12:27:22+00:00

I’ve been trying to learn Welsh for a number of years but unfortunately I’m not a natural linguist. So when one of our neighbours gave our Children an old Welsh book to read I set about using it as a challenge and translating it. Being prone to technical procrastination rather than getting on with the translation I’ve had a look at how I can turn this book into a IIIF resource. I hope this will be useful for people looking to make their images available as IIIF but don’t have the time or resources to invest in a comprehensive IIIF stack.

Stage 1 - Digitisation

So the quality of the digitisation isn’t going to win any prizes but I used a HP Envy 5546 combination printer and Scanner:

HP Envy 5546.

The provided ‘HP Easy Scan’ software has a nice simple interface for scanning. For scanning a book like this you can click ‘return’ to capture another scan allowing you to faff with the book on the scanner trying to ensure it doesn’t fall off the table. As its a graphical book I started off with the ‘Photos, Graphics, Etc.’ setting but this was a bit too clever for its own good and cropped the most graphical part of the page. Once switching to the ‘Documents with Color’ it scanned the page as expected.

I exported the scanned images (14 pages) to the following formats:

jp2 - I won’t use this in this blog post but in future I plan to write a post about using these images with one if the more feature full image servers. See IIIF Awesome for a list.
tiff

The images are named page001 to page014.

Stage 2 - Making the images through IIIF

For this project I’m looking for a no server option that ideally will be hosted through this blog. The fully featured IIIF image servers require a service to be running to do the required conversions on a single source image. An alternative is to chop up the source image into tiles and store them on disk. This allows you to get all of the zoomable features of IIIF without running a server. The limitation is that you won’t get all of the functionality a IIIF image server can support, particularly custom regions of an image. I’ll demonstrate this limitation later.

To chop up the images I’m going to use Simeon’s tile generator. Following the install instructions I do the following:

hostname:website gmr$ cd /tmp/
hostname:tmp gmr$ git clone git://github.com/zimeon/iiif.git
Cloning into 'iiif'...
remote: Counting objects: 5514, done.
remote: Total 5514 (delta 0), reused 0 (delta 0), pack-reused 5514
Receiving objects: 100% (5514/5514), 32.25 MiB | 1.71 MiB/s, done.
Resolving deltas: 100% (1825/1825), done.
iPhone-Glen:tmp gmr$ cd iiif/

hostname:iiif gmr$ ls
CHANGES.rst		MANIFEST.in		demo-static		iiif_static.py		pypi_upload.md		testimages
CONTRIBUTING.md		README			etc			iiif_testserver.py	run_tests.sh		testpages
INSTALLATION.md		README.rst		iiif			iiif_testserver.wsgi	run_validate.sh		tests
LICENSE.txt		demo-auth		iiif_cgi.py		iiif_testservertest.py	setup.py

hostname:iiif gmr$ sudo pip install 'Pillow<4.0.0'
Password:
The directory '/Users/gmr/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/gmr/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting Pillow<4.0.0
  Downloading Pillow-3.4.2-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (3.4MB)
    100% |████████████████████████████████| 3.5MB 372kB/s
Installing collected packages: Pillow
Successfully installed Pillow-3.4.2

hostname:iiif gmr$

The next stage is to feed in the tiff images and give an output directory for the IIIF tiles. Trying the first page:

hostname:iiif gmr$ mkdir -p ../welsh-book/page001
hostname:iiif gmr$ ./iiif_static.py --write-html ../welsh-book/page001 -d ../welsh-book/page001 --api-version=2.1 ~/Documents/images/welsh_book/tiff/page001.tif
iiif_static.py: source file: /Users/gmr/Documents/images/welsh_book/tiff/page001.tif
iiif.static: ../welsh-book/page001 / page001/0,0,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/0,512,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/0,1024,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/0,1536,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/0,2048,512,286/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/512,0,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/512,512,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/512,1024,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/512,1536,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/512,2048,512,286/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1024,0,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1024,512,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1024,1024,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1024,1536,512,512/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1024,2048,512,286/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1536,0,160,512/160,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1536,512,160,512/160,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1536,1024,160,512/160,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1536,1536,160,512/160,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1536,2048,160,286/160,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/0,0,1024,1024/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/0,1024,1024,1024/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/0,2048,1024,286/512,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1024,0,672,1024/336,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1024,1024,672,1024/336,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/1024,2048,672,286/336,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/0,0,1696,2048/424,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/0,2048,1696,286/424,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/full/212,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/full/212,292 -> page001/full/212,
iiif.static: ../welsh-book/page001 / page001/full/106,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/full/106,146 -> page001/full/106,
iiif.static: ../welsh-book/page001 / page001/full/53,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/full/53,73 -> page001/full/53,
iiif.static: ../welsh-book/page001 / page001/full/27,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/full/27,36 -> page001/full/27,
iiif.static: ../welsh-book/page001 / page001/full/13,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/full/13,18 -> page001/full/13,
iiif.static: ../welsh-book/page001 / page001/full/7,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/full/7,9 -> page001/full/7,
iiif.static: ../welsh-book/page001 / page001/full/3,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/full/3,5 -> page001/full/3,
iiif.static: ../welsh-book/page001 / page001/full/2,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/full/2,2 -> page001/full/2,
iiif.static: ../welsh-book/page001 / page001/full/1,/0/default.jpg
iiif.static: ../welsh-book/page001 / page001/full/1,1 -> page001/full/1,
iiif.static: ../welsh-book/page001 / page001/info.json
iiif.static: Writing HTML to ../welsh-book/page001
iiif.static: ../welsh-book/page001 / page001.html

I then had a look at the webpage page001.html to see if it had worked. Unfortunately I noticed that it couldn’t access the OpenSeadragon script in /tmp/welsh-book/page001/openseadragon200/openseadragon.min.js so I symlinked the OpenSeadragon javascript file:

hostname:iiif gmr$ ln -s /tmp/iiif/iiif/third_party/openseadragon200 ../welsh-book/page001/openseadragon200

and hit refresh and got a working zoomable image:

To process all of the images I can run the following:

./iiif_static.py -d ../welsh-book --api-version=2.1 ~/Documents/images/welsh_book/tiff/*.tif

and this generates 14 directories in /tmp/welsh_book. I’ve then copied the 14 directories into this github project which is the code behind this blog. Now I’ve moved the images I have to update the @id in all of the info.json files so that it points to the correct place. Note if you forget to do this then OpenSeaDragon won’t be able to open the image.

I’ve written a program updateImageId.py that can be run as follows:

./standalone-iiif/updateImageId.py json_file 'https://glenrobson.github.io/iiif/welsh_book/

and this should prepend the correct prefix to make it work on this site. To run this on allow of the images I run:

find ../gdmr-digital/ -name "info.json" -exec ./standalone-iiif/updateImageId.py {} 'https://glenrobson.github.io/iiif/welsh_book/' \;

Note if you make a mistake then have a look at the updateImageId.py README to see how you can replace a prefix.

It should now be possible to view the static image using Openseadragon:

MarineLives New Year Wishes

2018-01-04T12:27:22+00:00

This post began as a discussion on twitter on how IIIF could help the MarineLives project achieve their goals on collecting and transcribing historical content related to the manuscripts of the High Court of Admiralty 1627-1677. You can read more about their mission on the MarineLives wiki.

The MarineLives 2018 New Years Wish list is as follows:

A world in which we could download a single manuscript image from an archival, library or museum website; transcribe it, or add key words; then lob into the Commons. All data recorded with the image. And a Universal Search Tool would search all the content.
A manuscript image uploading & exchange site for PhD students to upload & share images taken as part of their research work, which would otherwise be unpublished.
A C17th combination of ZoomInfo & LinkedIn to create, improve & make available prosopographies by automatically aggregating historical web & archival sources.
A digital camera which saves each new image of a manuscript page with its archival reference number, not a number file name [e.g. TNA HCA 13/53 f.12r] - We could upload 1000 images a day to our laptops * search for them by archival reference & find instantly.

Some of these wishes fall squarely into what IIIF offers and the others may be assisted by some related technologies. I will split the first wish into its parts:

1-A: A world in which we could download a single manuscript image from an archival, library or museum website.

Starting with the first one IIIF supports the transcription of images without requiring the images to be downloaded. This can make it easier for institutions including Archives, Libraries and Museums to make their images available but will also help in linking transcriptions back to a source image. IIIF provides support for transcriptions by using the Web Annotations W3C specification. Web annotations allow annotation on many different types of resources including images, web pages and videos. It is an open format which allows the interoperability of annotations created by different systems.

Below is a video from Stanford demonstrating how to take a reference to two IIIF objects (one from Stanford and one from the Bodleian Library in Oxford) and drag them into a third viewer Mirador.

1-B: Transcribing images

The Mirador viewer allows the comparison of images from different locations and also supports transcription of these images through its annotation interface and this is demonstrated in the following video:

Mirador supports both transcription and tagging type annotations. Transcribing through Mirador does not require hosting of the images separately as you can rely on the IIIF Image and Presentation API to annotate the images while they are stored and published by the institution. There is a public demo of Mirador available that allows you to load remote Manifests and start transcribing them into a local browser store. For persistent annotations you would have to look at installing an annotation store and a list of options can be found on the IIIF Awesome List.

There are also other options for creating annotations including various Crowdsourcing systems mentioned in the Awesome List issue. All of these should allow the import of IIIF image and presentation endpoints and allow their transcription. While working at the National Library of Wales (NLW) I was involved in the development of a Crowdsourcing system with Digirati. I gave a presentation in the Rome IIIF conference demonstrating the crowdsourcing of Photographs from North Carolina State University using the NLW crowdsourcing system. A video showing the setup is available on youtube. This video hopefully demonstrates the idea of taking content from a third party and transcribing using IIIF tools.

1-C: then lob [annotations] into the Commons. All data recorded with the image.

Annotations created through Mirador or the other crowdsourcing systems will likely be in either the Web Annotations format or the precursor Open Annotations. There isn’t as far as I know a way to convert these annotations to something that could go on Wiki Commons but as the annotations are stored as JSON-LD it should be possible to convert them to other formats. An alternative to submitting the annotations to Commons is to submit the annotations back to the institution that is publishing the IIIF resource. Specifying this protocol is part of the work of the IIIF Discovery group but there is already an early proof of concept created between Jeff Witt (Loyola University Maryland) and e-Codices detailed in Jeff’s blog. This and Jeff’s further work on Linked data Notifications offer an exciting option for the future where institutions can take back the work of volunteers and integrate the valuable data generated into their existing systems.

1-D: And a Universal Search Tool would search all the content.

There are many institutions which implement IIIF and some of these can be found on the IIIF website but if you are working with content that isn’t IIIF accessible then there are some alternatives detailed in the next section. The IIIF Discovery group are looking at ways to make it easier to find IIIF institutions and their images but the ‘Universal Search Tool’ mentioned is likely out of scope. The group does aim to put the protocols in place to allow someone to create this type of discovery tool particularly for annotations.

2. Image upload and exchange

There is a little known feature with the IIIF presentation API that allows you describe non IIIF images. This could allow you to create a IIIF Manifest that links to images hosted on Wiki Commons or Flickr and then load this manifest into Mirador for transcription. This is a relatively unused feature of IIIF but is recently supported in the Universal Viewer and Mirador but may not be supported in the other crowd sourcing solutions mentioned above. Tom Crane has been pushing for this to be supported in the IIIF viewers and created an application which uses static images from Wikipedia. This would allow you to transcribe content using IIIF viewers without the source institution supporting IIIF.

It is also possible to support more IIIF image API functions by either using static tiles which can be generated by Simeon’s tile generator or looking at hosting a full IIIF compliant image server.

Although its possible to use non IIIF images in theses ways there are a number of reasons why it is preferable to use an institution’s IIIF image and manifest:

Single identifier for an image or object

Using IIIF and transcribing IIIF resources encourages the re-use of shared identifiers particularly the Canvas URI which is an identifier for a particular image and the target of most IIIF annotations. By sharing the canvas ID multiple users can transcribe the same resource and their annotations can be shared and combined. By publishing a separate manifest with new Canvas IDs it means the annotations are only associated with that instance so it may not be possible to combine then with other transcriptions or with the source content in the institution.

Shared coordinate space

As well as a canvas URI, the canvas also specifies the width and height of the image and this is usually the width and height of the source Tiff or JP2. It means annotations on the canvas will always be located in the correct place even when viewing the image at different sizes. If you were to transcribe a downloaded jpg or a new capture the institution and other users of the content may not have the information to transpose references to coordinates of the derivative jpg back to the source tiff or jp2. Coordinates for annotations are particularly important when implementing the IIIF Search API which allows users to search transcribed or generated text.

With the above points noted, it isn’t always possible to have IIIF images provided by an institution and using static images with a IIIF Manifest will allow some of the IIIF features to be used.

3. Aggregating prosopographies

This falls out of the scope of IIIF but IIIF will allow a better aggregation of the source of some of these prosopographies. Europeana is a good example of an aggregator which harvests IIIF manifests from institutions and is able to show a rich interface using the IIIF images remotely:

IIIF Image from the Universitätsbibliothek Heidelberg shown on Europeana.

Wikidata offers an exciting place to aggregate prosopographies but only if they meet the notoriety criteria. Other than that it is going to be difficult to create an aggregator of this type. Possibly the way to make this feasible is to make prosopographies available as Linked Open Data so they can be aggregated at a future date. The issue with aggregation is how to disambiguate different entries and ensure correctly linked entries are created. I have tried some of this type of disambiguation with the NLW Shipping Records and appreciate how difficult it can be.

4. Digitising with the original archival reference number

As mentioned in the Image upload and exchange section it is preferable to use an institution’s IIIF resource rather than host a separate copy of the image. This is particularly relevant with identifiers including archival reference numbers. Typically institutions maintain many identifiers for a digital image including; catalogue numbers, reference numbers, digital image filenames and repository identifiers. It can be difficult to transpose between identifiers but using the Canvas ID as mentioned previously can allow institutions to use this identifier as the master for the published image. The reference number in the example ‘TNA HCA 13/53 f.12r’ looks to be a combination of the identifier for the archival object with a page number appended. It can be difficult for a machine to split this identifier to work out which image is being referenced. With a Canvas URI there is a defined behaviour and API which allows it to be machine processed (if the machine also has a reference to the manifest it came from). Identifiers like the example in the wish list are important for users to be able to navigate the material and in a IIIF manifest the first portion of the identifier ‘TNA HCA 13/53’ would be present in the metadata section of the manifest and the ‘f.12r’ would be the canvas label. Neither the manifest metadata or canvas label would be expected in the IIIF implementation to be machine processable.

If its not possible to have a IIIF image from the institution then the images would likely have short filenames and new canvas Ids would be minted. In the NLW we handled the association between identifiers and digital images by using a workflow management system developed internally called Wombat. This allowed you to associate archival reference numbers with a book or file and also filenames with page labels. We relied on the public catalogue to allow users to search archival reference numbers and get access to digital objects.