How does Operation War Diary generate its data?
With all our Citizen Historians gallantly tagging away, we thought it was high time we explained how all that hard work is being used to produce the data sets for the project.
While we really appreciate all the effort each and every individual is putting in on the diaries we know that errors can arise for one reason or another. For that reason, we generate what’s known as consensus data. We have an algorithm that allows us to do this.
To begin with, each of the diary pages is tagged by at least five Citizen Historians. Five different people who might each look at that page in a slightly different way. Once that tagging is complete, the diary page is closed and put in the queue for processing.
The system starts this by identifying tags of the same type relating to the same entity (a place, a person, an action etc.). It has to take a best guess at this, clustering tags together based on a percentage of the image size for each scanned diary page. Trial and error has shown that this percentage is best set at 3% vertical and 10% horizontal. There must be a minimum of two tags for a particular entity if it is to make it into the final consensus data set. So, if two of the five Citizen Historians who have tagged a diary page have both identified a place in the same position on that page, that place makes it in.
Image © IWM (Q 5700)
The consensus tag generated from this tag cluster is then placed at the average location in which all of its constituent tags were generated
Next the system has to determine exactly what information should be attached to each tag. This is relatively straightforward when the original tags came from a fixed list (e.g. Activities tags, which can be of only a certain number of types). Where tags contain free text (e.g. person or place), fuzzy text matching is used to determine their attached information (e.g. Slater-Booth, Sclater-Booth and similar variants would be grouped together). Where a majority of these free text tags have the same value, that value becomes the consensus value. However, if there is no clear majority value, then the consensus tag will be formed of the leading variants.
The algorithm is also designed to create serialised data. In essence, this means that each consensus tag is associated with a date, which allows the data generated to then be ordered by date. When Citizen Historians tag dates on a diary page, they essentially segment that page, and it’s that segmentation which allows the system to determine which consensus tags should lie inside which date area.
Once these operations have been carried out for one page of a diary, the next page will be processed and so on until the diary is complete.
Don’t worry about us losing all the tags you’ve generated, though – our databases hold everything that every single one of our Citizen Historians has added to Operation War Diary, be it individual tags, hashtags or text comments. We know just how valuable a resource that’s going to be for anybody wanting to investigate the diaries beyond the standard, structured tags we’ve defined.
Why not check out our first batch of consensus data here: http://wd3.herokuapp.com/public