ADB Mining

I’d like to propose a session that aims to produce (or failing that, define) a tool that will take in an Australian historical text and locate within this text all the mentions of people named in the Australia Dictionary of Biography (ADB).

Annotation in the Dictionary of Sydney

For each person referenced I would like to create a metadata record for the corresponding ADB entry in my own database and link this to an annotation record pointing to the mention in the text. Clearly it follows that if a given person is mentioned multiple times then multiple annotation records should be created all pointing to a single ADB metadata record.

I’d ideally like to be able to scan the text and have it OCRed into TEI first but I think that would be a bit ambitious. I wanted to at least to mention this as a reminder that ultimately the process should be input agnostic – one should as easily load up a newly created Word document as a scan of a historical manuscript.

I’ve hacked around over the years with offset markup using TEI source files in Heurist. We render TEI documents as XHTML using Cocoon, allow the user to select passages of text then create an annotation record which records the annotated string, the source text, the location within the text and an optional pointer to another record – which could be an ADB entry metadata record. The Dictionary of Sydney project in particular makes extensive use of this data structure with mentions of famous Sydneysiders within text entries pointing – via annotations – to person entity records. As often as not these then reference the ADB entry.

This tool would greatly simplify the current workflow of the Dictionary of Sydney and many of the projects I am involved with would also benefit from this process. The Dictionary of Sydney however relies on a large team of volunteers and a pretty active editorial team to create annotations “manually”. It would be great to have a semi automated procedure – possibly based around a bunch of clever XSL transforms and an XML feed from the ADB – that would enable me to connect a given research database with the ADB and indeed beyond (to paraphrase Buzz Lightyear).

About Steven Hayes

I've worked at the Archaeological Computing lab for the past five years and have been deeply involved in the development and implementation of Heurist. Specific projects include the Dictionary of Sydney and the Charles Harpur Critical archive. I work predominantly with researcher groups - usually with ARC funding - to model their research data, implement this in Heurist and in doing so refine the general model of humanities research data that underpins the Heurist project.
This entry was posted in Session Proposal. Bookmark the permalink.

2 Responses to ADB Mining

  1. Tim Sherratt says:

    Count me in!

    You could of course scrape the ADB (and who hasn’t scraped the ADB at some stage), but for this it would seem to make more sense to use People Australia as the reference source. It includes the ADB as well as other biographical sources and has a convenient API (SRU).

    I started to work on something similar to this with my Identity Browser. If you’ve got your text in a browser you can just highlight a name, click on the bookmarklet, choose from a list of possible matches, and then copy some annotated text for insertion into your document. In this case I was providing RDFa, but it could be TEI.

    It seems that (leaving aside the OCR part) there are 4 main steps: named-entity extraction to pull out all the names from the text, searching People Australia for matches, returning matches for human disambiguation, and annotation.

    There are a growing number of tools and web services available for the entity extraction. I’m not sure you could do without human involvement in the disambiguation phase. You could do some content analysis on the source text and on the PA/ADB records to provide a confidence measure on each match, but you probably still want a human brain to check these.

    It’s all very interesting and potentially do-able. It’s interesting too that the Twitter-stream has brought a couple of similar-type projects to my attention today: DBPedia Spotlight and AXLD.

    I’ve also been thinking about building on my Identity Browser and Flickr Machine Tag Challenge projects to develop a more general biographical linking and annotation web service. I’d be interested in floating this to get some feedback and ideas.

  2. Sounds like an excellent proposal. Look forward to exploring it some more on the day

Leave a Reply