|Title||Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service |
|Publication Type||Conference Paper |
|Year of Publication||2016 |
|Authors||Collins, M |
|Editor||Poelen, J, Thompson, A, Hammock, J |
|Conference Name||31st Annual Meeting 2016 of the Society for the Preservation of Natural History Collections (SPNHC) |
|Date Published||06/2016 |
|Publisher||31st Annual Meeting 2016 of the Society for the Preservation of Natural History Collections (SPNHC) |
|Conference Location||Berlin, Germany |
|Keywords||biodiversity, GUODA, iDigBio |
|Abstract||Digitization of museum collections objects is a laborious task. While OCR and machine learning can assist with the manual keying of label information, the process of extracting properties from notes and descriptions requires both time and biological expertise. However keying and OCR can be used to simply transcribe additional label information and natural language processing algorithms can be used to perform some interpretation of the results. The formatting of notes and remarks data like the Darwin Core fieldNotes field varies widely from semi-structured attribute:value pairs to phrases to full sentences which complicate the automated analysis of these fields. Using text mining techniques such as entity recognition and libraries like Python’s Natural Language Toolkit (NLTK), attributes and values as well as broader understanding can be extracted and ground-truthed to known data available from sources like the Encyclopedia of Life’s TraitBank or previously expert-digitized specimens.
Computation engines like Apache Spark combined with whole datasets from multiple biodiversity data sources are ideal for performing these investigations. Global Unified Open Data Access (GUODA), hosted at iDigBio, provides resources where your Python, Scala, or R code can be run on a cluster to quickly explore algorithms and data at a scale not available previously.