GUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity Data

TitleGUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity Data
Publication TypeConference Paper
Year of Publication2016
AuthorsCollins, M
EditorThompson, A, Poelen, J, Hammock, J
Conference NameBiodiversity Information Standards (TDWG) 2016 Annual Conference
Date Published12/2016
PublisherTDWG 2016 Annual Conference
Conference LocationSanta Clara de San Carlos, Costa Rica
Keywordsbiodiversity, GUODA, iDigBio
AbstractManaging research data has always been challenging but the recent availability of multi-gigabyte and larger datasets from major aggregators has created new problems, especially for individual and small institution researchers. A recent collaboration between the Integrated Digitized Biocollections (iDigBio) and the Encyclopedia of Life (EOL) called Global Unified Open Data Access (GUODA) aims to bring new techniques and resources for working with large biodiversity datasets to the widest community of researchers possible. GUODA is both a computing infrastructure built and hosted by iDigBio and a community for collaboration in using the infrastructure. Our collaboration focuses on developing tools and workflows using Apache Spark for highly parallelized data analysis, a repository of pre-formatted and ready to use biodiversity datasets, and a resource management system capable of exposing these resources to the full skill range of software developers and data analysts. This presentation will outline the software and hardware used in GUODA, the process and formats for transforming common biodiversity data such as the Global Biodiversity Information Facility (GBIF), iDigBio, and the Biodiversity Heritage Library (BHL) into computable data structures, and demonstrate the Jupyter Notebook interface to GUODA that is designed for researchers to interact with directly.