|Title||GUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity Data |
|Publication Type||Conference Paper |
|Year of Publication||2016 |
|Authors||Collins, M |
|Editor||Thompson, A, Poelen, J, Hammock, J |
|Conference Name||Biodiversity Information Standards (TDWG) 2016 Annual Conference |
|Date Published||12/2016 |
|Publisher||TDWG 2016 Annual Conference |
|Conference Location||Santa Clara de San Carlos, Costa Rica |
|Keywords||biodiversity, GUODA, iDigBio |
|Abstract||Managing research data has always been challenging but the recent availability of multi-gigabyte and larger datasets from major aggregators has created new problems, especially for individual and small institution researchers. A recent collaboration between the Integrated Digitized Biocollections (iDigBio) and the Encyclopedia of Life (EOL) called Global Unified Open Data Access (GUODA) aims to bring new techniques and resources for working with large biodiversity datasets to the widest community of researchers possible.
GUODA is both a computing infrastructure built and hosted by iDigBio and a community for collaboration in using the infrastructure. Our collaboration focuses on developing tools and workflows using Apache Spark for highly parallelized data analysis, a repository of pre-formatted and ready to use biodiversity datasets, and a resource management system capable of exposing these resources to the full skill range of software developers and data analysts.
This presentation will outline the software and hardware used in GUODA, the process and formats for transforming common biodiversity data such as the Global Biodiversity Information Facility (GBIF), iDigBio, and the Biodiversity Heritage Library (BHL) into computable data structures, and demonstrate the Jupyter Notebook interface to GUODA that is designed for researchers to interact with directly. |