After getting oriented to the research problems of phenology, understanding data collection and storage, and discussing the statistical methods and approaches during the past few days of our expedition to Acadia National Park, we dug into solutions and designs on day four.
Fundamentally, more complete and accurate data sets around bird migration, barnacle abundance, weather, duck population, and water resource data all help us understand the impact of climate change. Today’s effort was focused on the questions to seek answers to, the data sources to ingest, the models to build, and the visualizations to share with others, ultimately leading to a solution and approach.
As we dove into the selection for a pilot, we believed that each problem domain could be improved, but it was clear that a template approach could be applied across domains once the first area was developed. With this in mind, the team chose the measurement and prediction of climate change on hawk migration as the starting candidate project. From the viewpoint of Acadia and Schoodic scientists, the underlying business case supports the goals of the National Park Service, making the data consumable by scientists, educators, and citizen scientists.
We outlined an approach of using a web-portal to host interactive visualizations showing within-year and inter-year variability on hawk movement with a dependence on climate factors. A user could drill into a region of interest to then see or compare migrations from the past and predictions of the future given certain climate conditions. In addition, added the concept of decomposition reports to help identify the various climate levers on migration. Within the architecture, we believe open source visualization tools like D3js will operate within a web portal run on Pivotal Cloud Foundry, and the data lake would be served up by the Pivotal Big Data Suite running on the EMC Big Data solution. This includes running the data on Hadoop with Pivotal’s SQL on Hadoop engine, HAWQ.
The initial, target data sources will include Hawkwatch and eBird for bird migration data, National Climatic Data Center (NCDC) or the British Atmospheric Data Center (BADC) for weather related data, and iNaturalist for data related to plant and animal observations such as food for hawks. We had also previously discussed the use of field-based, stationary or mobile cameras, image processing, and object recognition techniques to help offload the burden of data collection, and these approaches could be applied to the architecture over time. With the initial, target data, there will be several operations required as data is moved into the system—include standardization, conversion to a common frequency, imputing missing values and more. Of course, these functions will need to be addressed during the extract, transform, and load (ETL) process. Once data is loaded into the data lake, the wider team would have a robust platform for joining relevant table data, generating features of interest, and preparing models and visualizations.
Within this architecture, analytic development using MADlib or libraries within the PL/Python and PL/R ecosystems can operate at very high scale and high speed. For example, using these tools, we could quickly build a regression model to predict the time of arrival of a certain species at a given site in a given year, given climate factors such as temperature, precipitation, hours of daylight, wind speed vector, and more.
Ultimately, the team believed in building a community of citizen scientists who can participate and become advocates in combating climate change. Whether a larger group of scientists can contribute models and visualizing data or citizens help collect and properly attribute data as a crowd-sourced method, the existing program can benefit from enabling others to participate without being able to physically travel to the park. These types of programs connect with people who have a passion for the cause and assist scientists and ecologists with resource and budget constraints.
As someone who loves data and national parks, I couldn’t have asked for a more interesting and rewarding experience in the field. With a Maine lobster dinner (or a salad for vegetarians like me) to end the week, we will be taking the formative plans and learning back to our respective organizations to identify the next steps for building out a climate data lake. We thank the teams at Acadia National Park, the Schoodic Research Institute, and Earthwatch for their hospitality and insight. Our generation must stay on top of the challenges of climate change, and I feel fortunate to have made a minor contribution.
You can read articles from my data science colleagues or find out more about what open source software and products we use at the Pivotal Data Science blog.