In this series, we provided an introduction to the project and cited specific technology improvements that could transform the way phenology is studied by using stationary camera networks and machine based image processing on big data sets and using big data platforms.
With day one and two behind us, our team spent the day learning about current data archives, weather station sensors, data processing issues, current models used, and visualizations. Even though this week trip is only half over, here are very clear ways that technology can change the way science is practiced today, and I will share these concepts below.
The learning started once we were done with breakfast, and the team of scientists from Schoodic Research Institute, Earthwatch, and Acadia National Park, along with the group of us from Pivotal and EMC, set out from Schoodic Research Institute to the main part of the park on Mount Desert Island. Fortunately, our journey included a few scenic stops along the way, and these included the historic Carriage Roads, Thunder Hole, and Sand Beach. After arriving, we visited the archive room, which is in the process of digitizing much of their historic and cultural artifacts, including notebooks, photographs, and various plant or animal specimens. Eventually, the information will be available through digital repositories like Data Observation Network for Earth (DataONE) and Dryad. The prior is supported by the National Science Foundation and built to make environmental science data available for education and outreach. The latter is a non-profit that makes the data underlying scientific publications discoverable, freely reusable, and citable.
Next on the agenda was a visit to one of the weather stations, near Cadillac Mountain, where we could see, first hand, the sensors that measure temperatures, mercury levels, and precipitation as well as wind speed and direction. Importantly, this data is shared with two government bodies. One, the National Oceanic and Atmospheric Administration (NOAA) uses the data for the the mission of ultimately protecting life and property and enhancing the U.S. economy. Two, the Environmental Protection Agency, who address climate change, air quality, water protection, clean and sustainable communities, chemical safety, pollution, and health. As well, some samples collected at this weather station were mailed out to processing plants as far away as Oregon.
There were two people that I spent a large amount of time with during the day—Adam Kozlowski and Dr. Richard Feldman—and we spoke in depth about how they use data.
Adam manages data for the National Park Service (NPS) at the Northeast Temperate Network (NETN). He explained how they use data within the current system and explained the problems they face. One of their goals is detecting changes in the properties of water within the NETN region, and this includes the biology, chemical, and physical makeup. Adam and his team analyze the trends in water for a dozen parks, including Acadia.
The process starts when they pull data out of IRMA (Integrated Resource Management Applications), a depository of many files, publications, and datasets and the one-stop source for NPS data. The data is then stored in a SQL database on a Linux server where it is queried. The team uses R and develops visualizations using RShiny. Ultimately, these visualizations are presented to other researchers and employees within the NPS.
Dr. Richard looks at ecology to uncover how changes in the environment affect the population of a species and how it fluctuates. Within the Prairie Pothole region of the U.S. and Canada, Dr. Richard studies data sets of 10 different duck species measured from over 1000 sites and 50 years. Dr. Feldman uses Structural Equation Models (SEM) in this work, and the posterior distribution of the parameters in his model are estimated using Markov Chain Monte Carlo (MCMC) sampling in tools like OpenBUGS. There are common challenges here. One, fluctuations can happen regardless of environmental variables—the species density alone affects survival and reproduction rates. In addition, observers can create manual errors as we discussed yesterday.
There are multiple improvements we can make to the current processes by applying our platforms, tools, and techniques.
In the case of Adam’s use of RShiny, our teams at Pivotal Data Labs use and love the tool. We all know that R is a great language for developing statistical analysis and visualizing data. By using our open source product, PivotalR, on a platform built for massive scale and parallelization, like Pivotal Greenplum Database or Pivotal HAWQ (SQL on Hadoop), scientists can use existing skills to larger data sets, more complex models, and more advanced visualizations.
With the duck species population, we face both a data parallel problem and a completely parallel problem our technology stack can also drastically improve performance at scale. As an example, we have predicted demand for consumer goods as a function of meaningful explanatory levers (i.e. pricing, product & geo attributes, and weather). This is based on Bayesian Hierarchical Regression in the context of demand modeling, and these models were estimated using an MCMC algorithm named Gibbs Sampling, leveraging tools such as Procedural Language R (PL/R) and MADlib. By running PL/R and MADlib on a massively parallel platform, we can also allow scientists to achieve much greater performance and scale while using existing skill sets as explained in this article.
At the end of day 3, we watched in wonder and felt overcome by the challenge our generation and the world faces as we watched Chasing Ice, which shared the time lapse imagery of glaciers in Alaska, Greenland, and Iceland.
This brought forth motivation, particularly in the context of a brainstorming session yesterday where we were asked to consider the perspective of researchers, educators, and corporations—how could we ask citizen scientists to help? Almost unanimously, we all wanted to allow citizen scientists the ability to participate in building models to study the relationships between the stressors and the dependent variable of interest.
Tomorrow, the team will discuss how a climate data lake, powered by Pivotal and the EMC Federation can help take citizen science a step further.
You can read articles from my data science colleagues or find out more about what open source software and products we use at the Pivotal Data Science blog.