1 - Statistical bootstrapping - developed climate model and observations statistical assessments (2020-2022, academic work)
2 - Neural network regression - linked multiple modes of climate variability to Arctic sea ice (2022-2023, academic work)
3 - Database: building, querying, analyzing - an opportunity to learn SQL and Go, in the context of building a database, querying, analyses, and visualization of personal running data (2023-present, personal project)
1) Statistical Bootstrapping
Aim: To compare interannual variability in climate models and observational data for Arctic sea ice concentration.
Problem: Observations have a short time period of observation, models have coarse resolutions. Statistical bootstrapping is required to generate sufficiently long time period to enable comparisons.
Methods: Resampling observational and modeled sea ice concentration anomalies 10,000 times (Fig 1.1). Comparing the distribution of standard deviations to determine whether models and observations were consistent.
Conclusions: In general, models agree well with observations, but as no model is within observational uncertainty for all months and locations, choosing the right model for a given task is crucial (Fig 1.2). It is a relatively low bar for models and observations to be consistent as observational uncertainty is high, if this were reduced models would likely be considered biased for more regions and seasons.
Tools: Python - Dask, Scipy, Xarray, Matplotlib, Numpy, CartoPy. Shell scripting with - Cheyenne super computer, Climate Data Operators, Data downloading.
Journal arcticle: Wyburn-Powell et al. (2022) Modeled Interannual Variability of Arctic Sea Ice Cover is within Observational Uncertainty. DOI:10.1175/JCLI-D-21-0958.1.
Published data: At the Arctic Data Center - DOI:10.18739/A2H98ZF3T
Figure 1.1 - Resampling methodology for the observations
Figure 1.2 - Consistency between models and observations
2) Neural network regression
Aim: Link climate variability modes to regional Arctic sea ice anomalies. Assess the effect of each of the modes at different lag times and for different regions.
Problem: Sparse data and until recently, a small ensemble of climate modes has not allowed sufficient training data to detect the small signal of multiple climate modes' remote Arctic impacts.
Methods: Dimensional reduction by using a subset of input climate variability modes as features. Input features of climate variability modes at different lag times to regress onto regional Arctic sea ice concentration anomalies. Select the best of 4 different complexity ML models. Remove variables sequentially and remove the worst performing variables. Select only the relationships that provide an increase in validation r2 in excess of 0.2 compared with a persistence forecast.
Conclusions: The dominant climate variability modes are global surface temperature anomaly and Nino 3.4 Index which have strong negative/positive correlations with regional Arctic sea ice (Fig 2.1). Despite the many nonlinearities in the climate system, at least with constrained available data, nonlinearites are not important to include in our regression model to produce a high predictability (Fig 2.2).
Tools: Python - PyTorch, SciPy, Numpy, Pandas, Xarray, Matplotlib, Shell scripting frequently incorporating Climate Data Operators software on the Cheyenne supercomputer.
Journal article: Large-scale Climate Modes Drive Low-Frequency Regional Arctic Sea Ice Variability. Preprint https://doi.org/10.31223/X56D59
Data archiving: in progress
Figure 2.1 - Comparison of 4 ML methods' validation coefficients of determination with lag time.
Figure 2.2 - Linear coefficients linking specific climate variability modes with sea ice concentration anomalies in the Chukchi Sea in October, by climate model.
3) Analyze running data with a database
Aim: Learn and practice several skills which my time as a PhD student did not cover. This includes: accessing data using an API, coding in a new language Go, use of new data structures such as JSON, building a database in MySQL and running it in docker, querying the database with SQL, analysis of the data with a new ML method e.g. clustering.
Phase 1: Summer 2023 - Learning Go and SQL, building and querying a database:
Learn to use Go for applications where Python would be impractical such as when memory, speed, strong typing and compiling are important . Practice regular expressions and explore useful language-independent fundamental computer science activities.
Write Go script to use the Strava API to authenticate as a specific user and extract required variables. Incorporate unit testing and create an output JSON file.
Design schema and build the database with MySQL. Database will consist of two tables, one containing Activities, and the other containing Athlete information, the tables are linked by Activities's secondary key AthleteID.
Learn SQL sufficient to formulate complex queries on the database. Play around with queries, even if those are not directly related to required outputs. Consider using Spark to facilitate some of these queries.
Use Python to add value to the data not already available on the Strava website; output static visualizations here, alongside summary statistics.
Host the database on the cloud and schedule Go code to pull in new data on a monthly basis. Ensure extra care is taken for useful error reporting and unit testing.
Throughout the automated process, ensure backups of data are made for ease of reverting if issues arise. Also produce an email with a detailed report of updates made, sufficient to be quickly alerted to errors.
Figure 3 - An example activity from Strava that will from part of the database and analysis. Left shows the web page for the activity and on the right shows the data from Strava's API in JSON format.