Contents
1 - Statistical bootstrapping - developed climate mode and observations statistical assessments (2020-2022, academic work)
2 - Neural network regression - linked multiple modes of climate variability to Arctic sea ice (2022-2023, academic work)
3 - Database: building, querying, analyzing - an opportunity to learn SQL, Spark, Go, JavaScript, AWS, in the context of building a database, querying, analysis, and visualization of personal running data (2023-present, personal project)
1) Statistical Bootstrapping
Aim: To compare interannual variability in climate models and observational data for Arctic sea ice concentration.
Problem: Observations have a short time period of observation, models have coarse resolutions. Statistical bootstrapping is required to generate sufficiently long time period to enable comparisons.
Methods: Resampling observational and modeled sea ice concentration anomalies 10,000 times (Fig 1.1). Comparing the distribution of standard deviations to determine whether models and observations were consistent.
Conclusions: In general, models agree well with observations, but as no model is within observational uncertainty for all months and locations, choosing the right model for a given task is crucial (Fig 1.2). It is a relatively low bar for models and observations to be consistent as observational uncertainty is high, if this were reduced models would likely be considered biased for more regions and seasons.
Tools: Python - Dask, Scipy, Xarray, Matplotlib, Numpy, CartoPy. Shell scripting with - Cheyenne super computer, Climate Data Operators, Data downloading.
Journal arcticle: Wyburn-Powell et al. (2022) Modeled Interannual Variability of Arctic Sea Ice Cover is within Observational Uncertainty. DOI:10.1175/JCLI-D-21-0958.1.
Published data: At the Arctic Data Center - DOI:10.18739/A2H98ZF3T
Published code: synthetic-enemble Github repo, archived with Zenodo, DOI:10.5281/zenodo.6687725.
Figure 1.1 - Resampling methodology for the observations
Figure 1.2 - Consistency between models and observations
2) Neural network regression
Aim: Link climate variability modes to regional Arctic sea ice anomalies. Assess the effect of each of the modes at different lag times and for different regions.
Problem: Sparse data and until recently, a small ensemble of climate modes has not allowed sufficient training data to detect the small signal of multiple climate modes' remote Arctic impacts.
Methods: Dimensional reduction by using a subset of input climate variability modes as features. Input features of climate variability modes at different lag times to regress onto regional Arctic sea ice concentration anomalies. Select the best of 4 different complexity ML models. Remove variables sequentially and remove the worst performing variables. Select only the relationships that provide an increase in validation r2 in excess of 0.2 compared with a persistence forecast.
Conclusions: The dominant climate variability modes are global surface temperature anomaly and Nino 3.4 Index which have strong negative/positive correlations with regional Arctic sea ice (Fig 2.1). Despite the many nonlinearities in the climate system, at least with constrained available data, nonlinearites are not important to include in our regression model to produce a high predictability (Fig 2.2).
Tools: Python - PyTorch, SciPy, Numpy, Pandas, Xarray, Matplotlib, Shell scripting with - Cheyenne super computer, Climate Data Operators, Data downloading.
Journal article: working title: Large-scale Climate Modes Drive Low-Frequency Regional Arctic Sea Ice Variability.
Data archiving: in progress
Figure 2.1 - Comparison of 4 ML methods' validation coefficients of determination with lag time.
Figure 2.2 - Linear coefficients linking specific climate variability modes with sea ice concentration anomalies in the Chukchi Sea in October, by climate model.
3) Database: building, querying, analyzing
Aim: Learn and practice several skills which my time as a PhD student did not cover. This includes: accessing data using an API, coding in a new language Go, building a database in MySQL, use of new data structures such as JSON, querying the database with SQL, analysis of the data with a new ML method e.g. clustering,
Phase 1: Feb-Sep 2023 - Learning Go, building and querying a database, run some analyses in Python to generate static content :
Learn to use Go for applications where Python would be impractical such as when memory, speed, strong typing and compiling are important . Practice regular expressions and explore useful language-independent fundamental computer science activities.
Write Go script to use Strava API to authenticate as a specific user and extract required variables for a given time period. Incorporate unit testing and create an output JSON file. Solicit friends to volunteer their data for me to analyze.
Build a single database with Postgres or MySQL to consolidate the separate JSON files per person. Carefully define the database schema.
Learn SQL sufficient to formulate complex queries on the database. Play around with queries, even if those are not directly related to required outputs. Also use Spark to facilitate some of these queries.
Use Python to analyze the output and design analyses including a new machine learning method such as clustering, or classification using random forest. Output static visualizations for this website.
Phase 2: Oct-Dec 2023 - Build dynamic content for this website, use JavaScript and Tableau to navigate through data which already analyzed.
Select several analysis products that lend themselves to dynamic content such as zooming in and out of clustering scatter plots or graphs, as well as layering or animating specified variables.
Output the analyzed data in a format such as JSON, using Python.
Learn JavaScript basics sufficient to pass analyzed data to Tableau and embed the output into this website.
Figure 4.1 - Example activity from Strava that will from part of the database and analysis.
Phase 3: Jan-Jun 2024 - Enable automated monthly updating of the database using AWS, automate the updating of analysis in Python and visualization of dynamic data on this website:
In addition to the analysis done on a finite amount of multiple Strava users' data, automate the addition of the next month of my own Strava data to the database. To do this, make a version of the Strava API Go script to be run in the cloud e.g. on AWS.
With Go and SQL update the database on the cloud on a monthly basis, ensure extra care is taken for useful error reporting and unit testing.
After the database has successfully been updated, rerun the Python analysis script and output updated data files.
Pass this updated output data with JavaScript to Tableau and update the dynamic content.
Throughout the automated process, ensure backups of data are made for ease of reverting if issues arise. Also produce an email with a detailed report of updates made, sufficient to be quickly nt to quickly be alerted to errors.
Code: GitHub repo will be created shortly