DEA Notebooks Hackathon: Make DEA Notebooks faster! - GeoscienceAustralia/dea-notebooks GitHub Wiki
Hackathon aim
Our aim is to make DEA Notebooks faster and more efficient so they run more quickly for our users and in our integration tests.
More efficient notebooks will lead to an improved user experience, and allow us to expand our test coverage across the entire DEA Notebooks repository. This will free us up from having to manually check and test notebooks for broken code, and make sure we fix issues before they impact our users.
Guidance and suggested approach
[!IMPORTANT]
We want to preserve the overall "purpose" of our notebooks when making changes.
We should only make changes if we can do it without making our examples less useful or informative to our users. The guide below provides an example workflow you can follow on the day!
- Look at the spreadsheet here, and choose a notebook to focus on from the "Longest running notebooks" column (you can view the notebooks on Knowledge Hub to help you choose)
- Open the DEA Sandbox (https://app.sandbox.dea.ga.gov.au/), and launch the
Default environment 2 Cores, 16G Memory
server option (this is what our external users use, and is most similar to how we run our tests)
-
If you haven't edited DEA Notebooks before, follow the technical guide below to get started with Git on the DEA Sandbox.
-
Launch your notebook on the DEA Sandbox. Before making any changes, read through and run each cell in your notebook carefully. Try and understand its overall purpose or message, i.e. what is it trying to convey to our users? What functionality is it showing off? What do we need to keep so make sure the example is still useful?
-
Once you understand the purpose and approach of the notebook, look for places where we can make it faster to run without affecting its overall purpose. For most notebooks, these will be the most important things to look at:
- Reducing the time period, e.g.
time=...
(can the notebook be run one one year of data instead of two without affecting its conclusions?)- Reducing the area/extent, e.g.
x=..., y=...
(can we load data for a smaller area and still demonstrate the same functionality?)- Loading fewer products, e.g.
products=...
(e.g. do we have to load data from Landsat 7 and 8 if just Landsat 8 will do?)
Changes to time/spatial extents/products should be enough for most notebooks. Some other more advanced ideas include:
- Filtering to less cloudy images by metadata (e.g.
cloud_cover=(0, 10)
) to load only clear images- Updating code to be more efficient (e.g. using built-in
xarray
ornumpy
tools instead of for-loops etc)
-
Once you have made some changes to the notebook, double check that the notebook markdown cells still correctly match and describe the analysis (e.g. update references to time periods/locations to match your new values).
-
Re-run the entire notebook (
Kernel > Restart kernel and run all cells
), then commit it back into the repo for review! (see Git details below)
[!TIP] If you can't find an easy way to update the notebook without impacting its purpose, that's completely fine - some will be easier or harder than others! Feel free to skip it and move onto something new. 🙂
Technical guide to editing notebooks
If it is your first time using the DEA Sandbox, follow this guide to register: https://knowledge.dea.ga.gov.au/guides/setup/Sandbox/sandbox/
If it is your first time editing a notebook, follow this guide to setting up DEA Notebooks with Git: https://github.com/GeoscienceAustralia/dea-notebooks/wiki/Edit-a-DEA-Notebook
More details about using the DEA Sandbox and DEA Notebooks are available here: