Deep learning (DL) on satellite imagery - alabarga/geospatial-python GitHub Wiki

Introduction

This document lists resources for performing deep learning (DL) on satellite imagery. To a lesser extent classical Machine learning (ML, e.g. random forests) are also discussed, as are classical image processing techniques.

Table of contents

Top links

Datasets

WorldView

Sentinel

Landsat

Spacenet

Planet

DEM (digital elevation maps)

  • Shuttle Radar Topography Mission: data - open access
  • Copernicus Digital Elevation Model (DEM) on S3, represents the surface of the Earth including buildings, infrastructure and vegetation. Data is provided as Cloud Optimized GeoTIFFs. link

Kaggle

Kaggle hosts over 60 satellite image datasets, search results here. The kaggle blog is an interesting read.

Kaggle - Amazon from space - classification challenge

Kaggle - DSTL - segmentation challenge

Kaggle - Airbus Ship Detection Challenge

Kaggle - Draper - place images in order of time

Kaggle - Deepsat - classification challenge

Not satellite but airborne imagery. Each sample image is 28x28 pixels and consists of 4 bands - red, green, blue and near infrared. The training and test labels are one-hot encoded 1x6 vectors. Each image patch is size normalized to 28x28 pixels. Data in .mat Matlab format. JPEG?

  • Imagery source
  • Sat4 500,000 image patches covering four broad land cover classes - barren land, trees, grassland and a class that consists of all land cover classes other than the above three
  • Sat6 405,000 image patches each of size 28x28 and covering 6 landcover classes - barren land, trees, grassland, roads, buildings and water bodies.
  • Deep Gradient Boosted Learning article

Kaggle - Understanding Clouds from Satellite Images

In this challenge, you will build a model to classify cloud organization patterns from satellite images.

Kaggle - miscellaneous

Alternative datasets

There are a variety of datasets suitable for land classification problems.

Tensorflow datasets

  • There are a number of remote sensing datasets
  • resisc45 - RESISC45 dataset is a publicly available benchmark for Remote Sensing Image Scene Classification (RESISC), created by Northwestern Polytechnical University (NWPU). This dataset contains 31,500 images, covering 45 scene classes with 700 images in each class.
  • eurosat - EuroSAT dataset is based on Sentinel-2 satellite images covering 13 spectral bands and consisting of 10 classes with 27000 labeled and geo-referenced samples.
  • bigearthnet - The BigEarthNet is a new large-scale Sentinel-2 benchmark archive, consisting of 590,326 Sentinel-2 image patches. The image patch size on the ground is 1.2 x 1.2 km with variable image size depending on the channel resolution. This is a multi-label dataset with 43 imbalanced labels.

UCMerced

AWS datasets

Microsoft

  • USBuildingFootprints -> computer generated building footprints in all 50 US states, GeoJSON format, generated using semantic segmentation

Quilt

  • Several people have uploaded datasets to Quilt

Google Earth Engine

Weather Datasets

UAV & Drone datasets

Synthetic data

Interesting deep learning projects

Raster Vision by Azavea

RoboSat

  • https://github.com/mapbox/robosat
  • Semantic segmentation on aerial and satellite imagery. Extracts features such as: buildings, parking lots, roads, water, clouds
  • robosat-jupyter-notebook -> walks through all of the steps in an excellent blog post on the Robosat feature extraction and machine learning pipeline.
  • Note there is/was fork of Robosat, originally named RoboSat.pink, and subsequently neat-EO.pink although this appears to be dead/archived

DeepOSM

DeepNetsForEO - segmentation

Skynet-data

Techniques

This section explores the different techniques (DL, ML & classical) people are applying to common problems in satellite imagery analysis. Classification problems are the most simply addressed via DL, object detection is harder, and cloud detection harder still (niche interest). Note that almost all aerial imagery data on the internet is in RGB format, and techniques designed for working with this 3 band imagery may fail or need significant adaptation to work with multiband data (e.g. 13-band Sentinel 2).

Land classification

Assign a label to an image, e.g. this is an image of a forest.

Semantic segmentation

Whilst classification will assign a label to a whole image, semantic segmentation will assign a label to each pixel

Change detection

Monitor water levels, coast lines, size of urban areas, wildfire damage. Note, clouds change often too..!

Image registration

Image registration is the process of transforming different sets of data into one coordinate system. Typical use is overlapping images taken at different times or with different cameras.

Object detection

A good introduction to the challenge of performing object detection on aerial imagery is given in this paper. In summary, images are large and objects may comprise only a few pixels, easily confused with random features in background. An example task is detecting boats on the ocean, which should be simpler than land based detection owing to the relatively blank background in images, but is still challenging.

Cloud detection

A subset of the object detection problem, but surprisingly challenging

  • From this article on sentinelhub there are three popular classical algorithms that detects thresholds in multiple bands in order to identify clouds. In the same article they propose using semantic segmentation combined with a CNN for a cloud classifier (excellent review paper here), but state that this requires too much compute resources.
  • This article compares a number of ML algorithms, random forests, stochastic gradient descent, support vector machines, Bayesian method.

Wealth and economic activity measurement

The goal is to predict economic activity from satellite imagery rather than conducting labour intensive ground surveys

Super resolution

Super-resolution imaging is a class of techniques that enhance the resolution of an imaging system. Very hot topic of research.

Pansharpening

Image fusion of low res multispectral with high res pan band.

Stereo imaging for terrain mapping & DEMs

Measure surface contours.

Lidar

NVDI - vegetation index

SAR

  • Removing speckle noise from Sentinel-1 SAR using a CNN
  • A dataset which is specifically made for deep learning on SAR and optical imagery is the SEN1-2 dataset, which contains corresponding patch pairs of Sentinel 1 (VV) and 2 (RGB) data. It is the largest manually curated dataset of S1 and S2 products, with corresponding labels for land use/land cover mapping, SAR-optical fusion, segmentation and classification tasks. Data: https://mediatum.ub.tum.de/1474000
  • so2sat on Tensorflow datasets - So2Sat LCZ42 is a dataset consisting of co-registered synthetic aperture radar and multispectral optical image patches acquired by the Sentinel-1 and Sentinel-2 remote sensing satellites, and the corresponding local climate zones (LCZ) label. The dataset is distributed over 42 cities across different continents and cultural regions of the world.

Image formats, data management and catalogues

Cloud Optimised GeoTiff (COG)

STAC - SpatioTemporal Asset Catalog specification

The STAC specification provides a common metadata specification, API, and catalog format to describe geospatial assets, so they can more easily indexed and discovered.

State of the art

What are companies doing?

Online platforms for Geo analysis

  • This article discusses some of the available platforms
  • Pangeo -> There is no single software package called “pangeo”; rather, the Pangeo project serves as a coordination point between scientists, software, and computing infrastructure. Includes open source resources for parallel processing using Dask and Xarray
  • Airbus Sandbox -> will provide access to imagery
  • Descartes Labs -> access to EO imagery from a variety of providers via python API
  • DigitalGlobe have a cloud hosted Jupyter notebook platform called GBDX. Cloud hosting means they can guarantee the infrastructure supports their algorithms, and they appear to be close/closer to deploying DL. Tutorial notebooks here
  • Planet have a Jupyter notebook platform which can be deployed locally.
  • jupyteo.com -> hosted Jupyter environment with many features for working with EO data
  • eurodatacube.com -> data & platform for EO analytics in Jupyter env, paid

Free online computing resources

Generally a GPU is required for DL, and this section lists a couple of free Jupyter environments with GPU available. There is a good overview of online Jupyter development environments on the fast.ai site. I personally use Colab with data hosted on Google Drive

Google Colab

  • Collaboratory notebooks with GPU as a backend for free for 12 hours at a time. Note that the GPU may be shared with other users, so if you aren't getting good performance try reloading.
  • Also a pro tier for $10 a month -> https://colab.research.google.com/signup
  • Tensorflow pytorch can be installed

Kaggle - also Google!

  • Free to use
  • GPU Kernels - may run for 1 hour
  • Tensorflow, pytorch & fast.ai available
  • Advantage that many datasets are already available

Paperspace

Production

Once you have a trained model how do you expose it to the internet and other services? Usually through a rest API. This section lists a number of training and hosting options. For an overview on this topic checkout Practical-Deep-Learning-on-the-Cloud

Custom REST API

A conceptually simple and scalable approach to serving up deep learning model inference code is to wrap it in a rest API that is implemented in python (typically using flask or FastAPI) and deploy it to a lambda function.

Tensorflow Serving

  • https://www.tensorflow.org/serving/
  • TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. Multiple models, or indeed multiple versions of the same model, can be served simultaneously. TensorFlow Serving comes with a scheduler that groups individual inference requests into batches for joint execution on a GPU

Pytorch serve

AWS

Paperspace gradient

chip-n-scale-queue-arranger by developmentseed

Useful paid software

Useful open source software

A note on licensing: The two general types of licenses for open source are copyleft and permissive. Copyleft requires that subsequent derived software products also carry the license forward, e.g. the GNU Public License (GNU GPLv3). For permissive, options to modify and use the code as one please are more open, e.g. MIT & Apache 2. Checkout choosealicense.com/

Python low level numerical & data manipulation

  • Dask works with your favorite PyData libraries to provide performance at scale for the tools you love -> checkout Read and manipulate tiled GeoTIFF datasets and accelerating-science-dask. Coiled is a managed Dask service.
  • Rasterio -> reads and writes GeoTIFF and other raster formats and provides a Python API based on Numpy N-dimensional arrays and GeoJSON.
  • xarray -> N-D labeled arrays and datasets. Read Handling multi-temporal satellite images with Xarray
  • xarray-spatial -> Fast, Accurate Python library for Raster Operations. Implements algorithms using Numba and Dask, free of GDAL
  • Geowombat -> geo-utilities applied to air- and space-borne imagery, uses Rasterio, Xarray and Dask for I/O and distributed computing with named coordinates
  • NumpyTiles -> a specification for providing multiband full-bit depth raster data in the browser
  • Zarr -> Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. Zarr depends on NumPy

Python general utilities

  • gcsts for google cloud storage sile-system -> Pythonic file-system interface for Google Cloud Storage
  • satpy - a python library for reading and manipulating meteorological remote sensing data and writing it to various image and data file formats
  • Pyviz examples include several interesting geospatial visualisations
  • geemap: A Python package for interactive mapping with Google Earth Engine, ipyleaflet, and ipywidgets. See the Landsat timelapse example
  • rio-color -> Color correction plugin for Rasterio
  • WaterDetect -> an end-to-end algorithm to generate open water cover mask, specially conceived for L2A Sentinel 2 imagery. It can also be used for Landsat 8 images and for other multispectral clustering/segmentation tasks.
  • DeepHyperX -> A Python/pytorch tool to perform deep learning experiments on various hyperspectral datasets.
  • landsat_ingestor -> Scripts and other artifacts for landsat data ingestion into Amazon public hosting
  • PyShp -> The Python Shapefile Library (PyShp) reads and writes ESRI Shapefiles in pure Python
  • s2p -> a Python library and command line tool that implements a stereo pipeline which produces elevation models from images taken by high resolution optical satellites such as Pléiades, WorldView, QuickBird, Spot or Ikonos
  • TorchSat is an open-source deep learning framework for satellite imagery analysis based on PyTorch.
  • torchvision-enhance -> Enhance PyTorch vision for semantic segmentation, multi-channel images and TIF file,...
  • felicette -> Satellite imagery for dummies. Generate JPEG earth imagery from coordinates/location name with publicly available satellite data.
  • napari -> napari is a fast, interactive, multi-dimensional image viewer for Python. It’s designed for browsing, annotating, and analyzing large multi-dimensional images. By integrating closely with the Python ecosystem, napari can be easily coupled to leading machine learning and image analysis tools. Example viewing Landsat-8 imagery. Note that to view a 3GB COG I had to install the napari-tifffile-reader plugin.
  • EarthPy -> A set of helper functions to make working with spatial data in open source tools easier. readExploratory Data Analysis (EDA) on Satellite Imagery Using EarthPy
  • detectree -> Tree detection from aerial imagery
  • pylandstats -> compute landscape metrics

Tools for image annotation

If you are performing object detection you will need to annotate images with bounding boxes. Check that your annotation tool of choice supports large image (likely geotiff) files, as not all will. Note that GeoJSON is widely used by remote sensing researchers but this annotation format is not commonly supported in general computer vision frameworks, and in practice you may have to convert the annotation format to use the data with your chosen framework. There are both closed and open source tools for creating and converting annotation formats.

Movers and shakers

  • Adam Van Etten is doing interesting things in object detection and segmentation
  • Andrew Cutts cohosts the Scene From Above podcast and has many interesting repos
  • Ankit Kariryaa published a recent nature paper on tree detection
  • Chris Holmes is doing great things at Planet
  • Christoph Rieke maintains a very popular imagery repo and has published his thesis on segmentation
  • Jake Shermeyer many interesting repos
  • Nicholas Murray is an Australia-based scientist with a focus on delivering the science necessary to inform large scale environmental management and conservation
  • Qiusheng Wu is an Assistant Professor in the Department of Geography at the University of Tennessee
  • Robin Wilson is a former academic who is very active in the satellite imagery space

Courses

Online communities

Companies

  • awesome-geospatial-companies -> List of 500+ geospatial companies by Christoph Rieke
  • Dymaxion Analytics -> a machine learning API for developing bespoke object detection models for satellite and drone imagery.
  • Element84 -> consultancy
  • CosmiQ Works -> an IQT Lab focused on developing, prototyping, and evaluating emerging open source artificial intelligence capabilities for geospatial use cases.

Jobs

Neural nets in space

Processing on satellite allows less data to be downlinked. E.g. super-resolution image might take 4-8 images to generate, then a single image is downlinked.

About the author

My background is optical physics, and I have a PhD from Cambridge on the topic of Plasmon enhanced Raman spectroscopy. After doing a post doc I left academia and took a variety of roles, from industrial research at Sharp Labs Europe, to medical physics, to building optical telescopes at Surrey Satellites (SSTL). It was whilst at SSTL that I started this repo as a personal resource. I left SSTL, actually was made redundant along with 30% of the company, and after a brief stint at an IOT start up, I now work as a data engineer. Deep learning is currently a hobby, but I have ambitions to move into this domain when the right opportunity presents itself. My own satellite imagery projects are here, and feel free to connect with me on Twitter & LinkedIn

Linkedin: robmarkcole Twitter Follow