Waleed Alzuhair, flickr

Big Geodata Newsletter, April 2024

Become a high-skilled geospatial professional

Greetings from the Big Geodata Newsletter!

In this issue you will find information on PMTiles - a novel archive format for tiled data visualization, DiffusionSAT - a large generative model trained on high-resolution remote sensing datasets, the One Billion Row Challenge, and a MOOC on Cubes & Clouds.

After more than 3 years of operation, we are curious how our Geospatial Computing Platform has performed and how it can be further improved. Join us for the Geospatial Computing Platform Users Meeting on 12 June 2024 to share your experience, exchange your ideas, and voice needs!

Dr. Yijian Zeng, Prof. Dr. Bob Su, Ir. Bas Retsios, Dr. Yunfei Wang from the Department of Water Resources at ITC along with Dr. Sarah Alidoost, Dr. ir. Bart Schilperoort from the Netherlands eScience Center share their experience in using our Geospatial Computing Platform in building an Open Digital Twin of Soil-Plant System following Open Science. Don't miss their Big Geodata Story!

Happy reading! 

You can access the previous issues of the newsletter on our web portal. If you find the newsletter useful, please share the subscription link below with your network.

PMTiles Unleashes Effortless Visualization

Image credits: Protomaps, 2024

PMTiles is a novel single-file archive format tailored for tiled data visualization. It is designed to be a cloud-native file format used directly from a client over a network via HTTP range requests, without having a server in the middle. Functioning akin to a ZIP file, PMTiles encapsulates numerous individual files within a singular entity, which facilitates data management and reduces the complexity of handling numerous small files. This streamlined approach enhances the accessibility and usability of tiled data, empowering users to seamlessly integrate it into their mapping projects. While it is most employed with vector data, utilizing the Mapbox Vector Tile (MVT) encoding, PMTiles offers versatility across different data formats including raster and terrain mesh data.

Read more on internal structure and compression of PMTiles by this guide from Cloud Native Geospatial Foundation. Also, try out the amazing efficiency and speed of PMTiles on OpenStreetMap base layer using PMTiles Viewer.

Cubes & Clouds: MOOC for Cloud Native Open Data Sciences in EO

Image credits: EO College, 2024

With the growing volume and velocity of satellite data, we are constantly adapting our workflows to process the available EO data. Traditional methods of downloading the data are no longer appropriate in the world of cloud computing and open science practices. The massive open online course (MOOC) Cubes & Clouds can help to keep up with these changes and the learn the new concepts and workflow of processing such data. The MOOC introduces cloud platforms, data cubes, open science and gives emphasis on sharing results in a FAIR manner. The hands-on exercises are provided on the Copernicus Data Space Ecosystem cloud platform and leverage data through the STAC Catalog and openEO API. The final exercise is designed to be collaborative where each student creates a snow-cover map of a small region of the Alps and submits it to a STAC catalog.

You can access the Cubes & Clouds MOOC and many other lessons on the EO College E-learning Platform (requires an account login) or on the Zenodo open repository. If you want to learn more on STAC catalogs, don't miss our training workshop on 10 April!

Process One Billion Rows in Less Than 2 seconds!

Image credits: Dale McDiarmid, 2024

2024 kicked-off with quite an interesting challenge: One Billion Row Challenge (1BRC). The challenge required to calculate using Java simple stats like min, mean, and max from a billion rows of data representing weather stations and order the results. Primarily aimed to highlight recent features and optimization techniques in Java, the challenge blew-up on social media and received over 500 pull-requests in GitHub! While the Java implementations brought the run time down to just 1.5 s on eight cores of a Hetzner AX161 dedicated server with 128 GB RAM, many others tried this also with C++, Python, DuckDB, PostgreSQL and so on! A fast Python implementation example uses multiprocessing with good old dictionaries and lists instead of numpy or pandas. This was taken a step further with a Dask implementation using Dask Expressions that ran in 32.8 s on an M1 Mac (8-core CPU, 16 GB memory). An extreme end of this was an implementation in Python with Dask and cuDF run on GPUs in 4.5s!

Assessing the enthusiasm around this, Coiled recently opened a 'One Trillion Row Challenge' that aims to use Dask + Coiled for testing the limits of optimization for processing on the Cloud. Follow the link to participate or read more about the challenge!

Upcoming Meetings

Recent Releases

The "Big" Picture

Image credits: Khanna et. al., 2023

In the realm of AI, the evolution of diffusion models has been nothing short of groundbreaking. However, one frontier has remained largely unexplored: remote sensing data. That is, until now. Developed as the largest generative foundation model explicitly trained on high-resolution remote sensing datasets, DiffusionSat marks a significant leap forward. What sets DiffusionSat apart is its ability to understand and generate realistic samples while considering the spatio-temporal nature of remote sensing data. Unlike conventional approaches that rely solely on textual or visual cues, DiffusionSat incorporates critical metadata such as location for enhanced context and accuracy. The model facilitates temporal generation and enables super-resolution and in-painting, opening doors to a myriad of applications. Moreover, DiffusionSat surpasses previous state-of-the-art methods for satellite image generation. 

Khanna, S., Liu, P., Zhou, L., Meng, C., Rombach, R., Burke, M., Lobell, D., & Ermon, S. (2023). DiffusionSat: A Generative Foundation Model for Satellite Imagery. ArXiv. /abs/2312.03606. doi:10.48550/arXiv.2312.03606

CRIB Training Workshops

The past two months have been progressive and fruitful regarding our outreach and training activities. We gathered over 95 registrations for the Introduction to Raster and Vector data with Python and Introduction to Docker training workshops. The first workshop focused on utilizing Python and geospatial libraries for manipulating raster and vector data and, the second one, elaborated on how Docker in modern-day wields benefits in reproducibility in research workflows along with its flaws through showcasing all the crucial features and background information.

For anyone who missed these, there are more workshops and Big Geodata Talks upcoming, starting with the training workshop on Publishing SpatioTemporal Data with STAC on 10 April, Big Geodata Talk on High-Performance Data Management with Weka.io on 24 April, training workshop on Introduction to Raster and Vector data with R on the 15-16 May and more. Stay tuned for more updates!