Waleed Alzuhair, flickr

Big Geodata Newsletter, March 2024

Become a high-skilled geospatial professional

Greetings from the Big Geodata Newsletter!

In this issue you will find information on utilizing GDAL with AWS EMR-Serverless, GridMesa: an adaptive grid approximation model to handle large spatial data, and on creating EO data cubes with Cubo and XEE. 

After more than 3 years of operation and 500,000+ multi-core CPU hours provided for academic work, we are happy to invite you our first Geospatial Computing Platform Users Meeting on 12 June, 2024 to share experience, exchange your ideas, and voice needs! Find more information about it below and sign-up early to participate.

Dr. Rosa AguilarDr. Ellen-Wien Augustin and Prof. Dr. Raul Zurita-Milla from the Department of Geo-Information Processing at ITC share their experience in using our Geospatial Computing Platform to model pollen and pollen allergies in the Netherlands. Don't miss their Big Geodata Story! 

Happy reading! 

You can access the previous issues of the newsletter on our web portal. If you find the newsletter useful, please share the subscription link below with your network.

Using GDAL with EMR-serverless for large-scale geodata processing

Image credits: AWS, 2024 

The data processing landscape has undergone a dramatic transformation, marking a departure from the era of supercomputers to a new age of scalability and accessibility through cloud computing. Initially constrained by expensive hardware, innovations like MapReduce paved the way for distributed computing. Apache Spark further accelerated this progress by capitalizing on the expanding availability of RAM. AWS's EMR was considered as a game-changer by many, streamlining the deployment of large-scale data workloads. Despite its benefits, deploying EMR requires expertise and resource management. Enter EMR-Serverless, a paradigm shift eliminating these hurdles. With custom execution environments and automated scaling, it heralds a future where cutting-edge technologies are both accessible and economically feasible, revolutionizing data processing as we know it. 

Read more on utilizing Geospatial Data Abstraction Library (GDAL) with AWS EMR-Serverless for extensive data processing that involves leveraging the power of GDAL within the context of AWS EMR-Serverless for handling massive datasets efficiently.

Creating EO Data Cubes with ease!

Image credits: Cubo, 2024

The SpatioTemporal Asset Catalog (STAC) specifications provide standardized format to describe geospatial datasets. This format is adopted by many data providers to expose data to clients in a consistent manner. Users can access any data hosted in this format from different satellite missions, instruments, or collections with simply using the STAC API. Cubo is a python package that lets users of STAC easily create Earth System Data Cubes (ESDCs) for analysis on long term data. The cubo.create() function requires minimal inputs such as start/end date, edge size and central pixel location to create an Xarray object of the specified STAC data collection. The function also allows users to specify the resolution in meters for the ESDC. This cube as an xarray object can easily be used for further analysis tasks, including deep learning.

The documentation of the package provides interesting examples to use cubo. The package can also be used to access datasets from the Google Earth Engine catalogue, where the dataset is retrieved via XEE. Read more about it below!

XEE: Using Xarray with Google Earth Engine

Image credits: Medium, 2024

With over 90+ petabytes of analysis-ready data in the GEE catalogue, the platform has launched a few new tools for easier processing of geospatial workflows. XEE is a new python tool that integrates Google Earth Engine (GEE) with Xarray, a popular open-source python package for processing large multidimensional arrays. XEE allows users to work on Earth Engine ImageCollections as Xarray Datasets. This is especially helpful when processing large spatiotemporal EO or climate datasets as the xarray integration uses Dask under the hood for computing across multiple processors. The package also allows users to export data into the cloud optimised Zarr format. However as with the limitations of GEE, using the XEE tool for large datasets requires users to have a GEE account with a high volume endpoint

An article by Ali Ahmadalipour gives insight into using XEE for processing large climate data. GEE also recently introduced new data extraction methods that allow users to export data programmatically in a variety of formats.

Upcoming Meetings

Recent Releases

The "Big" Picture

Image credits: Yang et al., 2024

In response to the pressing need for managing vast spatial data sets, numerous spatial data management systems have emerged, often built on distributed NoSQL databases. However, existing systems typically use coarse Minimum Bounding Rectangles (MBRs) to approximate spatial objects, leading to inefficiencies in spatial operations and retrieval processes due to mismatches and varied index levels. GridMesa, a new grid-based management system designed to handle massive spatial data more effectively addresses these challenges. It features an adaptive grid-based approximation model called Grid-AM, which optimizes spatial computations by balancing filtering and refinement costs. Additionally, GridMesa utilizes a NoSQL-based storage scheme integrated with HBase, incorporating a novel secondary index to enhance query window quality. With efficient query algorithms supporting various data retrieval tasks, such as range and kNN queries, GridMesa demonstrates superior performance compared to popular NoSQL-based systems like GeoMesa and GeoWave, as evidenced by extensive experiments conducted with real-world datasets. 

Yang, X., Guan, X., Pang, Z., Kui, X., and Wu, H. (2024). GridMesa: A NoSQL-based big spatial data management system with an adaptive grid approximation model. Future Generation Computer Systems, 155, 324–339. doi:10.1016/j.future.2024.02.010 

Platform Users Meeting in June!

The Geospatial Computing Platform is ITC’s state-of-the-art cloud-based interactive computing platform provided to its staff, students, alumni, and partner organisations. Developed and operated by The Center of Expertise on Big Geodata Science (CRIB), it is available since January 2021 and currently serves more than 500 users for a wide range of use cases, such as research projects, thesis studies, course practicals, and even personal projects.

After more than 3 years of operation, we are curious how the platform has performed and how it can be further improved. We would like to hear about your user stories and ideas! How did you utilize the platform? Was it useful? Was it sufficient? What were the difficulties you encountered? How did you manage to overcome these difficulties? What else do you need - more computing resources, additional software and services?

Join us on 12 June 2024 to share experiences, exchange ideas, and voice needs!