Ask HN: Best practices for organizing geospatial data from different sources?

12 pointschasely5y ago7 comments

I am working on a project for fun using GeoTIFF, NetCDF, geojson, and satellite imagery as all part of the analysis. My ETL process for using this is basically a bunch of scripts to get the data to a place where I can actually run the analyses I'm interested in.

I'd like to make it easier on myself by using a system that can query these different sources, e.g., give me the data within a bounding box (or polygon) for these variables and in the year 2018.

Does such a system exist? Would dumping everything I can to a PostGIS database get me most of the way there? Hoping someone that works with this type of data at scale can provide some insight into best practices.

12 pointschasely5y ago7 comments

I'd like to make it easier on myself by using a system that can query these different sources, e.g., give me the data within a bounding box (or polygon) for these variables and in the year 2018.

7 comments

7 comments · 5 top-level

aaron-santos5y ago· 1 in thread

Loads of questions that might help to find your answers:

What's your re-projection strategy? Are you at liberty to apply the same projection to all of the data in your pipeline? If not (using UTM for rasters for example), what are the fewest number of CRSs you can get away with?

How are you going to efficiently retrieve data? For example, do you intent to CoG your rasters to enable range reading? Do you intend to pyramid your rasters on ingest so you can pull different zoom levels quickly? If you have a mix of resolutions, do you want to standardize your resolutions so that co-registration is easier on the read side?

Do you want to automate your ETL process and have it run continuously or are you ok with ad-hoc manual runs?

Is there any data filtering your want to apply in your ETL? Cloud removal, special NODATA cases, spatial-temporal filtering?

What are your cost, latency, throughput requirements? Does this project prioritize any of those more than the others?

Source: built a raster/vector ingestion pipeline which I now use for analysis. Contact info in bio if you want to chat more about this.

chaselyOP5y ago

Thank you for your response. This gives me a lot of good questions to think about and answer. I'll try to answer these for myself and probably reach out to chat.

schoenobates5y ago· 1 in thread

might be worth giving qgis a try. It can work with loads of formats and can be used to run an analysis ( as well as make maps)

https://qgis.org

chaselyOP5y ago

I started out using QGIS but since most of my work is in Python I found it frustrating to always tool hop. I then kind of went around QGIS to use tools like GRASS directly, but have recently found python-native tools like xarray to work very well for what I'm looking to do.

I know there are people that use QGIS/ArcGIS directly with python, but I could never really get comfortable with it

based25y ago

* https://www.safe.com/ $

* https://sourceforge.net/projects/geokettle/

* http://talend-spatial.github.io/

ref: https://gis.stackexchange.com/questions/5049/seeking-options...

tcbasche5y ago

Having worked on an ingestion pipeline for geospatial data I’d say postgis is more than ok (and possibly an industry leader) for those kinds of queries.

There’s other “big data” DB’s like Cassandra or Elastic that can handle GIS data but I’m skeptical they are even necessary until you reach petabytes of data.

Having a solid indexing strategy will basically fulfil the majority of your performance concerns. Things like simplifying geometry, reducing scans etc

nknealk5y ago

As others have said, PostGIS is a great option if supports everything in your use case. The GIN indexes in particular give you very fast bounding box lookups. You can also do shape intersection and other sophisticated things.

I recommend the Boundless Geo tutorial on PostGIS. It does a great job of teaching you about bounding box indexes and all the GIS functions and types.

j / k navigate · click thread line to collapse

7 comments

7 comments · 5 top-level

aaron-santos5y ago· 1 in thread

Loads of questions that might help to find your answers:

Do you want to automate your ETL process and have it run continuously or are you ok with ad-hoc manual runs?

Is there any data filtering your want to apply in your ETL? Cloud removal, special NODATA cases, spatial-temporal filtering?

What are your cost, latency, throughput requirements? Does this project prioritize any of those more than the others?

Source: built a raster/vector ingestion pipeline which I now use for analysis. Contact info in bio if you want to chat more about this.

chaselyOP5y ago

Thank you for your response. This gives me a lot of good questions to think about and answer. I'll try to answer these for myself and probably reach out to chat.

schoenobates5y ago· 1 in thread

might be worth giving qgis a try. It can work with loads of formats and can be used to run an analysis ( as well as make maps)

https://qgis.org

chaselyOP5y ago

I know there are people that use QGIS/ArcGIS directly with python, but I could never really get comfortable with it

based25y ago

* https://www.safe.com/ $

* https://sourceforge.net/projects/geokettle/

* http://talend-spatial.github.io/

ref: https://gis.stackexchange.com/questions/5049/seeking-options...

tcbasche5y ago

Having worked on an ingestion pipeline for geospatial data I’d say postgis is more than ok (and possibly an industry leader) for those kinds of queries.

There’s other “big data” DB’s like Cassandra or Elastic that can handle GIS data but I’m skeptical they are even necessary until you reach petabytes of data.

Having a solid indexing strategy will basically fulfil the majority of your performance concerns. Things like simplifying geometry, reducing scans etc

nknealk5y ago

I recommend the Boundless Geo tutorial on PostGIS. It does a great job of teaching you about bounding box indexes and all the GIS functions and types.

j / k navigate · click thread line to collapse