Managing large amounts of geospatial data?

I think the stock/obvious answer would be to use a spatial database (PostGIS, Oracle, SDE, MSSQL Spatial, etc) in conjunction with a metadata server such as esri's GeoPortal or the open source GeoNetwork application, and overall I think this is generally the best solution. However, you'll likely always have a need for project-based snapshots / branches / tags. Some of the more advanced databases have ways of managing these, but they're generally not all that easy to user/manage.

For things you store outside of a database (large images, project-based files) I think the key is to have a consistent naming convention and again a metadata registry (even something low-tech like a spreadsheet) that allows you to track them and ensure that they are properly managed. For instance, in the case of project-based files this can mean deleting them when records management policy dictates, or rolling them into the central repository on project completion.

I have seen some interesting solutions though...

Back when the BC Ministry of Environment was running things off of Arc/Info coverages, they had a really cool rsync-based two way synchronization process in place. The coverages that were under central control were pushed out to regions nightly, and regional data was pushed back in. This block-level differential transfer worked really well, even over 56k links. There were similar processes for replicating the Oracle-based attribute databases, but I don't think they they typically did too well over dial-up :)

My current place of work uses a similar hybrid solution. Each dataset has its authoritative copy (some in Oracle, others in MapInfo, others in personal geodatabases) and these are cross-ETL'd nightly using FME. There is some pretty major overhead here when it comes to maintenance though; the effort to create any new dataset and ensure organisational visibility is considerably higher than it should be. We're in the process of a review intended to find some way of consolidating to avoid this overhead.

Metadata is by far the most important issue here. If metadata answers whom, when, why, where it's an acceptable metadata record.

Having work experience in large companies with just a few GIS users (around 30) we had major issues to control data, specially versions and permissions. One side of this can be solved with extensive documenting of data (metadata) and the other problems are most likely solved with a central repository, in which PostGIS shines.

GeoNetwork is a good start to handle metadata issues. Solving the central repository is more complicated, because it might take a specialized person to design/maintain the database.

The complicated issue is who will be in charge of QA/QC these datasets and their metadata. Although computer driven processes work great they cannot be as rigorous as a good data manager/data keeper, which was made in this company I worked. Now there is someone exclusively there to review/commit metadata and organize geospatial data that is not centralized in a DBMS.

We have used a file system organized hierarchically by: - geographic extent (country or continent) - data provider, licensor - domain/dataset - date/version

After that we have a policy to separate the source data (in the same format that was on whatever CD/DVD that we got from the provider) from any derived datasets that we produced within our company.

The file system makes it really easy to ingest any data from the customer and also allows for some flexibility in terms of the physical storage - we keep our archives on larger, slower disks and we have special file servers (transparently linked into the hierarchy) for the more frequently used datasets.

To facilitate management within projects, we use symbolic links. We keep our vectors in a database (Oracle) and we make it a rule to have at least one database instance per customer (and several users/schemas for the projects). We haven't been keeping many rasters in a database, though, as they tend to take too much space even outside one. Also, we like to keep our database instances as lightweight as possible.

And yes, we have someone in charge of 'policing' the whole thing so it doesn't get too messy.

The biggest issue we have with this setup currently is the lack of a nice user interface which would help us have a better overview over the whole thing, and we've been planning to include a metadata storage on top of all that. We're still considering our options here.

We're using version control for our code and we've used it for documents, but it turns out that version control isn't really made for large datasets, especially if they're mostly binary files, so i wouldn't recommend that, except if you're dealing with GML or something similarly text-like (problems include huge overheads on the server-side disk usage as well as clients crashing when checking out huge repositories).

Managing large amounts of geospatial data?

Tags:

Versioning

Storage

Data Management

Related

Recent Posts