Performance of ArcGISScripting and large spatial data sets

Although this question was already answered, I thought I could chime in an give my two cents.

DISCLAIMER: I worked for ESRI at the GeoDatabase team for some years and was in charge of maintaining various parts of GeoDatabase code (Versioning, Cursors, EditSessions, History, Relationship Classes, etc etc).

I think the biggest source of performance problems with ESRI code is not understanding the implications of using different objects, particularly, the "little" details of the various GeoDatabase abstractions! So very often, the conversation switches to the language being used as a culprit of the performance issues. In some cases it can be. But not all the time. Let's start with the language discussion and work our way back.

1.- The programming language that you pick only matters when you are doing something that is complicated, in a tight loop. Most of the time, this is not the case.

The big elephant in the room is that at the core of all ESRI code, you have ArcObjects - and ArcObjects is written in C++ using COM. There is a cost for communicating with this code. This is true for C#, VB.NET, python, or whatever else you are using.

You pay a price at initialization of that code. That may be a negligible cost if you do it only once.

You then pay a price for every subsequent time that you interact with ArcObjects.

Personally, I tend to write code for my clients in C#, because it is easy and fast enough. However, every time I want to move data around or do some processing for large amounts of data that is already implemented in Geoprocessing I just initialize the scripting subsystem and pass in my parameters. Why?

  • It is already implemented. So why reinvent the wheel?
  • It may actually be faster. "Faster than writing it in C#?" Yes! If I implement, say, data loading manually, it means that I pay the price of .NET context switching in a tight loop. Every GetValue, Insert, ShapeCopy has a cost. If I make the one call in GP, that entire data loading process will happen in the actual implementation of GP - in C++ within the COM environment. I don't pay the price for context switching because there is none - and hence it is faster.

Ah yes, so then the solution if to use a lot of geoprocessing functions. Actually, you have to be careful.

2. GP is a black box that copies data (potentially unnecessarily) around

It is a doubled-edged sword. It is a black box that does some magic internally and spits out results - but those results are very often duplicated. 100,000 rows can easily be converted into 1,000,000 rows on disk after you ran your data through 9 different functions. Using only GP functions is like creating a linear GP model, and well...

3. Chaining too many GP functions for large datasets is highly inefficient. A GP Model is (potentially) equivalent to executing a query in a really really really dumb way

Now don't get me wrong. I love GP Models - it saves me from writing code all the time. But I am also aware that it is not the most efficient way of processing large datasets.

Have you every heard of a Query Planner? It's job is to look at the SQL statement you want to execute, generate an execution plan in the form of a directed graph that looks a heck of a lot like a GP Model, look at the statistics stored in the db, and choose the most optimal order to execute them. GP just executes them in the order you put things because it doesn't have statistics to do anything more intelligently - you are the query planner. And guess what? The order in which you execute things is very dependent on your dataset. The order in which you execute things can make the difference between days and seconds and that is up to you to decide.

"Great" you say, I will not script things myself and be careful about how I write stuff. But do you understand GeoDatabase abstractions?

4.Not understanding GeoDatabase abstractions can easily bite you

Instead of pointing out every single thing that can possibly give you a problem, let me just point out a few common mistakes that I see all the time and some recommendations.

  • Understanding the difference between True/False for Recycling cursors. This tiny little flag set to true can make runtime orders of magnitude faster.
  • Put your table in LoadOnlyMode for data loads. Why update the index on every insert?
  • Understand that even though IWorkspaceEdit::StartEditing looks the same in all workspaces, they are very different beasts on every datasource. On an Enterprise GDB, you may have versioning or support for transactions. On shapefiles, it will have to be implemented in a very different way. How would you implement Undo/Redo? Do you even need to enable it (yes, it can make a difference in memory usage).
  • The difference between batch operations, or single row operations. Case in point GetRow vs GetRows - this is the difference between doing a query to get one row or doing one query to fetch multiple rows. A tight loop with a call to GetRow means horrible performance and it is culprit #1 of performance issues
  • Use UpdateSearchedRows
  • Understand the difference between CreateRow and CreateRowBuffer. Huge difference in insert runtime.
  • Understand that IRow::Store and IFeature::Store triggers super heavy polymorphic operations. This is probably reason #2 culprit of really slow performance. It doesn't just save the row, this is the method that makes sure your geometric network is OK, that the ArcMap Editor gets notified that a row has changed, that notifies all relationship classes that have anything to do with this row validate to make sure that the cardinality is valid, etc. You should not be inserting new rows with this, you should be using an InsertCursor!
  • Do you want (need) to do those inserts in an EditSession? It makes a huge difference if you do or not. Some operations require it (and make things slower), but when you don't need it, skip the undo/redo features.
  • Cursors are expensive resources. Once you have a handle to one, you are guaranteed that you will have Consistency and Isolation and that has a cost.
  • Cache other resources like database connections (don't create and destroy your Workspace reference) and Table handles (every time you open or close one - several metadata tables need to be read).
  • Putting FeatureClasses inside or outside a FeatureDataset makes a huge difference in performance. It is not meant as an organizational feature!

5.And last and not least...

Understand the difference between I/O bound and CPU bound operations

I honestly thought about expanding more on every single one of those items and perhaps doing a series of blog entries that covers every single one of those topics, but my calendar's backlog list just slapped me in the face and started yelling at me.

My two cents.


Generally, for performance computations, I try to stay away from using any ESRI related stuff. For your example, I would suggest doing the process in steps, the first step reading the data into normal python objects, doing the calculations, and then the final step converting to the final ESRI spatial format. For ~10k records, you could probably get away with storing everything in memory for the processing, which would give a definite performance gain.