Character encoding support in geodatabases and shapefiles

I think you're part way there. You can use iconv to convert from one encoding to another, and you can use this as part of the shp2pgsql process. For example:

shp2pgsql *postgrestablename* | iconv -f *sourceencoding* -t *targetencoding* | psql -d *yourdatabase*

If you're working in a Linux environment then iconv should be installed already. For Windows I found LibIconv for Windows. But I have no experience of using iconv under Windows, so I can't vouch for it.

Hope this helps!

Jo


Below the details of the process I used for converting a File GeoDataBase with Arabic fields into shapefiles with UTF-8 encoding that open happily in both QGIS and ArcMap showing both Arabic and English correctly (without using extensions to export or to read):

  • The basic idea is: from the FGDB export a shapefile including a .dbf (in the wrong encoding), then export the Attribute Table of the same layer as text (in the right encoding, which is UTF-8), and use another program to replace the contents of the shapefile .dbf with proper UTF-8 data fields and save the .dbf with UTF-8 encoding. Then add a .cpg file to each shapefile to inform ArcGIS of the new encoding of the .dbf. Steps:

1) Add the layers from the FGDB into ArcMap (I used 10.1, but there's absolutely no reason for it not to work in earlier versions, because the encoding bit happens later, outside of Arc). To export, right-click a layer and choose Data-> Export Data, click the folder button in the export dialogue to bring up the Save dialog, and choose Shapefile as the output format.

1b) Alternate method to the above: navigate to the FGDB in ArcCatalog, right-click it, choose Export -> To Shapefile (multiple), and export the whole FGCB as a folder full of shapefiles in a single operation).

2) Now you have a set of shapefiles with gibberish where the Arabic script should be (on my machine it displayed question marks in place of characters). The .dbf portions of the shapefiles themselves, opened in Excel or whatever, have gibberish instead of Arabic; it's not merely a display issue in the GIS program, it's that the .dbf files themselves don't contain the Arabic characters. Not helpful just yet.

3) In ArcMap, open the Attribute Table of a layer from the FGDB. The table opens with both the English and the Arabic showing properly (that's why FGDB was used in the first place). In the Table Options menu of the Attribute Table window, choose Export, and in the Export Data dialogue click the output folder button to get to the Saving Data dialogue where you choose Text File as the output type. Now you have a text file that will open in Notepad with comma delimiters, encoded as UTF-8, with both English and Arabic properly encoded (the Arabic should, at this point, display properly in Notepad).

Now to get that information into the .dbf portions of the shapefiles!

4) Open LibreOffice Calc, a free and open-source Excel clone that opens, manipulates, and saves .dbf files easily, to open the .dbf file of a shapefile.

By the way, in this case I'm not using LibreOffice instead of MS Office for ideological reasons, but simply because I can't figure out how to make Excel save a .dbf file, which is easy in Calc, in fact it's the default option when hitting Save after having opened and modified a .dbf file in Calc, whereas Excel it actually states that the file "cannot be saved in the current format" and not-so-helpfully offers to "save it as the latest format" (no option for .dbf comes up). There are extensions/plugins for Excel that purport to do the job (

The .dbf file in Calc still shows the gibberish in place of the Arabic. Alonside it, open the .csv that you exported from the attribute table of the same shapefile, making sure you specify UTF-8 as the encoding (and commas as delimiters) in the opening dialogue. The text files should open in a second Calc spreadsheet with the Arabic displayed correctly, and they should contain the same columns as the .dbf plus an OBJECTID column at the beginning. Copy-paste the columns from the .csv containing the proper Arabic into the .dbf (I actually just copy-pasted the whole table with the exception of the leftmost ID column to save time; the information is identical anyway). Hit Save in the the modified .dbf in LibreOffice (it'll ask if you really want to use such a weird format as .dbf; yes, you do). You may have to again specify UTF-8 as the file encoding.

Repeat this process for all of the .dbf components of the shapefiles from the FGDB, replacing all gibberish columns with the Arabic strings.

5) As soon as you've resaved the .dbf portions with the Arabic columns pasted in, you can open the shapefiles in QGIS and they will work properly in both languages, provided that you specify UTF-8 as the encoding in the Import Vector File dialog. However, they still won't work properly in ArcGIS (or at least not in all versions) because ArcGIS doesn't automatically recognize the encoding or let you choose it when you add the shapefile to a project. Arc needs a separate component to the shapefile, called a Code Page Conversion (.cpg) file, to instruct it which encoding to read.

6) Use a text editor (notepad, nano, or whatever, but not Word or any other word processor) to create a text file that contains only the five characters "UTF-8". Save it as .cpg for each of the shapefiles (I just click on a peice of the shapefile in the Save As dialogue, then erase the extension and add .cpg), in the same folder as the shapefile (it basically becomes another peice ofHi the multi-part shapefile). The .cpg extension tells Arc that this is a file containing information about the encoding of the .dbf file; once it's bundled into the shapefile along with its same-name-but-different-extension siblings, the encoding of the shapefile is now automatically recognized by ArcGIS.

7) Voila. Now you have shapefiles that contain both English and Arabic strings, as far as I can tell exactly as they were in the original File GeoDataBase. They open in my installations of both ArcMap and QGIS, and in both cases the strings in both languages display correctly including in map labels.

Caveats:

  • Not all copies of ArcGIS seem to export the attribute table as a properly populated text file (on at least one computer, attempting to export the attribute table to a text file results in a file with only the headers, not the data lines. This is NOT the proper behaviour of Arc (of course it's supposed to be able to export Attribute Tables as text), but it may come up for some users. This makes the rest of the steps impossible.

  • It doesn't seem as though ArcGIS will save new shapefiles with UTF-8 encoding. This will only affect users that want to create new shapefiles from the data, not people who just want to display, modify, and use them to make maps. The workaround seems to involve messing with your Windows registry as detailed here: (http://support.esri.com/cn/knowledgebase/techarticles/detail/21106). I haven't had to deal with it because my ArcGIS and QGIS both seem to happily recognize the shapefiles I saved using the above process, and I can modify geometry and table entries or even add new polygons with more Arabic text without any obvious problems (even though Arc doesn't seem to want to save new shapefiles with UTF-8 encoding, it seems willing to update/resave them).

  • I'm assuming that the functionality of LibreOffice is the same in Windows as on my computer. I use GNU/Linux for most of my work, and only boot to Windows if I need to use ArcGIS or Autocad for some task or another, so I did the modification of the .dbf file in Libreoffice running on Fedora. I assume it works the same way on Windows, but I can't test that without installing LibreOffice on my Windows partition and my current Internet connection is a bit slow for non-necessary downloads. There are plugins for Excel that allows you to save .dbf files in a selected encoding (exceltodbf.sourceforge.net/, for example), but I haven't tried them. There may be other ways altogether to manipulate and save .dbf, but I haven't looked into them after finding a reasonably easy way to do it with LibreOffice. The Cadillac solution would be to write a script that automatically merges the .dbf with the text file and saves the .dbf in UTF-8, but unless you've got a lot of stuff to convert this seems like a rather extreme solution (especially given the option below).

  • The whole issue seems to be avoidable if you pay for the Production Mapping extension in ArcGIS, which allows you to directly convert FGDBs to shapefiles with UTF-8 encoding according to this page: http://resources.arcgis.com/en/help/main/10.1/index.html#//0103000001m1000000. Why this rather basic functionality (Unicode has been around for a while now, and there are a lot of languages other than English out there) is only available to those customers who pay extra is a question for ESRI.