Hosted Copies, Targeted Updates

An Alternative, Efficient ETL Strategy

Note

This was originally written in 2022.

Some of the Python modules have changed since then, especially the arcgis module.

Snippets here may no longer run as written.

Overview

What is a “hosted copy”?
What not to do
The Process
Improvements!

Requirements

Python
- arcgis module
- depending on source data, sqlalchemy, openpyxl, fiona
access to source data (files, database)
Portal / AGOL authentication

“Hosted Copy”?

A common problem

An organization (say, a county) has datasets that they wish to make available to the public, but…

The authoritative source(s) is/are:

Not available to register with ArcGIS Server
Too sensitive to expose outside your organization’s network
Not in the same place (i.e., a SQL view layer is not possible)

A common solution

The organization publishes a “hosted copy” of the data to ArcGIS Online / Portal.
Data is loaded into ArcGIS Datastore and served from there.
Public can see and interact with the dataset, but the authoritative source remains inaccessible to the rest of the internet.
All is well!

A new problem!

How can updates be passed to the hosted layer to ensure that the public sees the most current information?

Common Approaches

Overwrite the Service

If you load the source data into a map using ArcGIS Pro (or via arcpy), you can overwrite the hosted layer with something it.

This replaces literally everything about the service except for its ItemID and the service URL.

It’s like bulldozing and rebuilding your house to repaint the kitchen.

When to Overwrite

You have changed the number / order of layers in the service
You are making changes to relationship classes between layers and tables

When Not to Overwrite

Simple schema alterations
Updating features and attribute values
Making changes to vizualization settings

Truncate / Append

Leaves the schema and settings alone
Empties the layer, resets the objectid
Loads in fresh data

Much safer than overwriting! Also, pretty straightforward.

When to Truncate/Append

All or most features have been edited
Zero tolerance for errors
Some tolerance for unavailable data

When not to Truncate/Append

Very large layers with few edits
Zero tolerance for unavailable data

A Different Approach: Targeted updates

What’s different about this method?

Compares source / destination data, identifies changed rows
Identifies adds / deletes
Passes data directly into destination layer without separate upload / layer
Can run per source on subsets of fields

Example: Parcels

Parcels layer has about 60,000 rows
Source data comes from many tables, two databases
Source data is edited daily, but only on a few hundred rows

60,000 Edits, All at Once

flowchart LR

S1[Table 1]
S2[Table 2]
S3[Table 3]
T[Temporary Table / Join Layer]
H[Hosted Layer]

S1 & S2 & S3 --> T
T --Overwrite--> H
T --Truncate / Append--> H

300 Edits in 4 Operations

flowchart LR

S1[Table 1]
S2[Table 2]
S3[Table 3]

S1 --200 Updated\nAttributes--> H
S2 --50 Updated\nAttributes--> H
S3 --40 Updated\nGeometries--> H
S3 --10 New Geometries--> H

H[Hosted Layer]

Benefits

Large datasets can be refreshed in minutes
Public layer is never empty or unavailable
Updates can happen at any time without interruption
Total data transfer and server-side processing is dramatically reduced

Let’s Talk Code

Prep Work

#| echo: true
from arcgis.gis import GIS
from arcgis.features import GeoAccessor
import pandas as pd
from datetime import datetime

# Connect to AGOL
gis = GIS(profile='personalAGOL')

# Get Destination Layer
dest = gis.content.get('9dc51858aaa5475b8d3eafea84ee0e14')
dlayer = dest.layers[0]

dlayer

Pulling Source Data: Spatial Files

#| echo: true
p_df = GeoAccessor.from_featureclass('./data/parcels.shp')
p_df.sample()

Pulling Source Data: Non-Spatial Files

#| echo: true
s_df = pd.read_csv('./data/structures.csv')
a_df = pd.read_excel('./data/assessments.xlsx')

a_df.sample()

Other Sources?

Feature Services: FeatureLayer.query(as_df=True)
Databases: pd.read_sql()

Updating Features from Separate Sources

Assessments Data

Only certain fields
No shapes
Of many features, few edited any given day

Query the Destination Layer

#| echo: true
dest_a = dlayer.query(
  out_fields = a_df.columns.to_list(),
  as_df = True,
  return_geometry = False
)

dest_a.sample()

Merge the Source

#| echo: true
inner_a = a_df.merge(
  dest_a[['fid', 'pin']],
  how = 'inner',
  on = 'pin'
)

inner_a.sample()

Side Note: What about `GeoAccessor.compare()`?

This was originally written before that method was added, but I’m happy to say that using compare(destination, source) works well, and can simplify your code considerably.

We’ll still look at the entire process.

#| echo: true
a_df.spatial.compare(dest_a.spatial, 'pin')['modified_rows']

Compare

#| echo: true
dest_a = dest_a[inner_a.columns.to_list()].astype(inner_a.dtypes)

for df in [inner_a, dest_a]:
    df.set_index('pin', drop=False, inplace=True)
    df.sort_index(inplace=True)

comp_a = inner_a.compare(dest_a.loc[inner_a.index,:])
comp_a.head()

Selective Updates

#| echo: true
if len(comp_a) > 0:
  print(f'{len(comp_a)} changes identified.')
  update_a = inner_a.loc[comp_a.index,:]
  a_adds = dlayer.edit_features(updates=update_a.spatial.to_featureset())
  print(f"Updates: {len(a_adds['updateResults'])}")
else:
  print(f'No changes!')

Spatial Stuff

Parcel Shapes

Pretty much the same, but with the geometry
Can be an issue if source / destination have different projections / precision
Adds and Deletes, too

Query the Destination Geometry

#| echo: true
dest_p = dlayer.query(
  out_fields = ['pin', 'fid'],
  as_df = True,
  return_geometry = True
)

dest_p.head()

Reshape, Merge

#| echo: true
p_df.drop(columns=['date_creat', 'date_modif', 'globalid', 'OBJECTID', 'index'], inplace=True)

merge_p = p_df.merge(
  dest_p[['fid', 'pin']],
  how = 'outer',
  on = 'pin',
  indicator = True
)

merge_p.sample(3)

The Merged Table

_merge indicator
left_only: feature is only in source, i.e., is a new feature
right_only: feature is only in destination, i.e., should be deleted
both: common to both sets, possibly an update

Updates

Create a comparison frame and pass in the edits, as before. Use the _merge column to select the common rows

update_p = merge_p.query("_merge == 'both'").drop(columns='_merge')

for df in [update_p, dest_p]:
    df.set_index('pin', drop=False, inplace=True)
    df.sort_index(inplace=True)

dest_p = dest_p[update_p.columns.to_list()].astype(update_p.dtypes)
comp_p = update_p.compare(dest_p.loc[update_p.index,:])
comp_p.head()

Apply Edits

edit_flags = [
  len(comp_p) > 0,
  len(merge_p.query("_merge == 'left_only'")),
  len(merge_p.query("_merge == 'right_only'"))
]

if any(edit_flags):
  print(f'Adds:{len(merge_p.query("_merge == 'left_only'"))}, Edits: {len(comp_a)}, Deletes: {len(merge_p.query("_merge == 'right_only'"))}')

  add_p = merge_p.query("_merge == 'left_only'").spatial.to_featureset()
  update_p = merge_p.loc[comp_a.index,:].spatial.to_featureset()
  delete_p = merge_p.query("_merge == 'right_only'").spatial.to_featureset()
  p_adds = dlayer.edit_features(
    adds = add_p,
    updates = update_p,
    deletes = delete_p
  )
else:
  print(f'No changes!')

Alternate Route

If your source has a last_edited date, you can use that filter the source instead, and skip the comparison process. (Comparing geometry is especially problematic, anyway.)

NOTE: This will not help you identify deletes. If your source does not track deleted features, you’ll need to use the entire table to compare.

Further Enhancements

Simple

Add some logging
Tracking timestamps of successful runs, feeding timestamp into date-based filters of future runs

Complex

Abandon the ArcGIS Python API and do it all with requests and geopandas
Use the comparison dataframe to selectively update individual columns, not just rows

Questions?

Be in Touch!

josh@jcarlson.page

https://jcarlson.page

@jcarlson@mapstodon.space