Hosted Copies, Targeted Updates

An Alternative, Efficient ETL Strategy

Note

This was originally written in 2022.

Some of the Python modules have changed since then, especially the arcgis module.

Snippets here may no longer run as written.

Overview

  1. What is a “hosted copy”?
  2. What not to do
  3. The Process
  4. Improvements!

Requirements

  • Python
    • arcgis module
    • depending on source data, sqlalchemy, openpyxl, fiona
  • access to source data (files, database)
  • Portal / AGOL authentication

“Hosted Copy”?

A common problem

An organization (say, a county) has datasets that they wish to make available to the public, but…

The authoritative source(s) is/are:

  • Not available to register with ArcGIS Server
  • Too sensitive to expose outside your organization’s network
  • Not in the same place (i.e., a SQL view layer is not possible)

A common solution

  1. The organization publishes a “hosted copy” of the data to ArcGIS Online / Portal.
  2. Data is loaded into ArcGIS Datastore and served from there.
  3. Public can see and interact with the dataset, but the authoritative source remains inaccessible to the rest of the internet.
  4. All is well!

A new problem!

How can updates be passed to the hosted layer to ensure that the public sees the most current information?

Common Approaches

Overwrite the Service

If you load the source data into a map using ArcGIS Pro (or via arcpy), you can overwrite the hosted layer with something it.

This replaces literally everything about the service except for its ItemID and the service URL.

It’s like bulldozing and rebuilding your house to repaint the kitchen.

When to Overwrite

  • You have changed the number / order of layers in the service
  • You are making changes to relationship classes between layers and tables

When Not to Overwrite

  • Simple schema alterations
  • Updating features and attribute values
  • Making changes to vizualization settings

Truncate / Append

  • Leaves the schema and settings alone
  • Empties the layer, resets the objectid
  • Loads in fresh data

Much safer than overwriting! Also, pretty straightforward.

When to Truncate/Append

  • All or most features have been edited
  • Zero tolerance for errors
  • Some tolerance for unavailable data

When not to Truncate/Append

  • Very large layers with few edits
  • Zero tolerance for unavailable data

A Different Approach: Targeted updates

What’s different about this method?

  • Compares source / destination data, identifies changed rows
  • Identifies adds / deletes
  • Passes data directly into destination layer without separate upload / layer
  • Can run per source on subsets of fields

Example: Parcels

  • Parcels layer has about 60,000 rows
  • Source data comes from many tables, two databases
  • Source data is edited daily, but only on a few hundred rows

60,000 Edits, All at Once

flowchart LR

S1[Table 1]
S2[Table 2]
S3[Table 3]
T[Temporary Table / Join Layer]
H[Hosted Layer]

S1 & S2 & S3 --> T
T --Overwrite--> H
T --Truncate / Append--> H

300 Edits in 4 Operations

flowchart LR

S1[Table 1]
S2[Table 2]
S3[Table 3]

S1 --200 Updated\nAttributes--> H
S2 --50 Updated\nAttributes--> H
S3 --40 Updated\nGeometries--> H
S3 --10 New Geometries--> H

H[Hosted Layer]

Benefits

  • Large datasets can be refreshed in minutes
  • Public layer is never empty or unavailable
  • Updates can happen at any time without interruption
  • Total data transfer and server-side processing is dramatically reduced

Let’s Talk Code

Prep Work

#| echo: true
from arcgis.gis import GIS
from arcgis.features import GeoAccessor
import pandas as pd
from datetime import datetime

# Connect to AGOL
gis = GIS(profile='personalAGOL')

# Get Destination Layer
dest = gis.content.get('9dc51858aaa5475b8d3eafea84ee0e14')
dlayer = dest.layers[0]

dlayer

Pulling Source Data: Spatial Files

#| echo: true
p_df = GeoAccessor.from_featureclass('./data/parcels.shp')
p_df.sample()

Pulling Source Data: Non-Spatial Files

#| echo: true
s_df = pd.read_csv('./data/structures.csv')
a_df = pd.read_excel('./data/assessments.xlsx')

a_df.sample()

Other Sources?

  • Feature Services: FeatureLayer.query(as_df=True)
  • Databases: pd.read_sql()

Updating Features from Separate Sources

Assessments Data

  • Only certain fields
  • No shapes
  • Of many features, few edited any given day

Query the Destination Layer

#| echo: true
dest_a = dlayer.query(
  out_fields = a_df.columns.to_list(),
  as_df = True,
  return_geometry = False
)

dest_a.sample()

Merge the Source

#| echo: true
inner_a = a_df.merge(
  dest_a[['fid', 'pin']],
  how = 'inner',
  on = 'pin'
)

inner_a.sample()

Side Note: What about GeoAccessor.compare()?

This was originally written before that method was added, but I’m happy to say that using compare(destination, source) works well, and can simplify your code considerably.

We’ll still look at the entire process.

#| echo: true
a_df.spatial.compare(dest_a.spatial, 'pin')['modified_rows']

Compare

#| echo: true
dest_a = dest_a[inner_a.columns.to_list()].astype(inner_a.dtypes)

for df in [inner_a, dest_a]:
    df.set_index('pin', drop=False, inplace=True)
    df.sort_index(inplace=True)

comp_a = inner_a.compare(dest_a.loc[inner_a.index,:])
comp_a.head()

Selective Updates

#| echo: true
if len(comp_a) > 0:
  print(f'{len(comp_a)} changes identified.')
  update_a = inner_a.loc[comp_a.index,:]
  a_adds = dlayer.edit_features(updates=update_a.spatial.to_featureset())
  print(f"Updates: {len(a_adds['updateResults'])}")
else:
  print(f'No changes!')

Spatial Stuff

Parcel Shapes

  • Pretty much the same, but with the geometry
  • Can be an issue if source / destination have different projections / precision
  • Adds and Deletes, too

Query the Destination Geometry

#| echo: true
dest_p = dlayer.query(
  out_fields = ['pin', 'fid'],
  as_df = True,
  return_geometry = True
)

dest_p.head()

Reshape, Merge

#| echo: true
p_df.drop(columns=['date_creat', 'date_modif', 'globalid', 'OBJECTID', 'index'], inplace=True)

merge_p = p_df.merge(
  dest_p[['fid', 'pin']],
  how = 'outer',
  on = 'pin',
  indicator = True
)

merge_p.sample(3)

The Merged Table

  • _merge indicator
  • left_only: feature is only in source, i.e., is a new feature
  • right_only: feature is only in destination, i.e., should be deleted
  • both: common to both sets, possibly an update

Updates

Create a comparison frame and pass in the edits, as before. Use the _merge column to select the common rows

update_p = merge_p.query("_merge == 'both'").drop(columns='_merge')

for df in [update_p, dest_p]:
    df.set_index('pin', drop=False, inplace=True)
    df.sort_index(inplace=True)

dest_p = dest_p[update_p.columns.to_list()].astype(update_p.dtypes)
comp_p = update_p.compare(dest_p.loc[update_p.index,:])
comp_p.head()

Apply Edits

edit_flags = [
  len(comp_p) > 0,
  len(merge_p.query("_merge == 'left_only'")),
  len(merge_p.query("_merge == 'right_only'"))
]

if any(edit_flags):
  print(f'Adds:{len(merge_p.query("_merge == 'left_only'"))}, Edits: {len(comp_a)}, Deletes: {len(merge_p.query("_merge == 'right_only'"))}')

  add_p = merge_p.query("_merge == 'left_only'").spatial.to_featureset()
  update_p = merge_p.loc[comp_a.index,:].spatial.to_featureset()
  delete_p = merge_p.query("_merge == 'right_only'").spatial.to_featureset()
  p_adds = dlayer.edit_features(
    adds = add_p,
    updates = update_p,
    deletes = delete_p
  )
else:
  print(f'No changes!')

Alternate Route

If your source has a last_edited date, you can use that filter the source instead, and skip the comparison process. (Comparing geometry is especially problematic, anyway.)

NOTE: This will not help you identify deletes. If your source does not track deleted features, you’ll need to use the entire table to compare.

Further Enhancements

Simple

  • Add some logging
  • Tracking timestamps of successful runs, feeding timestamp into date-based filters of future runs

Complex

  • Abandon the ArcGIS Python API and do it all with requests and geopandas
  • Use the comparison dataframe to selectively update individual columns, not just rows

Questions?

Be in Touch!