flowchart LR S1[Table 1] S2[Table 2] S3[Table 3] T[Temporary Table / Join Layer] H[Hosted Layer] S1 & S2 & S3 --> T T --Overwrite--> H T --Truncate / Append--> H
An Alternative, Efficient ETL Strategy
This was originally written in 2022.
Some of the Python modules have changed since then, especially the arcgis
module.
Snippets here may no longer run as written.
arcgis
modulesqlalchemy
, openpyxl
, fiona
An organization (say, a county) has datasets that they wish to make available to the public, but…
The authoritative source(s) is/are:
How can updates be passed to the hosted layer to ensure that the public sees the most current information?
If you load the source data into a map using ArcGIS Pro (or via arcpy
), you can overwrite the hosted layer with something it.
This replaces literally everything about the service except for its ItemID and the service URL.
It’s like bulldozing and rebuilding your house to repaint the kitchen.
Much safer than overwriting! Also, pretty straightforward.
flowchart LR S1[Table 1] S2[Table 2] S3[Table 3] T[Temporary Table / Join Layer] H[Hosted Layer] S1 & S2 & S3 --> T T --Overwrite--> H T --Truncate / Append--> H
flowchart LR S1[Table 1] S2[Table 2] S3[Table 3] S1 --200 Updated\nAttributes--> H S2 --50 Updated\nAttributes--> H S3 --40 Updated\nGeometries--> H S3 --10 New Geometries--> H H[Hosted Layer]
FeatureLayer.query(as_df=True)
pd.read_sql()
GeoAccessor.compare()
?This was originally written before that method was added, but I’m happy to say that using compare(destination, source)
works well, and can simplify your code considerably.
We’ll still look at the entire process.
_merge
indicatorleft_only
: feature is only in source, i.e., is a new featureright_only
: feature is only in destination, i.e., should be deletedboth
: common to both sets, possibly an updateCreate a comparison frame and pass in the edits, as before. Use the _merge
column to select the common rows
update_p = merge_p.query("_merge == 'both'").drop(columns='_merge')
for df in [update_p, dest_p]:
df.set_index('pin', drop=False, inplace=True)
df.sort_index(inplace=True)
dest_p = dest_p[update_p.columns.to_list()].astype(update_p.dtypes)
comp_p = update_p.compare(dest_p.loc[update_p.index,:])
comp_p.head()
edit_flags = [
len(comp_p) > 0,
len(merge_p.query("_merge == 'left_only'")),
len(merge_p.query("_merge == 'right_only'"))
]
if any(edit_flags):
print(f'Adds:{len(merge_p.query("_merge == 'left_only'"))}, Edits: {len(comp_a)}, Deletes: {len(merge_p.query("_merge == 'right_only'"))}')
add_p = merge_p.query("_merge == 'left_only'").spatial.to_featureset()
update_p = merge_p.loc[comp_a.index,:].spatial.to_featureset()
delete_p = merge_p.query("_merge == 'right_only'").spatial.to_featureset()
p_adds = dlayer.edit_features(
adds = add_p,
updates = update_p,
deletes = delete_p
)
else:
print(f'No changes!')
If your source has a last_edited
date, you can use that filter the source instead, and skip the comparison process. (Comparing geometry is especially problematic, anyway.)
NOTE: This will not help you identify deletes. If your source does not track deleted features, you’ll need to use the entire table to compare.
requests
and geopandas
Questions?