I’ve got a soft spot for spatial joins. This is partly because they’re so darn useful, but also because a spatial join with raster data (e.g. the ArcGIS Sample tool) was one of the first GIS operation I coded up by hand, using Python and Numpy. Ah, those were the days …
But today, we’re going to take the easy route, and use libraries to do a spatial join with Clojure. Let’s find out how many fires were detected in the US since late 2000.
I’ve got some 46 million points representing fires detected by the MODIS sensors flying on NASA’s Terra and Aqua satellites over the last 11 years. You can get more info and all the data from the UMD FIRMS web site. We’ll be processing a 2.6gb textfile of fire points, and using polygon data from the Global Administrative Areas database (GADM).
I happen to have the GADM data sitting on a CartoDB instance, so let’s just grab some well-known-text in CSV format using the CartoDB SQL API:
http://your-account.cartodb.com/api/v2/sql/?q=SELECT iso,ST_asEWKT(the_geom)
AS geom FROM gadm0 WHERE iso='USA'&format=csv
That was easy! You can grab a zipped copy here if you want to follow along at home. I’ve stripped out the iso,geom
header line for convenience.
The file looks like this:
USA,"SRID=4326;MULTIPOLYGON(((179.585615158081
52.0173091888428,179.652040481567 52.0247783660889 ...)))"
A line from the fires data represents one 1km^2 pixel, and looks like this:
-18.735,144.626,324.8,1.1,1.0,05/13/2012,0030,T,78,5.0,297.2,24.9
A full description of the attributes is available on the FIRMS site. We’re only interested in the first two fields, latitude and longitude - the rest includes brightness, confidence, date, time, etc.
Add [cljts "0.1.0"]
to project.clj to get cljts and JTS. Check out this post for some simple use cases. Cljts pulls in the raw JTS library if you want to use it, but also gives you some convenient helper functions. Thank you sunng! Be sure to check out the jts and cljts docs.
Cljts has a nice read-wkt-str
function that we’ll be using here. Let’s set up the environment first:
Then we load the polygon, remembering to strip out some of the metadata provided by CartoDB:
Now we’ve got a JTS multipolygon object. Let’s create a fire point based on a coordinate:
Now we can check intersection:
That point happens to be in northeast Australia. So let’s try another random latlon, from Yellowstone National Park:
Now we need a spatial join function to handle the polygon attributes.
It works! Now we just need to loop through 46 million fires points, storing the count by iso code, and then we would have a spatial join.
So, looping through fire points line by line, intersecting each one with the polygon geometry in memory, would technically work. But it would take a while. On my 2010 iMac, intersect?
returns false
in as few as .0025 milliseconds. I suspect that intersect?
first checks whether a point is within the bounding envelope of the polygon of interest - that operation screams. If the point doesn’t fall within the bounding envelope, intersects?
immediately returns false. If it does, I ended up waiting 268 milliseconds on average. Granted, the border of the US and its territories is a complex polygon, with 40+ thousand vertices, but I would love to see it faster.
I don’t know enough (read anything) about computational geometry to know how hard it would be to do better than this. JTS uses uses a simple polygon intersection algorithm to check for point location. Tim Robertson, who helped inspire this exploration, has a nice implementation of a slab decomposition algorithm that is incredibly fast. But as I said at the beginning, for now I want to stick with libraries, hoping to retain some semblance of a general solution.
This exercise is going to be a problem. I happen to know (foreshadowing the next blog post) that of 46 million fires, 8.33 million fall within the USA’s bounding envelope, which apparently stretches across most of the northern hemisphere, from Alaska to perhaps Guam. From the GADM site:
So the question is, ignoring super-fast non-intersecting operations, how long should the exact intersection take?
That’s crazy! For completeness, note that ArcGIS chokes with a memory error after about 5 hours of work on a spatial join.
If only this ran in parallel … (stay tuned)
I’m satisfied with the ease of using JTS in Clojure thanks to cljts, but this level of performance just won’t do for the task at hand. Let’s not forget that I’ve simplified the question. My real question - how many fires were detected in each country? - would take ages to complete given the size of the bounding envelopes of the US and Russia alone. There has to be a better way.
Cljts is a great way to use JTS Topology Suite in Clojure, and has some convenient helper functions for IO and spatial relationships. But a naive loop through 46 million fires points to count up fires intersecting the US would take nearly 26 days using JTS’ intersection algorithm. A custom solution could be much faster, at the loss of generality. Look for a future post on trying this with Hadoop and Cascalog.