Thursday, September 24, 2009

Send Your R-Trees Packing

So long ago now I was looking at R-Trees and how scary the ones created by GiST indexing in PostGIS looked. I had actually already written a packing algorithm at the time - which I called Median-Split packing, which made some pretty nice looking indexes. Here is a picture of the index of the same data I was using before (the roads of British Columbia) when packed using Median-Split:


And here is a zoomed-in look at the densest area, the Greater Vancouver Area - approximately one third of all the road segments in the province are within this area which is on the order of 1/100th of the area of the province.


I have yet to do empirical tests to determine if the packed index works any better, but from a purely aesthetic point of view I'm happy with my packing algorithm. It works in O( n*log(n) ) time, using an algorithm similar to quick-sort. I am confident there will be a measurable improvement in performance in terms of the number of page-accesses required for the packed index vs. the regularly constructed index, however the real question is, once caching and what-not is factored in, can you see a significant difference in query time in the database.

Lots of other people have thought about packing R-Tree indexes before, STR-Pack and Hilbert-Sort do perform similar operations in similar time, but I feel strongly that Median-Split does a better job than they do, particularly for region queries, which are common for map rendering (as opposed to point queries). I don't yet have any proof of this but it just "feels good". To my knowledge no-one else has published an algorithm that works the way Median-Split does - I may look into publishing it.

Shoehorning an index packing mechanism into PostgreSQL is a daunting task, so before I go down that road I'm going to do some simulations to determine if I'm right about the performance in terms of page-accesses.

1 comment:

Ray Cast said...

Have you had a look at the R*-Tree?
I thought that everybody was using the R*-Tree, but apparently I was wrong, and people have fallen for the "oh, there is a linear time heuristic" trap.
Give it a try.