Justin's Code Musings: The Mystery of the Pivot table

While working for a previous employer I found myself in the position of being the "SQL" expert. This came about partially because of my work with our reporting tool, but also because of attrition. So, one day a problem that had hampered my co-workers was handed off to me. The issue was with an application that was pulling images from a server and was taking upwards of 15 minutes to return for a certain customer. My co-worker had tracked down that the issue was with a database query and that the customer was having the problem because their database had grown to a very large size. My task was to "fix" the query to "work" on the larger set of data.

I dug into the SQL and found that the issue was with one sub-query that looked like this


Select
    DocId,
    Max(Case When AttrName="AttrA" Then AttrValue Else null End) As AttrA,
    Max(Case When AttrName="AttrB" Then AttrValue Else null End) As AttrB,
    Max(Case When AttrName="AttrC" Then AttrValue Else null End) As AttrC
From
    AttrTable

Now for those of you in the know you might realize that this is a Pivot table. At that point in time I had not come across a Pivot table before, so it took a little bit of pondering on what this query was doing. What I realized is that we had designed a table of name-value pairs (which went against what I had learned in my DB classes). The reason for this design was that the attributes required by each customer could be different and we did not want to have custom database designs for each one.

So I now had to figure out why this query gave such bad performance on large data sets and what if anything could be done to fix it. I decided to look at the query and figure out how the database would execute it. I figured it would first have to group the rows by the DocID, that should be pretty fast with indexing. Then it would have to pull the values out into the column that matched the name of that attribute while filling the other columns with nulls. Finally it would have to compress all rows with the same DocID into one by taking the Max of those columns, basically resulting in removing all the nulls and taking the actual value. Pretty neat, but you can quickly see that this is at best O(n) because it is basically creating the results row by row.

So I got to thinking if we cannot create the results row by row, then what about column by column. So I got to thinking of how we would get the results for just one column and came up with this.


Select
    DocId,
    AttrValue As AttrA
From
    AttrTable
Where
    AttrName = "AttrA"

Well this seemed pretty simple and should run quickly given an index on AttrName. Then all we have to do is Join all the columns together. However we need to ensure that we put nulls in for DocIDs that didn't have a particular attribute name, so we do Left Joins with the first Sourse being the table. It ended up looking something like this


Select
    main.DocId,
    ta.AttrA,
    tb.AttrB,
    tc.AttrC
From
    AttrTable main
    Left Join
    (Select
        DocId,
        AttrValue As AttrA
    From
        AttrTable
    Where
        AttrName = "AttrA") ta
    On main.DocId = ta.DocId
    Left Join
    (Select
        DocId,
        AttrValue As AttrB
    From
        AttrTable
    Where
        AttrName = "AttrB") tb
    On main.DocId = tb.DocID
    Left Join
    (Select
        DocId,
        AttrValue As AttrC
    From
        AttrTable
    Where
        AttrName = "AttrC") tc
    On main.DocId = tc.DocID

Granted this is not as pretty as the other query, but because it constructs the results column by column it take advantage of the indexes and runs in something close to O(log n). And the query that took 15 minutes in the application was reduced to less than one second.

There are a couple of things to learn here. First, just because you wrote something that is really compact doesn't mean that it's "good". Second, even in SQL you have to understand how a query is implemented in order to take full advantage of all the goodness the Database provides. Or to state it another way, just because a query returns what you want for a test data set doesn't mean it will work (and yes taking 15 minutes could be considered as not working) on a production data set.

Justin's Code Musings

Followers

Blog Archive

About Me

Monday, March 29, 2010

The Mystery of the Pivot table

No comments:

Post a Comment