Tuesday, March 30, 2010

Best Quote ever


public static List BubbleSort(this List array)
where T: IComparable
{
List sortedArray = new List(array);
bool sorted = true;
do
{
for (int i = 0; i < sortedArray.Count - 1; i++)
{
if (sortedArray[i].CompareTo(sortedArray[i + 1]) > 0)
{
T temp = sortedArray[i];
sortedArray[i] = sortedArray[i + 1];
sortedArray[i + 1] = temp;
sorted = false;
}
}
}while (!sorted)
return sortedArray;
}


That is an implementation of the Bubble sort. It's one of the first sorts you learn about in Computer Science. It's simple to understand, but is one of the worst sorting algorithms.

A professor I had for a Data Structures class stated the following about Bubble Sort.

"Bubble Sort is like a transvestite hooker in the French Quarter of New Orleans. It looks really good, but it isn't"

That's always been one of my favorite quotes. If you got any better ones let me know.

Monday, March 29, 2010

Attack of the TLAs

For those of you not in the know a TLA is a three letter abbreviation. You find a lot of TLAs in the computing world. Also TLA has a self referring characteristic in that it is itself a TLA. I assume that as much as anything else is why it doesn't stand of two letter abbreviation. Anyway this is my story of how TLAs can get out of hand.

In a previous life I worked as a Systems Engineer. Basically I configured applications in a work-flow and set everything up for each customer. Like any work-flow system we had input and output. The vast majority of our input came from a scanner as we were a document capture solution company. However every so often you came across customers that already had a solution to capture the images and wanted to use our system for further processing. As it would happen I was the System Engineer for the first customer that wanted to input XML into our system.

Now we had a worker called GenPickup (short for Generic Pickup) that basically had a framework for picking up files. It would search a directory for files with a certian extension and process them once they were found. It was generic because you could plug in different DLLs that would determine exactly how the picked up files where handled. These DLLs where called Pickup Modules or PUMs.

Before I go on with my story I need to give a little more background. In our work flow system we had a number of workers that did not require any interaction with a user. We called them background workers. And because they shared a lot of common functionality we had created a Background Worker Shell or BWS. The BWS handled things like loading configuration files, writing to log files, and communicating with the work-flow system. The behavior of the worker was determined by a DLL that was pluged in. These DLLs would have names like DoStuff_bws.dll. The "_bws" was added to the end to indicate that the DLL was a plug-in to the BWS. Eventually the GenPickup worker was rewritten to be a BWS DLL as well.

Now we are in need of a worker that will pickup xml. So I was assigned to work with a developer to create this worker. For reasons I will not go into here he decided to bypass creating a PUM for GenPickup and instead decided to create a BWS worker. Even so he decided that this was still a Pickup Module and named the DLL xml_pum_bws.dll.

Now if that isn't enough TLAs for you later after I switched to being a Developer I became the maintainer of the XML Pickup Worker. While looking through the code I noticed that he named a lot of classes with XPB as a prefix like XPBFile and XPBInfo. At first I was a little stumped as to what XPB was until I looked a the name of the DLL again Xml_Pum_Bws.dll. That's right he created a TLA of TLAs. And that ladies and gentlemen is TLA overload.

The Mystery of the Pivot table

While working for a previous employer I found myself in the position of being the "SQL" expert. This came about partially because of my work with our reporting tool, but also because of attrition. So, one day a problem that had hampered my co-workers was handed off to me. The issue was with an application that was pulling images from a server and was taking upwards of 15 minutes to return for a certain customer. My co-worker had tracked down that the issue was with a database query and that the customer was having the problem because their database had grown to a very large size. My task was to "fix" the query to "work" on the larger set of data.

I dug into the SQL and found that the issue was with one sub-query that looked like this


Select
DocId,
Max(Case When AttrName="AttrA" Then AttrValue Else null End) As AttrA,
Max(Case When AttrName="AttrB" Then AttrValue Else null End) As AttrB,
Max(Case When AttrName="AttrC" Then AttrValue Else null End) As AttrC
From
AttrTable


Now for those of you in the know you might realize that this is a Pivot table. At that point in time I had not come across a Pivot table before, so it took a little bit of pondering on what this query was doing. What I realized is that we had designed a table of name-value pairs (which went against what I had learned in my DB classes). The reason for this design was that the attributes required by each customer could be different and we did not want to have custom database designs for each one.

So I now had to figure out why this query gave such bad performance on large data sets and what if anything could be done to fix it. I decided to look at the query and figure out how the database would execute it. I figured it would first have to group the rows by the DocID, that should be pretty fast with indexing. Then it would have to pull the values out into the column that matched the name of that attribute while filling the other columns with nulls. Finally it would have to compress all rows with the same DocID into one by taking the Max of those columns, basically resulting in removing all the nulls and taking the actual value. Pretty neat, but you can quickly see that this is at best O(n) because it is basically creating the results row by row.

So I got to thinking if we cannot create the results row by row, then what about column by column. So I got to thinking of how we would get the results for just one column and came up with this.


Select
DocId,
AttrValue As AttrA
From
AttrTable
Where
AttrName = "AttrA"


Well this seemed pretty simple and should run quickly given an index on AttrName. Then all we have to do is Join all the columns together. However we need to ensure that we put nulls in for DocIDs that didn't have a particular attribute name, so we do Left Joins with the first Sourse being the table. It ended up looking something like this


Select
main.DocId,
ta.AttrA,
tb.AttrB,
tc.AttrC
From
AttrTable main
Left Join
(Select
DocId,
AttrValue As AttrA
From
AttrTable
Where
AttrName = "AttrA") ta
On main.DocId = ta.DocId
Left Join
(Select
DocId,
AttrValue As AttrB
From
AttrTable
Where
AttrName = "AttrB") tb
On main.DocId = tb.DocID
Left Join
(Select
DocId,
AttrValue As AttrC
From
AttrTable
Where
AttrName = "AttrC") tc
On main.DocId = tc.DocID


Granted this is not as pretty as the other query, but because it constructs the results column by column it take advantage of the indexes and runs in something close to O(log n). And the query that took 15 minutes in the application was reduced to less than one second.

There are a couple of things to learn here. First, just because you wrote something that is really compact doesn't mean that it's "good". Second, even in SQL you have to understand how a query is implemented in order to take full advantage of all the goodness the Database provides. Or to state it another way, just because a query returns what you want for a test data set doesn't mean it will work (and yes taking 15 minutes could be considered as not working) on a production data set.

Sunday, March 28, 2010

Hello World

This is my "Hello World" Post. More to come latter.