Sorting behavior

One of the best parts of my job is explaining database concepts to non-database folks: developers, managers, CEOs, my wife, etc. One lesson I’ve had to explain numerous times is that SQL Server doesn’t guarantee that data will come back in any given order unless you explicitly tell it to return data in a given order. Even if you run a query a million times and the data always comes back in a particular order, if you haven’t explicitly told SQL Server that you need that data in a particular order, it may come back differently the next time.

 

Developers are often really smart folks and they notice patterns. One pattern that they notice is that if you select rows from a table they will usually come back sorted on their primary key.

 

For example, let’s create the following table.

CREATE TABLE [dbo].[t1]
  (
     [col1] INT NOT NULL ,
     [col2] INT NULL,
     [col3] NVARCHAR(10) NULL,
     [col4] DATE NULL,
     PRIMARY KEY CLUSTERED ( [col1] ASC ) WITH ( PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON ) ON [PRIMARY]
  )
ON [PRIMARY];

GO

Notice the primary key – [Col1].

PRIMARY KEY CLUSTERED ( [col1] ASC )

 

And now we’ll populate it with some data.

INSERT INTO [dbo].[t1]
            ([col1],
             [col2],
             [col3],
             [col4])
SELECT 1,
       150,
       'aaa',
       '2015-03-01'
UNION
SELECT 2,
       160,
       'bbb',
       '2015-03-07'
UNION
SELECT 3,
       170,
       'ccc',
       '2015-03-10'
UNION
SELECT 4,
        180,
       'ddd',
       '2015-03-20';

 

Now, let’s run a simple query on a subset of the columns – [Col1], [Col2] and [Col4].

SELECT [t].[col1],
       [t].[col2],
       [t].[col4]
FROM   [dbo].[t1] AS [t];

sort_order_1

 

The data comes back in the order of the primary key. This is because the data is stored in the physical database in a way that makes it very easy for the storage engine to return the data sorted. (Technically, it’s not stored in sorted order, but the details of the actual storage are a little too involved for a short blog post like this.)

 

If nothing else about the database changes, you can run that query all day and the data will always come back sorted. You can even add, delete and update data and it will still come back in sorted order. Let’s prove this by adding a couple rows, updating a row and deleting a row.

UPDATE [dbo].[t1]
SET col1 = 10
WHERE [col1] = 2

INSERT INTO [dbo].[t1]
            ([col1],[col2],
             [col3],
             [col4])
SELECT 1000,8,
       'eee',
       '2015-04-17'
UNION
SELECT 7,97182,
       'xyz',
       '2015-10-02'

DELETE [dbo].[t1]
WHERE [col1] = 3

SELECT [t].[col1],
       [t].[col2],
       [t].[col4]
FROM   [dbo].[t1] AS [t];


sort_order_2

The potential for pain in this is that SQL Server does not  – and should not! –guarantee that data is returned in any particular order unless an explicit ORDER BY clause is added to the query. There are particular situations where SQL Server will return the data in a different order to get the data back to the client faster.

 

And if a developer creates, for example, a report or a data entry screen which relies on the data coming back in a particular order and they do not explicitly instruct SQL Server to return the data in that order – or do the ordering in their report or data entry screen – their beautiful report or screen may look confusing or even lose functionality if the conditions arise that causes SQL Server to return data in non-key order.

 

Here’s a fairly realistic scenario where the data may come back in a different order after running the same query over and over again.

 

Developer #1 writes a report assuming the data comes back in Primary Key sorted order.

 

Developer #2 is writing a similar report, but the client needs the order sorted in a different manner, so she uses explicit sorting – either in the database layer using an ORDER BY clause or in the report. The consumers of the report complain to developer #2 that the report isn’t running fast enough. Developer #2 decides to add a covering index to support the report. Since Developer #2 needs to sort all her reports explicitly, she may think other folks are doing the same thing. Perhaps it’s even in the development standards to always explicitly sort your data as needed. But we’re all human and Developer #1 forgot about that section of the standard.

 

So, Developer #2 deploys the following covering index.

CREATE INDEX [sort_it]
  ON [dbo].[t1]([col2] DESC)
  INCLUDE([col4]);

 

Now, let’s run our select statement again and see what happens.

SELECT [t].[col1],
       [t].[col2],
       [t].[col4]
FROM   [dbo].[t1] AS [t];

GO

 

sort_order_3

Wow! Now the same query is returning the data in a completely unexpected order. (Well, unexpected to Developer #1!) Why is that? Because the SQL Server Query Optimizer knows it can get the required result set without scanning the full table. Let’s take a look at the query plan.

sort_order_plan_nc_index

As you can see, the Query Optimizer doesn’t even bother looking at our table as it knows all the data it needs for this query is stored in the index. (That’s what makes it a covering index. An index is said to cover a query when it has all the data needed for the query.) This behavior is really helpful in scenarios where you might have dozens of columns on a table and you only need a small subset of columns for a given query. The storage engine doesn’t need to pull all those unnecessary columns back if it can get all the data it needs from the covering index. (Remember, the core SQL Server Engine reads pages with full rows on them rather than specific columns. Columnstore indexes work differently in that they read data by columns instead of rows. But let’s stick with the core engine as that’s what most of us are using.)

 

There are other conditions under which the sort order might change, but this happens to be one that is really easy to demonstrate. Another reason your sort order might change is a so-called Merry-go-round scan. It’s a performance feature built into Enterprise Edition of SQL Server and you can read about it here.

 

Another cause for the order to be different is documented in this post and this post. Although that is also caused by an index, the behavior in that scenario is behavior that you’ll see right from the beginning – as it is a side effect of the table schema – instead of at a later point in time due to performance tuning. Under the hood, though, they are very similar cases even though they look a lot different on the surface. My intuition is that if someone sees this behavior “in the wild” it will more likely be due to the creation of a covering index with explicit ordering rather than due to indexes implicitly created by unique constraints. Your mileage may vary, though.

 

I seem to recall that there are cases where the query engine will return the results in the order the physical pages are written on the disk – meaning that chunks of data will be sorted, but those chunks may come back in an arbitrary order. So, you might get rows 100-199 in sorted order, followed by rows 500-599 in sorted order, followed by rows 1-99 in sorted order, etc. Sadly, I can’t recall when that will happen or why. If anyone knows why this might happen – or if this recollection is just my brain misfiring – let me know!

 

Moral of the story #1 – if you need your data to come back in a particular order, you really need to sort the data explicitly. If you have development standards, this should be added to them. If you do code reviews, this should be on your checklist.

 

Moral of the story #2 – even a seemingly innocuous change like adding an index can have surprising and negative impacts on your system, so test all your changes! Do full regression testing before a release!

 

Moral of the story #3 – it pays to know your SQL Server internals.

Posted in Business, DBA, Frustrating errors, SQL, SQL Server Internals, Testing | Tagged , , , | Leave a comment

Schema changes and row size

I was working on a stored procedure that was misbehaving on a rather massive and embarrassing scale. The code seemed solid. The data model made sense. And yet… Every time I ran this proc it crashed and burned. There were some odd things going on and I knew I needed to get under the hood for a bit.

 

It had been a while since I had done serious study on SQL Server Internals so I figured it was a good time to do so. I had a copy of Dmitri Korotkevithc’s Pro SQL Sever Internals and I thought I’d skim through it as I watched test runs of my proc failing miserably.

 

In the first chapter, I found something useful that I had either forgotten or never known. Mr. Korotkevitch writes: “Unfortunately, table alteration never decreases the size of a data row.” (Emphasis in original.) He quite generously provides some code to prove this point. If you start out with a table full of BIGINTs and then change them all to INTs or even TINYINTs, the rows will still be the same size.

 

The tables in my database are pretty big and I was wondering if there were some table alterations that might have given me some wasted space. So, I took his query and modified it a bit to ferret out any tables that might benefit from an ALTER TABLE REBUILD.

 

I ended up with the following query.

 

 


WITH [offset_to_expected]
AS (SELECT [t].[name] AS [table_name],
[t].[object_id],
[c].[column_id],
[c].[name] AS [column_name],
[pc].[leaf_offset] AS [offset_in_row],
[pc].[max_inrow_length] AS [max_length],
LAG([pc].[max_inrow_length], 1, 0)
OVER(
PARTITION BY [t].[object_id]
ORDER BY [c].[column_id]) AS [prev_max_length],
LAG([pc].[leaf_offset], 1, -1)
OVER(
PARTITION BY [t].[object_id]
ORDER BY [c].[column_id]) AS [prev_offset_in_row],
LAG([pc].[max_inrow_length], 1, 0) OVER( PARTITION BY [t].[object_id] ORDER BY [c].[column_id])
+ LAG([pc].[leaf_offset], 1, 0) OVER( PARTITION BY [t].[object_id] ORDER BY [c].[column_id]) AS [expectd_offset_in_row]
FROM [sys].[system_internals_partition_columns] AS [pc]
INNER JOIN [sys].[partitions] AS [p]
ON [pc].[partition_id] = [p].[partition_id]
INNER JOIN [sys].[columns] AS [c]
ON [c].[column_id] = [pc].[partition_column_id]
AND [c].[object_id] = [p].[object_id]
INNER JOIN [sys].[tables] AS [t]
ON [c].[object_id] = [t].[object_id]
WHERE [p].[index_id] = 1 --clustered index only
AND [pc].[leaf_offset] > 0 --simple case of fixed-length, in-row data
)
SELECT [ote].[table_name],
[ote].[object_id],
[ote].[column_id],
[ote].[column_name],
[ote].[offset_in_row],
[ote].[expectd_offset_in_row]
FROM [offset_to_expected] [ote]
WHERE [ote].[offset_in_row] > [ote].[expectd_offset_in_row]
AND [ote].[expectd_offset_in_row] > 0;

 

 

The basic principle is that we ae comparing the OFFSET of a column to the combination of OFFSET and MAX_LENGTH of the previous column in the table. To do this we need to use the Window Function LAG. Sadly, Window Functions are evaluated – logically, at least – after the FROM and WHERE clauses, so if we want to use the result of the OFFSET calculation in a filter, we need to put the results in some sort of temporary object.

 

Why are Window Functions evaluated after the FROM and WHERE clauses? Wouldn’t it be a lot more useful to evaluate them earlier so they could be used more flexibly? I’m no expert on the subject, but if you think about it a bit, you’ll see that Window Functions basically are giving you a way to compare the data in a given row to some subset of the total dataset returned. And how could you do that if the total dataset had not already been fully specified? And to do that, you’d basically be evaluating the FROM and WHERE clauses.

 

So, I decided to put the results of my LAG calculations in a CTE. You could easily use a temp table or some other object, but I tend to like the clarity of the CTE syntax. (If you have a huge data set, temp tables often perform better, but for the purposes of this demo, that’s not relevant.)

 

Now, I knew up front that the query had some limitations. For one, it only works for fixed-length, in-row data – eg, numbers, dates, fixed-length character data, etc. But I’m working on a data warehouse at the moment and most of the data is numeric, so that’s not a big deal for me. (If anyone cares to come up with a solution that would work for variable length data, I’d love to hear about it.)

 

Let’s see it in action. For demo purposes, create the following table in an empty database.

 

CREATE TABLE [t1]
(
[col1] BIGINT PRIMARY KEY,
[col2] BIGINT,
[col3] INT,
[col4] INT,
[col5] TINYINT,
[col6] SMALLINT,
[col7] DATETIME,
[col8] SMALLDATETIME
);

 

 

I run the offset analysis query and get no rows back. There have been no schema changes so there’s nothing to cause the offsets to be unexpected. Now, I change some columns to a smaller data type.

 

ALTER TABLE [t1]
ALTER COLUMN [col3] SMALLINT;

GO

 

And then I re-run the offset analysis query.

t1_comes_back_20151111_1

 

Now I see that the offsets are unexpected and that indicates that the table might benefit from a rebuild.

 

I execute a rebuild.

 

ALTER TABLE [dbo].[t1] REBUILD;

 

And now I can re-run the offset analysis and no results some back.

 

But there is another limitation. If I increase the size of a column then SQL Server may change the internal column order and that, too, will give me some unexpected offsets. To demonstrate this, I will create another table.

 

CREATE TABLE [t2]
(
[col1] BIGINT PRIMARY KEY,
[col2] BIGINT,
[col3] INT,
[col4] INT,
[col5] TINYINT,
[col6] SMALLINT,
[col7] DATETIME,
[col8] SMALLDATETIME
);

 

I run the offset analysis query and nothing comes back. But if I increase the size of a column like so:

ALTER TABLE [dbo].[t2]
ALTER COLUMN [col4] BIGINT;

Then I re-run the query, I can see that the new table comes back.

 

t2_comes_back_20151111_1

Now this is a pretty serious limitation. If you are in an environment where you do a lot of schema changes using the ALTER TABLE syntax (as opposed to some tools like E/R Studio that rebuild tables to accommodate schema changes) then you might get a lot of false positives. But I encourage you to play around with the query anyway, if only to get you thinking about how the internal representation of your data might be very different than what you expect. Look up the system views and DMVs – learn what they do and see if you can come up with some of your own ideas. Maybe start with sys.partitions. And let me know what you discover!

So what system views, DMVs or DMFs that you’ve found useful?

 

Posted in SQL, SQL Server Internals | Tagged , , | Leave a comment

Automating ER/Studio with C# code.

I really enjoy data modelling and I really enjoy using the various modelling tools I’ve come across. All of them have their charms and their drawbacks. But I’ve yet to use one which works the way I want it to work out of the box. Anyway, at my current job I’m using ER/Studio Data Architect 9.7. It’s a fine tool.

But sometimes you need to move fast. Someone wants to see an ERD at a particular meeting and you haven’t made all the changes you need to make. So, you cut some corners. However, as in so many fields and disciplines, there are activities you can add to your work as you are going which cost you almost nothing, but which can be a big pain to go back and update at a later date.

In my current project, I was asked to change a bunch of attribute names in the logical model. No big deal right? Well, I had to push through this operation rather quickly. But then when I generated the physical model and then generated the physical database objects, I noticed that a lot of the column names didn’t match the attribute names in the logical model.

This is because ER/Studio allows you to enter an attribute name and give it a separate column name. This can be quite useful. You might have prefixes and abbreviations in your column names that you might want to hide from the subject matter experts who would be reviewing your logical model.

attribute_versus_column

But if you want those names to stay the same, and you change a whole bunch at once, it might cost you a lot of time to re-open all those dialogue boxes and update the default column names. Had you done it the first go-round, it would have only cost you a few seconds for each attribute. You could just copy and paste the changes you make in one text box into another. The biggest cost in terms of time is opening the entity then selecting the attribute and opening that dialogue box.

When it comes to grunt work like this, I tend to try to script out as much of the work as possible. When I worked with Sybase Power Designer I used Power Shell a lot. (You can see a sample of the work I did with that here and a more efficient version of the same script here.) But for some reason I couldn’t get the ER/Studio assemblies to load in PowerShell. So, I turned to C#.

I decided to post the code not because it’s great C# code,but sometimes when you’re playing around with something like ER/Studio automation, it’s good to have an example to work from – even if it’s not the most elegant example.

Below is my C# code to force the default column names to match the attribute names in an ER/Studio data model. (Don’t forget to add ER/Studio assemblies to your project!)

er_studio_ref

As always, this code is offered “AS-IS”. Please save your work, backup your files, etc, before you use it. Use it on a test version of your DM1 file and take all the usual precautions.


<pre style="font:inherit;">
using System;
using System.Collections.Generic;
using System.Linq;
using System.Data;
using System.Text;
using System.Data.OleDb;
using ERStudio;
using Microsoft.Office.Core;
using Excel = Microsoft.Office.Interop.Excel;


namespace er_studio_automation
{
    class er_studio_automator
    {
        private ERStudio.Application app;
        private ERStudio.SubModel smod;
        private ERStudio.Diagram diag;
        private ERStudio.Model mod;


        private void FixDefaultColumnNames()
        {
            ERStudio.Entity ent;
            foreach (ERStudio.EntityDisplay entD in smod.EntityDisplays())
            {
                ent = entD.ParentEntity();
                foreach (ERStudio.AttributeObj attr in ent.Attributes())
                {
                    //Console.WriteLine("attr " + attr.AttributeName + " -- " + attr.ColumnName);

                    if (String.Compare(attr.ColumnName, attr.AttributeName, false) != 0)
                    {
                        attr.ColumnName = attr.AttributeName;
                    }
                     
                }

            }
            diag.SaveFile("");

        }
        public er_studio_automator(string diagram_path, string model_name, string sub_model_name)
        {
            this.app = OpenApp();

            this.diag = OpenDiagram(diagram_path);
            this.mod = OpenModel(model_name);
            this.smod = OpenSubModel(sub_model_name);


        }

        static int Main(string[] args)
        {

            string diagram_path;
            string model_name;
            string sub_model_name;
            string action;
            string file_path;

            

            // Test if input arguments were supplied: 
            if (args.Length < 4)
            {
                System.Console.WriteLine("Please enter a \\-escaped diagram path, a model name and a submodel name");
                return 1;
            }

            diagram_path = args[0];
            model_name = args[1];
            sub_model_name = args[2];
            action = args[3];

            if (args.Length > 4)
            {
                file_path = args[4];
            }
            else
            {
                file_path = "";
            }

            er_studio_automator my_app = new er_studio_automator(diagram_path,model_name,sub_model_name);
            //my_app.OpenDiagram()
            switch (action)
            {

                case "fix_default_column_names":
                    my_app.FixDefaultColumnNames();
                    break;
 
                
                default:
                    System.Console.WriteLine("Command not recognized");
                    break;
            }

            //my_app.ValidateRelationships();
            
            my_app.CloseApp();
            return 0;

        }
        private void CloseApp()
        {
            //this.app.Quit();
            this.app.CloseDiagram(this.diag.FileName);
            this.app.Quit();
        }
        private ERStudio.Application OpenApp()
        {
            ERStudio.Application app = new ERStudio.Application();
            app.HideWindow();
            return app;

        }
        private ERStudio.Diagram OpenDiagram(string FileName) 
        {
            ERStudio.Diagram diag = app.OpenFile(FileName);
            return diag;
        }
        private ERStudio.Model OpenModel(string ModelName)
        {
            ERStudio.Model mod = diag.Models().Item[ModelName];
            return mod;
        }

        private ERStudio.SubModel OpenSubModel(string SubModelName)
        {
            
            
            ERStudio.SubModel smod = mod.SubModels().Item[SubModelName];
            return smod;
            
        }
        
    }
}
</pre>


This code assumes you’re running from the command-line. If you are running in debug mode from the Visual Studio IDE, you can put the appropriate command-line parameters in the project properties by right-clicking on the project in the solution explorer and selecting properties. The parameters are:

  1. Path to the DM1 file
  2. The model name
  3. The sub-model name
  4. A command. I’ve only put the code for one of my commands – the one that fixes the default attribute names – in this blog post. The command is fix_default_column_names.
  5. An optional file path. I use this with some other commands I’ve written which import Excel spreadsheets or spit them out.

Now, my model name is “Logical” and my sub model is called “Main Model”. My DM1 file is called data_warehouse_v1.DM1 and let’s say its path is G:\DW\data_warehouse_v1.DM1. Assuming your .exe file ends up in G:\DW\bin and is called automation.exe then the syntax to call this program from the command-line is:

G:\DW\bin\automation.exe G:\DW\data_warehouse_v1.DM1 “Logical” “Main Model” fix_default_column_names

You need to be sure that ER/Studio is not running when you execute this code! If ER/Studio is running when you execute this code, the code will crash and you’ll have an invisible instance of ER/Studio running on your machine. You can kill it through the task manager or through this PowerShell one-liner:


get-process | where-object {$_.Name -match "Erstudio"} | foreach-object {stop-process $_.Id}

Play around with it. Let me know what you think. If you can’t get it to work or have some ideas to make it better or just want to point something out, let me know.

So what are y’all automating these days?

Posted in Automation, Data Modeling, ER/Studio | Tagged , , , | 2 Comments

SQL Server install fails due to SQL Browser not starting

While trying to install SQL Server 2012 as a named instance I got the following error in the error log:

Exception type: Microsoft.SqlServer.Configuration.Sco.ScoException
Message:
Service ‘SQLBrowser’ start request failed.
HResult : 0x84bb0001
FacilityCode : 1211 (4bb)
ErrorCode : 1 (0001)
Data:
Feature = SQL_Browser_Redist_SqlBrowser_Cpu32
Timing = Startup
DisableRetry = true

After reading this article I suspected the problem might be due to a network activity monitor installed on the machine.

Uninstalling the activity monitor was not an option. However, SQL Browser is not needed if there is only a default instance running on the machine and since no other instances were running on the machine I decided to install SQL Server with one default instance.

This work-around allowed install to continue successfully. I may need to re-address this issue if I need to install a second instance of SQL Server on this machine.

Posted in DBA, Frustrating errors, SQL, SQL Server 2012 | Tagged , , , , , | Leave a comment

Broaden your vision

Where there is no vision, the people perish
(Proverbs 29:18A, KJV)

It’s old wisdom but still valid. When you have no direction you flounder and sometimes the people around you flounder, too. This is true in the tech world as much as it is in more obviously social arenas.

A corollary of this is that if your vision is a bad one then the people may just barely scrape by.

A while back I posted some code that imported documentation from a CSV file and used that data to update a Sybase PowerDesigner data model. I mentioned that the code was slow because you were potentially iterating through almost the entire data model for each piece of column documentation you imported. I was pretty sure there was some method of the PowerDesigner COM object that held the key to fixing my performance problem. This may have been true but I managed to solve it by changing my vision. I moved from a very limited vision to a broader vision and this fixed things.

I was pretty interested in learning a lot about the PowerDesigner COM object. I was more interested in this than in solving my problem – e.g., importing documentation into the data model. I so focused on getting the PowerDesigner COM object to do interesting things that I didn’t pay much attention to the less nifty parts of my project – the lowly, old CSV file.

Iterating through a PowerDesigner data model through the COM interface is slow but interesting. Pulling a CSV file into an in-memory hash object and searching the hash object is less interesting. But it allowed me to solve the problem significantly faster. On my machine, when the CSV file is the pulled into an in-memory hash object and the PowerDesigner data model is iterated over only once the code executes in almost 1/10th the time it takes to iterate over the PowerDesigner data model multiple times and the CSV file only once.

Once I stepped back and looked at things more objectively it was easy to see how to make this thing run faster. Once my vision became broader – e.g., solving the problem rather than solving the problem by getting PowerDesigner to do something interesting – I was able to see the obvious solution.

The root of the problem was that my vision was too specific.

This seems to happen a lot in the tech world. XML is a great answer to a lot of different problems but a number of years ago people were using it to solve all sorts of problems to which it wasn’t suited. It seemed almost as if there was an unwritten mandate in a number of organizations that if you couldn’t fit XML into your application then you wouldn’t get your project funded. Developers were trying like mad to get that acronym on to their resumes, too, and that didn’t help. A lot of people got soured on XML or believed it was all hype because of that. The dust has settled and XML is a mature technology with a variety of uses but it was hurt by the narrowing of vision that came along with the XML boom.

Anyway, I’ll post the updated code below. It has two functions which do the same work. One of them does it by iterating over the PowerDesigner data model one time and finding the appropriate CSV data in an in-memory hash object. That function is called test_hash. There is another function called test_model_iteration that basically uses the algorithm in the original post. I’ve also included some helper functions that are used by test_model_iteration. The script outputs the start time and end time of each call so you can see for yourself that the in-memory hash approach is much faster.

As in the original example, the following variables need to be set:

$pathToDocumentation is a path to the CSV file with a column called Column and a column called Description

$PathToModel is a path to the Sybase PowerDesigner model

$tableToUpdate is the name of the table whose documentation comments you are updating.

Does anyone else have any stories about how the narrowness or broadness of their vision affected their work?

Please let me know if you have any questions about the code or have problems adapting it to your situation. I can’t promise I’ll have time to help you but I might be able to offer a comment or two.


Function GetTable($Model,$table){

$Model.Children | where-object{$_.Name -eq $table}

}

Function GetColumn ($Model,$table, $column){

GetTable -Model $Model -table $table | foreach-object{$_.Columns} | where-object{$_.Name -eq $column}

}

Function GetColumnComment($Model,$table, $column){

GetTable -Model $Model -table $table | GetColumn -Model $Model -Table $table -Column $column | foreach-object{$_.Comment}

}

Function SetColumnComment($Model,$table, $column, $comment){

GetTable -Model $Model -table $table | GetColumn -Model $Model -Table $table -Column $column | foreach-object{

$_.Comment = $comment

}

}

function test_hash{

foreach ($row in $documentaion_raw){

$documentaion[$row.Column] = $row.Description

}

$MyModel.Children | foreach-object{

If ($_.ClassName -eq "Table" -and $_.Name -eq $tableToUpdate){

$_.Columns | foreach-object{

$_.Comment = $documentaion[$_.Name]

}

}

}

}

function test_model_iteration{

Foreach($comment_entry in $documentaion_raw){

SetColumnComment -Model $MyModel -table $tableToUpdate -column $comment_entry.Column -comment $comment_entry.Description

}

}

$pathToDocumentation = "C:\Documents and Settings\My Documents\PD_Test_Project\data models\documentation\test_import.csv"

$PathToModel = "C:\Documents and Settings\My Documents\PD_Test_Project\data models\PD_Test_Project.pdm"
$tableToUpdate = "ps_import_test"

$PDApp = new-object -comobject PowerDesigner.Application

$MyModel = $PDApp.OpenModel($PathToModel)

$documentaion_raw = Import-Csv($pathToDocumentation)

$documentaion = @{}

Get-Date

test_hash

Get-Date

Get-Date

test_model_iteration

Get-Date

$MyModel.Save()

$MyModel.Close()

[System.Runtime.Interopservices.Marshal]::ReleaseComObject($PDApp) > $null

Remove-Variable MyModel

Remove-Variable PDApp

Posted in Automation, Business, Data Modeling, Life, PowerShell, spirit, Sybase PowerDesigner, Tools | Tagged , , , | 2 Comments

Data Modeling, Conciseness and Philosophy

Graeme Simsion and Graham Witt use the term conciseness to describe one of the most interesting aspects of a data model. Simsion and Witt write that a good data model

implicitly defines a whole set of screens, reports and processes needed to capture, update, retrieve and delete the specified data.

The data modeling process can similarly take us more directly to the heart of the business requirements

When I first read this quote it was like a light going off in my head. I always felt that a good understanding of the data model was 90% of the battle to understanding an application, a business process or even possibly an enterprise.

On reflection, this really shouldn’t be much of a surprise. I think at our heart we are conceptual beings more than task oriented beings. We are often finding new and supposedly better ways of doing things but basically we are still living in a world of people, places and things and the relationships between them. Email, Facebook and Twitter may have changed the way we meet, keep up with and relate to people but it’s still essentially about relationships.

Get the concepts right. The code will follow.

If this sounds a bit on the philosophical side, I suppose that can be forgiven as apparently I’m not the only one who sees data modeling as being a primarily conceptual activity. With this article I also had that feeling of everything falling into place. People have sometimes asked me if my philosophy training was useful and I would say that it really did seem to have a positive impact on my career as a database professional but I never really had the time to sit down and think about why this was.

Posted in Data Modeling, Life, Uncategorized | Tagged , , | Leave a comment

Importing Column Documentation from a CSV into a Sybase PowerDesigner Data Model

(Update – I have a more efficient version of this script posted here)
I’m a big fan of data modeling tools. I used to chafe against them as so often it was easier for me to think in code. But the advantages of using tools like E/R Studio, Toad Data Modeler and Sybase PowerDesigner became apparent quickly. Being able to visually look at your data model through diagrams and automatically generated reports allowed you to get a general sense of what was going on in the database as a whole. And of course, it was much easier to document the database through the tool as you created or modified your tables with the same tool. And many of these tools have built in checks and warnings that can be very useful.

Still, certain repetitive tasks are easier to do through scripts. Or even through scripts generated by scripts. Right now I’m working with PowerDesigner and it has some built in scripting capabilities but there were some aspects of it that I just couldn’t get used to. And I already knew PowerShell pretty well so I decided to try and see if I could use PowerShell to do some grunt work.

Here’s the scenario. A non-technical subject area expert is asked to review the documentation of one of the key entities in your model. You are asked to give them a spreadsheet with all the attributes/columns as well as the description of each columns. The subject area expert makes some improvements in the spreadsheet and sends it back.

Now you need to import the spreadsheet and update the data model with the revised documentation.

For the sake of this example, we’ll assume that you want to store the column-level documentation in the Comment field of the column. You might want to put it in the annotations or somewhere else depending on your situation.

$PathToModel holds the path to the data model file.

$pathToDocumentation is the path to the CSV file you are importing. It should have a column called Column which has the column names and Description which has the new comment you want in the data model.

$tableToUpdate should have the name of the table in the model you want to update.

Obviously, this isn’t a production utility. It’s just a quick and dirty script to give you ideas for building more useful, flexible tools. I will probably use this as a basis for a PowerDesigner library if I end up doing a lot of PowerDesigner work.

Here is the code:


Function GetComment($table, $column){

$MyModel.Children | where-object{$_.Name -eq $table} | foreach-object{$_.Columns} | where-object{$_.Name -eq $column} | foreach-object{$_.Comment}

}

Function SetComment($table, $column, $comment){

$MyModel.Children | where-object{$_.Name -eq $table} | foreach-object{$_.Columns} | where-object{$_.Name -eq $column} | foreach-object{

$_.Comment = $comment

}

}

$PathToModel = "C:\my_model.pdm"

$pathToDocumentation = "C:\updated_documentation.csv"

$PDApp = new-object -comobject PowerDesigner.Application

$tableToUpdate = "ps_import_test"

$MyModel = $PDApp.OpenModel($PathToModel)

$documentaion = Import-Csv($pathToDocumentation)

Foreach($comment_entry in $documentaion){

SetComment -table $tableToUpdate -column $comment_entry.Column -comment $comment_entry.Description

}

$MyModel.Save()

$MyModel.Close()

[System.Runtime.Interopservices.Marshal]::ReleaseComObject($PDApp)

Remove-Variable PDApp

The big problem with this code is that for each update you may end up iterating over the whole model, so it might be slow for large models. It looks like there are methods in the COM interface that might avoid this but I had no luck calling them from PowerShell. Anyone know the PowerDesigner object model well enough to help with this?

Anyone else have some good automation tips you’d like to share?

Posted in Automation, Data Modeling, PowerShell, Sybase PowerDesigner, Tools | 4 Comments

SSRS report throws an error that says a parameter doesn’t exist when the report designer clearly shows the parameter exists

The order in which the parameters appear in the BIDS Report Data tab can be the difference between your report executing and it failing. Failure to pay attention to this will give you very mysterious error messages!

"The Value expression for the query parameter PARAMNAME contains and error:The expression that references the parameter PARAMNAME does not exist in the parameters collection. Letters in the names of parameters must have the correct case"

It may be visible at design time but not come into scope when the particular expression is evaluated!

You can force the order of the parameters by going into the XML code and adjusting as needed.

Posted in Frustrating errors, SQL | Tagged , , | 15 Comments

Downloading the Results of a Twitter Search using PowerShell and SQL (part 3)

Part one of this series is here and part two is here. You’ll need the SQL scripts from part two in order for the code below to work but you can skip part one unless you’re curious to see the evolution of this project.

First off, I found a bug in my code. A unique contraint violation may be thrown if the exact same tweet is tweeted at the exact same time by two different users. I didn’t code to handle this because I thought it was an incredibly unlikely scenario but given the popularity of automatic tweeting – eg, “Share this on Twitter” buttons – it seems like it is a scenario which can and does happen.

Anyway, I decided to create a different constraint. Here is the code to drop the old one and add the new one.


use twitter
go

/*
a web app which auto tweets for a user 
might cause two identical tweets 
at the same second
*/

if exists (select 1 from sys.indexes where name = N'uq_tblTweet_tweet_tweetDate')
	alter table tblTweet drop constraint uq_tblTweet_tweet_tweetDate

alter table tblTweet
add constraint uq_tblTweet_tweet_tweetDate_twitterUserId unique (tweet, tweetDate, twitterUserId) 


Secondly, I added some paramters to the script. The parameter -searchString takes a single string to search on. Or you can specify a file path as an argument to the -inputFile parameter and it will go through each of the strings and do a search on those strings.

If you use both parameters it will do the -searchString first and then the -inputFile

I needed to update the proc which moves data from the staging table to the data tables so that the staging table doesn’t grow indefinitely. I like the data to hang around for a bit in case I need to debug some error so I don’t want to truncate the table every day. I decided to add a paramater to the proc so that it deletes data whose import date (not the twitter date but the date the data was imported) is more than a certain number of days old. The number of days is passed as an argument on the command line of the PowerShell script call. The default is 30 days for both the PowerShell script and the stored proc and the paramter is named -delAgedStagingData. If you pass a 0 it will not delete any data

Here is the updated proc:


use twitter
go

if object_id(N'uspLoadTwitterDataFromStaging') is not null
	drop procedure uspLoadTwitterDataFromStaging

go

create procedure uspLoadTwitterDataFromStaging
/*
If data is oldet than @deleteDataAge in days, delete it
default = 30
if 0 then don't delete data
*/
@delAgedStagingData int = 30 
as
begin


	insert into tblTwitterUser (twitterUserName)
	select twitterUser
	from tblTwitterStaging
	except
	select twitterUserName
	from tblTwitterUser


	/*
	I could prime this table in a config step 
	but I want the whole thing to be driven by user configurable files in the end product
	or parameters on the command line

	Anyway, the PowerShell script is responsible for inserting the search string into the staging table
	*/

	insert into tblSearchString (searchString)
	select searchString
	from tblTwitterStaging
	except 
	select searchString
	from tblSearchString


	--there is some weird date stuff going on - salvage what you can!
	update tblTwitterStaging
	set twitterdate = convert(datetime,left(twitterdate,10))
	where isdate(twitterdate) = 0


	insert into tblTweet(tweet,tweetDate,twitterUserId)
	select distinct stg.tweet, convert(datetime,stg.twitterDate), u.twitterUserId
	from tblTwitterStaging stg
	inner join tblTwitterUSer u
	on stg.twitterUser = u.twitterUserName

	except select tweet,tweetDate,twitterUserId
	from tblTweet


	insert into tblTweetBySearchString(tweetId,searchStringId)
	select t.tweetId, s.searchStringId
	from tblTwitterStaging stg
	inner join tblTweet t
	on stg.tweet = t.tweet
	inner join tblSearchString s
	on stg.searchString = s.searchString
	except select tweetId,searchStringId
	from tblTweetBySearchString





	if @delAgedStagingData <> 0
		
		delete tblTwitterStaging where datediff(day,importDate,getdate()) > @delAgedStagingData
	
		
end

I also found one case where the published string came back with a format that SQL server couldn’t handle. If I dropped off the time then it could convert. That causes a bit of data loss but I figure it’s better than losing the record and it seems like the situation is pretty rare.

Thirdly, I decided for ease of use to do the URL encoding in the PowerShell script. So when you pass in a string you simply type the search string as you would into any search box. So, if you want to search for “i am the cloud” you can just pass that string as a paramter.

Here is the updated PowerShell script.



param([string]$searchString = "", [string]$inputFile = "", [int]$delAgedStagingData = 30)
Write-Host "searchString: $searchString"
Write-Host "inputFile: $inputFile"
Write-Host "delAgedStagingData: $delAgedStagingData"


[System.Reflection.Assembly]::LoadWithPartialName("System.Web") | out-null
$searchString = [System.Web.HttpUtility]::UrlEncode($searchString)



Function getSearchPage ($searchString, $page)
{
	"searchString = $searchString"
	"page = $page"


	$req = "http://search.twitter.com/search.atom?q=$searchString&show_user=1&rpp=100&since_id=1&page=$page"
	
	
	"`$req = $req"
	
	([xml](new-object net.webclient).DownloadString($req)).feed.entry |
	%{

		#replace problematic characters in input
		$title = $_.title -replace ("`t","``t")
		$title = $title -replace ("'","''")
		$title = $title -replace ("`n","``n")
		$title = $title -replace ("`r","``r")
		$title = $title -replace ("`$","``$")
		$title = $title -replace ("`"","`"`"")
		$published = $_.published -replace ("`t"," ")
		$published = $published -replace ("'","''")
		$published = $published -replace ("`n","``n")
		$published = $published -replace ("`"","`"`"")
		
		
		
		
		$splitData = $title.split(":")
		$twitterUser = $splitData[0]
		
		$tweet = $title.substring($twitterUser.length,$title.length - $twitterUser.length)
		if ($tweet.length -ge 2)
		{$tweet = $tweet.substring(2,$tweet.length -2)}
		
		

		
		#turn input into sql statement
		if ($tweet.length -gt 0){
			$sql = "insert into twitter.dbo.tblTwitterStaging 
			(twitterUser,tweet, twitterDate,searchString) 
			select '$twitterUser','$tweet' ,'$published','$searchString'" 
			
			sqlcmd -E -Q $sql 
			

		}

	} 

	
}

Function getLast10SearchPages ($searchString)
{
	for ($i=10; $i -ge 1; $i--)
	{



		$i

		getSearchPage $searchString $i



	} 
}


#process command line searhc string first
if ($searchString -ne '') {getLast10SearchPages ($searchString)}

#process any search strings in the file
if ($inputFile  -ne '') {Get-Content $inputFile | Foreach-Object {getLast10SearchPages ([System.Web.HttpUtility]::UrlEncode($_))}}



sqlcmd -E -Q "exec twitter.dbo.uspLoadTwitterDataFromStaging @delAgedStagingData = $delAgedStagingData"



I’ve been calling the file twitter_search_to_db.ps1 and assuming that you call yours the same thing here are some sample calls. I created a text file called search_strings.txt which sits in the same directory in which I’m running my script. Here is what is saved in search_strings.txt:

#powershell
#sqlhelp
i am the cloud
death cab for cutie
florence + the machine
mary timony
Jolie Holland

Here is the call with a search string:

.\twitter_search_to_db.ps1 -searchString "J. Vernon McGee"

Here is a call with a file parameter:

.\twitter_search_to_db.ps1 -inputFile ".\search_strings.txt" -delAgedStagingData 30

And here is a call with both:

.\twitter_search_to_db.ps1 -searchString "J. Vernon McGee" -inputFile ".\search_strings.txt" 

Notice that the search string is stored URL encoded in the tables:

select * from twitter.dbo.tblSearchString

I figure that this is closer to what is actually required by the program. It should be fairly safe for use in SQL and over the web. It’s up to the interface to transform user input into something computer friendly. You could store both inside but I try to avoid repetivive data and since the unencoded data should be fairly easy to get from the encoded data it seems silly to have both. You might end up with difficult to find bugs if you have both.

I would recommend you add this as a scheduled task. That’s fairly easy to do. You can find details on that here along with some other nice PowerShell tricks.

I hope to play around a bit with displaying the results but my plan for my next post is probably going to be some pure T-SQL stuff. I feel like I could go on and on with this project. I love PowerShell but my deeper interest in T-SQL is starting to pull me away. So I’m going to take a brief PowerShell vaction.

Posted in PowerShell, SQL | Tagged , , , , , , | Leave a comment

Downloading the Results of a Twitter Search using PowerShell and SQL (part 2)

I’ve expanded a little on my Twitter project. (See part 1 here.) I adjusted the PowerShell script to page through the data going back as far as Twitter holds the data. There is still a lot to be done.

Left to be done:

  • allow for dynamic search strings instead of hard coding
  • read search strings from file or command line
  • better structure in the script

But today I want to focus on the data model. I pasted the commented DDL statements below


--assuming you have created a database for this project. I'm calling mine simply twitter
use twitter 
go

/*

staging - dump all the strings in hear then clean them with T-SQL

*/


create table tblTwitterStaging(
	staging_row_id int identity
	/*
	I'm making the twitterUser and tweet columns much larger than
	the twitter app allows because this is a staging DB and I want to be sure that
	I can handle any mess that twitter sends me
	
	As a general rule, I try to code as defesnively as the situation allows
	
	
	*/
	,twitterUser nvarchar(1000)
	,tweet nvarchar(1000) 
	
	--use nvarchar to avoid differences in formatting between their system and SQL server.
	,twitterDate nvarchar(25) 
	
	--the actual date the row is inserted into the staging table
	,importDate datetime 
	constraint df_tblTwitterStaging_importDate default getdate()
	
	--search string sent to twitter
	,searchString nvarchar(1000) 
	
)


/*

Since a user can have multiple Tweets across many searches it seems good to give them a table

*/

create table tblTwitterUser(
	twitterUserId int identity
	--code defensively - more than twitter will allow
	,twitterUserName nvarchar(450) not null 
)


alter table tblTwitterUser
add constraint pk_tblTwitterUser primary key (twitterUserId)

/*
I can't see how twitter could allow dups in user name
So let's put that in our data model

We may be able to use knowledge like this later for query optimization and so forth
*/
alter table tblTwitterUser
add constraint uq_tblTwitterUser_twitterUserName unique (twitterUserName) 

	
/*

I'm going to scale this out to track multiple searches

*/


create table tblSearchString(
	searchStringId int identity
	,searchString nvarchar(450)
)		


alter table tblSearchString
add constraint pk_tblSearchString primary key (searchStringId)

--We control the search strings and it doesn't make any sense to have multiple, identical search strings
alter table tblSearchString
add constraint uq_tblSearchString_searchString unique (searchString)



create table tblTweet(
	tweetId int identity
	,twitterUserId int
	--once again, much longer than we'd ever expect a tweet to be but coding defensively
	,tweet nvarchar(400) 
	,tweetDate datetime
)

alter table tblTweet
add constraint pk_tblTweet primary key (tweetId)

/*
I don't care about duplicate tweets
if the same tweet shows up in more than one search it will tie back to this one
*/
alter table tblTweet
add constraint uq_tblTweet_tweet_tweetDate unique (tweet, tweetDate) 

--every tweet ties to a user
alter table tblTweet
add constraint fk_tblTweet_tblTwitterUser foreign key (twitterUserId) references tblTwitterUser(twitterUserId)

/*
each search will return zero or more tweets

The same tweet could be returned by more than one search string
So this relationship needs its own table
*/
create table tblTweetBySearchString(
	tweetId int not null
	,searchStringId int not null
)		

alter table tblTweetBySearchString
add constraint pk_tblTweetBySearchString primary key (tweetId, searchStringId)

alter table tblTweetBySearchString
add constraint fk_tblTweetBySearchString_tblTweet foreign key (tweetId) references tblTweet(tweetId)

alter table tblTweetBySearchString
add constraint fk_tblTweetBySearchString_tblSearchString foreign key (searchStringId) references tblSearchString(searchStringId)

go

/*

And here we have the actual results

I'm using a view so that a client application will not break if the table schemas change

*/
create view vwTweetsBySearchString
as
select  s.searchString, u.twitterUserName,t.tweet, t.tweetDate

from tblTweetBySearchString t_by_s

inner join tblSearchString s
on t_by_s.searchStringId = s.searchStringId

inner join tblTweet t
on t_by_s.tweetId = t.tweetId

inner join tblTwitterUser u
on t.twitterUserId = u.twitterUserId

Here is the proc we’ll use to move the data from staging to the tables we’ll actually query

use twitter
go

if object_id(N'uspLoadTwitterDataFromStaging') is not null
	drop procedure uspLoadTwitterDataFromStaging

go

create procedure uspLoadTwitterDataFromStaging
as
begin


	insert into tblTwitterUser (twitterUserName)
	select twitterUser
	from tblTwitterStaging
	except
	select twitterUserName
	from tblTwitterUser


	/*
	I could prime this table in a config step 
	but I want the whole thing to be driven by user configurable files in the end product
	or parameters on the command line

	Anyway, the PowerShell script is responsible for inserting the search string into the staging table
	*/

	insert into tblSearchString (searchString)
	select searchString
	from tblTwitterStaging
	except 
	select searchString
	from tblSearchString




	insert into tblTweet(tweet,tweetDate,twitterUserId)
	select distinct stg.tweet, convert(datetime,stg.twitterDate), u.twitterUserId
	from tblTwitterStaging stg
	inner join tblTwitterUSer u
	on stg.twitterUser = u.twitterUserName

	except select tweet,tweetDate,twitterUserId
	from tblTweet


	insert into tblTweetBySearchString(tweetId,searchStringId)
	select t.tweetId, s.searchStringId
	from tblTwitterStaging stg
	inner join tblTweet t
	on stg.tweet = t.tweet
	inner join tblSearchString s
	on stg.searchString = s.searchString
	except select tweetId,searchStringId
	from tblTweetBySearchString






end

And here is the PowerShell script to get the data into the data staging table. Like I said before there is a lot to be done but it pages through the data and puts the data into our staging table using sqlcmd.




for ($i=10; $i -ge 1; $i--)
{
	
	
	$i
	$req = "http://search.twitter.com/search.atom?q=%23sqlhelp&show_user=1&rpp=100&since_id=1&page=$i"
	
	([xml](new-object net.webclient).DownloadString($req)).feed.entry |
	%{

		#replace problematic characters in input
		$title = $_.title -replace ("`t","``t")
		$title = $title -replace ("'","''")
		$title = $title -replace ("`n","``n")
		$title = $title -replace ("`r","``r")
		$title = $title -replace ("`"","`"`"")
		$published = $_.published -replace ("`t"," ")
		$published = $published -replace ("'","''")
		$published = $published -replace ("`n","``n")
		$published = $published -replace ("`"","`"`"")
		
		
		

		
		$splitData = $title.split(":")
		$twitterUser = $splitData[0]
		
		$tweet = $title.substring($twitterUser.length,$title.length - $twitterUser.length)
		if ($tweet.length -ge 2)
		{$tweet = $tweet.substring(2,$tweet.length -2)}
		
		

		
		#turn input into sql statement
		if ($tweet.length -gt 0){
			$sql = "insert into twitter.dbo.tblTwitterStaging 
			(twitterUser,tweet, twitterDate,searchString) 
			select '$twitterUser','$tweet' ,'$published','%23sqlhelp'" 
		
			#insert into staging table
			sqlcmd -E -Q $sql
		}

	} 
} 

sqlcmd -E -Q "exec twitter.dbo.uspLoadTwitterDataFromStaging"

And here is a view that our downstream systems will use to view the results

create view vwTweetsBySearchString
as
select  s.searchString, u.twitterUserName,t.tweet, t.tweetDate

from tblTweetBySearchString t_by_s

inner join tblSearchString s
on t_by_s.searchStringId = s.searchStringId

inner join tblTweet t
on t_by_s.tweetId = t.tweetId

inner join tblTwitterUser u
on t.twitterUserId = u.twitterUserId

In the next installment in the series I’m hoping to build out the PowerShell script to accept multiple search strings. I may play with the T-SQL a bit as that is the sort of thing I never get tired of doing.

Thoughts? Feature requests?

Posted in PowerShell, SQL | Tagged , , , , , , , , , | 1 Comment