SqlBits 2017 In Review

IMG_20170408_073513This past Saturday 8th April 2017, the annual SqlBits conference took place in the International Centre in Telford, Shropshire.  The event is a four day conference, with the first three days being a paid conference and the final day, the Saturday, always being a free community day.

I’d had to get up quite early for this event, setting my alarm for 5:30am to allow me to get my all-important cup of coffee before setting off for the approximately 90 mile journey to arrive at the conference for the opening time of 7:30am!

IMG_20170408_074118After arriving at the venue, which on this occasion had free car parking in the venue’s ample car park, I heading indoors to search for the registration booth.  It appeared that I’d arrived slightly early as I joined a small crowd that was gathered in one of the venue’s corridors just outside the entrance to a large hall where the event itself was taking place.  After a short wait, we were informed that we could enter the hall and proceed to the registration booths on the way in.

After registering and receiving my badge, conference programme and goodie bag, I heading to the catering area for another all-important cup of coffee and some breakfast.

IMG_20170408_074857This year, having recently become vegetarian, I knew the almost obligatory bacon butties would not be an option, so I quickly acquired a cup of coffee (the most important thing) and searched for the vegetarian breakfast option.  The choice was of either a fruit smoothie or a fruit bowl.  I selected the fruit bowl and made my way to one of the many tables dotted around the venue to consume my breakfast and take a peek through the goodie bag!

After a short while, an announcement came over a venue wide loudspeaker system to tell the attendees that the first sessions would be starting in 10 minutes.  This year, there were eight tracks of talks, each one presented in one of eight separate domes scattered throughout the venue.  I quickly finished my coffee, collected my things and made my way to Dome 3 for my first session, Ust Oldfield’s “A Deep Dive Into Data Lakes”.

IMG_20170408_081230Ust first introduced himself as a consultant working for Adatis who provide Business Intelligence (BI) consultancy services.  He says that, as a company, they were fully conversant with standard data warehouses, but needed to move forward in order to understand the relatively recent phenomenon of data lakes on the Microsoft Azure platform.

Ust asks the audience who already knows about Data Lakes, not many of us do, so he asks if we’re familiar with Hadoop – the Apache Foundations distributed computing framework – to which a few more people are familiar.  Ust explains that, under the hood, Azure’s Data Lake is a combination of distributed Azure BLOB storage (and so can work with any file type or size) with Hadoop overlaid on top to provide the distributed compute capability.

Azure Data Lake (hereafter referred to as ADL) uses things called “Extents” to contain its data, which are 250MB blocks that all storage is divided into.  Ust explains that ADL uses the “lambda” architecture and allows users to perform computations and queries using a language called U-SQL, which he says is like a cross between T-SQL used in SQL Server and C#.  All of the files that are added to a data lake can be set to automatically expire and be deleted, and so ADL contains functionality to allow some automated maintenance of the data held within it.

All data warehouses and lakes in ADL go through three stages which are the ingestion of raw data, and enrichment phase (where the data is verified, de-duplicated, cleaned and augmented with additional data from other sources) and finally is curated and presented for user consumption.  U-SQL scripts specify how the raw input data is transformed into output data and U-SQL includes both traditional ways to select and filter raw data, similar to how SQL Server would provide such functionality, but also includes other methods of transforming data more specific to distributed data sets, such as MapReduce

ADL provides a dashboard within the Azure website where ADL can be accessed and scripts created and run against the ADL data, however, there is also Microsoft Visual Studio tooling available so that many of the ADL functions can be accessed through Visual Studio.  One very interesting feature is that U-SQL scripts that would normally be confined to running within Azure can be downloaded to a local machine and debugged using Microsoft Visual Studio and it’s important to note that some functionality of ADL can only be accessed via the ADL Visual Studio tooling.

When performing queries against ADL data, U-SQL scripts are split and parallelized into multiple “vertexes”, which are the discrete units of computation within ADL.  Each vertex can be independent or dependent upon a previous vertex completing it’s computation.  You can manage vertexes and their dependencies within ADL, but this is a piece of functionality only available within the ADL Visual Studio tooling.

Ust shows us a demo of some U-SQL queries running over some sample ADL data.  He demonstrates how, despite ADL like almost all Azure features is a pay-as-you-go service, reworking your queries to be longer running queries that use fewer ADLAU’s (Azure Data Lake Analysis Units – the discrete single compute/billing units within ADL) can actually save you a lot of money.  This is due to how ADL charges are calculated, meaning that it’s far more expensive to use more ADLAU’s than it is to use fewer but for a longer time.

Ust shows us some tips around using “partition elimination”, which is a mechanism whereby data is pre-filtered prior to being distributed and computed upon by your standard U-SQL scripts.  Partition Elimination is best implemented with a deliberately defined file naming system (i.e. MyLogs_2017_05_01.txt, MyLogs_2017_05_16.txt etc.)  Using such a mechanism, you can filter the data files to be included within the U-SQL compute based upon partial filename matches and wildcards (i.e. you could process MyLogs for the month of April 2017 with a pre-filter such as MyLogs_2017_04_**.txt).  Ust tells us some more about the ADL data and the requirements for its storage.  He says that indexes are mandatory within ADL data, but that we can only have a single clustered index on each table.  Currently, ADL does not support non-clustered indexes however this is something that may come in the future.

Finally, Ust talks about data “skew”, which is the mechanism of how your dataset is distributed throughout the cluster for computing.  Data can be split for processing based upon a round-robin technique, which guarantees an even distribution of data across all nodes in the cluster, but does not guarantee that similar data will be kept together and processed by the same node.  This can cause a performance degradation of the compute function as potentially separate nodes must communicate much more to transfer related data when it’s on multiple nodes.  The other technique for data distribution is to split the data based upon a hash.  This guarantees that related data will be kept together on the same node – thus potentially improving the compute performance – but can now no longer guarantee that the data is evenly distributed across all the nodes in the cluster.  This means that some nodes will have significantly more work to perform than other nodes which can again impact overall compute performance.  Therefore, it’s essential that you understand the general “shape” of your raw data in order to maximise the compute performance – and thus the overall cost – of your ADL service.

After Ust’s session, we had a quick coffee break and I grabbed another cup of coffee.  There was just enough time to drink my coffee and take a very quick look around the main hall of the venue before I had to make my way back to Dome 3 for the next session.  This one was Hugo Kornelis’s “Normalization Beyond Third-Normal Form”.

IMG_20170408_093319Hugo starts his sessions by reminding us of some key concepts that we need to be aware of when performing any data modelling activity.  He talks about the “Universe Of Discourse” which is the view of reality as defined by the data/software model, it’s not necessarily the view of actual reality.  We then look at the purpose of database normalization.  We recall that normalization is the process of organising our data into columns and tables in such a way as to reduce redundancy and improve data integrity.  Hugo points out that normalization’s purpose is not to prevent incorrect data but to prevent impossible, inconsistent or business rule violating data.  We can’t stop the user from entering false data into the name column, but we can prevent them from providing us with a non-date value for their birthdate.  Hugo also reminds us that normalization is never performed at a database level, only at a table level.  It’s perfectly possible to have a database that, across it’s many tables, contains multiple forms of normalization.

Next, we look at what defines normalization.  Hugo tells us that it’s based upon Functional Dependency.  This is a constraint that dictates that for every value in Column A in a relationship, there is exactly none, or one value for Column B (i.e. A –> B).  Column A can actually be a composite of multiple actual columns (i.e. {A,B} –> C) and Hugo gives the conference-specific example of a SqlBits Dome number and a chair number which can define the exact name of the attendee sitting there.  It’s possible that the composite can exist on the other side of the relationship (i.e. A –> {B,C}) however this can be reduced to two constraints of A -> B and A –> C. 

Hugo reminds us of 3rd normal form.  This is the most “popular” normal form that many people take their databases to and then stop there.  3rd normal form (3NF) states that, in a given table, every non-key column is dependent upon the key, the whole key and nothing but the key (so help me Codd!).  We can use an algorithm called “Bernstein’s Algorithm for Synthesis of a 3rd Normal Form Schema” to help us create a database schema that is guaranteed to be in 3rd normal form, so long as all of our functional dependencies are known up front.  Hugo also mentions Boyce-Codd normal form, which is based upon 3rd normal form but extends the requirement that all columns in a table, including key columns, must be dependent upon the key.  When all columns in a table are dependent upon the key, there should usually be no duplicated data within that table’s row.

Hugo proceeds by detailing something called Elementary Key normal form.  This is perhaps a little known and used type of normal form, based upon 3rd normal form but where the constraint is defined as only non-elementary columns being dependent upon the key.  So what is an non-elementary column?  Well, it’s where functional dependencies such as {A,B} –> C does not have the reverse dependencies of either C –> A or C –> B.  This can also be expressed as where every full non-trivial functional dependency of the form A –> B, then either A is a key or B is (a part of) an elementary key.  Hugo explains that, in practice, Elementary Key normal form is almost identical to 3rd normal form.

From here, Hugo takes us into the more elaborate normal forms.  We start with 4th normal form. 4th normal form, unlike the lower normal forms, is less concerned with functional dependencies, but rather with multi-valued dependencies.  These are best explained with an example.  Hugo uses a table representing the availability of experts to discuss SQL problems on given days of the week:

Day Expert Subject
Monday Jim Design
Monday Jim Tuning
Tuesday Jim Debugging
Tuesday Fred Design

Looking at this data, we can infer the following fact: On Monday, you can ask Jim about Design.  From this fact, we can further infer two additional facts: On Monday, you can ask Jim questions, and Jim knows about Design.  In looking at the two facts that we’ve inferred, we can see that it is not possible to work backwards and infer the first original fact merely from the two subsequent facts.  This is a violation of 4th normal form.  In order to make this data compliant with 4NF, we must separate the information regarding days of the week and subjects into different tables, each table then becomes compliant with 4th normal form:

Expert Day
Jim Monday
Jim Tuesday
Fred Tuesday
 
Expert Subject
Jim Design
Jim Tuning
Jim Debugging
Fred Design

After 4NF we move on to look at 5th normal form.  5th normal form is based upon 4th normal form but extends the rules to dictate that there must be no “join dependencies” between the columns except based upon key.   A join dependency is effectively the ability to take a single table, split it into multiple tables and be able to recreate the original table by constructing a query that joins the split tables back into one.  In practice, a table being in 5NF effectively means that if a column has the same value in multiple rows and to remove the value from the table requires the removal of multiple columns then our table is not compliant with 5th normal form. 5NF is so closely related to 4NF that it’s very rare for a table compliant with 4NF to not also be compliant with 5NF.

Expert
Jim
Fred
 
Day
Monday
Tuesday
 
Subject
Design
Tuning
Debugging

Hugo briefly touches upon 6th normal form.  He starts by stating that 6th normal form is very hard to find in practice, being far more an academic curiosity.  6NF is based upon 5NF but further constrains the join dependencies to state that no join dependencies, even those implied by key, are allowed to exist within the table.  This effectively means that there can never be any NULL values within any columns of a 6NF table.  There would be no need for NULLs as we could simply remove the entire row.  The primary reason we don’t see tables and especially entire databases that conform to 6th normal form is that 6NF largely implies that our entire data schema is modelled using a very large number of tables with each table having only a key column and a data value column.  Today’s real-world database platforms are simply not optimised to operate with such a data schema and so data normalisation to this level is rarely, if ever, performed in the real world.

Hugo next talks about Optimal Normal Form.  This is based upon 6NF but prevents the “splitting” of tables if “elementary fact types” would be split. Elementary fact types are multiple columns that would have to remain together in a single table to ensure integrity of data.  Again, optimal normal form is very rarely found in the real-world.

IMG_20170408_105616Finally, Hugo talks about a entirely different type of data normalization, and this is known as Domain/Key Normal Form.  Domain Key Normal Form (DKNF) is not based upon functional dependencies like all other forms of normalization, but is instead based solely upon domain constrains and key constraints.  Domains in this context refers to the range of values that are allowed within the given column.  These are not the values allowed by the data type of the underlying column, but rather the values allowed by the business logic of the domain.  An example that violates DKNF could be shown as follows with a school report card and grade for students whereby the score is a value between 0 and 100, and the status of FAIL or PASS (FAIL for scores below 50):

Student Score Status
James 78 PASS
William 63 PASS
David 48 FAIL
Timothy 57 PASS

From the table above, we can see that it would be possible to enter a value of FAIL in the Status column for the row containing James’ name.  The database constraints would not prevent us from doing this, however, we would be violating our business rules that state that Scores greater than or equal to 50 are a PASS status.  In order to correct this data so as not to violate DKNF, we would change it as follows by splitting into two tables:

Student Score
James 78
William 63
David 48
Timothy 57
 
Status Minimum Score Maximum Score
FAIL 0 49
PASS 50 100

By splitting the data, we ensure that business logic is captured and no table data can violate the domain rules.  An interesting side-effect of complying with DKNF is that you’ll also comply with 5NF too.  The relevance of DKNF, despite being a very different form or normalization that other forms, is that data integrity against business rules can now be expressed and enforced from the database design alone, something that has traditionally been enforced only within application code that is responsible for reading and writing data to and from the database.  It should be noted, however, that compliance with DKNF isn’t always possible and depends very much on the business domain.

After this, Hugo’s session was complete and it was time for another short coffee break.  I quickly grabbed another coffee from one of the numerous catering stands dotted throughout the venue and checked my programme for the Dome that I would need to head towards for the next session.  That was Dome 4 and Conor Cunningham’s “SQL Server vNext and SQL Azure – Upcoming Features”.

IMG_20170408_111016

Conor’s talk was originally intended to be given by Lindsey Allen, however, a scheduling mix-up had resulting in Lindsay being unable to give the presentation.  Instead we were provided with some excellent content from Conor Cunningham who is the Principal Software Architect for Microsoft on the SQL Server Query Processor Team.  Conor is here to tell us all about the new features that will be coming in the upcoming versions of the on-premise SQL Server product as well as SQL Azure.

Firstly, Conor tells us that both the on-premise SQL Server product and the SQL Azure product share the exact same codebase.  SQL Azure has a monthly release cadence and so is always the first product to receive new SQL Server functionality and have that functionality available to the public whilst on-premise SQL Server currently has a release cadence of approximately 1 year and so receives the same features from SQL Azure in each of it’s subsequent public releases.

A big feature coming in SQL Server vNext (the official marketing title is not yet decided) is the ability to run it on Linux.  This isn’t just a version of SQL Server that’s specially built for Linux, but the exact same binaries that run the Windows version of SQL Server.  Microsoft has built an abstraction layer, known as a “PAL” (or Platform Abstraction Layer) which is used to align all operating system or platform specific code in one place and allow the rest of the codebase to stay operating system agnostic.  Moreover, SQL Server when run on Linux will effectively be SQL Server running inside a Docker container.  Previously, SQL Server has relied on Windows Server Failover Clustering (WSFC) to provide clustering capability to SQL Server, however, as part of the work required to allow SQL Server to run on Linux, this is being abstracted away to allow 3rd party cluster management software to be used.  Initially, SQL Server on Linux will support an open-source product called Pacemaker, however more cluster management product support will follow in time.

IMG_20170408_113255There have been big improvements within the In-Memory Tables features of SQL Server.  These improvements mean that In-Memory tables, which were previously constrained in how they operated compared to normal disk-based tables, will now operate much closer to how standard tables operate, supporting many more features including JSON support, CROSS Apply, CASE statements amongst others.

IMG_20170408_113256Another major set of improvement work within SQL Server vNext are improvements in the area of ColumnStore indexes.  ColumnStore indexes are perhaps one of the best new features to be added to SQL Server in recent years and allow potentially significant performance enhancements for queries on tables using such an index.  ColumnStore index now have support for BLOB column data types and the index itself is now compressed, reducing space and storage requirements as well as improving performance.  Further, rebuilds of ColumnStore indexes will now no longer cause significant blocking of the tables upon which the index is being rebuilt, meaning that users of the database are no longer severely negatively impacted by such rebuilds.

IMG_20170408_114604SQL Server vNext also includes advancements in “Adaptive Query Processing”.  This is a major new area of functionality in SQL Server and will receive even more improvements in future versions of SQL Server beyond vNext.  Adaptive Query Processing is a series of algorithms that work within SQL Server’s Query Processor in order to improve query performance by analysing query plans, SQL Server data and other meta-data.  It aims to improve query performance without introducing any query degradation from incorrect query plan optimisations.  It does this dynamically adjusting joins (i.e. switching from hash joins to merge or loop joins, or vice-versa), adjusting memory grants in order to ensure efficient allocation of memory without under or over allocating and interleaving compilation and execution for the most complex queries in order to maximise their performance.

IMG_20170408_120426Another major new feature of SQL Server vNext will be support for Graph Databases.  Graph databases are highly specialised databases that have their data in graph structures, using nodes, edges and properties in which to store their data allowing for semantic querying of data.  Common applications of graph databases are for querying large graphs of data such as those found inside a social network.  Graph data and the ability to efficiently query it makes questions such as “How many friends of Person A are also friends of Person B?” and “Which friends of Person A are also friends of the friends of Person B?” very easy to answer, something a relational database would have difficulty in achieving in an efficient manner.  SQL Server vNext’s support for graph databases promises to offer full CRUD support for node and edge creation, query language extensions to allow querying of graph data as well as allowing queries to span both standard relational SQL Server data and graph data at the same time.

Conor continues his exploration of the new SQL Azure features by telling us about a new feature in SQL Azure that can automatically create indexes for table columns inferred from usage of the column within queries, the maximum database size has also improved, now supporting databases up to 4TB in size.  There’s also some long-awaited improvements to the syntax of the T-SQL query language itself.  There are new string concatenation and aggregation functions as well as a TRIM function (finally!).  New Japanese collation families have been added also and new bulk insert operations have been added to support specific new standards such as RFC 4180 CSV file formats.

IMG_20170408_123009After a brief Q&A at the end of Conor’s session it was time for another refreshment/toilet break.  I decided I was all coffee’d out by this point, having had around 4 or 5 cups of coffee so far, and it still only being just past 12pm.  There was one more session before the break for lunch and so after consulting my conference programme, I headed off to Dome 2 for the somewhat light-hearted session that was Denny Cherry’s “Things You Should Never Do In SQL Server”.

Denny introduced himself first and indicated that this session was to be a bit lighter than other sessions in the day, being a look at some of things that you should not do rather than the things you should.  As such, he reminded us that everything on his slides was wrong!

He started by talking about the enforcement of data integrity in SQL Server and tells us that we really shouldn’t use things like triggers, stored procedures or even application code to enforce data integrity.  SQL Server is a fully relational database and we can leverage what SQL Server is good at by designing our schemas to provide such integrity for us.  Denny talks about a book that he reviewed as a technical editor, which was so bad that he implored the publisher to completely scrap the book.  One of the pearls of wisdom in this book, he says, was the recommendation to use 32-bit editions of SQL Server for “local offices” reserving the 64-bit edition of SQL Server only for large corporations.  Don’t do this.  We all run on 64 bit operating systems today, so where available, we should always be running 64 bit application software, too.  He states that recommendations from third party software vendors should always be questioned, too.

Next up is migrating databases.  Something seen quite frequently is the “copy database wizard” to migrate databases from one server to another.  This is error prone and simply not as good an option as something like log shipping, which has been around for decades is a very robust and mature technique for performing migrations.  Then we look at the account under which SQL Server will run.  Whilst it’s true that very old versions of SQL Server (pre-2005 versions at least) required local administrator privileges in order to run, modern versions of SQL Server do not require such special privileges at all.  SQL Server does require some additional permissions above those usually found in a standard local user account, but not many more.  Always run with the minimum permissions you need.

Next we look at SQL Profiler.  It’s a great tool for debugging issues on a SQL Server instance, but it should never really be connected to your production database.  This can negatively impact performance of the database.  It’s far better to use it against either a local server or an offline backup or staging server.  Moreover, the very latest versions of SQL Server have functionality that SQL Profiler unfortunately doesn’t support.

Denny then moves on to look at the SQL queries we write.  He says that it’s really not worth the effort to ensure that SQL is written in a cross-platform manner (i.e. ANSI SQL).  Whilst it’ll work, of course, you’re really giving up a lot of functionality and performance improvements that have been built into the platform specific dialects of SQL used on each database platform.  Using SQL that is written specifically for the platform you’re targeting will always allow you to write code in the most performant manner.  Moreover, it’s incredibly rare to have to need your SQL written in a cross-platform way as it’s incredibly rare to actually want to ever migrate your databases to an entirely new database platform.

Denny then looks at the some anti-patterns with data itself.  He states that you should always use NULL where there’s an absence of data, and not values such as empty strings, minimum dates (i.e. 01/01/1990) and other “magic” values.  He also says that you should never blindly design your database schema to a specific level of normalization.  You should always consider the application that the data will support and the required performance of that data and domain design and then design your database schema accordingly.  Next we talk about transaction logs.  Denny says he’s seen a number of people simply deleting transaction logs in order to reclaim disk space.  This is a bad idea, and if you find you really need more disk space, you should simply buy more disk space rather than severely impact your ability to recover your database from crashes by deleting the transaction log.  On the subject of transaction logs, Denny reminds us not to use RAID 5 for our disk array that will store the transaction log.  RAID 5 is not optimized for write intensive operations – which the transaction log requires – and so the performance will suffer as a result.  Also, never ever use AUTO_SHRINK to automatically reclaim disk space.  Whilst this does reclaim disk space, the negatives of doing this far outweigh the positives.

Next we look at columns and schema.  Denny reminds us to always use the correct and most appropriate data type for our data, and always be aware of the kinds of data we’ll be working with.  For example, in the US, the zip code (equivalent to postcode in the UK) is entirely numeric – it’s a 5 digit number.  An integer column might seem appropriate here, but some states (Maine) have their zip code start with two zeroes and it must always be written in 5 digit format with these leading zeroes.  Further, zip codes / post codes are not always numeric once you move beyond the USA, so it’s highly likely you’ll need to support alphabetic characters in there too.  Also, don’t assume that certain values will never change and therefore use them as a primary key.  Some developers have previously used a US Social Security Number as a primary key thinking that it’ll never change, and whilst it very rarely does, it’s not guaranteed not to.

Some developers believe that views will improve performance, however, they really don’t.  And don’t ever be tempted to use nested views as they are considered evil and incredibly difficult to debug.  Don’t require RDP access to a SQL Server in order to run queries against it, it’s not only a security risk, but it’s simply not required.  In thinking about the permissions that you can grant to users and other objects in SQL Server, always ensure that you grant the minimum amount of permissions required.  Also, don’t ever revoke permissions from the built-in database roles, such as the public role.  Denny talks about a time that an over-zealous auditor at a client insisted that certain permissions be removed from the public database role (this was a permission allowing access to the underlying Windows Registry).  When revoked this caused users within the public role to be entirely unable to log in to the SQL Server due to SQL Server’s own requirement to access the Windows Registry when a user logs in!  And with that in mind, like third-party vendors, don’t ever listen to external auditors about what permissions you should or should not assign/revoke on your SQL Server.

IMG_20170408_133339After Denny’s session, it was time for lunch.  All of the attendees gathered in the main hall and headed towards one of the 4 main catering points throughout the venue.  As a vegetarian, there was a rather nice option which was Butternut squash & Sage tortellini with roasted Mediterranean vegetables.  This was followed by either a Strawberry bavarois or a warm chocolate brownie with toffee sauce.  Well, I couldn’t resist the chocolate brownie, so after collecting my meal along with some freshly squeezed orange juice to wash it all down, I found an empty spot at one of the many tables and ate my lunch.

After my lunch I decided to take a stroll around the grounds of the International Centre as it had become such a lovely sunny day outside.  The morning’s sessions had been great but intensive, so this gentle stroll allowed me to clear my head, get some fresh air, enjoy the sunshine and put myself back in the right frame of mind required for the final two sessions of the day.  I made my way back inside the venue as the lunch break drew to a close and once again consulted my programme to determine the correct dome for the next session.  This time it was Dome 6 for Richard Douglas’s “Understanding the Transaction Log”.  After taking my seat, the speaker announced that he wasn’t actually Richard at all, as unfortunately, Richard was suffering from food poisoning so this talk was to be given by one of Richard’s colleagues, John Martin.

IMG_20170408_142519John starts by saying that the transaction log is not just all about backups.  There are numerous and varied uses for the transaction log, and it’s integral to a well-running and performant SQL Server.  John talks about the three recovery models for the transaction log.  There’s “Simple”, “Full” and “Bulk” modes and John is keen to point out that even when using Simple mode, everything is still logged to the transaction log.  There may not be as much detail as could be found if using Full or Bulk modes, but everything is still there.  One of the golden rules is to only ever have one transaction log file for each database.  You can have numerous data files, however, transaction logs should be kept within a single file.  This improves performance and makes recovery and backups much easier.  John reminds us what the various modes do.  Simple mode is effectively auto-maintenance and auto-shrinking of our log – SQL Server will take care of all of this itself.  This may or may not be a good thing depending upon your use case.  In Full mode, SQL Server will not perform any of it’s own maintenance or shrinking at all.  You’re responsible for doing this.  This is very often the better choice as you will know your own use cases better than SQL Server does and so can schedule such maintenance for the most convenient and appropriate times.

IMG_20170408_144507John reminds us of the set of operations within SQL Sever that are “minimally logged”, even when operating in Full recovery mode.  Many common operations , such as “SELECT * INTO”, the CREATE / ALTER & DROP’ping of indexes etc. are all minimally logged in the transaction log, and this minimal logging impacts your ability to perform a complete “point-in-time” restore of your database in the event of a crash.

Select_InsertAfter this we look at the general process by which a standard SELECT or INSERT statement is processed by SQL Server, we can see how there’s a lot of moving parts to the entire process flow and how the transaction log is central to ensuring that SQL Server is able to provide the D (Durability) from the ACID properties that we require from our database engine.  John reminds us that it’s not until our data is fully persisted to the transaction log file that SQL Server considers the data persisted and committed to the database as it’s only after this step that the data could be recovered in the event of a server crash.

John moves on to talk about the internals of the transaction log and how SQL Server uses the space within the log file.  Transaction log files are split into multiple VLFs (Virtual Log File).  These VLF’s exist within the same single physical file on disk.  A VLF can be in either an active or inactive state, and at any given time, there’s always at least one active VLF.  SQL Server will manage the creation of new VLFs as the transaction log grows over time, but it’s possible to control some of how VLF’s are created yourself.  The creation of too many VLF’s within a file is thought to be a bad thing and there’s various discussions about how best to manage this.  Ordinarily, you will have either 4, 8 or 16 VLF’s inside the transaction log file and the VLF is sized appropriately based upon the “chunk” size (chunks are the size by which the transaction log physical file grows on disk).

We look at the makeup of a VLF and John tells us that within a VLF there’s many “log blocks”.  Log Blocks are between 512 bytes and 60KB in size, and a VLF will have as many log blocks as is required to fill the VLF space.  John states how it’s better to try to keep the log blocks at the largest possible size of 60KB as performance is better with fewer, but larger log blocks rather than with many smaller log blocks.  Inside the log block are the individual log records.  Each log record is identified by a unique Log Sequence Number (LSN).  The log sequence number is made up from the VLF sequence number, the log block number and the log record number.

John talks about how the transaction log grows over time.  First we look at log growth when using the “Simple” recovery mode.  SQL Server will periodically instigate a “checkpoint”.  This is the flushing of dirty pages of data within memory to disk and any VLF’s that have no open transactions within them are cleared down.  Note that this does not reclaim any disk space from the VLF, however.  If additional logs are added SQL Server moves through the VLF’s within the file looking for space to write the log records, primarily looking for currently inactive VLFs that have been previously cleared and reusing those. If we reach the end of the file, we “wrap around” to the beginning of the file searching for inactive VLF’s where we can write the transactions.  If no inactive VLFs are found, we must increase the size of the physical transaction log file.  This negatively impacts the server, which has to pause all activity whilst the physical file is extended on disk.  Then we look at log growth with “Full” or “Bulk” recovery mode, which is almost the same as log growth with the simple recovery mode, but instead of a checkpoint, we have a transaction file backup occurring instead which ensures we have much more transaction data available for a full recovery of the state of the server if required.

IMG_20170408_152331At this point, John talks about how we can make things go faster with SQL Server by improving transaction log performance.  We start firstly with a good I/O architecture – consider the balance of reads and writes of your data and select an appropriate RAID strategy that’s optimized for your own use case, and remember that using a bigger RAID cache is always better.  Within the architecture of your SQL Server itself, it’s best if you can determine the actual required size of your physical transaction log file at the beginning.  This is difficult to predict, but if you can get close, your server will perform better as a result.  We should be aware of such things as page splits, and particularly the “evil” variety as not all page splits are bad.  These bad splits can have serious negative performance impacts.  John also cautions against using “delayed durability” which is effectively asynchronous transaction writes.  These cause the server to consider data persisted even though it’s not yet fully written to disk.  Depending upon your application, delayed durability could be ok, but if your system must never lose a single transaction then don’t use it.  One time when it might be appropriate to temporarily enable delayed durability is during large scale purging of data from the database.  John tells us to keep an eye on our indexes.  Too many of those means that each one is a discrete data structure that must be updated and persisted to disk individually which can hurt performance.

Finally John tells us about monitoring the SQL Server transaction log file and we can use both external tools such as PERFMON for that as well as built-in SQL Server system stored procedures such as the sys.dm_io_virtual_file_stats stored procedure.  Regularly reviewing these monitors and logs can help identify bottlenecks within the server and highlight the areas where issues may arise.

IMG_20170408_153231After John’s session, it was time for the final afternoon coffee break, which this time was accompanied by a rather nice selection of cakes!  After a cup of coffee and a cake or two (the scones and carrot cake was particularly nice!) it time to consult the conference programme one last time to determine the dome for the final session of the day.

This time it was Dome 3 for Emanuele Zanchettin’s “Performance Tips For Faster SQL Queries”.

IMG_20170408_160333Emanuele starts his session by talking about debugging.  He reminds us that debugging database queries often starts at the application layer with developers digging through C# code, but he states that debugging sometimes also stops there too.  We need to be aware that we often need to debug down into SQL Server itself.  With that, Emanuele asks the attendees if they’d rather have a talk full of slides or a talk full of real-world demos.  The attendees unanimously vote for demos, so from here Emanuele opens up his SQL Server Management Studio tool and begins to show us some T-SQL code.

He first creates a demo database which includes at least one table of over 2 million rows.  He writes some simple SELECT queries that contain a few joined tables.  We run the queries and see that they execute in a very short space of time, however, in looking at the execution plan generated for our query, we can see that we can make the query perform even better.  So Emanuele’s first tip is to make sure you always check the execution plans generated for your queries as they can often indicate where an additional index on the source tables would greatly improve query performance.

IMG_20170408_073649Unfortunately, it was at this point that I had to leave Emanuele’s session in order to make an early start on my journey back home. I missed the end of Emanuele’s session as well as the final conference wrap-up and prize giving session, but I’d had a fantastic day at another incredibly well-run, well-organised and very informative SQLBits event.

As ever, I can’t wait until the next one!

Entity Framework with MySQL - Booleans, Bits and "String was not recognized as a valid Boolean" errors

In my current role, I'm using MySQL v5.6 as the database engine for an ASP.NET MVC web application which also uses Entity Framework v6.1.3 as the ORM to connect to the backend database, doing so via the use of the MySQL .NET Connector ADO.NET driver.

It appears that there are some "interesting" issues regarding how MySQL (and the .NET Connector) handle and model boolean entity properties, modelling them as a TINYINT(1) column by default, and causing odd and spurious errors, such as the "String was not recognized as a valid Boolean" error when performing entirely innocuous Entity Framework functions within your code.

The MySQL .NET Connector contains a bug whereby database columns that are generated for any boolean property of an entity, by default, get modelled with a TINYINT(1) data type rather than the more appropriate BIT datatype.  This, in itself, isn't the bug, however the strange usage of a TINYINT(1) data type for a boolean column rather than the seemingly more appropriate BIT(1) datatype does enable the bug to subsequently raise it's head.

In using a TINYINT(1) data type by default, this causes the MySQL .NET Connector to occasionally throw an exception which upon examination contains the error string, "String was not recognized as a valid Boolean".  This occurs when performing such innocuous functionality as using a number of Entity Framework Include methods when attempting to retrieve a large enough graph of objects from the database.  For example, code such as this:

var results = dbContext.SocietyMemberships
				.Include(m => m.TenantPersona),
				.Include(m => m.Society),
				.Include(m => m.IncomeTypeGroups),
				.Include(m => m.IncomeTypeGroups.Select(itg => itg.IncomeTypeGroup)),
				.Include(m => m.IncludeTerritories),
				.Include(m => m.ExcludeTerritories);

return results.ToList();

Would result in the aforementioned exception being throw.  However, take out a few of those "includes" and the query works perfectly with no exception thrown.

It appears that the spurious exception is caused by the MySQL .NET Connector's attempt to generate the necessary SQL statement required for the query and subsequent parsing of that SQL statement would fail to parse any included TINYINT(1) columns as boolean values. In my (albeit limited) testing, it appeared that include'ing 5 or less related tables did not display the error, but Including more than 5 tables would cause the error consistently.

In order to fix the issue, we simply have to ensure that the MySQL connector, via Entity Framework, will model all Boolean entity properties as a BIT(1) data type column.  This must be done explicitly in code using code such as the following:

protected override void OnModelCreating(DbModelBuilder modelBuilder)
{
	base.OnModelCreating(modelBuilder);
	
	// other initialization code here...
	
	modelBuilder.Properties<bool>().Configure(c => c.HasColumnType("bit"));
}

Performing this configuration in the OnModelCreating method of the DbContext object ensures that this configuration applies to all entities, both those already existing and any new ones created in the future,  ensuring they are modelled with a BIT(1) column data type rather than a TINYINT(1) data type.

It should be noted that the same "String was not recognized as a valid Boolean" exception can be thrown when using other Entity Framework functionality, however, I've yet to pin down exactly what code can cause the error.  I suspect that, due to the nature of the error as detailed above, there's an incredible amount of different C# code that, when translated to the MySQL SQL dialect, generates SQL of sufficient amount and complexity that the exception is thrown.

Finally, it would appear that the choice of the MySQL .NET Connector to use TINYINT(1) as the default column data type instead of BIT(1) stems from the history of the BIT data type in MySQL.  An old blog post from Baron Schwartz that documents this history and behaviour states,

MySQL has supported the BIT data type for a long time, but only as a synonym for TINYINT(1) until version 5.0.3. Once the column was created, MySQL no longer knew it had been created with BIT columns.

In version 5.0.3 a native BIT data type was introduced for MyISAM, and shortly thereafter for other storage engines as well. This type behaves very differently from TINYINT.

He goes on to indicate how this behaviour has changed over time:

The old data type behaved just like the small integer value it really was, with a range from -128 to 127. The new data type is a totally different animal. It’s not a single bit. It’s a fixed-width “bit-field value,” which can be from 1 to 64 bits wide. This means it doesn’t store a single BIT value; it’s something akin to the ENUM and SET types. The data seems to be stored as a BINARY value, even though the documentation lists it as a “numeric type,” in the same category as the other numeric types. The data isn’t treated the same as a numeric value in queries, however. Comparisons to numeric values don’t always work as expected.

This change in behavior means it’s not safe to use the BIT type in earlier versions and assume upgrades will go smoothly.

It's from this chequered history that it appears that such issues converting from "BIT" columns (that can potentially have a large range of actual values) to a binary boolean have arisen.

SQLBits 2016 In Review

IMG_20160507_072440On 7th May 2016 in Liverpool, the 15th annual SQLBits event took place in the new Liverpool Exhibition Centre.  The event had actually been running since Wednesday 4th, however, as with all other SQLBits events, the Saturday is a free, community day.

This particular SQLBits was rather special, as Microsoft had selected the event as the UK launch event for SQL Server 2016.  As such the entire conference had a very large Microsoft presence.

Since the event was in my home town, I didn’t have too far to travel to get to the venue.  That said, I did have to set my alarm for 6am (a full 45 minutes earlier than I usually do on a working weekday!) to ensure I could get the two different trains required to get me to the venue in good time.  The Saturday day is jam packed with content and as such, the event opened at the eye-watering time of 7:30am!

IMG_20160507_072440After arriving at the venue just as it was opening at 7:30am, I heading straight to the registration booth to confirm my registration and collect my conference lanyard.  Once collected, it was time to head into the main hall.  The theme for this years SQLBits was “SQLBits in Space” so the entire hall had the various rooms where the sessions would take place as giant inflatable white domes.  In between the domes and around the main hall there was plenty of space and sci-fi themed objects.

After a short while, the venue staff started to wheel out the morning refreshments of tea & coffee, shortly followed by the obligatory bacon, sausage and egg sandwiches!

After enjoying the delicious breakfast, it was soon time to head off the the relevant “dome” for the first session of the day.  The SQLBits Saturday event had 9 different tracks, so choosing what talk to attend was difficult and there was always bound to be clashes of interesting content throughout the day.  For the first session, I decided to attend Aaron Bertrand’s T-SQL: Bad Habits and Best Practices.

Aaron’s talk is all about the various bad habits that we can sometimes pick up when writing T-SQL code and also the myths that have built up around certain approaches to achieving specific things with T-SQL.  Aaron starts by stating that we should ensure that we don’t make blind assumptions about anything in SQL Server.  We can’t always say that a seek is better than a scan (or vice-versa) or that a clustered index is better than a non-clustered one.  It always depends.  The first big myth we encounter is that it’s often stated that using SELECT * when retrieving all columns from a database is bad practice (instead of naming all columns individually).  This can be bad practice as we don’t know exactly what columns we’ll be getting – e.g. future added columns will be returned in the query, however, it’s often stated that another reason it’s bad practice is due to SQL Server having to look up the database meta data to figure out the column names.  The reality is that SQL Server will do this anyway, even with named columns!

Next, Aaron shows us a little tip using SQL Server Management Studio.  It’s something that many audience members already knew, bit it was new to me. He showed how you can drag-and-drop the “Columns” node from the left-hand treeview into a query window and it will add a comma-separated list of all of the tables columns to the query text!

Aaron continues by warning us about omitting explicit lengths from varchar/nvarchar data types.  Without specifying explicit lengths, varchars can very easily be truncated to a single character as this simple T-SQL shows:

DECLARE @x VARCHAR = 'testing'; 
SELECT [myCol] = @x;

We’re told that we should always use the correct data types for our data!  This may seem obvious, but many times we see people storing dates as varchars (strings) simply to ensure they can preserve the exact formatting that they’re using.  This is a presentation concern, though, and doing this means we lose the ability to perform correct sorting and date arithmetic on the values.  Also, avoid using datatype such as MONEY simply because it sounds appropriate.  MONEY is a particularly bad example and should always be replaced with decimal

Aaron reminds us to always explicitly use a schema prefix when referencing tables and SQL Server objects within our queries (i.e. Use [dbo].[TableName] rather than just [TableName]).  Doing this ensure that, if two different users of our query have different default schemas, there won’t be any strange potential side-effects to our query.

We’re reminded not to abuse the ORDER BY clause.  Using ORDER BY with an Ordinal column number after it can easily break if columns are added, removed or their order in the schema altered.  Be aware of the myth that tables have a “natural order”, they don’t.  Omitting an ORDER BY clause may appear to order the data the same way each time, however, this can easily change if additional indexes are added to the table.

We should always use the SET NOCOUNT ON directive as this cut down on noisy chatter in our application’s communication with SQL Server, but make sure you always test this first.  Applications built using older technologies, such as the original ADO from the Classic ASP era can be reliant upon the additional count message being returned when NOCOUNT is off.

Next, Aaron highlights the cost of poorly written date / range queries.  He tells us that we shouldn’t use non-sargable expressions on a column – for example, if we use a WHERE clause which does something like WHERE YEAR([DateCoulmn]) = 2016, SQL Server will not be able to utilise any indexes that may exist on that column and will have to scan the entire table to compute the YEAR() function for the date column in question – a very expensive operation.  We’re told not use use the BETWEEN keyword as it’s imprecise – does BETWEEN include the boundary conditions or only everything between them?  It’s far better to explicitly use a greater than and less than clause for date ranges – e.g.  WHERE [OrderDate] > ‘1 Feb 2016’ AND [OrderDate] < '1 March 2016'. This ensures we’re not incorrectly including outlying boundary values (i.e. midnight on 28th Feb which is actually 1st March!).  Regarding dates, we should also be aware of date format strings.  Formatting a date with many date format strings can give entirely different values for different languages. The only two “safe” format strings which work the same across all languages are YYYYMMDD and the full ISO 8601 Date format string, “YYYY-MM-DDTHH:MM:SS”.

Aaron continues by reminding us to use the MERGE statement wisely.  We must remember that it effectively turns two statements into one, but this can potentially mess with triggers, especially if they rely on @@ROWCOUNT.  Next up is cursors.  We shouldn’t default to using a cursor if we can help it.  Sometimes, it can be difficult to think in set-based terms to avoid the cursor, but it’s worth the investment of time to see if some computation can be performed in a set-based way.  If you must use a cursor, it’s almost always best to apply the LOCAL FAST_FORWARD qualifier on our cursor definition as the vast majority of cursors we’ll use are “firehose” cursors (i.e. we iterate over each row of data once from start to end in a forward-only manner).  Remember that applying no options to the cursor definition effectively means the cursor is defined with the default options, which are rather “heavy-handed” and not always the most performant.

We’re reminded that we should always use sp_executesql when executing dynamic SQL rather than using the EXEC() statement.  sp_executesql allows the use of strongly-typed parameters (although unfortunately not for dynamic table or column names) which reduces the chances of SQL injection.  It’s not complete protection against injection attacks, but it’s better than nothing.  We’re also reminded not use to CASE or COALESCE in sub-queries.  COALESCE turns into a CASE statement within the query plan which means SQL Server will effectively evaluate the inner query twice.  Aaron asks that we remember to use semi-colons to separate our SQL statements.  It protects against future edits to the query/code and ensures atomic statements continue to operate in that way.

Aaron says that we should not abuse the COUNT() function.  We very often write code such as:

IF (SELECT COUNT(*) FROM [SomeTable]) > 0 THEN …..

when it’s really much more efficient to write:

IF EXISTS (SELECT 1 FROM [SomeTable]) THEN….

We don’t really need the count in the first query so there’s no reason to use it.  Moreover, if you do really need a table count, it’s much better to query the sys.partitions table to get the count:

-- Do this:
SELECT SUM(rows) FROM sys.partitions where index_id IN (0,1)
AND object_id = (SELECT object_id FROM sys.tables WHERE name = 'Addresses')
-- Instead of this:
SELECT COUNT(*) FROM Addresses

Aaron’s final two points are to ensure we don’t overuse the NOLOCK statement.  It’s a magic “go-faster stripes” turbo button for your query but it will produce inaccurate results.  This is fine if, for example, you only  need a “ballpark” row count, however, it’s almost always better to use a scope-levelled READ COMMITED SNAPSHOT isolation level for your query instead.  This must be tested, though, as this can place a heavy load on the tempdb.  Finally, we should remember to always wrap every query we do with a BEGIN TRANSACTION and a COMMIT/ROLLBACK transaction.  Remember – SQL Server doesn’t have an “undo” button!  And it’s perfectly fine to simply BEGIN a transaction when writing ad-hoc queries in SQL Server Management Studio, even if we don’t explicitly close it straight away.  The transaction will remain so long as the connection remains open, so we can always manually perform the commit or the rollback at a slightly later point in time.

And with that, Aaron’s session was over.  An excellent and informative start to the day.

IMG_20160507_073553After a coffee break, during which time there was some left over breakfast bacon sandwiches available for those people who fancied a second breakfast, it was time to head off to the next session.  For this one, I’d chosen something a little leftfield.  This wasn’t a session directly based upon technology, but rather was a session based upon employment within the field of technology.  This was Alex Whittle’s Permy, Contractor Or Freelance.

Alex’s session was focused on how we might get employed within the technology sector, the various options open to us in gaining meaningful employment and the pros and cons associated with each approach.

Alex starts his talk by introducing himself and talking us through his own career history so far.  He started as an employee developer, then a team lead and then director of software before branching out on his own to become a contractor.  From there, he became a freelancer and finally started his own consultancy company.

Alex talks about an employer’s expectations for the various types of working relationship.  For permanent employees, the focus is very much on your overall personality, attitude and ability to learn.  Employers are making a long term bet with a permanent employee.  For contractors, it’s your existing experience in a given technology or specific industry that will appeal most to the client.  They’re looking for someone who can deliver without needing any training “on-the-job” although you’ll get time to “figure it out” whilst you’re there.  You’ll also have “tech-level” conversations with your client, so largely avoiding the politics that can come with a permanent role.  Finally, as a freelancer, you’ll be engaged because of your technical expertise and your ability to deliver quickly.  You’re expected to be a business  expert too and you’re engagement will revolve around “senior management/CxO” level conversations with the client.

Alex moves on to discuss the various ways of marketing yourself based upon the working relationship.  For permanent employees its recruitment agencies, LinkedIn and keeping your CV up to date.  You’re main marketing point is your stability so you’re CV needs to show a list of jobs with good lengths of tenure for each one.  One or two shorter tenures is acceptable, but you’ll need to be able to potentially explain it well to a prospective employer.  For contractors, it’s much the same avenues for marketing, recruitment agencies, LinkedIn and a good CV, but here the focus is quite different.  Firstly, a contractor’s CV can be much longer than a permanent employee’s CV, which is usually limited to 3 pages.  A contractors CV can be up to 4-6 pages long and should highlight relevant technical and industry experience as well as show contract extensions and renewals (although older roles should be in summary only).  For freelancers, it’s not really about your CV at all.  Clients are now not interesting in you per-say, they’re interested in your company.  This is where company reputation and you’re ability to really sell the company itself has the biggest impact.  For all working relationships, one of the biggest factors is networking.  Networking will lead to contacts, which will lead to roles.  Never underestimate the power of simply speaking to people!

We then move on to talk about cash flow in the various types of working relationship.  Alex states how for permanent employees, there’s long term stability, holiday and sickness pay and also a pension.  It’s the “safest” and lowest stress option.  For contractors, cash flow has medium term stability.  There’s no holiday or sickness pay and you’d need to pay for your own pension.  You need to build a good cash buffer of at least 6 months living expenses, but you can probably get started on the contracting road with only 1 or 2 months of cash buffer.  Finally, the freelance option is the least “safe” and has the least stability of cash flow.  It’s often very “spiky” and can range of short periods of good income interspersed with longer periods of little or no income.  For this reason, it’s essential to build a cash buffer of at least 12 months living expenses, although the quieter income periods can be mitigated by taking on short term contracts.

Alex shares details on the time when he had quit his permanent job to go contracting.  He says he sent around 20-30 CV’s to various contract job per week for the first 3 weeks but didn’t get a single interview.  A helpful recruiter eventually told him that it was probably largely to do with the layout of his CV.  This recruiter spent 30 minutes with him on the phone, helping him to reformat his CV after which he sent out another 10 CV to various contract roles and got almost 10 interviews as a result!

We continue by looking into differences in accounting structures between the various working types.  As a permanent employee, there’s nothing to worry about at all here, it’s all sorted for you as a PAYE employee.  As a contractor, you’ll send out invoices usually once a month, but since you’ll rarely have more than one client at a time, the invoicing requirements are fairly simple.  You will need to do real-time PAYE returns as you’ll be both a director and employee of your Ltd. company and you’ll need to perform year-end tax returns and quarterly VAT returns, however, you can use the flat-rate VAT scheme if it’s applicable to you.  This can boost your income as you charge your clients VAT at 20% but only have to pay 14.5% to HMRC!  As a freelancer, you’ll also be sending out invoices, however, you may have more than one client at a time so you may have multiple invoices per month thereby requiring better management of them (such software as Xero or Quickbooks can help here).  One useful tip that Alex shares at this point is that, as a freelancer, it can be very beneficial to join the Federation of Small Businesses (FSB) as they can help to pay for things like tax investigations, should you ever receive one.

Alex then talks about how you can, as an independent contractor, either operate as a sole-trader, work for an umbrella company, or can run your own Limited company.  Limited company is usually the best route to go down as Limited companies are entirely separate legal entities so you’re more protected personally (although not from things like malpractice), however, the previous tax efficiency of company dividends that used to be enjoyed by Ltd’s no longer applies due to the loophole in the law being closed.  As a sole trader, you are the company – the same legal entity, so you can’t be VAT registered and your not personally protected from liability.  When working for an umbrella company, you become a permanent employee of the umbrella company. They invoice on your behalf and pay your PAYE.  This affords you the same protection as any other employee and takes away some of the management of invoicing etc. however, this is probably the least cost efficient way of working since the umbrella company will take a cut of your earnings.

We then move onto the thorny issue of IR35. This is legislation that designed to catch contractors who are really operating more as “disguised employees”.  IR35 is constantly evolving and application by HMRC can be inconsistent.  The best ways to mitigate being “caught” inside of IR35 legislation are to perform tasks that an employee does not do.  For example, advertising your business differentiates you from an employee, ensuring your contracts have a “right of substitution” (whereby the actual worker/person performing the work can be changed), having multiple contracts at any one time – whilst sometimes difficult for a contractor to achieve - can greatly help, showing that you are taking on risk (especially financial risk) along with being able to show that you don’t receive any benefits from the engagement as an employee would do (for example, no sick pay).

Finally, Alex asks, “When should you change?”  He puts a number of questions forward that we’d each need to answer for ourselves.  Are you happy with your current way of working?  Understand the relative balance of income versus stress from the various working practices.  Define your goals regarding work/life balance.  Ask yourself why you would want to change, how do you stand to benefit?  Where do you live?  Be aware that very often, contracts may not be readily available in your area, and that you may need to travel considerable distance (or even stay away from home during the working week), and finally, Alex asks that you ask yourself, “Are you good enough?”.  Alex closes by re-stating the key takeaways.  Enjoy your job, figure out your goals, increase your profile, network, remember that change can be good – but only for the right reasons, and start now – don’t wait.

IMG_20160507_105119

After another coffee break following Alex’s session, it’s time for the next one.  This one was Lori Edwards’ SQL Server Statistics – What are the chances?

Lori opens by asking “What are statistics?”.  Just as Indexes provide a “path” to find some data, usually based upon a single column, statistics contain information relating to the distribution of data within a column across the entire set of rows within the table.  Statistics are always created when you create an index, but you can create statistics without needing an index.

Statistics can help with predicates in your SQL Server queries.  Predicates are the conditions within your WHERE or ORDER BY clauses.  Statistics contain information about density, which refers to the number of unique values in the column along with cardinality which refers to the uniqueness of a given value.  There’s a number of different ways to create statistics, you can simply add an index, you can use AUTO CREATE STATISTICS and CREATE STATISTICS directives as well as using a system stored procedure, sp_createstats.  If you’re querying on a column, statistics for that column will be automatically created for you if they don’t already exist, however, if you anticipate heavy querying utilising a given column, it’s best to ensure that statistics are created ahead of time.

Statistics are quite small and don’t take up as much space as indexes.  You can view statistics by running the sp_helpstats system stored procedure or you can query the sys.stats system table or even the sys.dm_db_stats table.  The best way of examining statistics, though, is to use the database console command, DBCC SHOW_STATISTICS.  When viewing statistics, low density values indicate a low level of uniqueness.  Statistics histograms show a lot of data, RANGE_HI_KEY is the highest key value, whilst RANGE_ROWS indicates how many rows there are between the HI_KEYS in different column values.

The SQL Server Query Optimizer uses statistics heavily to generate the optimized query plan.  Note, though, that optimized query plans are necessarily optimal for every situation, they’re the most optimal general purpose plans.  It’s purpose is to come up with a good plan, fast, and statistics are necessary for this to be able to happen.  To make the most of the cardinality estimates from statistics, it’s best ensure you use parameters to queries and stored procedures, use temp tables where necessary and keep column orders consistent.  Table variables and table-valued parameters can negatively affect cardinality.  Whether the query optimizer selects a serial or parallel plan can be affected by cardinality, as can the choice to use an index seek versus an index scan.  Join algorithms (i.e. hash match, nested loops etc.) can also be affected.

From here, the query optimizer will decide how much memory it thinks it needs for a given plan, so memory grants are important.  Memory grants are effectively the cost of the operation multiplied by the number of rows that the operation is performed against, therefore, it’s important for the query optimizer to have accurate row count data from the statistics. 

2016-06-07 21_42_25-I5xLz.png (766×432)One handy tip that Lori shares is in interpreting some of the data from the “yellow pop-up” box when hovering over certain parts of a query plan in SQL Server Management Studio.  She states how the “Estimated Number Of Rows” is what the table’s statistics say there are, whilst the “Actual Number Of Rows” are what the query plan actually encountered within the table.  If there’s a big discrepancy between these values, you’ll probably need to update the statistics for the table!

Statistics are automatically updating by SQL Server, although, they’re only updated after a certain amount of data has been added or updated within the table.  You can manually update statistics yourself by calling the sp_updatestats system stored procedure.

By default, tables inside a database will have AUTO UPDATE STATISTICS switched on, which is what causes the statistics to be updated automatically by SQL Server occasionally – usually after around 500 rows or 20% of the size of the table have been added/modified.  It’s usually best to leave this turned on, however, if you’re dealing with a table that contains a very large number of rows and has either many new rows added or many rows modified, it may be better to turn off the automatic updating of statistics and perform the updates manually after either a specific number of modifications or at certain appropriate times.

Finally, it’s important to remember that whenever statistics are updated or recomputed, any execution plans built on those statistics that were previously cached will be invalidated.  They’ll need to be recompiled and re-cached.

After Lori’s session, there’s another quick coffee break, and then it’s on to the next session.  This one was Mark Broadbent’s Lock, Block & Two Smoking Barrels.  Mark’s session focused on SQL Server locks.  Different types of locks, how they’re acquired and how to best design our queries and applications to ensure we don’t lock data for any longer than we need to.

Mark first talks about SQL Server’s transactions.  He explains that transactions are not committed to the the transaction logs immediately.  They are processed through in-memory buffers first before being flushed to disk.  Moreover, the logs need to grow to a certain size before they get flushed to disk so there’s always a possibility of executing a COMMIT TRANSACTION statement yet the transaction isn’t visible within the transaction log until sometime later.  The transaction being available in the transaction log is the D in ACID – Durability, but Mark highlights that it’s really delayed durability.

IMG_20160507_125814Next, Mark talks about concurrency versus correctness. He reminds us of some of the laws of concurrency control.  The first is that concurrent execution should not cause application programs to malfunction. The second is that concurrent execution should not have lower throughput or higher response times than serial execution.  To balance concurrency and correctness, SQL Server uses isolation, and there are numerous isolation levels available available to us, all of which offer differing levels of concurrency versus correctness.

Mark continues by stating that SQL Server attempts to perform our queries in as serial a manner as possible, and it uses a technique called transaction interleaving in order to achieve this between multiple concurrent and independent transactions.  Isolation levels attempt to solve the interleaving dependency problems.  They can’t completely cure them, but they can reduce the issues caused by interleaving dependencies.  Isolation levels can be set at the statement, transaction or session levels.  There are 4 types defined by the ANSI standards, but SQL Server 2005 (and above) offer a fifth level.  It’s important to remember that not all isolation levels can be used everywhere, for example, the FILESTREAM data type is limited in the isolation levels that it supports.

We’re told how SQL Server’s locks are two-phased and are considered so if every LOCK is succeeded by an UNLOCK.  SQL Server has different levels of locks, and they can exist at a various levels of granularity from row locks, to page locks all the way up to table locks.  When SQL Server has to examine existing locks in order to acquire a new or additional lock, it will only ever compare locks on the same resource.  This means that row locks are only ever compared to other row locks, page locks compared to other page locks and table locks to other table locks.  They’re all separate.  That said, SQL Server will automatically perform lock escalation when certain conditions occur, so when SQL Server has acquired more than 5000 other locks of either row or page type, it will escalate those locks to a single table level lock.  Table locks are the least granular kind of lock and a very bad for performance and concurrency within SQL Server – basically the one query that holds the table level lock prevents any other query from accessing that table.  For this reason, it’s important to ensure that our queries are written in such a way as to minimize the locks that they need, and to ensure that when they do require locks that those locks as granular as can be.  Update locks will allow multiple updates against the same table and/or rows.  They’re compatible with shared locks but not other update locks or exclusive locks so it’s worth bearing in mind how many concurrent writes we attempt to make to our data.

Mark continues to show us some sample query code that demonstrates how some simple looking queries can cause concurrency problems and can result in lost updates to our data.  For example, Mark shows us the following query:

BEGIN TRAN T1
SELECT @newquantity = quantity FROM basket
SET @newquantity = @newquantity + 1
UPDATE some_other_table SET quantity = @newquantity
COMMIT TRAN T1

The above query can fail badly, with the required UPDATE being lost if multiple running transaction perform this query at the same time.  This is due to transaction interleaving.  This results in two SELECTs which happen simultaneously and acquire the quantity value, but the two UPDATEs get performed in interleaved transactions which means that the second UPDATE that runs is using stale data to update, effectively “overwriting” the first UPDATE (so the final newquantity value is one less than it should be).  The solution to this problem is to perform the quantity incrementing in-line within the UPDATE statement itself:

BEGIN TRAN T1
UPDATE some_other_table SET quantity = t2.newquantity FROM (SELECT quantity + 1 FROM basket) t2
COMMIT TRAN T1

Reducing the number of statements needed to perform some given function on our data is always the best approach.  It means our queries are being as granular as they can be, proving us with better atomic isolation and thereby reducing the necessity to interleave transactions.

IMG_20160507_133930After Mark’s session was over, it was time for lunch.  Lunch at the SQLBits conferences in previous years has always been excellent with a number of choices of hot, cooked food being available and this year was no different.  There was a choice of 3 meals, cottage pie with potato wedges, Moroccan chicken with couscous or a vegetarian option (I’m not quite sure what that was, unfortunately), each of which could be finished off with one of a wide selection of cakes and desserts!

IMG_20160507_123500I elected to go for the Moroccan chicken, which was delicious, and plumped for a very nice raspberry ripple creamy yoghurt.  An excellent lunch, as ever!

During lunch, I managed to catch up with a few old friends and colleagues who had also attended the conference, as well as talking to a few newly made acquaintances whilst wandering around the conference floor and the various sponsors stands.

After a good wander around, during which I was able to acquire ever more swag from the sponsors, it was soon time for the afternoon’s sessions.  There were only two more sessions left within the day, it now being around 14:30 after the late lunch hour was over, I headed off to find the correct “dome” for the first of the afternoon’s sessions, Erland Sommarskog’s Dynamic Search Conditions.

Erland’s talk will highlight the best approaches when dealing with dynamic WHERE and ORDER BY clauses in SQL Server queries, something that I’m sure most developers have had to deal with at some time or another.  For this talk, Erland will use his own Northgale database, which is the same schema as Microsoft’s old Northwind database, but with a huge amount of additional data added to it!

Erland first starts off by warning us about filtered indexes.  These are indexes that themselves have a WHERE condition attached to them (i.e. WHERE value <> [somevalue]) as these tend not to play very well with dynamic queries.  Erland continues by talking about how SQL Server will deal with parameters to queries.  It will perform parameter “sniffing” to determine how best to optimize a static query by closely examining the actual parameters we’re supplying.  Erland shows us both a good and bad example:  WHERE xxx = ISNULL(@xxx,xxx) versus WHERE xxx = (xxx = @xxx OR xxx IS NULL).  He explains how the intended query will fail if you use ISNULL in this situation.  We’re told how the SQL Server query optimizer doesn’t look at the stored procedure itself, so it really has no way of knowing if any parameters we pass in to it are altered or modified in any way by the stored procedures code.  For this reason, SQL Server must generate a query plan that is optimized for any and all possible values.  This is likely to be somewhat suboptimal for most of the parameters we’re likely to supply.  It’s for this reason that the execution plan can show things like an index scan against an index on a “FromDate” datetime column even if that parameter is not being passed to the stored procedure.  When we’re supplying only a subset of parameters for a stored procedure with many optional parameters, it’s often best to use the OPTION RECOMPILE statement to force a recompilation of the query every time it’s called.  This way, the execution plan is regenerated based upon the exact parameters in use for that call.  It’s important to note, however, that recompiling queries is an expensive operation, so it’s best to measure exactly how often you’ll need to perform such queries.  If you’re calling this query very frequently, you may well get the best performance from using purely dynamic SQL.

Erland then moves on to discuss dynamically ordering data.  He states that the CASE statement inside the ORDER BY clause is the best way to achieve this, for example: ORDER BY CASE @sortcolumn WHEN ‘OrderID’ THEN [Id] END, CASE @sortcolumn = ‘OrderDate’ THEN [Date] END…..etc.  This is a great way to achieve sorting my dynamic columns, however, there’s a gotcha with this method and that is that you have to be very careful of datatype differences between the columns in the different case clauses as this can often lead to error.

Next, we look at the permissions required in order to use such dynamic SQL and Erland says that it’s important to remember that any user who wishes to run such a dynamic query will require permissions to access the underlying table(s) upon which the dynamic query is based.  This differs from (say) a stored procedure where the user only need permissions to the stored procedure and not necessarily the underlying table upon which the stored procedure is based.  One trick that can be used to gain somewhat the best of both of these approaches is to use the sp_executesql system stored procedure.  Using this will create a nameless stored procedure from your query, it will cache it and execute it.  The stored cache can then be re-used on subsequent calls to the query with the nameless stored procedure being identified based upon a hash of the the query content itself.

Another good point that Erland mentions is to ensure that all SQL server objects (tables, functions etc.) referenced within a dynamic query should always be prefixed with the full schema name and not just referenced by the object name (i.e. use [dbo].[SomeTable] rather than [SomeTable]).  This is important as different users who run your dynamic SQL code could be using different default schemas – if they are and you haven’t specified the schema explicitly, the query will fail.

Erland also mentions that one very handy tip with dynamic queries is to always include a @debug input parameter of datatype bit, that can have a default setting of 0 (off).  It’ll allow you to always specify this parameter and pass in a value of 1 (on) to ensure that code such as IF @debug PRINT @sql will be run allowing you to output the actual T-SQL query generated by the dynamic code.  Erland says that you will need this eventually, so it’s always best to build it in from the start.

When building up your dynamic WHERE clause, one tricky condition is to know whether to add an AND at the beginning of the condition if you’re adding the 2nd or higher condition (the first condition of the WHERE clause won’t need the AND to be prepended of course).  One simple way around this is to make it so that all of the dynamically added WHERE clauses are always the 2nd or higher numbered condition by statically creating the first WHERE clause condition in your code as something benign such as “WHERE 1 = 1”.  This, of course, matches all records and all subsequently added WHERE clauses can always be prefixed with an AND, for example, “IF @CustomerPostCode THEN @sql += “ AND Postcode LIKE …..”, also it’s important to always add parameters into the dynamic SQL rather than concatenating values (i.e. avoid doing @sql += ‘ AND OrderId = ‘” + @OrderId + “’) as this will mess with the query optimizer and your generated queries will be less efficient overall as a result.  Moreover, raw concatenation of values can be a vector for SQL injection attacks.  For this same reason, you should always translate the values that you’ll use for WHERE and ORDER BY clauses that are passed into your stored procedure.  Translate the passed parameter value to a specific hard-coded value that you explicitly control.  Don’t just use the passed in parameter value directly.

Occasionally, it can be a useful optimization to inline some WHERE clause values in order to force a whole new query plan to be cached.  This is useful in the scenario when, for example, you're querying by order city and 60% of all orders are in same city.  You can inline that one city value to have a cached plan just for that city and a different single cached plan for all other cities.

Finally, for complex grouping, aggregation and the dynamic selection of the columns returned from the query, Erland says is often easiest to and more robust to construct these kind of queries in the client application code rather than in a dynamic SQL producing stored procedure.  One caveat around this is to ensure that you perform the entirety of your query client-side (or entirely server-side if you must) – don’t try to mix and match by performing some client-side and some server-side.

IMG_20160507_154920And with this, Erland’s session on dynamic SQL search conditions is complete.  After yet another short coffee break, we’re ready for the final session of what has been a long, but information-packed day.  And for the final session, I decided to attend Simon D’Morias’ “What is DevOps For Databases?”

Simon starts with explaining the term “DevOps” and reminds us that it’s the blurring of lines between the two traditionally separate disciplines of development and operations.  DevOps means that developers are far closer to the “operations” side of applications which frequently means getting involved with deployments, infrastructure and a continuous delivery process.  DevOps is about automation of application integration and deployment, provably and reliably.

Simon shows the three pillars upon which a successful DevOps process is built.  Develop, Deploy & Measure.  We develop some software, deploy it and the then measure the performance and reliability of the deployment.   From this measurement we can better plan and can thus feed this back into the next iteration of the cycle.  We’re told that to make these iterations work successfully, we need to keep changes small. From small changes, rather than larger ones, we can keep deployment simple and fast.  It allows us to gather frequent feedback on the process and allows continuous improvement of the deployment process itself.  With the teams behind the software (development, operations etc.) being more integrated, there’s a greater spread of knowledge about the software itself, the changes to the software in a given development/deployment cycle which improves early feedback.  Automation of these systems also ensures that the deployment is made easier and thus also contributes to better and earlier feedback.

When it comes to databases, DBA’s and other database professionals are frequently nervous about automating any changes to production databases, however, by keeping changes small and to a minimum within a given deployment cycle, and by having a continuously improving robust process for performing that deployment, we can ensure that each change is less risky than if we performed a single large change or upgrade to the system.  Continuous deployments also allow for detecting failures fast, which is a good thing to have.  We don’t want failures caused by changes to take a long time before they surface and we’re made aware of them.  Failing fast allows easy rollback and reliability of the process enables automation which further reduces risk.  Naturally, monitoring plays a large part of this and a comprehensive monitoring infrastructure allows detection of issues and failures and allows improves in reliability over time which, again, further reduces risk.

Simon moves on to discuss the things that can break DevOps.  Unreliability is one major factor that can break a DevOps process as even something running at 95% reliability is no good.  That 5% failure rate will kill you.  Requiring approval within the deployment chain (i.e. some manual approval, governance or compliance process) will break continuity and is a potential bottleneck for a successful DevOps deployment iteration also.  A greater “distance” between the development, operations and other teams will impact their ability to be knowledgeable about the changes being made and deployed.  This will negatively impact the team’s ability to troubleshoot and issues in the process, hindering the improvement of reliability.

IMG_20160507_164439It can often be difficult to know where to start with moving to an automated and continuous DevOps process.  The first step to to ensure we can “build ourselves a pipeline to live” – this is a complete end-to-end automated continuous integration process.  There are numerous tools available to help with this and SQL Server itself assists in this regard with the ability to package SQL server objects into a DACPAC package.  Simon insists that attempting to proceed with only a partial implementation of this process will not work.  It’s an all or nothing endeavour. Automating deployments to development and test environments, but not to the production environment (as some more nervous people may be inclined to do) is like building only half and bridge across a chasm!  Half a bridge is no bridge at all!

Simon concludes by showing us a quick demo of a simple continuous deployment process using Visual Studio to make some local database changes, which are committed to version control using Git and then pushed to Visual Studio Team Services (previously known as Visual Studio Online) which performs the “build” of the database objects and packages this into a DACPAC package.  This package is then automatically pushed to an Azure DB for deployment.

Finally, Simon suggests that one of the the best ways to ensure that our continuous deployment process is consistent and reliable is to ensure that there are minimal differences (ideally, no differences) between our various environments (development, test, staging, production etc.), and especially between our staging and production environments.

After Simon’s session was over, it was time for all of the conference attendees to gather in the main part of the exhibition hall and listen as one of the conference organisers read out those people who had won prizes by filling in forms and entering competitions run by each of the conference sponsors.  I didn’t win a prize, and actually, had entered very few competitions having been far too busy either attending the many sessions or drinking copious amounts of coffee in between them!  Once the prizes were all dished out, it was time for yet another fantastic SQLBits conference to sadly end.  It had been a long, but fantastic day at another superbly organised and run SQLBits conference.  Here’s hoping next year’s conference is even better!