SQLBits 2017 In Review

This past Saturday 8th April 2017, the annual SqlBits conference took place in the International Centre in Telford, Shropshire. The event is a four day conference, with the first three days being a paid conference and the final day, the Saturday, always being a free community day.

I’d had to get up quite early for this event, setting my alarm for 5:30am to allow me to get my all-important cup of coffee before setting off for the approximately 90 mile journey to arrive at the conference for the opening time of 7:30am!

After arriving at the venue, which on this occasion had free car parking in the venue’s ample car park, I heading indoors to search for the registration booth. It appeared that I’d arrived slightly early as I joined a small crowd that was gathered in one of the venue’s corridors just outside the entrance to a large hall where the event itself was taking place. After a short wait, we were informed that we could enter the hall and proceed to the registration booths on the way in.

After registering and receiving my badge, conference programme and goodie bag, I heading to the catering area for another all-important cup of coffee and some breakfast.

This year, having recently become vegetarian, I knew the almost obligatory bacon butties would not be an option, so I quickly acquired a cup of coffee (the most important thing) and searched for the vegetarian breakfast option. The choice was of either a fruit smoothie or a fruit bowl. I selected the fruit bowl and made my way to one of the many tables dotted around the venue to consume my breakfast and take a peek through the goodie bag!

After a short while, an announcement came over a venue wide loudspeaker system to tell the attendees that the first sessions would be starting in 10 minutes. This year, there were eight tracks of talks, each one presented in one of eight separate domes scattered throughout the venue. I quickly finished my coffee, collected my things and made my way to Dome 3 for my first session, Ust Oldfield’s “A Deep Dive Into Data Lakes”.

Ust first introduced himself as a consultant working for Adatis who provide Business Intelligence (BI) consultancy services. He says that, as a company, they were fully conversant with standard data warehouses, but needed to move forward in order to understand the relatively recent phenomenon of data lakes on the Microsoft Azure platform.

Ust asks the audience who already knows about Data Lakes, not many of us do, so he asks if we’re familiar with Hadoop – the Apache Foundations distributed computing framework – to which a few more people are familiar. Ust explains that, under the hood, Azure’s Data Lake is a combination of distributed Azure BLOB storage (and so can work with any file type or size) with Hadoop overlaid on top to provide the distributed compute capability.

Azure Data Lake (hereafter referred to as ADL) uses things called “Extents” to contain its data, which are 250MB blocks that all storage is divided into. Ust explains that ADL uses the “lambda” architecture and allows users to perform computations and queries using a language called U-SQL, which he says is like a cross between T-SQL used in SQL Server and C#. All of the files that are added to a data lake can be set to automatically expire and be deleted, and so ADL contains functionality to allow some automated maintenance of the data held within it.

All data warehouses and lakes in ADL go through three stages which are the ingestion of raw data, and enrichment phase (where the data is verified, de-duplicated, cleaned and augmented with additional data from other sources) and finally is curated and presented for user consumption. U-SQL scripts specify how the raw input data is transformed into output data and U-SQL includes both traditional ways to select and filter raw data, similar to how SQL Server would provide such functionality, but also includes other methods of transforming data more specific to distributed data sets, such as MapReduce.

ADL provides a dashboard within the Azure website where ADL can be accessed and scripts created and run against the ADL data, however, there is also Microsoft Visual Studio tooling available so that many of the ADL functions can be accessed through Visual Studio. One very interesting feature is that U-SQL scripts that would normally be confined to running within Azure can be downloaded to a local machine and debugged using Microsoft Visual Studio and it’s important to note that some functionality of ADL can only be accessed via the ADL Visual Studio tooling.

When performing queries against ADL data, U-SQL scripts are split and parallelized into multiple “vertexes”, which are the discrete units of computation within ADL. Each vertex can be independent or dependent upon a previous vertex completing it’s computation. You can manage vertexes and their dependencies within ADL, but this is a piece of functionality only available within the ADL Visual Studio tooling.

Ust shows us a demo of some U-SQL queries running over some sample ADL data. He demonstrates how, despite ADL like almost all Azure features is a pay-as-you-go service, reworking your queries to be longer running queries that use fewer ADLAU’s (Azure Data Lake Analysis Units – the discrete single compute/billing units within ADL) can actually save you a lot of money. This is due to how ADL charges are calculated, meaning that it’s far more expensive to use more ADLAU’s than it is to use fewer but for a longer time.

Ust shows us some tips around using “partition elimination”, which is a mechanism whereby data is pre-filtered prior to being distributed and computed upon by your standard U-SQL scripts. Partition Elimination is best implemented with a deliberately defined file naming system (i.e. MyLogs_2017_05_01.txt, MyLogs_2017_05_16.txt etc.) Using such a mechanism, you can filter the data files to be included within the U-SQL compute based upon partial filename matches and wildcards (i.e. you could process MyLogs for the month of April 2017 with a pre-filter such as MyLogs_2017_04_**.txt). Ust tells us some more about the ADL data and the requirements for its storage. He says that indexes are mandatory within ADL data, but that we can only have a single clustered index on each table. Currently, ADL does not support non-clustered indexes however this is something that may come in the future.

Finally, Ust talks about data “skew”, which is the mechanism of how your dataset is distributed throughout the cluster for computing. Data can be split for processing based upon a round-robin technique, which guarantees an even distribution of data across all nodes in the cluster, but does not guarantee that similar data will be kept together and processed by the same node. This can cause a performance degradation of the compute function as potentially separate nodes must communicate much more to transfer related data when it’s on multiple nodes. The other technique for data distribution is to split the data based upon a hash. This guarantees that related data will be kept together on the same node – thus potentially improving the compute performance – but can now no longer guarantee that the data is evenly distributed across all the nodes in the cluster. This means that some nodes will have significantly more work to perform than other nodes which can again impact overall compute performance. Therefore, it’s essential that you understand the general “shape” of your raw data in order to maximise the compute performance – and thus the overall cost – of your ADL service.

After Ust’s session, we had a quick coffee break and I grabbed another cup of coffee. There was just enough time to drink my coffee and take a very quick look around the main hall of the venue before I had to make my way back to Dome 3 for the next session. This one was Hugo Kornelis’s “Normalization Beyond Third-Normal Form”.

Hugo starts his sessions by reminding us of some key concepts that we need to be aware of when performing any data modelling activity. He talks about the “Universe Of Discourse” which is the view of reality as defined by the data/software model, it’s not necessarily the view of actual reality. We then look at the purpose of database normalization. We recall that normalization is the process of organising our data into columns and tables in such a way as to reduce redundancy and improve data integrity. Hugo points out that normalization’s purpose is not to prevent incorrect data but to prevent impossible, inconsistent or business rule violating data. We can’t stop the user from entering false data into the name column, but we can prevent them from providing us with a non-date value for their birthdate. Hugo also reminds us that normalization is never performed at a database level, only at a table level. It’s perfectly possible to have a database that, across it’s many tables, contains multiple forms of normalization.

Next, we look at what defines normalization. Hugo tells us that it’s based upon Functional Dependency. This is a constraint that dictates that for every value in Column A in a relationship, there is exactly none, or one value for Column B (i.e. A –> B). Column A can actually be a composite of multiple actual columns (i.e. {A,B} –> C) and Hugo gives the conference-specific example of a SqlBits Dome number and a chair number which can define the exact name of the attendee sitting there. It’s possible that the composite can exist on the other side of the relationship (i.e. A –> {B,C}) however this can be reduced to two constraints of A -> B and A –> C.

Hugo reminds us of 3rd normal form. This is the most “popular” normal form that many people take their databases to and then stop there. 3rd normal form (3NF) states that, in a given table, every non-key column is dependent upon the key, the whole key and nothing but the key (so help me Codd!). We can use an algorithm called “Bernstein’s Algorithm for Synthesis of a 3rd Normal Form Schema” to help us create a database schema that is guaranteed to be in 3rd normal form, so long as all of our functional dependencies are known up front. Hugo also mentions Boyce-Codd normal form, which is based upon 3rd normal form but extends the requirement that all columns in a table, including key columns, must be dependent upon the key. When all columns in a table are dependent upon the key, there should usually be no duplicated data within that table’s row.

Hugo proceeds by detailing something called Elementary Key normal form. This is perhaps a little known and used type of normal form, based upon 3rd normal form but where the constraint is defined as only non-elementary columns being dependent upon the key. So what is an non-elementary column? Well, it’s where functional dependencies such as {A,B} –> C does not have the reverse dependencies of either C –> A or C –> B. This can also be expressed as where every full non-trivial functional dependency of the form A –> B, then either A is a key or B is (a part of) an elementary key. Hugo explains that, in practice, Elementary Key normal form is almost identical to 3rd normal form.

From here, Hugo takes us into the more elaborate normal forms. We start with 4th normal form. 4th normal form, unlike the lower normal forms, is less concerned with functional dependencies, but rather with multi-valued dependencies. These are best explained with an example. Hugo uses a table representing the availability of experts to discuss SQL problems on given days of the week:

Day	Expert	Subject
Monday	Jim	Design
Monday	Jim	Tuning
Tuesday	Jim	Debugging
Tuesday	Fred	Design

Looking at this data, we can infer the following fact: On Monday, you can ask Jim about Design. From this fact, we can further infer two additional facts: On Monday, you can ask Jim questions, and Jim knows about Design. In looking at the two facts that we’ve inferred, we can see that it is not possible to work backwards and infer the first original fact merely from the two subsequent facts. This is a violation of 4th normal form. In order to make this data compliant with 4NF, we must separate the information regarding days of the week and subjects into different tables, each table then becomes compliant with 4th normal form:

Expert	Day
Jim	Monday
Jim	Tuesday
Fred	Tuesday

Expert	Subject
Jim	Design
Jim	Tuning
Jim	Debugging
Fred	Design

After 4NF we move on to look at 5th normal form. 5th normal form is based upon 4th normal form but extends the rules to dictate that there must be no “join dependencies” between the columns except based upon key. A join dependency is effectively the ability to take a single table, split it into multiple tables and be able to recreate the original table by constructing a query that joins the split tables back into one. In practice, a table being in 5NF effectively means that if a column has the same value in multiple rows and to remove the value from the table requires the removal of multiple columns then our table is not compliant with 5th normal form. 5NF is so closely related to 4NF that it’s very rare for a table compliant with 4NF to not also be compliant with 5NF.

Expert
Jim
Fred

Day
Monday
Tuesday

Subject
Design
Tuning
Debugging

Hugo briefly touches upon 6th normal form. He starts by stating that 6th normal form is very hard to find in practice, being far more an academic curiosity. 6NF is based upon 5NF but further constrains the join dependencies to state that no join dependencies, even those implied by key, are allowed to exist within the table. This effectively means that there can never be any NULL values within any columns of a 6NF table. There would be no need for NULLs as we could simply remove the entire row. The primary reason we don’t see tables and especially entire databases that conform to 6th normal form is that 6NF largely implies that our entire data schema is modelled using a very large number of tables with each table having only a key column and a data value column. Today’s real-world database platforms are simply not optimised to operate with such a data schema and so data normalisation to this level is rarely, if ever, performed in the real world.

Hugo next talks about Optimal Normal Form. This is based upon 6NF but prevents the “splitting” of tables if “elementary fact types” would be split. Elementary fact types are multiple columns that would have to remain together in a single table to ensure integrity of data. Again, optimal normal form is very rarely found in the real-world.

Finally, Hugo talks about a entirely different type of data normalization, and this is known as Domain/Key Normal Form. Domain Key Normal Form (DKNF) is not based upon functional dependencies like all other forms of normalization, but is instead based solely upon domain constrains and key constraints. Domains in this context refers to the range of values that are allowed within the given column. These are not the values allowed by the data type of the underlying column, but rather the values allowed by the business logic of the domain. An example that violates DKNF could be shown as follows with a school report card and grade for students whereby the score is a value between 0 and 100, and the status of FAIL or PASS (FAIL for scores below 50):

Student	Score	Status
James	78	PASS
William	63	PASS
David	48	FAIL
Timothy	57	PASS

From the table above, we can see that it would be possible to enter a value of FAIL in the Status column for the row containing James’ name. The database constraints would not prevent us from doing this, however, we would be violating our business rules that state that Scores greater than or equal to 50 are a PASS status. In order to correct this data so as not to violate DKNF, we would change it as follows by splitting into two tables:

Student	Score
James	78
William	63
David	48
Timothy	57

Status	Minimum Score	Maximum Score
FAIL	0	49
PASS	50	100

By splitting the data, we ensure that business logic is captured and no table data can violate the domain rules. An interesting side-effect of complying with DKNF is that you’ll also comply with 5NF too. The relevance of DKNF, despite being a very different form or normalization that other forms, is that data integrity against business rules can now be expressed and enforced from the database design alone, something that has traditionally been enforced only within application code that is responsible for reading and writing data to and from the database. It should be noted, however, that compliance with DKNF isn’t always possible and depends very much on the business domain.

After this, Hugo’s session was complete and it was time for another short coffee break. I quickly grabbed another coffee from one of the numerous catering stands dotted throughout the venue and checked my programme for the Dome that I would need to head towards for the next session. That was Dome 4 and Conor Cunningham’s “SQL Server vNext and SQL Azure – Upcoming Features”.

Conor’s talk was originally intended to be given by Lindsey Allen, however, a scheduling mix-up had resulting in Lindsay being unable to give the presentation. Instead we were provided with some excellent content from Conor Cunningham who is the Principal Software Architect for Microsoft on the SQL Server Query Processor Team. Conor is here to tell us all about the new features that will be coming in the upcoming versions of the on-premise SQL Server product as well as SQL Azure.

Firstly, Conor tells us that both the on-premise SQL Server product and the SQL Azure product share the exact same codebase. SQL Azure has a monthly release cadence and so is always the first product to receive new SQL Server functionality and have that functionality available to the public whilst on-premise SQL Server currently has a release cadence of approximately 1 year and so receives the same features from SQL Azure in each of it’s subsequent public releases.

A big feature coming in SQL Server vNext (the official marketing title is not yet decided) is the ability to run it on Linux. This isn’t just a version of SQL Server that’s specially built for Linux, but the exact same binaries that run the Windows version of SQL Server. Microsoft has built an abstraction layer, known as a “PAL” (or Platform Abstraction Layer) which is used to align all operating system or platform specific code in one place and allow the rest of the codebase to stay operating system agnostic. Moreover, SQL Server when run on Linux will effectively be SQL Server running inside a Docker container. Previously, SQL Server has relied on Windows Server Failover Clustering (WSFC) to provide clustering capability to SQL Server, however, as part of the work required to allow SQL Server to run on Linux, this is being abstracted away to allow 3rd party cluster management software to be used. Initially, SQL Server on Linux will support an open-source product called Pacemaker, however more cluster management product support will follow in time.

There have been big improvements within the In-Memory Tables features of SQL Server. These improvements mean that In-Memory tables, which were previously constrained in how they operated compared to normal disk-based tables, will now operate much closer to how standard tables operate, supporting many more features including JSON support, CROSS Apply, CASE statements amongst others.

Another major set of improvement work within SQL Server vNext are improvements in the area of ColumnStore indexes. ColumnStore indexes are perhaps one of the best new features to be added to SQL Server in recent years and allow potentially significant performance enhancements for queries on tables using such an index. ColumnStore index now have support for BLOB column data types and the index itself is now compressed, reducing space and storage requirements as well as improving performance. Further, rebuilds of ColumnStore indexes will now no longer cause significant blocking of the tables upon which the index is being rebuilt, meaning that users of the database are no longer severely negatively impacted by such rebuilds.

SQL Server vNext also includes advancements in “Adaptive Query Processing”. This is a major new area of functionality in SQL Server and will receive even more improvements in future versions of SQL Server beyond vNext. Adaptive Query Processing is a series of algorithms that work within SQL Server’s Query Processor in order to improve query performance by analysing query plans, SQL Server data and other meta-data. It aims to improve query performance without introducing any query degradation from incorrect query plan optimisations. It does this dynamically adjusting joins (i.e. switching from hash joins to merge or loop joins, or vice-versa), adjusting memory grants in order to ensure efficient allocation of memory without under or over allocating and interleaving compilation and execution for the most complex queries in order to maximise their performance.

Another major new feature of SQL Server vNext will be support for Graph Databases. Graph databases are highly specialised databases that have their data in graph structures, using nodes, edges and properties in which to store their data allowing for semantic querying of data. Common applications of graph databases are for querying large graphs of data such as those found inside a social network. Graph data and the ability to efficiently query it makes questions such as “How many friends of Person A are also friends of Person B?” and “Which friends of Person A are also friends of the friends of Person B?” very easy to answer, something a relational database would have difficulty in achieving in an efficient manner. SQL Server vNext’s support for graph databases promises to offer full CRUD support for node and edge creation, query language extensions to allow querying of graph data as well as allowing queries to span both standard relational SQL Server data and graph data at the same time.

Conor continues his exploration of the new SQL Azure features by telling us about a new feature in SQL Azure that can automatically create indexes for table columns inferred from usage of the column within queries, the maximum database size has also improved, now supporting databases up to 4TB in size. There’s also some long-awaited improvements to the syntax of the T-SQL query language itself. There are new string concatenation and aggregation functions as well as a TRIM function (finally!). New Japanese collation families have been added also and new bulk insert operations have been added to support specific new standards such as RFC 4180 CSV file formats.

After a brief Q&A at the end of Conor’s session it was time for another refreshment/toilet break. I decided I was all coffee’d out by this point, having had around 4 or 5 cups of coffee so far, and it still only being just past 12pm. There was one more session before the break for lunch and so after consulting my conference programme, I headed off to Dome 2 for the somewhat light-hearted session that was Denny Cherry’s “Things You Should Never Do In SQL Server”.

Denny introduced himself first and indicated that this session was to be a bit lighter than other sessions in the day, being a look at some of things that you should not do rather than the things you should. As such, he reminded us that everything on his slides was wrong!

He started by talking about the enforcement of data integrity in SQL Server and tells us that we really shouldn’t use things like triggers, stored procedures or even application code to enforce data integrity. SQL Server is a fully relational database and we can leverage what SQL Server is good at by designing our schemas to provide such integrity for us. Denny talks about a book that he reviewed as a technical editor, which was so bad that he implored the publisher to completely scrap the book. One of the pearls of wisdom in this book, he says, was the recommendation to use 32-bit editions of SQL Server for “local offices” reserving the 64-bit edition of SQL Server only for large corporations. Don’t do this. We all run on 64 bit operating systems today, so where available, we should always be running 64 bit application software, too. He states that recommendations from third party software vendors should always be questioned, too.

Next up is migrating databases. Something seen quite frequently is the “copy database wizard” to migrate databases from one server to another. This is error prone and simply not as good an option as something like log shipping, which has been around for decades is a very robust and mature technique for performing migrations. Then we look at the account under which SQL Server will run. Whilst it’s true that very old versions of SQL Server (pre-2005 versions at least) required local administrator privileges in order to run, modern versions of SQL Server do not require such special privileges at all. SQL Server does require some additional permissions above those usually found in a standard local user account, but not many more. Always run with the minimum permissions you need.

Next we look at SQL Profiler. It’s a great tool for debugging issues on a SQL Server instance, but it should never really be connected to your production database. This can negatively impact performance of the database. It’s far better to use it against either a local server or an offline backup or staging server. Moreover, the very latest versions of SQL Server have functionality that SQL Profiler unfortunately doesn’t support.

Denny then moves on to look at the SQL queries we write. He says that it’s really not worth the effort to ensure that SQL is written in a cross-platform manner (i.e. ANSI SQL). Whilst it’ll work, of course, you’re really giving up a lot of functionality and performance improvements that have been built into the platform specific dialects of SQL used on each database platform. Using SQL that is written specifically for the platform you’re targeting will always allow you to write code in the most performant manner. Moreover, it’s incredibly rare to have to need your SQL written in a cross-platform way as it’s incredibly rare to actually want to ever migrate your databases to an entirely new database platform.

Denny then looks at the some anti-patterns with data itself. He states that you should always use NULL where there’s an absence of data, and not values such as empty strings, minimum dates (i.e. 01/01/1990) and other “magic” values. He also says that you should never blindly design your database schema to a specific level of normalization. You should always consider the application that the data will support and the required performance of that data and domain design and then design your database schema accordingly. Next we talk about transaction logs. Denny says he’s seen a number of people simply deleting transaction logs in order to reclaim disk space. This is a bad idea, and if you find you really need more disk space, you should simply buy more disk space rather than severely impact your ability to recover your database from crashes by deleting the transaction log. On the subject of transaction logs, Denny reminds us not to use RAID 5 for our disk array that will store the transaction log. RAID 5 is not optimized for write intensive operations – which the transaction log requires – and so the performance will suffer as a result. Also, never ever use AUTO_SHRINK to automatically reclaim disk space. Whilst this does reclaim disk space, the negatives of doing this far outweigh the positives.

Next we look at columns and schema. Denny reminds us to always use the correct and most appropriate data type for our data, and always be aware of the kinds of data we’ll be working with. For example, in the US, the zip code (equivalent to postcode in the UK) is entirely numeric – it’s a 5 digit number. An integer column might seem appropriate here, but some states (Maine) have their zip code start with two zeroes and it must always be written in 5 digit format with these leading zeroes. Further, zip codes / post codes are not always numeric once you move beyond the USA, so it’s highly likely you’ll need to support alphabetic characters in there too. Also, don’t assume that certain values will never change and therefore use them as a primary key. Some developers have previously used a US Social Security Number as a primary key thinking that it’ll never change, and whilst it very rarely does, it’s not guaranteed not to.

Some developers believe that views will improve performance, however, they really don’t. And don’t ever be tempted to use nested views as they are considered evil and incredibly difficult to debug. Don’t require RDP access to a SQL Server in order to run queries against it, it’s not only a security risk, but it’s simply not required. In thinking about the permissions that you can grant to users and other objects in SQL Server, always ensure that you grant the minimum amount of permissions required. Also, don’t ever revoke permissions from the built-in database roles, such as the public role. Denny talks about a time that an over-zealous auditor at a client insisted that certain permissions be removed from the public database role (this was a permission allowing access to the underlying Windows Registry). When revoked this caused users within the public role to be entirely unable to log in to the SQL Server due to SQL Server’s own requirement to access the Windows Registry when a user logs in! And with that in mind, like third-party vendors, don’t ever listen to external auditors about what permissions you should or should not assign/revoke on your SQL Server.

After Denny’s session, it was time for lunch. All of the attendees gathered in the main hall and headed towards one of the 4 main catering points throughout the venue. As a vegetarian, there was a rather nice option which was Butternut squash & Sage tortellini with roasted Mediterranean vegetables. This was followed by either a Strawberry bavarois or a warm chocolate brownie with toffee sauce. Well, I couldn’t resist the chocolate brownie, so after collecting my meal along with some freshly squeezed orange juice to wash it all down, I found an empty spot at one of the many tables and ate my lunch.

After my lunch I decided to take a stroll around the grounds of the International Centre as it had become such a lovely sunny day outside. The morning’s sessions had been great but intensive, so this gentle stroll allowed me to clear my head, get some fresh air, enjoy the sunshine and put myself back in the right frame of mind required for the final two sessions of the day. I made my way back inside the venue as the lunch break drew to a close and once again consulted my programme to determine the correct dome for the next session. This time it was Dome 6 for Richard Douglas’s “Understanding the Transaction Log”. After taking my seat, the speaker announced that he wasn’t actually Richard at all, as unfortunately, Richard was suffering from food poisoning so this talk was to be given by one of Richard’s colleagues, John Martin.

John starts by saying that the transaction log is not just all about backups. There are numerous and varied uses for the transaction log, and it’s integral to a well-running and performant SQL Server. John talks about the three recovery models for the transaction log. There’s “Simple”, “Full” and “Bulk” modes and John is keen to point out that even when using Simple mode, everything is still logged to the transaction log. There may not be as much detail as could be found if using Full or Bulk modes, but everything is still there. One of the golden rules is to only ever have one transaction log file for each database. You can have numerous data files, however, transaction logs should be kept within a single file. This improves performance and makes recovery and backups much easier. John reminds us what the various modes do. Simple mode is effectively auto-maintenance and auto-shrinking of our log – SQL Server will take care of all of this itself. This may or may not be a good thing depending upon your use case. In Full mode, SQL Server will not perform any of it’s own maintenance or shrinking at all. You’re responsible for doing this. This is very often the better choice as you will know your own use cases better than SQL Server does and so can schedule such maintenance for the most convenient and appropriate times.

John reminds us of the set of operations within SQL Sever that are “minimally logged”, even when operating in Full recovery mode. Many common operations , such as SELECT * INTO, the CREATE / ALTER & DROP’ping of indexes etc. are all minimally logged in the transaction log, and this minimal logging impacts your ability to perform a complete “point-in-time” restore of your database in the event of a crash.

After this we look at the general process by which a standard SELECT or INSERT statement is processed by SQL Server, we can see how there’s a lot of moving parts to the entire process flow and how the transaction log is central to ensuring that SQL Server is able to provide the D (Durability) from the ACID properties that we require from our database engine. John reminds us that it’s not until our data is fully persisted to the transaction log file that SQL Server considers the data persisted and committed to the database as it’s only after this step that the data could be recovered in the event of a server crash.

John moves on to talk about the internals of the transaction log and how SQL Server uses the space within the log file. Transaction log files are split into multiple VLFs (Virtual Log File). These VLF’s exist within the same single physical file on disk. A VLF can be in either an active or inactive state, and at any given time, there’s always at least one active VLF. SQL Server will manage the creation of new VLFs as the transaction log grows over time, but it’s possible to control some of how VLF’s are created yourself. The creation of too many VLF’s within a file is thought to be a bad thing and there’s various discussions about how best to manage this. Ordinarily, you will have either 4, 8 or 16 VLF’s inside the transaction log file and the VLF is sized appropriately based upon the “chunk” size (chunks are the size by which the transaction log physical file grows on disk).

We look at the makeup of a VLF and John tells us that within a VLF there’s many “log blocks”. Log Blocks are between 512 bytes and 60KB in size, and a VLF will have as many log blocks as is required to fill the VLF space. John states how it’s better to try to keep the log blocks at the largest possible size of 60KB as performance is better with fewer, but larger log blocks rather than with many smaller log blocks. Inside the log block are the individual log records. Each log record is identified by a unique Log Sequence Number (LSN). The log sequence number is made up from the VLF sequence number, the log block number and the log record number.

John talks about how the transaction log grows over time. First we look at log growth when using the “Simple” recovery mode. SQL Server will periodically instigate a “checkpoint” . This is the flushing of dirty pages of data within memory to disk and any VLF’s that have no open transactions within them are cleared down. Note that this does not reclaim any disk space from the VLF, however. If additional logs are added SQL Server moves through the VLF’s within the file looking for space to write the log records, primarily looking for currently inactive VLFs that have been previously cleared and reusing those. If we reach the end of the file, we “wrap around” to the beginning of the file searching for inactive VLF’s where we can write the transactions. If no inactive VLFs are found, we must increase the size of the physical transaction log file. This negatively impacts the server, which has to pause all activity whilst the physical file is extended on disk. Then we look at log growth with “Full” or “Bulk” recovery mode, which is almost the same as log growth with the simple recovery mode, but instead of a checkpoint, we have a transaction file backup occurring instead which ensures we have much more transaction data available for a full recovery of the state of the server if required.

At this point, John talks about how we can make things go faster with SQL Server by improving transaction log performance. We start firstly with a good I/O architecture – consider the balance of reads and writes of your data and select an appropriate RAID strategy that’s optimized for your own use case, and remember that using a bigger RAID cache is always better. Within the architecture of your SQL Server itself, it’s best if you can determine the actual required size of your physical transaction log file at the beginning. This is difficult to predict, but if you can get close, your server will perform better as a result. We should be aware of such things as page splits, and particularly the “evil” variety as not all page splits are bad. These bad splits can have serious negative performance impacts. John also cautions against using “delayed durability” which is effectively asynchronous transaction writes. These cause the server to consider data persisted even though it’s not yet fully written to disk. Depending upon your application, delayed durability could be ok, but if your system must never lose a single transaction then don’t use it. One time when it might be appropriate to temporarily enable delayed durability is during large scale purging of data from the database. John tells us to keep an eye on our indexes. Too many of those means that each one is a discrete data structure that must be updated and persisted to disk individually which can hurt performance.

Finally John tells us about monitoring the SQL Server transaction log file and we can use both external tools such as PERFMON for that as well as built-in SQL Server system stored procedures such as the sys.dm_io_virtual_file_stats stored procedure. Regularly reviewing these monitors and logs can help identify bottlenecks within the server and highlight the areas where issues may arise.

After John’s session, it was time for the final afternoon coffee break, which this time was accompanied by a rather nice selection of cakes! After a cup of coffee and a cake or two (the scones and carrot cake was particularly nice!) it time to consult the conference programme one last time to determine the dome for the final session of the day.

This time it was Dome 3 for Emanuele Zanchettin’s “Performance Tips For Faster SQL Queries”.

Emanuele starts his session by talking about debugging. He reminds us that debugging database queries often starts at the application layer with developers digging through C# code, but he states that debugging sometimes also stops there too. We need to be aware that we often need to debug down into SQL Server itself. With that, Emanuele asks the attendees if they’d rather have a talk full of slides or a talk full of real-world demos. The attendees unanimously vote for demos, so from here Emanuele opens up his SQL Server Management Studio tool and begins to show us some T-SQL code.

He first creates a demo database which includes at least one table of over 2 million rows. He writes some simple SELECT queries that contain a few joined tables. We run the queries and see that they execute in a very short space of time, however, in looking at the execution plan generated for our query, we can see that we can make the query perform even better. So Emanuele’s first tip is to make sure you always check the execution plans generated for your queries as they can often indicate where an additional index on the source tables would greatly improve query performance.

Unfortunately, it was at this point that I had to leave Emanuele’s session in order to make an early start on my journey back home. I missed the end of Emanuele’s session as well as the final conference wrap-up and prize giving session, but I’d had a fantastic day at another incredibly well-run, well-organised and very informative SQLBits event.

As ever, I can’t wait until the next one!

Search

Categories

Tags