SQLBits 2019 In Review
On Saturday 2nd March, the annual SQLBits conference had it's free community day, this year held in the Central Convention Centre in Manchester.
The theme for this year's SQLBits was "Roaring '20's" and the convention centre was indeed fitted out in impressive decor to reflect that theme.
Having recently started a new role that involved extensive use of Azure, Big Data and Analytics, the line up of talks for the Saturday of SQLBits was very appealing indeed. I made my way to Manchester from Liverpool by train at a completely ungodly hour of the morning in order to arrive at the convention centre in time for registration and breakfast at 7:30am!
After registering and collecting my badge and programme, I settled down at one of the tables to double check the sessions I'd earmarked to attend, eat my breakfast and drink my coffee whilst waiting for the first session. This first session was to be Alberto Ferrari's Aggregations In Power BI
Alberto runs sqlbi.com and has written The Definitive Guide To DAX book. Alberto tells us that he's going to talk about PowerBI's aggregations feature and and how they work. He says we're going to build a sample model and aggregation from scratch to demonstrate this. He says we'll take a look at DirectQuery, Import and Composite models inside PowerBI, then show an introduction to aggregates, how to detect when our queries actively use our aggregates and why and when aggregates may not work for us.
Alberto starts my saying that we'll be using a custom version of the Contoso SQL Server sample database and we'll use PowerBI desktop to build the initial models. First, we start off with a DirectQuery model which means the PowerBI query is translated to a SQL Server query, so we're querying SQL Server directly. This approach is convenient but the downside is speed - it's slow.
Alberto builds a simple model for products grouped by brand with a sum of quantity. He tells us how how changes to this model change the DirectQuery and re-run the underlying SQL query and as the underlying data grows, performance suffers. This is where aggregations come in.
One way to create an Aggregation is to simply perform a SQL Server query with the necessary groupings and save the query results into a new table, then use a PowerBI DirectQuery against that new table. We can then tell PowerBI, via the Manage Aggregations feature, to use this new aggregated SQL table when a PowerBI query is set to use the various groupings that are also used in the underlying SQL table.
Alberto shows us how we can determine if our PowerBI queries are actively using the aggregations when the PowerBI queries are being run. Unfortunately, PowerBI isn't the correct tool for this job, but another tool, DAX Studio, is far more effective. DAX Studio's "Server Timings" feature gives not only the timings on the server but also the underlying queries and sub-queries that are run to satisfy the PowerBI query.
We move back to PowerBI Desktop and look at how we can actually configure usage of aggregation tables. In order to use the custom built aggregation tables from SQL Server, we of course need to "load" them into our PowerBI model. We can then build a relationship in the PowerBI model that allows a DAX query to be grouped by any column that isn't in the aggregated table, but is in the linked aggregated table.
Alberto tells us that aggregations in PowerBI can't be used to speed up calculated columns in the DAX query. We need to build the calculation in the underlying query in SQL Server and then build the aggregation query on top of that to store the results in the aggregation table. Only then can it can be used in the DAX query as a standard column.
We look at some potential "gotchas" with PowerBI aggregations, firstly looking at the Storage Mode for the various tables in the PowerBI model. If we try to set the Storage Mode of our aggregate tables to "Import", we can potentially introduce "weak references" which are bad and negatively impact performance. A far better approach is to use the "Dual" Storage Mode to import some of our aggregated data into PowerBI's model (i.e. the VertiPaq storage engine) whilst some data remains in SQL Server.
One problem when building PowerBI queries is that you can't perform a query that works across data in different places, for example, a table stored in PowerBI's VertiPaq storage engine and a linked DirectQuery table whose data is in SQL Server. This is an example of a "weak reference" that we looked at earlier, and this is where the "Dual" mode of storage can help. Dual mode imports the linked data into the VertiPaq engine but maintains a connection to the underlying SQL Server data store. This allows PowerBI models and queries to span data that exists both inside of PowerBI and outside in a SQL Server database.
Alberto explains how, when there are multiple aggregate candidates that can satisfy a DAX query, you will need to explicitly tell PowerBI which candidate it should use. Having to perform this manual configuration step helps you to decide whether to use imported or external (DirectQuery) aggregations.
Finally, Alberto tells us that another potential "gotcha" is that data that has been imported into a PowerBI model is not kept in sync with any underlying SQL Server tables and so, if using imported tables or the Dual mode of storage, we must be careful to manually refresh that data to ensure we're not working with stale data.
In summary, aggregations in PowerBI are powerful optimization tools. They work particularly well with large SQL Server tables or tables with large cardinality columns but that they do require careful consideration when configuring with regard to the storage mode of the data.
After a quick coffee break, the next session was Brett Cooper's SQL Developer's Guide To Open Source Analytic workloads on HDInsight.
The agenda for Brett's talk is a focus on SQL Server developers, but from the perspective of helping them understand the possibilities of open source software and how that could help them achieve their goals. Specifically, Brett's talk will focus on Apache Hadoop and the wider Azure HDInsight group of technologies.
Brett starts by asking us to consider why we should use something other that SQL Server, after all SQL Server is a mature product and well understood. One good reason is that SQL Server is very difficult to scale in a cloud environment, whereas many open source tools are naturally distributed and much easily to scale. They are also very portable across environments and operating systems, have an excellent price point for adoption and usually separate the compute and storage requirements well making scaling both up and out very easy.
Azure's HDInsight combines Apache Hadoop - for distributed data processing, Apache Kafka - for data streaming, and Apache Spark - for real-time data analytics along with other services for a complete end-to-end streaming and data analytics platform.
With Azure HDinsights, we can use a schema-on-read approach meaning that data is not validated against a given schema when writing to storage, but is validated against the required schema at the time of reading. This gives great flexibility making processes resilient to changes in source data and makes versioning of data over time much easier.
Brett tells us how the open source tool-chain is rich and broad in scope too. We have largely free choice of IDE, database connectivity tools for various platforms and frameworks will generally work and the continuous integration and delivery pipelines are well defined. There's rich functionality for data management and data science and many have been integrated together, but lineage of tools can be a problem to corporate IT departments who might prefer commercial tools from known large vendors.
Brett starts a demonstration by looking at the HDinsight cluster architecture. He tells us there are three types of nodes, Gateway, Head and Edge nodes. The first two are not generally consumer facing but are used by Azure itself to perform management and security and Hadoop processing services respectively. The edge nodes allow users to access the cluster and are lightweight Linux virtual machines with the necessary client tools pre-installed. There's also worker nodes and zookeeper nodes within the cluster which perform the actual data analysis and deal with leader elections for master services on the head nodes. HDinsight also contains RevoScaleR Servers to provide machine learning capabilities with a cluster.
Data storage for HDinsight is backed by either Azure Blob Storage or Data Lake Storage. Brett says it's likely that you'll end up using both types of storage in an HDinsight solution, and especially Data Lake storage for large-scale compute of the dataset being processed.
We move on to look at Spark structured streaming which provides "exactly once" processing for events. Exactly once comes at a cost, however, which is that Spark operates with high latency, but does support numerous languages such as Python, R and others. Alternative approaches are Storm. This is low latency processing in Java, but offers only "at least once" delivery of messages.
For orchestration of the data pipeline, we can use Azure Data Factory, however, this is not open source. One alternative is Apache Airflow, which is open source or even Apache Oozie, which is also open source but is unfortunately losing traction in the community due to is use of XML for configuration.
Whilst IDE choices are varied, the IDE's that have the best out of the box support for the HDinsight technology stack are Visual Studio Code and Apache Zeppelin.
Brett moves on to show us the portion of the Azure user interface that allows us access to the HDinsight tools, named Ambari - yet another Apache software project. Brett shows us a demo of creating and working with a complete HDinsight cluster using Visual Studio Code. There's an extension to VS Code called Azure HDinsight Tools that allows for working with HDInsight and can interactively create and query data. We see how simple it is to load a CSV dataset of some New York City taxi data into a file in Azure Blob storage using VS Code. Brett then shows us how composing a simple query in VS Code's editor can be immediately invoked to query that data, displaying the results in a new window. Brett tells us that the query language isn't ANSI SQL, but is far closer to the syntax of MySQL.
Brett then continues by showing us HQL, which is the Hive Query Language used by Apache Hive which sits on top of Apache Hadoop and provides the data querying and analysis services. Brett uses HQL to construct a query that creates a database with a schema based upon the columns contained within the previously loaded New York Taxi CSV file.
We're reminded of the importance at optimizing our data contained within the data store. Doing such optimization can ensure we keep bandwidth requirements down and greatly improve the IOPS, improving processing. In Brett's own example, optimizing of the dataset reduced the required storage from 42GB down to 25GB. Optimizing of data entails reading the raw data from our data store, performing some processing (usually aggregations, deduplications etc.) and writing the results back to a new area of our data store before finally reading the data back in for querying. Brett tells us that this whole sequence of operations is far more efficient within Data Lake Storage than it is with Blob Storage as the final step of writing and re-reading for querying happens only as a metadata operation with Data Lake storage reducing the need to transfer the data set to and from persistence storage and eradicating the final read of the data.
Finally, Brett tells us that the LLAP (Live Long and Process) functionality that was added in HIVE v2.0 can provide extensive caching of our data meaning that queries don't always need to go back to storage to retrieve data.
After another coffee break, it was time for the last of the morning's sessions. This one was Joe Plumb's Getting started with real-time analytics with Kafka and Spark in Microsoft Azure.
Joe starts by saying that the alternative title for his talk was going to be "Everything I know about real time analytics in MS Azure" and that the agenda for his talk is based around his own learnings with real time analytics. The agenda will cover the fundamentals of streaming data and the various options available in Azure to process that streaming data along with some demos.
We start with a definition of streaming data. Streaming data is is data that is continually generated from different sources. We also define what is meant by a streaming system. This is the system that ingests and processes the streaming data.
Joe gives us an overview of the available options for dealing with streaming data in Azure. For data ingestion we have Event Hubs, Apache Kafka and the IoT hubs. For processing, we have Stream Analytics, Storm and Spark. For data storage we have SQL Data Warehouse, Spark SQL, Apache HBase and Apache Hive.
All of these streaming systems work with Events or Messages. Joe asks us to consider the difference between and event and a message and says that the difference is really just semantics, but the term message is a more generic, catch-all term.
We then move on to look at the difference between event time and processing time. This defines the difference between the actual time that an event occurs (Event time) versus the time that the streaming system becomes aware of an event (Processing time). In an ideal world, these times would be the same however in the real-world, there can be a large amount of skew between them.
Joe talks about event buffers. When processing events, we must buffer them as they arrive within the system. For the event buffer, we use "Watermarks" to indicate which events have been processed and which events are buffered and not yet processed. Watermarks can be a strict guarantee of processing, these are known as perfect watermarks or can be an imperfect watermark, known as a heuristic watermark, which uses algorithms to determine a "best guess" of the events still left to process.
Next, we look at "windowing" on the event stream. Windowing is a way of partitioning, or batching, the incoming event stream. There's a number of different types of window that we can use and each type suits a different purpose.
The first type of window is a Tumbling Window in which a window of a given size is place over the event stream and moves forward over time. Tumbling windows don't overlap with each other and no events are repeated within more than one window. Tumbling windows can be used to answer a question such as "Tell me how many things occur every 10 seconds".
Next is a Hopping Window, which is similar to a tumbling window, but that each temporal window can overlap with another meaning that single events can exist in multiple windows. Hopping windows have a window size and an additional parameter for how much they "hop" by with each window. Hopping windows can answer questions such as "Every 5 seconds, tell me how many things occurred in the last 10 seconds".
Sliding Windows are similar to hopping windows and can overlap and contain the same event within multiple windows, but rather than moving in a consistently defined temporal nature, they occur only in response to events. This means that every sliding window will always contain at least one event. They help answer questions such as "Give me the count of things that have happened at least 10 times in the last 10 seconds".
Finally, there's Session Windows. These groups events that occur at similar times, filtering out periods of time where there's no events. They help to answer questions such as "Tell me the count of things that occurred within 5 minutes of each other".
Joe talks about the concept of triggers. Triggers are intrinsically linked to the windowing capabilities and determine when processing on a given window starts. We see that we can have repeated update triggers which fire multiple times in a given window as events within the window occur. There are also completeness triggers, which fire once when the window is determined to be complete.
Next, we move on to look at delivery guarantees of messages within the streaming system. There's a few different guarantees that are available and which place different constraints on other parts of the system.
Exactly Once guarantees that a message will be delivered downstream in the processing pipeline exactly one time. Kafka can offer this kind of delivery guarantee, but this kind of guarantee is often difficult to achieve and requires close cooperation between both the message broker and the consumer. This increases latency and decreases message throughput. At Least Once is a common delivery guarantee which is easier to achieve than exactly once delivery but requires that consumers can appropriately deal with potentially receiving duplicate messages. At Most Once is another delivery guarantee mechanism offered, but like Exactly Once, requires cooperation between sender and receiver of messages.
Joe then tells us about some of the options within Azure for streaming data and streaming processing. We first look at Event Hubs, which are a fully managed PaaS (Platform-As-A-Service) event ingestion service. Although it's a service specific to Azure, it can operate with API compatibility with Apache Kafka in order to capture data into a data store, usually Azure Blob Storage or Azure Data Lake Storage.
Stream Analytics is another Azure PaaS which provides event processing allowing examining high volumes of data from an Event Hub. Both Event Hubs and Stream Analytics are often referred to as being two sides of the same coin and work very together.
Azure HDinsight is another service but this is a managed version of the Apache Hadoop and Spark services. It also includes Apache Kafka for event ingestion along with Apache Storm, which is the open source equivalent of Azure's Stream Analytics. Azure also includes a service known as Azure Databricks, which is effectively Microsoft's own version of Apache Spark.
Joe then shows us a quick demo capturing user interactions on a webpage (Joe calls them "wows" - similar to likes or claps). These are API calls from the web page to generate real-time events. Joe starts with a Python Flask application which serves as a mock application generating lots of "wow" events for us by continually hitting the API endpoint, generating lots of events in a short space of time. Joe with switches to the Azure Portal and to an Event Hub which is ingesting all of those events. We then see how a Stream Analytics job is processing those events as they arrive in real-time. Finally, Joe shows us PowerBI Desktop, which has the ability to use a streaming event source as it's input data. Joe has created a simple report with a graph showing the "wows" over time and we can see how this report is updated in near real-time as Joe generates more and more events.
After another coffee break, it's time for the last session before lunch, this one is Kasper de Jonge's Microsoft PowerBI Premium: Building Enterprise-grade models
Kasper starts by explaining some of the differences between standard and premium PowerBI offerings. Standard is free and contains 1GB of storage for datasets, whilst Premium has a cost attached, but contains 10GB of storage along with additional features, such as allowing the usage of larger source datasets of 400GB compressed, and even larger if you're also using Azure. You can also use SQL Server Management Studio, SQL Server Profiler and SQL Server Data Tools for deployment against PowerBI workspaces, as well as using the Tabular Model Scripting Language which enables rich functionality for data refreshing and administrative functions.
PowerBI Premium exposes it's datasets via a powerbi://
protocol which allows you to use external tools such as SQL Server Management Studio and Microsoft Excel to connect to the PowerBI data. This provides for a powerful combination of reporting capabilities over your PowerBI model.
PowerBI premium allows for an incremental refresh of data. This affects data in the PowerBI service, not the data inside PowerBI desktop. Incremental refresh relies on an underlying data source that has a date column. We can configure incremental refresh by giving it a start and end date and also define a smaller time window that will refresh. For example, we can tell it to display data from the last 10 years, but to refresh the data that's from the last 5 days. PowerBI achieves this by automatically creating partitions based upon date ranges within the data source for the required levels of granularity. PowerBI can then work to update these partitions independently of other partitions and these partitions are also automatically maintained over time.
Kasper then moves on to look at a modern Data Warehouse on Azure with PowerBI Premium. He explains how reporting over data sources with literally billions, or even trillions of rows of data is entirely possible, and shares with us a link to a Microsoft demo, the TrillionRowDemo. Kasper tells us that the trillion rows of data consume 0.25 petabytes and are housed within a Spark cluster. The data is a series of latitude and longitude updates for a large number of delivery drivers for a courier company.
We are shown how we can build a PowerBI desktop set of reports using various columns from the dataset including location, driver name, total distance driven, and how we can slice and aggregate this data in different ways. Changes to these groupings and filters are processed instantaneously even though we're processing 1 trillion rows of data.
We continue by looking at Imported versus DirectQuery data. For big models and data freshness, we'll need DirectQuery. This gives us access to the underlying data source, but comes at the price of performance. If we're optimizing for performance, however, we should use Imported data. Kasper tells us about a couple of new features. Composite Models and Aggregations. We've heard all about Aggregations in Alberto's talk from earlier in the morning, and the Composite Model is similar to the Dual Storage mode that Alberto had discussed earlier in the day, allowing multiple types of data connections within a single report.
Kasper asks why we would even need PowerBI for such composite data functionality. Why not query the Spark Cluster directly? Well, it's very costly to directly query data of such volume from the backend Spark cluster. By using PowerBI, it acts largely as a cache for approximately 80% of the use cases of the data and thereby vastly reducing cost of processing.
Finally, we take a quick look at multi-dimensional versus tabular data. This includes ragged hierarchies and aggregates/composite models. We're shown a new feature, currently only in CTP (Community Technology Preview) called Calculation Groups which reuses DAX calculations and can groups multiple calculated values together in their own group or table of data.
After Kasper's session it was time for lunch. Lunch at SQL Bits is always excellent and this year was no different. With a great variety and quantity of food on offer to suit all dietary requirements.
The wild mushroom and spinach risotto was so delicious, I may even have gone back for seconds!
After a delightful lunch, it was time for the first session of the afternoon. This is Asgeir Gunnarson's Who, what, when - Monitoring usage of your MS BI Stack.
Asgeir looks at a number of approaches to monitoring all aspects of a business intelligence stack focusing specifically on databases and PowerBI.
For database monitoring, SQL Server Reporting Services (SSRS) Execution Logs are one approach. They're flexible and contain lots of useful data, although very rudimentary. They only show the subscription ID of the connection to the database and not the actual user's account ID when connecting via a subscription making identifying such users quite difficult.
We look at some commercial tools that can provide similar monitoring such as Idera's SQL BI Manager, SentryOne's BI Suite and Scrubs from SummitCloud. One advantage of using commercial tools is that they're usually very well supported with very good tooling but they're disadvantaged due to associated cost. They're also not always focused on user activity and behaviour within the system but sometimes focus only the reporting data.
For monitoring of PowerBI, we can use Workspace Usage Reports which are built in to PowerBI and are very easy to setup. One constraint of these reports are that they only run against a single workspace at a time making user reporting across multiple workspaces difficult. There's also Office 365 Audit Logs. These show all activity within a tenant show not only usage and different types of usage, but also much more user data. The downsides of using the audit logs are the requirement for extensive permissions in order to access the necessary data to produce the reports, which can often be difficult to get from corporate IT departments. They also require a code solution in order to collect and process the logs, although this can be automated with something like a PowerShell script which can be provided to corporate IT staff in order for them to run using their own credentials. A commercial product from a company called Neal Analytics is also available that performs this function.
Asgeir continues by looking at the Query Log from SQL Server Analysis Services (SSAS). It's built into SSAS and is easy to implement but it has a lot of overhead as the data contained within requires a lot of transformation and interpretation and isn't really intended for audit purposes, instead being intended for query performance tuning.
We also look at SQL Server's Dynamic Management Views which are another option for gathering monitoring data. They're built into SQL Server and have a clear and well-supported structure so there's lots of information available online for querying and processing this data. Some downsides to this approach are that it only works with data for current operations happening on the SQL Server in real time.
Finally, we look at trace events. There's SQL Server Profiler traces and Extended Events (xEvents). Again, both of these options are built into SQL Server so available without additional tooling or expense. It's possible to trace lots of different events and there's an abundance of information available online to help. One downside is that the event/trace data is stored in emitted in XML format, so will require some processing in order to extract all of the data contained inside the XML document. The data is also very granular in nature so you may need to aggregate multiple events together in order to get a complete picture of a user's given actions. xEvents will only work with versions of SQL Server from 2012 onwards so will not be available to use if you're stuck on an older version of SQL Server, and the actual information for a PowerBI user is very difficult to find and requires cross-joining with other data sources. This is due to xEvents only showing the "gateway" user within the event data.
Asgeir concludes by stating that the decisions between rolling-your-own solution versus buying a commercial product come down to your ability to perform the often required additional coding work and appetite for learning about the raw data and how it must be transformed in order to provide you with the monitoring data required. Commercial solutions will require none of the additional coding work, but must be carefully evaluated to ensure that they provide the required information from the data that they work with.
After another coffee break, during which we were treated so some lovely chocolate brownies, it was time for the penultimate session of the day. This one was to be Ana Maria Bisbe York's Some transformations that can save your ETL in PowerBI.
Ana's talk agenda is to first define the Extract, Transform, Load (ETL) process and in particular, the process from within PowerBI and then to take a deep dive with some hands-on demos of configuring such a process.
The ETL process is the most important layer in a complete business intelligence system, but it's often a largely "hidden" process. For example, when end user's run their reports, they haven't seen the work that has gone into the process to prepare the data for the reports.
We first look at the Extract part of ETL. Data must be imported or linked from various sources. Then comes Transform. We must filter, format and otherwise prepare the data. Finally, there's Load, which loads the transformed data into the PowerBI Model. We also have the opportunity at this stage to actually apply further transformations on the data, such as defining DAX calculated columns.
Ana shows using PowerBI Desktop to load two CSV files - one for each of two years of data (Sales2017 & Sales2018). She adds a number of transformation steps which are retained as individual steps within PowerBI and can be replayed against other data sources. These transformation steps are applied to the first CSV file once it's loaded, and are then also applied to the second CSV file, which is loaded and transformed in the same way. After this, we combine the resulting transformed data into one combined table with the data from both files.
Ana continues and shows us how to add a calculated and transformed column for a person's age based upon the date of birth column in the source data. We then add another column representing a textual description of the age based upon age ranges, one for "very young", "young" and "old". The specific ranges chosen can be encoded directly into the column definition, but it's better to use a run-time parameter for this.
We then see how columns can be renamed based upon example values. There's a column called "P_Mountain" within the source data but we'd really prefer it to be named "Mountain". We can write the function expression ourselves, but we can also tell PowerBI the result that we'd like to achieve (in this case, we give it the value "Mountain") and PowerBI will reverse computes the required expression.
We then see how, with all of these transformation steps set up, we can simply add additional CSV or Excel files into the source folder and hit refresh within the PowerBI desktop and the new files are loaded, transformed and processed using the same sequence of steps as previously defined for the original files. (Of course, the files need to have the same schema/structure).
We continue by looking at the creation of queries which are based upon other queries within PowerBI desktop. We can do this whilst only loading in the final query that produces the ultimate dataset that we require, electing to leave the working (or temporary) queries out of the PowerBI model.
Ana then shows us how we can write code using M Script that can be added to PowerBI and which performs processing against some underlying data in order to produce a transformed dataset. This is very useful for creating such things as a table of data that represents a chronological calendar, for example.
PowerBI desktop also includes the ability to view the complete list of dependencies for the data model. You can see all of the interim queries and transformation steps that are required in order to build the model tables. This essentially provides the user with a complete overview of the entirety of the ETL process and all of its' constituent steps.
Finally, Ana concludes by reiterating that the PowerBI Query Editor is a very powerful tool for ETL purposes and that all of this power can be used not only when sourcing data from large backend datasets but also as a standalone tool that can load data from local files.
After another short break, which was the last of the day and unfortunately did not include any coffee, it was time for the last session of the day. This final session was Jason Horner's Data Lake Design Patterns.
Jason starts his talk by stating that "data is the new currency". He says that a wallet is great for monetary notes, but not for a huge quantity of money, and likewise we need to treat the containers for our data in the same way. A single folder is great for a few files, but for large quantities of data, we need something better. Within Azure, we'd likely use a Data Lake.
Azure Data Lakes can store huge amounts of data and can use Azure Blob Storage under the hood, however, Jason says that this is probably not the best option. but it's maybe not the best overall solution. Data Lakes in Azure are currently available in both Gen 1 and Gen 2 variations. Gen 1, the original Data Lake architecture has it's limitations, but the new Gen 2 architecture addresses those limitations and provides the current best options with improved performance and a more flexible structure for data storage.
Jason then talks about file formats for storage within the Azure Data Lake. He suggests that a good file format to use is Apache Parquet which is a columnar storage format. It supports good compression and predicate pushdowns which is a way of optimizing the data required to be read. It's a relatively new format, but is already currently supported by Twitter and Cloudera.
Jason says that it's important to think about Zones when designing your Data Lake. Zones are logical or physical separations of data that allows you to keep the overall Data Lake environment secure and well organised. In many ways, Zones are really just folders in the lake, with some of the common folders used being Archive, Raw, Reference and others.
For querying within the Data Lake, a current popular choice is U-SQL however, the relatively recent introduction of Azure DataBricks will effectively replace it.
We look at some good practices around naming of files inside the Data Lake. Once common good practice is to suffix the file with the date in ISO format (i.e. nameYYYYMMDD_HHmm.csv). It's then best to partition these files in folders based upon a similar date structure. For example, a folder structure such as archive/2017/03/14/sales20170314_1132.csv
. Jason also reminds is that it's important to avoid creating folders that contain no data and not to mix file types within the same folder.
We continue by talking about the "splittability" of our files. Splittability is the ability to arbitrarily split one file into multiple files and for the data to still make sense. If our data is amenable to this kind of splitting it means we can use parallel computation when processing the data which improves processing speed. However, this also needs to be contrasted against the fact that our files may also be compressed and it's probably easier to decompress and process fewer larger files than it is to decompress and process many smaller files.
Jason then starts to look at the common areas that we might divide our Data Lake into. The Raw zone should contain only immutable data and it's really to be considered a staging area for data which will undergo further processing. A Curated zone is the place where data from the Raw zone has been processed and enriched with additional data. It's a good place to allow end-user access to the data. The Temp zone houses mutable data and is the place where files can be moved, archived, deleted and more. Data here should be considered to be transient. One good use of the Temp zone is to compare data here with the same data in the immutable Raw zone in order to compute deltas. Finally, the Archive zone is the place to store data when it's no longer required on a regular basis. Data here can be restricted via ACL's is this is the last place for data prior to it being moved into long-term archival storage, perhaps on magnetic tape or other medium.
We're reminded that we need to avoid creating a "Data Swamp". A Data Swamp is a Data Lake that has no standard file naming, poor folder structure, the mixing of multiple file types within the same folder and overall poor governance. There's also the "small file problem" that we need to be concerned with. We want to avoid placing too many small files in the same folder, however, a good date-based folder naming structure will go a long way to avoid this. The main reason for not wanting too many files in a single folder is that they can slow down the Massively Parallel Processing (MPP) engine when processing these files.
With regards to Data Lake governance, we should use a Data Catalogue. This is a central place in which we can list all data sources and relevant connections and credentials for access. Azure itself has a data catalogue however, it only a rudimentary product but does allow registration, documentation and search and discovery.
Finally, Jason looks at other best practices to use for the data inside the Data Lake. We should consider good metadata for all files inside the lake. Use a common and consistent set of columns such as a Created Date, Processed Date, Source and MD5Hash to allow easy identification of key stages of the files processing lifecycle. Use schema-on-read for data access to allow data to be ingested without reference to a specific strongly typed schema. Use Azure security with Users, Groups & Principals along with POSIX ACL's to restrict access to the relevant areas of the Data Lake for certain users and user groups. One handy tip that Jason shares for Azure security is to ensure that folder permissions are set as not only Access permissions but also as Default permissions. This ensures that files written into the folder are given the same permissions as the containing folder.
After Jason's session it was time for the final wrap up for the day and the prize draw. It had been an excellent day at SQLBits 2019 and I'd learned a lot about many new areas that are going to be directly relevant to me in my current role, so when my name was read out as a prize winner also, it was the icing on the cake for a brilliant day. Thanks to the guys at Power BI Sentinel for the remote control plane that I'd won.
Once again, the year's SQLBits proved to be another excellent conference, incredibly well organised and run and with lots of informative content. I'm very much looking forward to attending next time.