How-To
What Is Azure Purview?
Paul Schnackenburg looks at the new public preview of the cloud giant's unified data governance service.
In this article we'll look at the recently released public preview of Azure Purview, how it fits into the overall information governance story from Microsoft and why you should consider using it in your organization.
Data Governance
GDPR and CCPA brought home to many businesses the realization that if you don't know what data you have, where it's stored, who's got access to it and where it's being moved to, you are accepting an unknown amount of risk. Most business see their data as a strategic asset and are looking for the right technology to manage the risk associated with storing, mining and managing it.
In Microsoft 365 (M365), documents and emails stored in SharePoint Online, OneDrive for Business and Exchange Online -- plus file shares and SharePoint server on-premises -- are covered by Information Protection as well as Data Loss Prevention, which has been available for quite a few years. Microsoft Information Protection (MIP) lets you assign tags to documents, either manually or by scanning the content and automatically identifying credit card numbers or Social Security numbers, for example, and based on the tags you can apply a policy which can include visual indicators in the document as well as encryption. This protection follows the file so that if a user copies sensitive documents to a USB stick and gives them to someone unauthorized, they won't be able to open the file. This system works well for emails and files and it's one of the strengths of M365, with many new features added such as machine learning-based custom templates for more complex document types, and Content Explorer to be able to get a quick glance over an entire M365 estate as to what sensitive data is present.
But for structured data, stored in cloud based or on-premises databases, data lakes and data warehouses, there's been nothing comparable (from Microsoft), until Azure Purview.
Azure Purview Overview
Azure Purview is a data governance solution that helps you deeply understand all data across your data estate and maintain control over its usage. It's built on Apache Atlas, an open-source project for metadata management and governance for data assets.
Apart from the C-Suite whose jobs might be on the line if there's a major data breach, Azure Purview is aimed at two distinct personas, the data scientists/engineer who needs to know what data the business has, where it's coming from and what value it can be mined for and the chief data officer or chief risk officer. This latter role is relatively new in many businesses, and might be served by the CISO, but in any case, this is the person who needs to be able to answer questions such as what data do we have, can it be trusted, where is it stored, what is sensitive, what's the risk of storing this particular Personal Identifiable Information (PII) in this way, and who's got access to it, across the entire data estate.
Azure Purview provides three main functions, starting with the Data Map, which provides fast and precise scanning across your data estate as well as showing Lineage, i.e., where data is sourced and where it's targeted when it's transformed. Lineage is tracked both at the asset and column level for supported data sources.
Second is a Data Catalog to present all discovered data sources so that the right people can easily understand what data is there and where it's stored. Finally, there's Data Insights which gives you reports to understand what assets you have, glossary terms across them plus your classification and labelling results.
A glossary is a naming convention used by business consumers of the data, for example SKU name or Shipment address. Classification on the other hand is a tag applied at the table, column or file level to identify the data in the asset, based on its sensitivity level. A scan rule set combines several scan rules for easy scanning of your data sources. Each scan rule lets you define how file types are handled and which classifications rules to use.
The following data sources are supported at the time of writing:
- SQL server on-premises
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2
- Azure Blob Storage
- Azure Data Explorer
- Azure SQL DB
- Azure SQL DB Managed Instance
- Azure Synapse Analytics (formerly SQL DW)
- Azure Cosmos DB
- Power BI
And these file types in those data sources. Expect the list to grow as Azure Purview goes towards General Availability and beyond. Azure Data Share is a mechanism to securely share data with external business partners without having to set up clunky FTP endpoints or create copies of large datasets and is supported in Azure Purview for lineage. Azure Data Factory is a managed Extract-Transform-Load (ETL) service for data transformation and is also supported in Azure Purview for lineage.
Stored procedures are inventoried by Azure Purview to understand when data from table A is joined to table B to produce C to help you understand the lineage of data. These relationships aren't just analyzed statically but also identified at runtime, so "this procedure ran at this time/date" and produced this result. This feature helps with two main scenarios: impact analysis if you're planning a change and root cause analysis if there's an issue with data quality.
The sensitive information types supported in Azure Purview are exactly the same 200+ built-in ones that are available in M365 -- note that Azure Purview doesn't rely on M365, you manage your sensitive information types (SITs) and classification rules directly in Purview studio.
Microsoft provides good deployment guidance on how to think about Purview in your business, what questions to ask to lay the foundation for a good architecture, stakeholders in your organization to involve, and key business scenarios to focus on. There's also a four-step deployment plan from pilot to production.
Azure Purview Features
Azure Purview integrates with Azure key vault for storing credentials. As one would expect, there's a Role Based Access Control (RBAC) model with Azure Purview permissions such as Purview Data Reader Role who has read only access to the portal and can read all content, except for scan bindings. The Purview Data Curator Role adds the ability to edit information about assets, classification definitions and glossary terms as well as apply classifications and glossary terms to assets. Finally, the Purview Data Source Administrator Role does not have access to the portal but can manage all aspects of scanning the data.
For those following this space closely it's worth mentioning that there's a current product called Azure Data Catalog (ADC) Gen 1 that provide some of what Purview offers which was to be replaced by ADC Gen 2, which has morphed into Azure Purview. It's codename early on was Babylon. If you're using ADC Gen 1 be prepared for some migration work if you're going to adopt Azure Purview, at this time it's an extract from the ADC Gen 1 API and then bulk importing via CSV files.
Note that today an Azure Purview account can only scan data sources in the same tenant that it exists in, I suspect that many Microsoft partners will want to unlock this power as a managed service for their clients through Azure Lighthouse.
Provisioning Azure Purview
You can use either the portal or PowerShell to provision an Azure Purview account. As this is a preview there are a few pre-requisites and hoops to jump through, including registering three Resource providers. You'll also need to pick 4 or 16 Capacity Units, each of these supports one API call per second, I suspect more sizing guidance will be provided as the preview progresses and Microsoft gets real world data on the load on the service from different data sources. There are currently some limitations, it'll be interesting to see how these change for GA. Be aware of some caveats around which browser to use.
Once you've created your first account, delegate permissions as per your business requirements, link Azure Purview to an Azure key vault, then start adding data sources to scan. Since this is a preview you don't want to run this against production data sources, so there's a starter kit with dummy data to help you kick the tires.
Conclusion
I suspect integration with another data related hit product, Azure Synapse, is coming to Azure Purview. When a demo was shown on the Azure Friday show with Scott Hanselman the fact that scanning AWS data sources is coming was revealed, although it's not mentioned in the documentation yet. It's early days for Azure Purview, I'm really looking forward to seeing how this "Data Governance as a Service" (DGaaS?) is going to shape up.