In-Depth

Data Hygiene 101: A Step-by-Step Data Cleaning Process

It's time for spring cleaning, including your enterprise data stores, says data expert Joey D'Antoni, who offers front-line data-hygiene advice straight from the IT trenches.

"Data can be one of our most valuable corporate assets, but frequently it becomes cluttered and unreliable" was advice dispensed in a recent presentation by D'Antoni, principal cloud architect at 3Cloud. "We have too many various points of data collection. Frequently in organizations, we keep data too long, which can lead to both regulatory risk as well as performance and its cost."

D'Antoni was dispensing his hard-won expertise to an audience of hundreds who attended today's online IT education summit, "Data Hygiene 101: Cleaning Up Enterprise Data for the AI Age," now available for on-demand replay from Virtualization & Cloud Review.

As part of the hour-long presentation, he provided and explained a step-by-step data cleaning process involving assessment, categorization, validation, cleaning and maintenance.

Step-by-Step Data Cleaning Process
[Click on image for larger view.] Step-by-Step Data Cleaning Process (source: D'Antoni).

Although a relatively small part of his presentation, it provides easily digestible, crucial information for enterprise data jocks. Here are summaries of what he had to say about each.

Assess: Conduct a Data Audit
"So the first step here is to conduct a data audit," D'Antoni said. "And the goal of this is really to understand the current state of the data you have, and what you want to do is inventory all of your data sources. And in a large organization, this could take six months. You're going to look for all databases, spreadsheets and files where data is stored. Typically, this is going to have to involve system admins."

"You really want to analyze your data quality. You want to look for inconsistencies, missing values, duplicates and outdated records. This is a lot harder to do in things like Excel files. So you want to go to your sources of truth. So your systems that are sources of your truth are going to want to be where you're going to want to do that data quality lookup. You're also going to want to check your compliance, and this assumes you have these policies in place, but you want to make sure that you're compliant with GDPR and HIPAA, and you want to take advantage of profiling tools to help you analyze patterns and anomalies."

Categorize: Identify ROT Data
"Next, we want to identify that ROT [Redundant, Obsolete, or Trivial] data. So we want to tag any data that's redundant. Ideally, in your storage subsystem, you're going to have some level of deduplication. I know a lot of the large storage vendors do this to identify duplicate records and use pointers, but you kind of want to get rid of that data. You want to flag any outdated data and potentially get rid of it based on last access data in your policies. And if you have a lot of trivial data, like temporary files or relevant log files that you're no longer using, those are a good way to do that. You can use file analysis tools ... or file systems, a lot of folks will write PowerShell scripts to do that. There are also some data deduplication and compression tools, like I mentioned a lot of times, that's going to be built into the storage array. So depending on where you are in your organization, you may want to talk to your storage vendor. You may want to talk to your legal team."

Validate: Ensure Accuracy and Completeness
"And our goal here is to ensure that we have reliable, formatted and complete data. There are a number of steps to do this, and they kind of begin at software development. If you're you're you're transaction processing application, or if you're using something like Shopify, doesn't validate the data as it's incoming, you're setting yourself up to have poor data quality in your system. So you want to make sure that data, ideally, is getting checked on input that doesn't always happen. So if you don't have that in place, you want to standardize those formats. You want to make sure phone numbers are all in a single format, dates and addresses all in a uniform format and verified and validated, identifying missing values and fill them where necessary. There are various approaches to this." Watch the replay for those various approaches.

Clean: Remove or Archive Unnecessary Data
"You want to do this while maintaining data integrity across organizations. So there are some steps to this process. You want to correct inconsistent or incorrect data, like long ZIP codes you want to merge or remove, redundant records in your systems, and then delete or archive so that trivial or non compliant data you probably want to delete. And if you have old but legally required data, you want to archive it, you move that to tape storage or archive cloud storage, where you have it if you need it big, keep that going, and then you want to make sure that your reporting systems are up to date with current information. And you can use ETL tools to streamline some of that cleaning and just overall best practices that we have, or backup before cleaning. You always want to have a rollback plan document the changes you make."

Maintain: Implement Ongoing Data Governance
"And then finally, you want to maintain and have ongoing data governance, and you want to prevent those future quality issues by enforcing best practices. So you want to assign ownership to data stewards that are responsible for data quality. You want to have validation processes like I mentioned. Ideally, you want to have that on the front end, and maybe additional checks in your data warehouse to double check for a belt and suspenders approach, and you want to enforce data retention access controls and validation rules, and you can take advantage of You can monitor that data using data quality tools."

And More
D'Antoni provided much more expert information on data hygiene, and third-party tools were discussed in another session, all valuable content that can be consumed on-demand. Of course, a benefit of attending these events live is the ability to ask questions of the expert presenters, a rare opportunity for one-on-one advice. And, of course, there is the opportunity to win free prizes or in this case $5 Starbucks gifts for the first 300 attendees provided by event sponsors, in this case Rubrik, a Zero Trust data security specialist. With those considerations in mind, here are some upcoming events being held for the rest of the month by the parent company of Virtualization & Cloud Review.

About the Author

David Ramel is an editor and writer at Converge 360.

Featured

Subscribe on YouTube

Upcoming Training Events

0 AM
Visual Studio Live! San Diego
September 8-12, 2025
Live! 360 Orlando
November 16-21, 2025
Cloud & Containers Live! Orlando
November 16-21, 2025
Data Platform Live! Orlando
November 16-21, 2025
Visual Studio Live! Orlando
November 16-21, 2025