News
7 Steps to Tame the Unstructured Data Jungle
It's not just a mess -- it's a security risk, a compliance hazard, and a missed opportunity. That was the message from data evangelist Karen Lopez during the June 27 session of the "How To Take Unstructured Data from Chaos to Clarity" summit, hosted by Virtualization & Cloud Review and now available for on-demand viewing thanks to sponsor Rubrik. In a presentation packed with real-world examples, tactical advice, and a few quotable zingers, Lopez laid out a structured framework for getting a handle on one of IT's most persistent problems: unstructured data.
The longtime data expert -- known for her frank talk and deep industry knowledge -- just returned from a business trip to Iceland to walk attendees through a seven-step roadmap to bring order to chaotic data environments as just one part of her extensive presentation.
"We need to automate away the boring stuff so we can free up humans to do the difficult stuff."
Karen Lopez, @datachick.bsky.social
Titled "How to Tackle Unstructured Data," the slide that anchored the part of her talk we'll discuss here offered both a conceptual framework and a practical to-do list for IT pros looking to turn messy data into a manageable, secure, and value-generating asset.
[Click on image for larger view.] How to Tackle Unstructured Data (source: Karen Lopez).
1. Find Unstructured Data
Lopez opened her strategic roadmap with a reminder that organizations can't address unstructured data unless they know where it is. She detailed how to find unstructured data, because there's so many places it could be. She emphasized emerging formats and overlooked sources like telemetry, software-defined configurations, and AI artifacts: "The newest ones are prompts and all the other AI related data and values."
She introduced a favorite term for 2025: "dark data -- data that we don't know is out there, and therefore it's hidden, either intentionally or unintentionally."
2. Inventory Format, Size, Location, & Access Controls
Once found, unstructured data must be inventoried. Lopez stressed this doesn't mean ingestion into a system, but rather documentation: "By inventory, I don't mean bringing it in. I mean you've discovered it, you've scanned it, you've entered it, you've exposed it to a catalog or portal. Now you need people to help you do all the next steps."
She advised leveraging a cross-functional set of allies:
- Business users and process owners
- Data stewards or guardians
- Developers and DBAs
- Records managers
- Storage staff
"Because while storage people tend not to know about the content of files, they certainly know a lot about files and blocks, and they can trace where it comes from and what it's exposed to many times."
3. Establish & Enhance Metadata
Lopez underscored that metadata is foundational: "Metadata is one of the number one ways to handle chaos with data."
She provided a variety of useful metadata categories:
- Embedded metadata -- e.g., "camera model, GPS location, when it was taken"
- Derived data -- results of "entity extraction," "topics," "tone"
- Usage metadata -- how it's accessed, stored, or versioned
Metadata also helps with automation and governance: "We need metadata to help us secure it, to make it easier to use, to support the new things like data mesh."
4. Classify Data
Classification is essential for securing and governing unstructured data. "That's a label or tag we put on data to say, is it PII? Does it have sensitive data, confidential data?" she said.
She called out the need for a standardized tagging system, and stressed the importance of keeping it updated: "We need to regularly review and update those classification standards, because legislation changes, our understanding of the data changes."
5. Standardize Formats Where Possible
Format diversity adds chaos. Lopez advised that teams should "convert it to other formats" and consider "naming standards," "setting folders and paths," and even cleaning up text.
"Sometimes date formats, you know, get transposed and messed around at times," she said. Even redaction may be necessary: "A typical one would be on resumes… you want to take away any photos… or any sensitive data that we're not supposed to use in evaluating candidates."
6. Add to Data Governance Processes
Unstructured data must be folded into broader governance efforts. "It's still data, even if it's unstructured, even if it's a raw image, even if it's a JPEG or PNG," Lopez said.
She warned: "Most of the data governance programs that I look at today don't have a lot of good details or policies or tools for unstructured data." Challenges include masking, complex security, and legacy tools built for structured environments.
7. Automate It All
Lopez closed her framework with a call to scale: "We need to automate all the things. Now, not everything can be automated, but in order to handle the volume and the just fire hose of unstructured data… we need to be automating as much as possible."
She highlighted both opportunity and risk: "What I really say is we need to automate away the boring stuff so we can free up humans to do the difficult stuff." But she also cautioned that "a mismatch for all the types of unstructured data" and "format obsolescence" remain major concerns.
And More
Lopez had a lot more expert advice to share on those seven points and a lot more about other topics in her complete presentation, and an on-demand replay is definitely in order. And, although replays are fine -- this was just today, after all, so timeliness isn't an issue -- there are benefits of attending such summits and webcasts from Virtualization & Cloud Review and sister sites in person. Paramount among these is the ability to ask questions of the presenters, a rare chance to get one-on-one advice from bona fide subject matter experts (not to mention the chance to win free prizes -- in this case $5 Starbucks gift cards which were awarded to the first 300 attendees during a session by sponsor Rubrik, a leader in cloud data management and enterprise data protection which also presented at the summit).
With all that in mind, here are some upcoming summits and webcasts coming up from our parent company in the next month or so:
About the Author
David Ramel is an editor and writer at Converge 360.