Why does it have to be so hard?
Unlike structured data, unstructured data is “the people’s data”, i.e. it is used by just about everyone within your organization for a million different purposes. And let’s face it, data is meant to be shared to get the maximum value out of it, that’s how collaboration works. We are all human (perhaps with a little squirrel mixed in) so we naturally like to share our data socially while at the same time stashing it away in our favorite places like some crazy folder structure on some rarely used file share that is older than your high school daughter. Think about the number of times you have copied a file off Sharepoint, and then emailed it to the team so they can download it and then access it when they are offline, long flight perhaps? One of the biggest contributors to the complexity of unstructured data management is data sprawl. Add to the fact that most businesses have a significant employee turnover rates, compounding the sprawl problem as those files typically become “orphaned” and often times have open permissions and remain completely unmanaged causing potential risks from hackers and clogging file shares. We have witnessed some file shares where over 70% of the data was orphaned, most of which had open read/write permissions. . . not pretty!
Why am I drowning in data?
The growth of unstructured data is quite impressive and yet somehow depressing if you are tasked with managing it within your data center environment. New applications are driving enormous amounts of storage whether it is image processing with higher resolution or new sensors generating billions of data points to manage more complex processes. Add to that the new demands of large scale AI models and it is no wonder that many analysts predict that unstructured data growth will continue to increase with estimates of 25% to 30% per year. It is expected that overall data capacity will reach 180 Zettabytes in less than 2 years’ time, of that 80% is unstructured. If history is any guide, this trend will only accelerate during the second half of the decade.
Why is data management increasingly important?
Many businesses have come to the realization that their data is an important asset, and the 2024 results of the Wavestone (formerly New Vantage Partners) 12th annual survey of data and analytics leadership confirms that. I found it interesting that over the last 5 years of the survey:
- Data driven business innovation increased from 59.5% to 77.6%
- Businesses that are managing data as a business asset increased from 39.5% to 49.1%
- Businesses that established a data and analytics culture increased more than 100% from 20.6% to 42.6%
And, like any other valuable asset, businesses must secure and protect that asset in accordance to its value to the organization, adding another layer of complexity.
Data is obviously an important asset, but how can my company manage it more efficiently?
Here are a few ideas that our customers have used to help them deal with these challenges and help them become a more successful data driven business.
- During a recent AI conference, I had a conversation with the Chief Data Scientist for a large bank and he mentioned to me that with all of his projects he begins with an assessment of the metadata. He called it the “DNA” of the data and believes that it provides the logical starting point for any data management project. I completely agree with his take, and believe that it is the most efficient way to inventory your data. So that you can tackle the scale and scope of your data set, no matter where it is being stored, use a solution that can:
- Scale by leveraging a flexible VM based architecture
- Utilize multi-threaded scan processing
- Utilize robust connectors to be able to process PBs of data in weeks, not months
- Integrate well with a wide range of cloud sources and applications to cover all of the new unstructured data sources and storage.
- Once you have a full inventory of data you can then start the remediation process which allows you to organize the data by classifying it and then use automated policies to take actions such as culling ROT (Redundant, Obsolete & Trivial data), archiving inactive data, identifying data for AI model inclusion and moving into the proper queue, migrating data to the cloud and finally identify data that may have risk for further content analysis or change access controls, etc. The solution should include:
- Simple to use, enterprise class GUI
- Robust/flexible reporting capabilities
- API connectivity for integration with reporting tools
- Detailed audit trail capabilities
- Automated workflow with notifications and approval templates
- Now that you have done “clean up on aisle 6” with your data, you now have a significantly smaller data set (typically 10% to 20% of original scan amount) to apply the heavy lifting required for content analysis. At the same time, your data hygiene efforts can result in impressive cost savings from storage & backup optimization, AI model preparation, cloud transformations and reduced security risks. Now you are ready to tackle the last part which focuses on full Governance, Risk & Compliance mitigation that requires full content analysis. The solution should include:
- Advanced AI/ML technology to drive higher accuracy
- Scalable architecture
- Advanced automated workflow
- Detailed audit trail capabilities
By breaking down the data management process into three steps, you can really reduce the complexity, time and costs associated with managing unstructured data in today’s changing environment.