Platform
We’re going back to the basics on the role of data friction in software development today and how the Delphix platform unleashes the flow of data across the enterprise for faster innovation.
Serapheim Dimitropoulos
May 21, 2019
Share
Imagine you’re a developer at an enterprise company with millions of users, and you’re about to integrate code that will change your production database in some way. You’ve finished your local testing, and now you want to finally test your change with real data before rolling it out. Obviously, you’re not allowed to connect to the production database because if there’s a bug in your code, you’ll be destroying actual user data. In addition, the production database may contain sensitive information that you shouldn’t be able to access anyway.
Most engineering organizations use some variation of QA or “staging” database for this reason. A staging database is a database where a subset of the production data is periodically copied to and curated, so it doesn’t contain sensitive information. Once a staging database is in place, developers can make copies of it to test their changes before changing code in production. For many companies, this process is automated by a Continuous Integration and Deployment pipeline, but the idea remains the same: code changes are tested in curated samples copied from production data.
This setup provides the benefit of security and reliability. Production data can’t be accessed directly, so it can’t be mutated by mistake during tests and developers don’t have access to sensitive customer data. However, there are a few drawbacks, the obvious one being the waste of organizational resources. A lot of time is spent on the developer’s side waiting for data to be copied to their own staging database, and storage is wasted on staging databases that contain the same data. Another drawback, which is not as obvious, is the need for a dedicated QA or infrastructure team that is responsible for the automation of those databases, which introduces human overhead and even more costs. Finally, even with a dedicated team and automation in place, the workflows associated with restoring, subsetting, and masking fresh production data can have refresh cycles that take weeks.
These drawbacks are what we call “data friction” at Delphix, and that’s why we’ve built our platform to solve for those challenges.
The Delphix platform is a virtual machine that has all our software installed (both OS and userland). Customers generally place the VM within their network, either on-prem or within their VPN in the cloud, between the production/sensitive data sources and the parties interested in them. Following up with our example above, that would be between the production database and the developers who would be introducing changes to it. Our customers can then enable one or both of the two main features of the platform: virtualization and masking.
Virtualization deals with the resources aspect of data friction. Instead of creating and copying staging databases for each developer, any organization using the Delphix platform can enable developers to create virtual copies of the production database, referred to as virtual databases or VDBs, on-demand. A VDB initially looks exactly like the database that it’s copied from, and it’s created instantaneously and does not require extra storage. The space usage of the VDB grows only when the developer starts introducing changes to it because internally it only contains the differences between the original database and the modifications that the developer introduces.
This is a powerful concept because developers can test their changes with production data in a safe manner as they create copies of that data on the fly while introducing changes. The process is very efficient as developers no longer need to wait for data to be copied beforehand and the space requirements are minimized.
Database virtualization alone is not enough to completely eliminate data friction. Even if developers can make efficient virtual copies of production data, it doesn’t mean they should because production data usually contains an enormous amount of sensitive information. Hence, curation and cleaning are necessary prior to usage.
With our masking technology, a database administrator (DBA) can automatically identify sensitive information in database schemas and files and choose from a variety of different ways to transform it. As opposed to simply redacting sensitive data, DBAs can create test data that preserves referential integrity and looks real enough to allow developers to effectively test their applications. Together, the masking and virtualization capabilities are very powerful. They enable end users to efficiently receive virtual databases free of sensitive data and safe to use for testing.
Because the production database contains sensitive data, QA has to create and maintain a staging database. Then each developer or development team has to request a copy of that database to perform their testing. This whole process involves a lot of overhead in terms of engineering resources. The staging database cannot be the same size as production because curating the whole production database demands too much effort and time. In addition, creating hard copies of terabytes of data for each development team takes a long time and results in wasted space as developers only care only for a specific portion of the data the majority of the time.
With the Delphix platform, every developer can get access to a masked version of a production database. Virtual database creation is instantaneous and its space consumption automatically scales with the demands of its users. Last but not least, the maintenance burden is minimized to the administration of the Delphix platform.
The use case described above is a textbook example of friction in test data management (TDM) where it’s easy to highlight the benefits of our platform. Over time, our customers have been using Delphix in more complicated database setups and workflows like CI/CD pipelines but the story doesn’t end there. Virtualization and masking also work with unstructured data like documents, images, and videos, enabling a wide range of possibilities.