Software Architect
Sharing Data in a Secure Way
"The companies that do the best job on managing a user’s privacy will be the companies that ultimately are the most successful." – Fred Wilson
One of the most common processes in custom enterprise software development is obtaining a copy of production data to develop new features or troubleshooting an issue. This normally involves creating a database backup by copying the backup files over the internet to a local development server or the developer’s computer.
In these processes there are several risks involving the data, for example, how was the backup generated? Is the copy being done over a clear or encrypted connection? Is the developer’s computer fully patched and secured with antivirus, anti-malware, or disk encryption?
In the current state, there are sophisticated attacks from hackers stealing information all around the world, and the ways we share data have become a primary target. Whether that happens by eavesdropping through unprotected connections or with malware; the reality is that we need to protect our customer data at all costs. We must always be one step ahead of hackers and because data still needs to be shared and moved around, it is imperative we consistently provide the proper tools to protect information while also implementing a secure practice for sharing such data. We will discuss the secure practice in a different article but for now, we’ll focus on the tools to implement different levels of data protection.
The first and most obvious solution to prevent anyone from stealing data while it is being shared is not to share it, but then we would not have data to work on! Once recognized we need to share the data, what can we do to maximize its protection, so it won’t be stolen?
Some solutions have surfaced to address this problem. One solution is to allow the developer to connect directly to the production database. This is by far the most dangerous solution. Anything can happen in this scenario and most likely the entire infrastructure and operations teams would refuse to allow it.
Another solution is encryption. In general, you should aim to have all (or at least the most sensitive) information encrypted. This can be done in several ways, but secure certificates are one of the most popular choices in recent years. With an encrypted database, a backup will also take the encrypted data so even if it is stolen that data would be inaccessible.
There is a second level of encryption during its copy stage. The connection over the Internet or through a regular network must also be encrypted. This will protect the data while in transit to its destination. The third and final level of encryption is when it reaches the developer’s computer (or local development server). Disks should then be encrypted with technologies such as BitLocker or similar.
Now, let’s pause a moment and study why are we doing this. We are talking about customer data that needs to be shared. If you take, for example, a database from an online shop, you could have sensitive information that was used to make purchases. Most likely there will also be customer’s addresses, birth dates, and so forth. This information constitutes PII (personally identifiable information). PII is, in simple terms, information that can potentially identify a specific individual. It can be (miss)used in many ways, for example, to access financial information and steal money, contact and/or locate a person, etc. It is very dangerous this information could fall in the wrong hands.
As you can see, while encryption is a solid solution, it requires many elements at different levels and may not only be costly but also impractical or overkill to implement.
What should we do then? A solution that works very well is data masking. With data masking, we are providing developers with a copy of production data, but sensitive information has already been masked to protect PII. Once PII has been masked and cannot be traced back to people, it is safe to share with the development team.
Data masking is a tedious and difficult process but there are tools that can help, such as Red Gate’s Data Masker. Masking must also abide by certain data rules such as when data is shared and masked data must propagate across different tables. For example, masking a complete set of information related to a person could potentially spread across multiple tables and sometimes that information needs to flow through multiple databases. In the end, all PII will be masked but will also be consistent.
In the following image, you can see how Red Gate Data Masker works. It works based on a set of rules that are applied to tables. In the image, a substitution rule works on the customer table by changing the values in some of the columns. This includes a set of dictionaries with readily available fake information that will anonymize real data.
After applying the rules to the database tables, the new, fake data is ready to be used. It cannot be traced back to real people but can certainly be used to troubleshoot problems or develop new features.
Data masking is a great solution to protect data considered PII. With the use of the right tools, it can be implemented to meet all levels of compliance, run repeatedly, consistently and fast. Planning for data masking should be considered when sharing data, from provisioning a copy of a database to masking the data to share the masked database, all to protect our customers and their data.