With data comes responsibility. As data grows exponentially in volume and value, organizations that collect and manage large amounts of data seek to build their data ecosystems and grow their data collaboration partnerships. At the same time, data privacy regulations and the ever-growing risks of cyberattack, mean data owners must maintain robust control over their datasets. Until recently, such control was a matter of time-consuming manual processes on top of customized, code-intensive architectures. With Habu, data owners have a better way.
Habu delivers security and privacy by design
Organizations managing data ecosystems today can leverage a modern Habu Data Clean Room to accelerate data collaboration with advanced controls for security, privacy, governance, and consent management. For data owners, control over data begins by ensuring that the three types of data involved in data collaboration are handled securely:
- Source data Habu minimizes source data movement and allows access only for approved processes.
- Processing data is a temporal subset of source data that must be accessed, joined, and processed to facilitate data collaboration. Habu never stores or persists this data.
- Results data is the approved, privacy-preserved output of a data collaboration that meets the requirements of data owners. Habu securely delivers results data to an agreed cloud location.
With these types of data in mind, Habu delivers security and privacy for data owners across five key dimensions of the data collaboration process:
1. Advanced data governance and data access
For most data owners, restricting the movement of critical datasets is paramount. Habu facilitates data collaboration by connecting to and accessing data directly at its source, which can be any cloud-based location (e.g. AWS, GCP, Databricks, Snowflake, Azure). Data owners provide identity and access management (IAM) credentials to Habu that enable read-only access to source data, which is not copied or stored. Habu stores metadata on the structure and schema of source data so that data owners can further configure controls and provision access as required.
Data owners have two additional methods of controlling access to their data. The first comes at the time of connection. During the data mapping process, data owners can filter specific columns of data; once filtered, these columns can never be accessed by Habu for data collaboration.
Next, data owners can filter their datasets during configuration of the clean room. For each collaboration, data owners can implement additional column- and row-level filtering logic that Habu enforces prior to accessing and processing source data at runtime.
Once a dataset is filtered and connected, Habu enables data owners to control usage of their data through a detailed approval process. In the Habu Data Clean Room, participants select analyses, in the form of template-based questions (ex., “What is the propensity score for CRM users by category?”), each of which contains metadata about the data or inputs required to run it. Data owners must specifically approve each analysis by assigning their pre-configured datasets to a question in order for the analysis to run.
Habu’s approval process minimizes the amount of data necessary to complete each analysis, and that data is only accessed from its source at runtime, subject to the data owner’s pre-defined filtering. Habu accesses only the subset of data required to perform the pre-approved analysis, and does not store or persist data during analysis.
2. Secure multi-party collaboration
Habu’s goal is to enable the broadest spectrum of collaboration use cases. To achieve that, Habu offers data owners two ways to run secure multi-party data analyses— the choice of which depends on customer requirements.
For joins that involve our key partner integrations — including Databricks, Snowflake, GCP, and more — Habu orchestrates distributed joins and multi-party queries on data stored in two or more cloud accounts from that partner.
For joins that run on different clouds, Habu auto-provisions temporal database infrastructure and reference data connections in a designated cloud location.
When data requires processing beyond distributed joins or SQL queries (as with machine learning (ML) tasks), Habu establishes a trusted, temporal execution environment, called Clean Compute. This execution environment auto-provisioned at runtime and decommissioned after processing. Habu can spin up processing infrastructure in any public cloud, including AWS, GCP, and Azure, depending on the requirements of data owners and/or their partners.
During processing, Habu reads source data into the execution environment based on filtering rules defined by each owner. Ingress is limited to data sources mapped by the clean room participants, and egress is disabled, except to the approved cloud location where results will be written. Once the data has been processed in accordance with the data governance rules defined at the outset of the collaboration, Habu produces a privacy-preserving results set, all data used during processing is deleted, and the runtime infrastructure is securely spun down.
3. Layered privacy-preserving techniques
Habu extends data owners’ control over sensitive datasets with a variety of configurable privacy-preserving techniques including tokenization, K-min enforcement, and noise injection. Optional tokenization enables data owners to substitute tokenized identities for identifying fields so that raw identifying fields are never present in the clean room. Within a Habu Data Clean Room, data owners can also specify K-min thresholds for user metrics and add randomized noise to guard against inadvertent leakage of sensitive information.
K-minimization is a technique which when applied, removes all records with values too unique to meet the crowd-size set. For example, if the K-min threshold is 25, and a dataset has a transaction field illustrating total basket for a shopping trip, records where the total basket value is not repeated at least 25 times in the dataset will be dropped. In cases where K-min is insufficient or not applicable, Habu also supports the injection of random noise based on the Laplacian distribution for output results of questions. Depending on the level of noise, or data decibel, selected, Habu will slightly and randomly alter results such that the privacy of the individuals within the applicable datasets is preserved.
4. Adaptive governance framework
Habu Data Clean Room solutions feature a flexible governance model to provide data owners with granular control over who gets access to their data in which context. Governance options include:
- At the organization level: Specify global defaults for all data processing activities in a Habu Data Clean Room
- At the dataset level: Dictate which field schemas are accessible to which clean rooms and partners.
- At the user level: Develop queries to filter datasets for user-level attributes, including consent and purpose settings.
All Habu Data Clean Room queries, either for analytics or data activation, respect the most granular level of consent provided by the data owner.
5. Full auditing capabilities
With Habu, data owners have complete traceability and auditability of every user action and every analytical process that executes within the Data Clean Room. If the data collaboration involves two or more instances, all actions and executions are recorded within each data owner’s instance logs. If the data collaboration involves datasets hosted on other cloud resources, audit logs are preserved and written as outputs along with the results set.
Control your data — at every stage of collaboration
With Habu, data owners can be confident in the end-to-end security and privacy of their datasets during data collaboration. Talk to one of our experts today about how Habu Data Clean Room can help meet your data collaboration goals.