Preparing Your Data for Matching
Please Note
Before you load PI you need to have signed the Privacy-Preserving Matching license agreement. You may test with synthetic data in the meantime.
Preparing PI for upload
Datasets loaded on to the Data Republic Platform must not contain Personally Identifiable Information (PII). To work with data in Data Republic, you will need to de-identify that data you intend to upload. Your data preparation process would look like this:
Extract the data from the original source (e.g. your CRM or EDW).
Split the data into two tables:
PII data (contains personally identifiable information, such as email addresses or phone numbers); and
Attribute data (contains non-identifying data, such as gender or age-range).
Format your PII data for alignment with the other matching party.
Tokenize the PII data with your Privacy-Preserving Matching Contributor Node. This will replace all PII with a randomly generated token.
Download your tokens, attach to your attribute data (if required for your project) and remove the personid column.
To join the randomly generated tokens with the right attribute records you will use a PersonID. This is usually your internal customer ID. It is not shared and will not leave your organisation. However you can use it to map tokens back to the right records.
Tokenization of PII data
To tokenize your PII data and receive tokens for each customer record, you can upload a CSV to your Contributor Node. The CSV file should contain:
A ‘personid’ column, which is your internal key to reference this record. It will be used to track which tokens belong to which records. It is not used in matching, and will not leave your Contributor Node.
One or more PII fields which will be salted, hashed, sliced, and then distributed.
Once tokens are generated, they can be appended to your table containing the non-identifiable transactional or demographic information for each customer. Your tokenized dataset therefore does not contain any PII data and can be uploaded to Data Republic and matched with other tokenized datasets in governed spaces on the platform.
Example:
1. We start with a single table that has a mix of PII and attribute data extracted from the original source. Note the PersonID which is an internal customer reference.
11 | alison@example.com | (555) 623-2565 | Sutton | Alison | F | 20-29 |
23 | james@example.com | (555) 710-1092 | James | Logan | M | 30-39 |
43 | john@example.com | (555) 877-9905 | Gilbert | John | M | 49-50 |
2. The first step is to split this data into two tables. The first table will have just the PII:
11 | alison@example.com | (555) 623-2565 | Sutton | Alison |
23 | james@example.com | (555) 710-1092 | James | Logan |
43 | john@example.com | (555) 877-9905 | Gilbert | John |
The second table will contain only attributes. The personid field will be used to link the two tables into a table with the personID and the chosen attributes (this is only in the case when attributes will be added to your data):
11 | F | 20-29 |
23 | M | 30-39 |
43 | M | 49-50 |
3. Once we have our data split in this way the next step is to upload the PII table to the Contributor Node. This will:
Generate a random token identifier for each record in the PII table.
Record the token value for each PersonID.
Hash, slice and distribute the tokens and hash slices to the Matcher Network.
4. You can then download the mapping file that records the tokens for each PersonID. If your project requires attribute data with your tokens, using any software that you like, you will need to replace the PersonID in your attribute table with these token values. The final table that you will upload to Data Republic would look like this:
0x34575 | F | 20-29 |
0x94251 | M | 30-39 |
0x45732 | M | 49-50 |
If your project only requires tokens, you will need to remove the personid column, and upload a file with only tokens from the CN to Data Republic. The file table might look like this:
0x34575 |
0x94251 |
0x45732 |
Uploading PI for tokenization
There are two methods of uploading PI for tokenization:
Via the browser UI (see below); or
1 | Initial Privacy-Preserving Matching login Type in / copy the address of your Contributor Node into the browser (https://[host name of your Contributor Node]/)
| |
2 | Dashboard The first page you see will be the Dashboard. It shows:
| |
3 | Database Management Screen
| |
4 | Upload PII data Note: It may be helpful to click the Download data template button and save the blank CSV to use as a template when testing. This will ensure the correct field names (which are case sensitive) are used.
Notes:
|
|
Downloading Tokens and preparing Attribute Data
1 | Download tokens & join to attributes
This mapping file is then used to attach the token only to customer attribute data, which is then loaded into Data Republic.
| Append tokens to your attribute data table - example |
2 | Upload tokenized attribute data (Data Republic) Log in to Data Republic for your region Upload file with tokens & attributes into Data Republic
Create Database and Table
Load data into the Table
Create a View This is optional, if you want to share a subset of the table only. For example, you can share a view rather than the whole table in a project (Create a view) | Upload file with tokens & attributes into Data Republic Create Database and Table Load data into the Table |
3 | Create a data package
| Create a Data Package (with tokens) Select the table you have created earlier |
4 | Link Token Database
| Click 'Link token database' to select which token database generated the tokens in this package Specify which token database generated the tokens in this package Package is now created in status 'Draft', ‘M’ icon appears to represent linked tokens |
5 | Update package
| |
6 | Unlink token database You can always Unlink the Token Database by:
|