Preparing Your Data for Matching

Please Note

Before you load PI you need to have signed the Privacy-Preserving Matching license agreement. You may test with synthetic data in the meantime.

Preparing PI for upload

Datasets loaded on to the Data Republic Platform must not contain Personally Identifiable Information (PII). To work with data in Data Republic, you will need to de-identify that data you intend to upload. Your data preparation process would look like this:

  1. Extract the data from the original source (e.g. your CRM or EDW).

  2. Split the data into two tables:

    1. PII data (contains personally identifiable information, such as email addresses or phone numbers); and

    2. Attribute data (contains non-identifying data, such as gender or age-range).

  3. Format your PII data for alignment with the other matching party.

  4. Tokenize the PII data with your Privacy-Preserving Matching Contributor Node. This will replace all PII with a randomly generated token.

  5. Download your tokens, attach to your attribute data (if required for your project) and remove the personid column.

To join the randomly generated tokens with the right attribute records you will use a PersonID. This is usually your internal customer ID. It is not shared and will not leave your organisation. However you can use it to map tokens back to the right records.

Tokenization of PII data

To tokenize your PII data and receive tokens for each customer record, you can upload a CSV to your Contributor Node. The CSV file should contain:

  • A ‘personid’ column, which is your internal key to reference this record. It will be used to track which tokens belong to which records. It is not used in matching, and will not leave your Contributor Node.

  • One or more PII fields which will be salted, hashed, sliced, and then distributed.

Once tokens are generated, they can be appended to your table containing the non-identifiable transactional or demographic information for each customer. Your tokenized dataset therefore does not contain any PII data and can be uploaded to Data Republic and matched with other tokenized datasets in governed spaces on the platform.

Example:

1. We start with a single table that has a mix of PII and attribute data extracted from the original source. Note the PersonID which is an internal customer reference.

personid

email

phone

family_name

given_name

gender

age_range

personid

email

phone

family_name

given_name

gender

age_range

11

alison@example.com

(555) 623-2565

Sutton

Alison

F

20-29

23

james@example.com

(555) 710-1092

James

Logan

M

30-39

43

john@example.com

(555) 877-9905

Gilbert

John

M

49-50

 

2. The first step is to split this data into two tables. The first table will have just the PII:

personid

email

phone

family_name

given_name

personid

email

phone

family_name

given_name

11

alison@example.com

(555) 623-2565

Sutton

Alison

23

james@example.com

(555) 710-1092

James

Logan

43

john@example.com

(555) 877-9905

Gilbert

John

The second table will contain only attributes. The personid field will be used to link the two tables into a table with the personID and the chosen attributes (this is only in the case when attributes will be added to your data):

personid

gender

age_range

personid

gender

age_range

11

F

20-29

23

M

30-39

43

M

49-50

 

3. Once we have our data split in this way the next step is to upload the PII table to the Contributor Node. This will:

  • Generate a random token identifier for each record in the PII table.

  • Record the token value for each PersonID.

  • Hash, slice and distribute the tokens and hash slices to the Matcher Network.

4. You can then download the mapping file that records the tokens for each PersonID. If your project requires attribute data with your tokens, using any software that you like, you will need to replace the PersonID in your attribute table with these token values. The final table that you will upload to Data Republic would look like this:

token

gender

age_range

token

gender

age_range

0x34575

F

20-29

0x94251

M

30-39

0x45732

M

49-50

If your project only requires tokens, you will need to remove the personid column, and upload a file with only tokens from the CN to Data Republic. The file table might look like this:

token

token

0x34575

0x94251

0x45732

Uploading PI for tokenization

There are two methods of uploading PI for tokenization:

  1. Via the browser UI (see below); or

  2. Using the REST API.

Step

Screenshots & Notes

Step

Screenshots & Notes

1

Initial Privacy-Preserving Matching login

Type in / copy the address of your Contributor Node into the browser (https://[host name of your Contributor Node]/)

  1. Your login credentials are created by your organisation when your Contributor Node is configured.

    1. If using local authentication, the username is "api" and the password is set in contributor.sh (See Contributor Script - HTTP_BASICAUTHPASSWORD setting)

    2. Alternatively, if your IT department has configured SSO support, you will be directed to your organisation's single sign on system.

  2. If you have any issues logging in, please reach out to support@datarepublic.com



2

Dashboard

The first page you see will be the Dashboard. It shows:

  1. List of databases. By default, there will be two databases (one for production, and one for testing).

  2. The Privacy-Preserving Matching system status, including

    1. The status of your Contributor Node (basically an internal health check)

    2. Connectivity between the contributor and the Privacy-Preserving Matching network,

    3. Health status of the Privacy-Preserving Matching network generally.

3

Database Management Screen

  1. Select a token database on the left to load the file (e.g. Production database)

  2. On the database page you will see:

    1. A database summary panel, showing

      • Number of tokens (should reflect the number of customers in the database)

      • Download tokens to download the PersonID to token mapping file (CSV formatted)

      • Last updated date

      • Download data template provides a blank CSV which shows the format to use for uploading

    2. The middle panel is for uploading a CSV to update or create new customer records

    3. Finally, the right side panel is a history of recent uploads.

4

Upload PII data

Note: It may be helpful to click the Download data template button and save the blank CSV to use as a template when testing. This will ensure the correct field names (which are case sensitive) are used.

  1. Upload your prepared CSV file of PII records:

    1. Drag and drop the file into the middle panel, or use the browse button to select the file from your machine.

    2. Leave the CSV format details (currently, only the defaults are supported).

    3. Click Upload data file

  2. Your browser will parse the CSV, and salt and hash the data prior to sending the data to the Contributor Node "back end".

  3. A green progress bar will show the file being uploaded and processed and the token count will start increasing.

  4. The status panel will have a new entry, with today's date. Wait a few seconds and will update saying it processed how many records

  5. Once the tokenization process is complete, the Database summary panel will update to show today’s date, and the total number of tokens in the database

Notes:

  • There is a currently an upper limit of 1 million records per CSV file when using the browser interface. Larger uploads are possible depending on the amount of RAM available to your browser.

  • Your Contributor Node also supports an API for automating data updates.







 

Downloading Tokens and preparing Attribute Data

Step

Screenshots & Notes

Step

Screenshots & Notes

1

Download tokens & join to attributes

  1. Click the 'Download Tokens' button in the database summary panel (on left hand side)

  2. Save the resulting file when prompted

  3. The file contains two columns:

    • Your original customer ID (called "personid")

    • New random token for that customer

This mapping file is then used to attach the token only to customer attribute data, which is then loaded into Data Republic.

  • You can do this using the "personid" field, which will be consistent with the "personid"s you've provided prior to tokenization, using whichever method you prefer - many custodians choose to use SQL, excel, or a scripting language to do this.

  • Your resulting file should drop the "personid" column, as this should not be uploaded to Data Republic.

Append tokens to your attribute data table - example

2

Upload tokenized attribute data (Data Republic)

Log in to Data Republic for your region

Upload file with tokens & attributes into Data Republic

  1. Login to Data Republic for your region and go to Manage Data screen (select Files tab)

  2. Either use SFTP (for large files) or HTTP (for smaller files up to 100MB) to upload the file into Data Republic

  3. This should contain a token column (with token values from your node) and then one or more "attributes"

Create Database and Table

  1. Data Republic → Manage Data → Databases → create a new Database

  2. Create a new Table  + add columns structured to match the data that you're about to load into it (attributes & tokens)

Load data into the Table

  1. When viewing the newly created Table → click 'Load Data'

  2. Select the file you wish to load + how many header rows it contains (the rest can be left as default)

  3. Click 'Done, Load Data'

  4. Monitor the status of your data load under Manage Data → Load Jobs

    1. If everything went well, the Row count (when viewing a table) will increase to reflect the number of loaded rows

Create a View

This is optional, if you want to share a subset of the table only. For example, you can share a view rather than the whole table in a project (Create a view)

Upload file with tokens & attributes into Data Republic

Create Database and Table

Load data into the Table

3

Create a data package 

  1. Data Republic → Manage Data → Packages → click 'Create a new data package'

  2. Fill in all the fields + click 'Create new data package'

  3. Select the table you have loaded data into earlier on

Create a Data Package (with tokens)

Select the table you have created earlier

4

Link Token Database

  1. Click Link token database to tell Data Republic which token database generated the tokens in this package

    1. Select the table or view in your package that has the tokens

    2. Select the name of the column that contains the tokens

    3. Select which token database issued the tokens – if you’re unsure, check with the user in your organization that is responsible for preparing your dataset for matching

  2. Click 'Save'

Click 'Link token database' to select which token database generated the tokens in this package

Specify which token database generated the tokens in this package

Package is now created in status 'Draft', ‘M’ icon appears to represent linked tokens

5

Update package

  1. The button at the bottom of the Packages screen will change from Link token database to Edit token link

  2. Click 'Submit' (currently, packages get approved automatically)

  3. On the main Packages screen an ‘M’ icon will appear next to packages that have been configured for matching (have a linked token database)

6

Unlink token database

You can always Unlink the Token Database by:

  1. Editing the token link

  2. And clicking the Unlink button