Contributor Node Data Schema

General Notes - Applies to all Fields

Field type or scenario

Considerations for formatting

Field type or scenario

Considerations for formatting

Missing / No data 

When a field does not have any data, it should be left empty.  Do not use "space", hyphen, NULL character, or other placeholder values.

Numeric fields

Take particular care with numeric fields that they do not default to "0" when an empty value should be used (e.g. due to missing data)

Large integers

Large integers should not be written in scientific notation, as this often loses precision. This can be particularly problematic for phone numbers – take care to export these as string.

Custom fields

Custom fields may be added – contact DR.

Tokens (as strings)

They are random integers written in hexadecimal notation. You should process them as "opaque strings" to avoid any conversion to scientific notation, stripping of leading zeros, or loss of precision caused by conversion to floating point types.

When updating a CSV

Privacy-Preserving Matching takes the whole record to be an "upsert" operation.

  • If the personid (natural key) is new, a new token is generated and the fields inserted.

  • If the personid exists in the database, the previously issued token is used and the PII fields are updated.

  • If a PII field is blank, it is taken to mean that you want any previous values deleted.

  • If a PII field is non-blank, it is taken to mean that any previous values are to be overwritten.

  • In other words, the "unit of operation" is the record. To update "just" the email, you must send values for all of the fields, including the new email.

Data upload format

  • The data format for uploading PII to a Contributor Node is a simple, single table format.

  • Upload is typically via CSV however a JSON format is also available via the API.

  • When uploading using CSV files, please use consistent line endings – i.e. all lines should end in DOS/Windows line endings (CR+LF) or Unix line endings (CR). Mixing line ending styles in the same file yields an error.



Working with Multi-Value Fields

The following fields can accept multiple values for a field (a list of values):

Field name

Max values

Field name

Max values

email

2

phone

3

dpid

2

nationalid

1

frequent_flyer_number

1

custom_name

2

When using multi-valued fields in a CSV file, please follow these rules / examples:

  • In your CSV file, use the same header name for the field but append a digit to separate into multiple columns, one for each value you want to provide.

  • For example, to provide up to 2 values for "email", use the field names "email:0" and "email:1".

  • Example below:

    • John has 2 email addresses, when Jane has only 1

    • John has 1 phone number, when Jane has 2

personid

email:0

email:1

phone:0

phone:1

dpid

nationalid

frequent_flyer_number

custom_name

birthdate

family_name

given_name

postcode

personid

email:0

email:1

phone:0

phone:1

dpid

nationalid

frequent_flyer_number

custom_name

birthdate

family_name

given_name

postcode

123

john@example.com

johno@example.com

0400000000



0101

888

1901234567



1978-04-25

Doe

John

2000

456

jane@example.com



0400000001

0400000002

0202

999



Jane Doe

1993-12-01

Doe

Jane

2000



Default Schema

  • NOTE: CN template headers are case sensitive - should all be in lower case

Field Name

Type

Description

Formatting & Normalization Rules

Field Name

Type

Description

Formatting & Normalization Rules

1

personid

String (varchar 100)

  • A unique string that identifies the customer record.

  • Should uniquely identify the person within the token database (this ID usually comes from your CRM or data lake).

  • Important: Do not use personid to store PII (e.g. an email address). Appropriate values are internal customer numbers, database primary keys, or some other non-identifying value.

  • Your Contributor Node will maintain a mapping table between this personid and the randomly generated token.

  • Customers typically download this mapping table to store the token values in their CRM or EDW, which can then be used to upload anonymised attribute data to Senate.

  • Only one value per row allowed

  • Mandatory, case sensitive, and unique within token database

  • Allowed characters are alpha-numeric plus underscore (_), semi-column (:) and dot (.)

  • Max 100 characters

  • No normalization or hashing is applied – this field does not leave your Contributor Node

2

email

String (varchar)

  • Person's email address

  • Multiple values are allowed (2)

  • Provide email address only (e.g. test@example.com)

  • Contributor Node supports UTF-8 in email addresses

  •  Normalisation (pre-hash): email

    • Convert to all lower case

    • All spaces removed

3

phone

Numeric

  • Person's phone number. Use local format (area code + phone number)

  • Multiple values are allowed (2)

  • Provide phone numbers with area codes (where applicable) but strip international dialling country codes.

  •  Normalisation (pre-hash): phone

    • Letters are converted to the numbers (see E.161 standard)

    • Non-digits are stripped, including all spaces

4

dpid

Numeric

  • Delivery Point Identifier. This is an 8-digit number which is allocated to each address maintained in Australia Post’s National Address File.

  • Contact DR for list of suppliers who can convert address data to DPID.

  • Multiple values are allowed (2)

  • Provide complete 8-digit code.

  • Do NOT strip leading zeros

  • Leave blank if no DPID is available

  • Normalisation (pre-hash): numeric

    • Non-digits are stripped, including any spaces, letters, etc. Only 0-9 retained.

5

nationalid

String (varchar)

  • National Identifier (e.g. a Social Security Number in USA; or NRIC in Singapore)

  • This is a unique identifier of each citizen in a country

  • Only one value allowed

  • Important: There may be regulatory restrictions on using a "national ID" in matching projects. Check with DR or legal team.

  • Provide complete code.

  • Normalisation (pre-hash): uppercase

    • Any letters are converted to upper case (A-Z)

    • Leading and trailing spaces are stripped

  • Australian users please note: There is no "national id" in Australia. Use of Medicare, Passport, or Tax File Numbers is prohibited.

6

frequent_flyer_number

String (varchar)

  • Person's Frequent Flyer Number (if exists)

  • Only one value allowed

  • Provide complete code.

  • Normalisation (pre-hash): uppercase

    • Any letters are converted to upper case (A-Z)

    • Leading and trailing spaces are stripped

7

custom_name

String (varchar)

 

  • Full name of a person, with customer qualifiers as agreed between matching parties (e.g. addition of post code or birthdate - see below)

  • Multiple values are allowed (2)

  • Concatenate given_name and family_name, separated by a space.

  • Remove titles and suffixes (e.g. Mr, Ms, Dr, or Jnr, esq.)

  • Add qualifiers as agreed with matching partner (e.g. post code), preceded by a space.

  • Normalisation (pre-hash):

    • Same as for family_name above.

8

birthdate

Date

  • Person's date of birth, in YYYY-MM-DD format

  • Note that this field is used as a qualifier – matching cannot happen on birthdate alone

  • Normalisation (pre-hash): numeric

    • Provide date in the format YYYY-MM-DD (e.g. 1975-03-19)

    • Normalisation (pre-hash):

      • Non-digits are stripped, including any spaces, letters, etc. Only 0-9 retained

9

family_name

String (varchar)

  • Family name of a person (AKA "surname")

  • In some countries, this might be referred to as a "last name", however note that there are regional differences in the ordering of family and given names. This field should always be the family name.

  • Multiple values are allowed (2)

  • Databank Users: Unlike Databank, there is no special handling of suffixes that might have been included (e.g. "Jnr" or "the third"). DR recommends you remove these.

  • Remove or exclude titles (e.g. Mr, Ms, Dr) and suffixes (e.g. Jnr, III, esq.)

  • Contributor Nodes support UTF-8, make sure you're using UTF-8 string encoding.

  • Normalisation (pre-hash): name

    • Convert to all lower case (should be locale sensitive using UTF-8)

    • Remove non-alpha characters (e.g. apostrophes, dashes, digits) and white space.

    • The contributor's locale should be specified and used when deciding how to "down-case" letters.

10

given_name

String (varchar)

  • Given name of a person (AKA "Christian name")

  • In some countries, this might be referred to as "first name", however note that there are regional differences in the ordering of family and given names. This field should always be the given name.

  • Multiple values are allowed (2)

  • Same as 'family_name' → name

11

postcode

String (varchar)

  • Person's postcode, that is part of their residential address

  • Normalisation (pre-hash): numeric

    • Non-digits are stripped, including any spaces, letters, etc. Only 0-9 retained.



Default Matching Combinations

The following fields are always matched, because they are generally "unique enough" in a data set to be useful to match on.

  • email

  • phone

  • dpid

  • nationalid

  • frequent_flyer_number

  • custom_name

These next fields however are used as "qualifiers" – they are not matched on their own but instead are matched in combination. This is because alone, they are not useful fields to match on (e.g. it is not useful to match all the first names together, as it would include many false positives).

  • email + given_name

  • custom_name + postcode

  • phone + given_name

  • custom_name + birthdate

  • email + family_name

  • phone + family_name

  • dpid + family_name

  • dpid + given_name

Other combinations can be configured for you. Talk to support@datarepublic.com.