by Tony Scott / Madhu Therani

Principal Data Scientist / Chief Technology Officer

September 24, 2018

The How and Why of ID Unification

Identifier (ID) Unification is one of the key data fusion problems to be solved when integrating data from multiple data sources. In the context of enterprises, building a 360 degree view of your customers is a key need. Data from internal sources (first party), external data brokers (third party), various kinds of partners (second party) and open, public sources needs to “fused” to build this view. Furthermore data from these systems can be at an individual level or group-level where a group may be defined by a variety of criteria.

One fundamental issue in data integration is the use of “different” unique identifiers for individuals across these systems. Data per individual is associated with the identifiers. These identifiers help one not only to access the data but also “engage” with the individual depending on the channel on which the individual is active - for example - cookies on a browser and “Ad Identifiers” on a mobile device or email ids via email. Just to get a sense of the “ID” landscape, we list below -

  • Cookies - Desktop & Mobile Browsers - Cookies on mobile browser are different from desktop browser cookies. Mobile smartphones will have different cookies from those on mobile tablets! Additionally, cookies may also differ by the browser the consumer uses - Chrome, Firefox and Safari - on each device. Further, there are first party and third party cookies (Double Click, Criteo, Nielsen, Quantcast, Krux, Adobe to name a few.). So a single user has already so many cookies - we estimate about 30 - 50 cookies (both first party and third party) for a single individual depending on their level of online activity. Furthermore, these identifiers may be hashed and morphed depending on context of usage - such as identifiers defined by an ad exchange based on these for internal use.
  • Mobile Ad IDs - Both Android and iPhone have unique identifiers for delivery of content - especially ads. These can be reset periodically by the end-user and can be obfuscated - based on privacy settings.
  • Mobile device identifiers - IMEI and other hardware centric identifiers.
  • Network identifiers - MAC address for devices that the user is active one. MAC addresses can also be randomized at different points in the network.
  • Email identifiers - A single user may have multiple touchpoints with different entities in the data eco-system and use different email ids.
  • Usernames - If a user is forced to sign in with an username, they could have a number of those.
  • Social Media handles - Facebook, Twitter, and other handles.
  • CRM /Loyalty System numbers/identifiers - such as subscriber, account or customer identifiers (internal to an organization) - TV and Telco identifiers for example.
  • PII information - Full names, personal device/phone number, social security identifier or equivalent, Home address.

As you review the list above, IDs become less anonymous as one goes down the list. So the ID Unification problem boils down to linking these identifiers together so that data associated with each ID may be merged. The ID Unification problem occurs in various guises and is referred to in many ways in different fields, namely:

  • Record Linkage.
  • Entity Resolution, ID Resolution.
  • ID Unification.
  • Entity Disambiguation.

In the context of Marketing and Advertising, the problem is also referred to as - cross device, cross-screen or cross-channel identification - with the focus being on engaging with the customer. However, the underlying problem is still the same. In the context of CRM data management, it is also called the “Onboarding” problem - and posed as linking offline customer data along with identifiers to online identifiers of the customer.

Approaches to linking identifiers vary across vendors and different enterprises. Key to realize is more “common” knowledge about an individual is available across sources to be “fused” - the better your match rate. Having deep knowledge about an ID in only one source is really not that useful. A number of issues also need to be considered while developing a holistic solution:

  • Staleness of the identifiers - We at Near have experienced old - non-active/non-relevant identifiers being used for linking. ID unification needs to be relevant using knowledge of the life cycles of identifiers across channels. Some identifiers are long-lived whereas others are short-lived.
  • Linking identifiers as individuals and as groups - Given the number of identifiers and different points of data overlap, it is very difficult to get unique matches at scale. The classic quality/quantity trade-off occurs in this problem too. Depending on the use-case, identifier matching has to be tuned appropriately - Single identifiers in one channel may be linked to small group - say a household - on another.
  • ID Unification is a repeat activity and needs to be done periodically - Periods vary based on the ID churn in different channels.
  • How does one validate a match between any two identifiers? What can improve your confidence?
  • Finally, standards and comparisons across different vendors is non-existent. Different metrics are used to obfuscate. It is not clear for the advertiser what to rely on. Intuitions can go awry. It is important to define how a metric is being defined and explain how one expects it to behave, have baselines for good quality and bad. Much work remains to be done here by the advertising and marketing community in this aspect.

Once Identifiers are linked, there are a range of issues associated with the data enrichment/merging problem. These include:

  • Which fields to unify when data from multiple buckets are integrated?
  • What if some of the field values are in conflict? Let us say a cookie id says the person is a male whereas the mobile id says he/she is a female - what do you pick? Or do you discard the match itself?
  • Which data sets should one consider as sources of truth compared to others?
  • When data flows between different identifiers - should they be kept separate? What about privacy issues? How do you link encrypted data?

At scale, many of these problems become more acute and designing viable solutions is a formidable task.

CrossMatrix at Near

At Near, we have developed an in-house technology for ID unification as part of our ongoing efforts to build our Ambient Intelligence Platform. CrossMatrix embeds generic statistical matching techniques along with domain/channel specific heuristics to generate resolved identifiers. Our internal AllsparkID is linked to different identifiers across data sets and used in specific use-cases. We are developing approaches and ground rules to address some of the aforementioned issues.

As an end note, it is important to appreciate that solving the ID unification problem is fundamental to the digital eco-system. This problem is also seen when linking product identifiers from different manufacturers - we have tried to develop UPC - Universal Product Code - giving unique ids for physical goods. Similar issues will arise when digitizing and linking any real-world or virtual entity.

If you would like to know more, please reach us here.