Our ongoing efforts in developing an Ambient Intelligence Platform at Near critically depends on fusing heterogeneous data sets. Understanding and utilizing the heterogeneity raises a number of conceptual and technical questions which need to be addressed. These problems are also common to a number of other similar data productization efforts as enterprises globally try to get a real data economy going.
So, what are the different dimensions of heterogeneity in the data sets? Some of the key ones to consider are:
- Semantic heterogeneity - Are these different data sets referring to the same phenomena? Are they complementing data sets or conflicting data sets? Are they sharing the same event source? Do the data sets share the same conceptual abstractions? same representations? This is especially acute in textual, audio and visual data streams - are we capturing or referring to same events of interest?
- Temporal heterogeneity - Data sources may be static or dynamic. Static data sets are snapshots of phenomena at a point in time. Dynamic data may be streaming data - data that reflects a phenomena continuously (possibly sampled at fast rates) or with larger time intervals (order of weeks/days).
- Spatial heterogeneity - Data may reflect spatial effects or capture spatial nuances in one, two or three dimensions. Space may be represented in multiple ways with different assumptions of the spatial properties. For example population density - may be represented/estimated in different ways.
- Modeling heterogeneity - FInally, most of the data we gather are from sensors and devices that capture analog and digital phenomena with an underlying model. The nature of this abstraction itself matters when fusing data. Properties of interest include a) resolution or granularity of the data, b) active versus passive generation of events/measurements, c) observability of the data - full or partial and sampled observations d) density of the observations along different dimensions of interest, e) biases in the data due to natural skews introduced, method of measurement/instrumentation, environmental effects such as noise, transmission losses etc., and f) Manually-induced effects - Especially in digital phenomena, many data streams can be faked, artificially generated and digitally doctored - how do we know what is real and what is not - if we even do not have a baseline (a library of truth). At scale, detecting this becomes extremely complex.
- Infrastructural heterogeneity - Large data sets may not be captured due to storage/power/bandwidth limitations, data may be corrupted due to infrastructure issues.
Given these properties of different data sets, before defining an approach to fuse, you need to know what questions you want to answer? Different questions will lead to different fusion approaches on the same data set as different properties will become relevant in the fusion process. The following example illustrates the point.
One of the most pressing questions for retail customers is the following - every retailer wants to know how many prospects are there for their wares? How many actually show up at their outlet? How many can they reach out to on variety of advertising channels?
The retailer has invested in the following - a) One or more WiFi hotspots in the outlet, b) beacon-based system, c) Digital POS system that tracks actual transactions in their store, d) Video cameras in their store and finally e) Leveraged human mobility data into and near the store from vendors such as Near. Now given these five data sources, we want to develop various count - time series data sets to answer the following questions (let us say in a given time interval):
- How many customers came near the store?
- How many came into the store?
- How many were window shoppers ?
- How many actually bought something?
- How many are repeat customers? How many buy each time they come into the store?
- How many did not find what they were looking for?
- How many were price-shopping? Comparing?
- How much time was spent by each customer?
- How many were disappointed by the customer service?
- Which products really sold well versus those that did not ?
- How many visitors show up in one source and not the other? Should one count visits or number of people per day?
These questions can be broken down by temporal buckets - weekdays versus weekends, mornings, afternoons, evenings. Depending on the nature of the outlet - different questions listed above become important. Each of the five sources may be fused in different ways depending on the question being answered. A generic department store may have different needs from a restaurant which may be different from a shoe store. Now consider the scenario of all the different stores co-located in a mall versus being in independent areas. The approach to data fusion will differ based on spatial constraints too. This complexity is just for understanding what has happened in the window of interest in the past, it is much more complex to build something predictive - to answer questions about the future - How many customers will actually come? How many should I reach? (and there are many more such questions).
Getting into some more nitty-gritty - as a data scientist - when we review the five streams - some of the questions that one needs to grapple with are:
- Some streams are every minute, whereas some are daily.
- Some data is available instantaneously, some come delayed.
- There are blackouts in the data on different streams at different times.
- There are spikes and troughs in the data.
- How does one really check if what was the actual? (Even the notion of actual count is on sketchy ground).
- How does one validate the fused model?
- If decisions are to be made on the model - what is the acceptable error margin? What are the risks involved to evaluate different data trade-offs during the fusion process?
So even fusing multiple count streams into one suddenly looks like quite a complex problem. All the above business/modeling questions are just related to counts! We are not even considering other attributes - such as demographics, affluence/income etc. Even in counts questions - additional complexity arises when you pick subsets - weekdays versus weekends, counts in one store versus another etc. Fusion approaches to understand and infer properties of an individual entity (people, place, product etc) are much different from approaches to infer group/cohort properties. For example, in the five streams above, different identifiers may be used to tag an individual - how do we link different tags in different streams to a single individual in the real-world? Also called entity resolution - this is a different kind of fusion problem. This needs to be done properly if one needs to infer properties from multiple sources at an individual entity level. Inferring Group/Cohort properties from multiple sources is also complex. How does on maintain a consistent definition of a group - so that the dependent properties are valid? For example, estimating a demographic mix of an audience in a store from multiple streams. These could be in conflict - how does one resolve.
As we build out the Near Platform, we are considering data from different aspects of human activity such as :
- Traffic data - counts and trajectories distributed spatially
- Real estate data
- Weather data
- Social data streams - posts/blogs/events
- Online digital activity data - page visits, clicks etc
- Census data - static snapshots in time
The problem of fusion only becomes more exacerbated as different sources are brought on board. In the above discussion thus far, we have implicitly assumed that all the sources are available concurrently. All the different sources are not available at the same time. They all become available incrementally, may be active at different non-overlapping time intervals. So, the next major issue is - how do we combine the evidence from these sources to understand the past and predict the future. A key aspect is to adopt a bayesian world view of the overall phenomena at hand to guide the platform development. Retro-fitting bayesian views on an non-bayesian framework is cumbersome to say the least. Though intuitively, more information is better, it is not straightforward to combine multiple data sets in a consistent way. Wrong approaches will lead to erroneous conclusions. Finally, the overall approach also gets convoluted due to power law behaviors in real world data - at multiple levels of abstraction and along multiple dimensions of data. This has been discussed in one of the earlier blogs.
The Near platform has been envisioned with these constraints in mind. Considering that some of the core modeling tasks in Allspark 3.0 such as: a) Audience Estimation, b) Offline Attribution, c) User Profiling, d) Place Profiling depend on addressing data fusion issues at scale. Our teams are addressing these issues in a methodical manner. Though the problems are complex, we have taken the first step in addressing some of these. Unlike some of the major enterprises, who can control the provenance of different data sets from cradle to grave - building a platform wherein one does not control the provenance is a much harder problem.
If you are interested to know more or want to work with us on similar or related problems, please reach out to us here.
Also read The How and Why of ID Unification