Data fusion, the art of merging information from multiple sources to create sophisticated models, presents significant potential benefits for a range of industries. This is a fundamental problem in many domains as one tries to bootstrap a data economy.
Bringing together heterogeneous datasets, however, poses several conceptual and technical questions, particularly when considering different existing analytical approaches and also developing appropriate metrics for evaluation. Understanding these potential challenges, along with the complexities surrounding different kinds of heterogeneity in the data is key to developing viable data fusion approaches. The boundary between generic and domain-specific approaches towards data fusion is yet to be defined.
For starters, the following are main dimensions of data heterogeneity to consider when discussing datasets:
- Semantic Heterogeneity — Are these different data sets referring to the same phenomena? Are they complementing data sets or conflicting data sets? Are they sharing the same “event” source? etc.
- Temporal Heterogeneity — Data sources may be static or dynamic. Static data sets are snapshots of phenomena at a point in time. Dynamic data may be “streaming” data that reflects a phenomena continuously or with larger time intervals.
- Spatial Heterogeneity — Data may reflect spatial effects or capture spatial nuances in one, two or three dimensions. Do the different data sets share similar abstractions?
- Modeling Heterogeneity — Most of the data gathered are from sensors and devices that capture analog and digital phenomena based on an underlying model. The nature of this abstraction itself matters when fusing data to understand what one should expect when they fuse the data. Are the underlying assumptions compatible?
- Infrastructural Heterogeneity — Large data sets may not be captured due to storage/power/bandwidth limitations, data may be corrupted due to infrastructure issues. Data may be incomplete due to operational and systems issues. How do you fuse data on an ongoing basis under such conditions?
Knowing these various properties (and the fact that different properties will become relevant in different scenarios), anyone looking to fuse data must consider the questions they wish to answer to determine the most appropriate approach in tackling the complexities in merging their data. Let’s use the example of a retailer wishing to develop various “count” (time-series datasets) to illustrate our point.
Many retailers want to put together detailed views of their customers and the performance of their stores/products. Those retailers might gather consumer data from a multitude of sources, such as:
- WiFi hotspots within the outlet
- A beacon-based system
- A digital POS system that tracks transactions in-store
- In-store video cameras
- Human mobility data from a third party
With the data from those sources, they might want to answer questions regarding store activity in the recent past, like:
- How many customers came near the store?
- How many came into the store?
- How many were window shoppers?
- How many actually bought something?
- How many are repeat customers? How many buy each time they come into the store?
- How many did not find what they were looking for?
- How many were price-shopping? Comparing products?
- How much time was spent by each customer?
- How many were disappointed by the customer service?
- Which products really sold well versus those that did not?
- How many visitors show up in one source and not the other? Should one count visits or number of people per day?
Depending on the nature of the outlet, different questions listed above will take priority and each of the five possible data sources may be fused in different ways depending on the question being answered.
A generic department store, for instance, may have different needs from a restaurant which may be different from a shoe store. Stores co-located in a mall will require a different approach than those located in independent areas. There are even differing spatial constraints between stores that factor into the data fusion process.
And the above applies merely to understanding what has happened in a sliver of time the past. To build something predictive, to answer questions about the future — like how many customers will come to the store or how many customers a store owner should reach — brings with it even more challenges.
The rabbit hole goes deeper. Those five aforementioned data streams bring with them questions and hurdles of their own:
- Some streams are every minute, whereas some are daily
- Some data is available instantaneously, some come delayed
- There are blackouts in the data on different streams at different times
- There are spikes and troughs in the data
- How does one verify the data’s accuracy?
- How does one validate the “fused” model?
- If decisions are to be made on the model — what is an acceptable margin of error? What are the risks involved with evaluating different data trade-offs during the fusion process?
Layer upon layer of complexity — just for relatively straightforward counts. The above speaks nothing to addressing subsets of counts like weekdays vs weekends, counts in one store or another, etc., or of fusing data related to other attributes like demographics, income, and the like.
All of these considerations are exacerbated by the fact that data is not always available concurrently. More often than not, it becomes available incrementally — so how can anyone combine multiple sources of data to further fusion efforts that will help understand the past and predict the future?
Building data fusion approaches ground up into a data processing platform is essential. From a conceptual perspective, adopting a Bayesian worldview is essential, given the sound underpinnings of the mathematical formalisms to allow incremental assimilation and propagation of updates/inferences across the complete data stack. Data-driven thinking has to pervade the architectural design of the data platform. Current approaches treat architecting data processing pipelines as an independent activity from the “design/use of data” - the classic split between data engineering and data science. Retrofitting Bayesian data views on frameworks that are primarily non-Bayesian is cumbersome to say the least. Current approaches are ad hoc, highly heuristic and really one does not know when it is going to work and when it is not. Building data teams with an underlying Bayesian vision is still in its early days.
“Data Fusion” as it stands today is not a well-defined problem. Every data platform is facing this issue - however big or small. Developing a viable, extensible approach even for your own organization requires your technology team to staying cognizant about the constraints of data fusion, understanding its implications on your products/services, and invest in the underlying R&D activities with rigor to forge a path forward.
Also published in Datafloq.