Since our last blog entry in the Summer of 2016 outlining the nuances of developing a Location Intelligence platform, the tech team at Near has been focused on implementing the vision. Over the past 18 months, there have been quite a few interesting learnings and relevant observations on this journey. These observations span the data we process, our places repository - Near Places, architecting our pipelines, our APIs and finally Allspark. Over the next few weeks, we intend to share these on this blog.
- How to handle this data
- What analysis and models built with this data are meaningful.
We process 6 - 10 TB of data everyday at about 60,000-80,000 events per second from multiple sources including - mobile apps, from Wifi access point providers, infrastructure providers, data partners. Here are some key characteristics of the data and the questions it raises:
Location Data Variance
Location data about consumers varies temporally and spatially in every country. Weekday and weekend patterns are quite different. Spatial coverage is highly varied - metro, suburban and rural areas are represented at different density levels. This is characteristic of all data sources and data types. One of the key questions is how do you “fuse” data across these sources? We process both “ping” data and “trajectories”. How do you identify one from the other in an online mode versus offline mode? What are common metrics to compare one data stream with another? The “fusion” needs to happen along multiple dimensions - spatially, temporally and other relevant dimensions.
Data Granularity and Errors
Location information granularity varies - GPS information at different precision levels or tiled information. Any intermediary manipulation of the data upstream distorts the picture - so how do we infer what really happened when the event was generated? What potential errors are being introduced? Given these errors - what should we send downstream for the rest of the subsystems?
Context from Spatial References
The quality and coverage of our “spatial” reference database is critical for linking and contextualizing the incoming streams. Having a reasonable density and coverage of different place categories and places of interest in a given area is essential, else one can make erroneous inferences. Popular tasks such as analysing white-spaces to identify future stores/billboards locations, offline attribution for marketing etc. critically depend on understanding the spatial distribution of POIs. We have spent quite a bit of effort understanding how the physical world evolves based on our Near Places repository. This has guided us on what physical world data to source and how to merge them.
Inferences from Disparate Data Streams
Data analysis - that spans time and geographic scales. Imagine a retail store which provides a video stream from closed-circuit TV camera, a beacon system, a Wi-Fi ping stream and Near’s data streams. What should one expect when one compares these streams at single location for a given duration? What are the correlations? What inferences can one make reliably? What are potential artefacts of the data?
Static vs Dynamic Data
At Near we process anonymized data about individuals and also aggregate this data to make inferences about “groups” of individuals. These groups/cohorts can be defined in a variety of ways in Allspark - our SaaS product. What kinds of individual attributes can be inferred? What kinds of individual data can be merged reliably to characterize a group? What kind of cohort inferences are possible in a reliable manner? What kind of offline data sources are available to inform us in our model building efforts. For example, census data from different countries is a good indicator of near-static characteristics of a census group - either at a city or country level or for even a smaller region. However, it is unclear what we should expect to see when the data is dynamic. For example, it is one thing to say the gender ratio in a city is 50:50, but do we expect to see the same ratio over a given time period at a McDonald’s, at a golf course, at a movie theater, at a spa? What are the bounds of this ratio at different locations, and what are the reasonable time-scales to measure this ratio? Our common sense notions or general census characteristics cannot be reliably extrapolated under different temporal and spatial windows.
Data Sources and Data Gaps
Sourcing data appropriately is essential. Understanding the data provenance - the original purpose of data source, or the reason for collecting it is essential when making downstream inferences. We have sourced different types of data from multiple partners in different geographies. Studying these data sets has helped us develop techniques to utilize these, and also identify “data gaps”. Sampling techniques introduce their own biases.
Finally, a key focus of our efforts is to make inferences and share data in a privacy-compliant manner. How can one merge data and make inferences in a privacy preserving manner? How can we prevent identity resolution from the data and inferences in the platform?
We have been working with navigation data, traffic data, spend data, weather data, event data, e-commerce data, search data and a few others - including new interactive devices such as wearables and chatbots. As these human activities happen in the physical world - location is a key attribute - that can potentially link these data streams.
There are potentially innumerable patterns that one could potentially infer. As our online and offline worlds merge in seamless ways - the points of interaction are already potential sources of “ambient” data. Obtaining high-quality, high coverage data at scale may be quite difficult but we believe by merging data from moderate-quality sources at different scales - one can support reasonable data-driven decision making.
While we set out to build a Location Intelligence Platform, we realised we cannot provide real-time intelligence on Places, People and Products with just Location Data. Hence, over the last 18 months, we’ve scaled up our data collection and cleansing models, along with our data science engines to steer towards building an Ambient Intelligence Platform.
Contact Us to use data for superior decision-making.