Data processing guidelines
Any location-based data set has inherent strengths and limitations based on how the underlying data is sourced and pre-processed. It is important to understand these nuances to fully leverage the value of the Mapbox Movement data set. The sections below are recommendations and considerations for processing and analyzing Movement data.
Quadkeys with zero activity in a given day are excluded from that day's data. As a result, when trying to compare activity aggregated over multiple days, we do not recommend trying to take an average of the activity (since this value would be biased by the omitted quadkeys). Instead, we recommend either using the sum of activity index across multiple days as the point of comparison, or weighting the average appropriately to account for the omitted quadkeys.
Due to Mapbox's strict privacy requirements, we omit quadkeys that do not pass minimum activity thresholds over the period of aggregation. As a result, the daily data will have more zero-activity areas, even when aggregated over a month, than monthly data. If you're interested in comparing activity over low-activity areas, contact Mapbox to see if data productions at monthly or other aggregations may be a good fit for your use case.
The data for each country is normalized within that country, while the normalization factors are different between countries. As a result, the levels of activity indices are not directly comparable across different countries. The data can be used to compare trends in activity levels over time within each country, though, with comparisons being inherently slightly more exact within closer geographic proximity. This is described in more detail below.
The anonymous telemetry data that provide the foundation for the Mapbox Movement product are well correlated with the movement of people about the world, but do not provide a direct measure of the absolute number of people moving. To emphasize this point, we have chosen to present these data as the unit-less activity index. Any decision made from Mapbox Movement data should be informed by a comparison, whether that’s a comparison of the activity difference between two blocks of a city on any given day, of the change in activity for a given location over some time span, or some combination of the above.
For any single place and time, the actual value of the activity index is not the only thing that matters. We also care about the difference in activity at that place and time relative to some other place or time. In fact, it makes more sense to talk about our confidence level for a comparison made from Movement data over space or time than of an individual data point.
As we develop a mechanism to provide confidence levels for any given comparison, this general rule applies: Comparisons made over short timescales and short distances are more trustworthy than those made over long timescales and distances.
We are always working to better calibrate our underlying data, but we expect some drift to accumulate. Day-to-day comparisons are almost always quite exact, while year-over-year comparisons can be trusted to provide general trends but not detailed examination. Similarly, any comparison made within a single city will almost always be quite exact, while comparisons made across cities should be used to provide directional trends and small differences may not carry a lot of signal.
Also note that the activity index measurement is more stable for areas in which there is a relatively high average activity, since it is less likely that an individual event could influence the scores in the area. As we work through a quantified confidence level for absolute values, this general rule also applies: Comparisons made over areas that have a high average activity are more trustworthy than those made over areas that have a low average activity.
Sometimes it’s useful to normalize the data regionally, to understand how local trends differ from other regions. In other cases it may be useful to smooth out the timeline data, to avoid having too much volatility in the data. Here are a few examples:
- When you are interested in smooth timelines, we recommend working with moving averages with a window of seven days, or a multiple of seven. This will help smooth out some natural weekend effects. You can do this at any zoom level or boundary aggregation.
- When you want to normalize and compare individual days of the week within each other, we recommend defining a period of time for the normalization (possibly accounting for holidays). For example, you can pick four weeks in February and define a generic Monday as the average of those to compare against for the future Mondays. This normalization would help when trying to identify changes over time of individual days of week. For example, do people move more on Sundays during shelter in place than they did before shelter in place, compared to weekday movement?
- When you want to compare individual areas (like counties) that have a different average activity level, it might be useful to normalize each area independently if you are interested in their relative change more than in their absolute difference. For example, you could pick four weeks in February as the normalization period, and normalize each area based on the average activity seen in February. This can help if you want to analyze when different areas are going back to the same activity as in the reference period, using the same scale for each area.
We’ve seen cases of unusual spikes in the time series of normalized data for small areas. They likely represent a Mapbox customer going through some high-impact event, or they could represent an app or feature launch. This is usually corrected for in our normalization procedures, but there are some edge cases in which our normalization doesn’t re-absorb the full effect. We’re working on a way to reduce this issue by introducing a more localized anomaly detection within our normalization and baselining methodologies.
As a mitigation plan, we recommend introducing an outlier removal procedure to avoid showing or using these spurious data points. There are a multitude of methodologies to approach such outlier issues. We think a method based on the z-score (or standard score) can correctly identify anomalies, on either the full time series, or using a moving window for recency when computing averages and standard deviations on long time series.
Since in these time series we are considering a ratio of measured data over a baseline, the logarithm of the z-score can be used to identify outliers in both directions and you should set a threshold based on the intrinsic variability observed.
Finally, since the spikes are likely proportional to real events and movement increase, though not as high as measured, we recommend evaluating whether to cap the value of the outlier, rather than removing it fully from the time series, depending on the specific purpose of the work.