Missing data on Experience and Experiment dashboards

Minor incident Management Platform
Jul, 18 2023 20:30 -03 · 2 days, 14 hours, 48 minutes

Updates

Retroactive

Summary

Between 20:30 BRT on July 18 and 11:18 BRT on July 21, 2023, we experienced an interruption in the data collection for our Experience and Experiment dashboards. This issue resulted in a lack of Experience Traffic Data for this period. Our team identified and resolved the case on July 21, ensuring the resumption of data collection from 11:18 BRT onwards.

Timeline of Investigation and Resolution

  • July 21, 10:39 BRT: We received initial reports about a lack of experience traffic data from the previous day.
  • July 21, 10:47 BRT: We identified the affected services, and the team responsible for these services joined the investigation.
  • 21st July, 10:55 BRT: The source of the problem was identified. The issue was due to an automatic update of a third-party service that led to changes in some functionality.
  • July 21, 11:18 BRT: We fixed the underlying issue, and data collection resumed.

Incident Details

Time and Date of the Incident: The issue started at 20:30 BRT on July 18 and entirely took effect by 07:30 BRT on July 19 when data collection completely stopped. The team fixed the problem on July 21 at 11:18 BRT.

Symptoms: The data collection service for our Experience and Experiment dashboards was interrupted. However, the Workspace and Application dashboards were unaffected. The functionality of the Experiences and Experiments was also not affected; they continued to run seamlessly.

Affected Parties: The incident affected marketers utilizing the Experience and Experiment dashboards within the abovementioned timeframe. Significantly, no end-users were impacted by this issue, and all personalizations on our customer applications continued to receive the appropriate content from the Experiences and Experiments.

Root Cause: The problem originated from an automatic update to a third-party service which inadvertently discarded the configuration for a feature we use. This unprompted update discarded some configurations for the third-party service, one of which being the health check for application readiness, an indicator of whether an application has completed its initialization and is ready to receive traffic. Our service component that collects the traffic data was incorrectly initializing itself only when it received a call to check for its readiness. However, due to the problem above, that was never happening.

Actions Taken and Future Measures

Immediately after identifying the issue, we restored the configurations the third-party service update discarded. This action re-enabled the service and allowed the resumption of proper data collection.

We are working on a permanent fix for the data collection service component to ensure it doesn’t rely on the readiness health check. We will update this incident report as soon as this fix is implemented.

In the meantime, we have heightened our service monitoring to ensure accurate data collection. We sincerely apologize for any inconvenience caused and thank you for your understanding and patience as we continue to improve our service.

July 21, 2023 · 13:51 -03

← Back