The curious case of misrendered JSON

Have you ever felt that AEM is playing tricks on you? I bet you have. This is one of those stories - the more you dig in, the less you understand.

That was a classic headless setup - AEM’s role was to expose content via REST-like JSON API so various applications can consume the data. Things were running smoothly, but one day a misrendered JSON shows up - a set of mandatory properties just disappeared from the object. The case got thoroughly investigated, however no one could even reproduce it. Nothing has changed at JCR level, there was no deployment in the meantime and when you visit the exact same URL all the data are correct. “Oh, that must have been a one-off incident” someone said. The ticket gets closed and life goes on. A week after similar issue got reported - a different JSON object is broken this time, but at least you can reproduce it. Unfortunately, an hour later the problem magically goes away. Time passes by and a slightly different variant of the problem surfaces in production - you keep requesting affected URL and the response alternates between completely valid JSON and its broken form. Your team hops on a call to get to the bottom of the problem, but in a matter of minutes it just vanishes without a trace again.

Interested in what happened and where we ended up? That’s what the talk’s going to be about.

Krystian

Do you know that JMX resource for Sling Installer "activeResourceCount" could be 0 but the separate boolean "active" could be true at the same time... AEM is full of surprises. Even type of JSON fields are changing during startup which is evil... (my parser failed someday because boolean like field was served as string for some moment). Thanks Kuba for nice talk 🙂

jwadolowski

That's so true - there's a number of inconsistencies out there and some are extremely hard to predict. Thanks for the feedback Krystian!

Anian Weber

What would you recommend to others to mitigate those issues? Are you e.g extending the healthcheck for all future projects?

Stefan Seifert

You have to be especially careful with OSGi components that do not have a "hard dependency" to the components/model they are using. For example we had a OSGi component to configure our link/media handler, and if this component was not there it still worked - but with default instead of the project-specific configuration. including those components in the health checks ensured they are ready on startup.

Konrad Windszus

Have you raised an issue with Sling Models (Exporter) to no longer return a 200 when exceptions are happening in some of the models (or they cannot be initialized at all)?

(see answer in talk video)

Henry Kuijpers

So if I understand correctly, your Sling Models Exporters / Sling Models didn't "require" the appropriate adapter definitions to be there? You had to delay making them work to have the Resource to Asset adapter to be there, for example?

(see answer in talk video)