Challenges when operating 1000+ different AEM applications

Over the course of almost 15 years, many APIs were added to AEM, and even more implementations. The extensible nature of OSGI, Sling and the JVM itself made it possible to develop a multitude of solutions, which no one has ever thought about when building AEM. But that also makes it hard to evolve the platform while ensuring backwards compatibility and not breaking anything.

And evolution is necessary, either to build new features, but also to improve operational efficiency. Because not everything what was possible in the past is a feature worth to support in the future.

This talk wants to give an overview of some challenges which we as the Site Reliability Team of AEM CS are facing in this area, and how we plan to solve them. And why your collaboration as customers and partners is important to make this work.

Tomasz Sobczyk

Could you elaborate on how leader selection works when aem authors are recycling / restarting? I have seen the isLeader to be unreliable and I wonder what is the proper way to find out which instance for example is responsible for running workflows or any other actions which need to run only ona single instance?

Jörg

IsLeader() is actually quite reliable, but it can happen that at some time isLeader() returns false on all cluster nodes. Improving this is on my todo list, although it does not have highest priority.

Michal

How do you synchronize the content on newly created publishers after startup? How do you decide when all the states on all publishers are the same, especially when there are ongoing publications?

Jörg

Please approach me during the conference, as I want to understand your requirements better. In general the replication is a totally async process, each publish imports changes at its own pace. Also the number of publishs can change at all time, so you cannot wait for N publishs to have reported back that they have imported. (And that's definitely worth a blog post)

Maverick

Any numbers about how much start up time was reduced across years and how long it takes on average to have a new instance ready serve traffic during a peak?

Jörg

Quite hard to say ... especially as the amount of bundles and services was constantly growing, and that definitely contributed to a slower startup (comparing AEM 6.0 and AEM CS). On the other hand side we improved a lot of operations and moved them out of the critical path on startup. Right now the startup of AEM instances can be quite fast, in the range of 3-5 minutes, including the provisioning of all resources and services required.

Helge

How bad is it to have async communication on startup? Anything to watch out for? Any rule of thumb like handling timeouts/retries and other failovers ? Like getting feature flags from an external registry that need to be registered

Jörg

In general async communication is not that much a problem; the tough ones are the synchronous calls with requests and during startups

Helge

Is there any insight how to measure time all the related services take during a request/response cycle?

Jörg

I am not aware of any ootb way. A profiler could give you that information, another option could be AOP to instrument service methods.

Yves De Bruyne

Would providing an incentive to use the CDN help? Currently CDN hits cost the sama as publisher responses...

Jörg

That could be a way :-) But it would require to change the licensing, and for that reason rather hard to implement.

Helge

Is there a way to optimize for build time in cm for large repos?

Jörg

the actual build time should be independent of the repository size. For the deployment duration itself Adobe should be be responsible, and often times you as a customer cannot influence it that much. (We will let you know if you can/should).

Tomasz Sobczyk

Are you looking to provide customers / partners with any of your internal tooling for the monitoring. It doesn't sound very practical to ask everyone to to build their own logging / observability of a as-a-cloud service platform

Radu Cotescu

For logging there's log forwarding: https://experienceleague.adobe.com/en/docs/experience-manager-cloud-service/content/implementing/developing/log-forwarding For monitoring you can integrate with New Relic One: https://experienceleague.adobe.com/en/docs/experience-manager-cloud-service/content/implementing/using-cloud-manager/user-access-new-relic

Jörg

No, it's not practical,and you should not be required to do it. Adobe should take care of it. But caching is entirely project specific, and for that reason the caching decisions you make have a high impact. And for that reason we are working on integrating these information and recommendations into Cloud Manager.

Rogier

Will you add an post-content-sync CDN cache flush to help us all avoid content update issues?

Jörg

As said, it is not that easy to achieve. Ping me here on adaptTo(), then I can outline the implications. And for the other attendees I will write it up as a blog post :-)

Tomasz Sobczyk

Are you expecting to reduce the fullstack deployment times any further from the current 30+ minutes?

Jörg

What target time would you expect? Please be aware, that fullstack builds always include a full build plus a deployment on all instances. And to be honest, I don't expect it to reduce dramatically. If you need a faster turnaround time, I suggest to use RDEs, as there many short-cuts are made.

Divanshu Goyal

At times, we have multimodule projects created with the multiple projects and general idea is migrated old project and new project for new features from the day it will be released to aemaacs but we keep old cq projects but just make them compatible. After a time it becomes bottleneck as it keeps on getting compiled everytime while it is meant to just sit on the instance but because of immutable behaviour of repo we can't take it away. Is there a way that project can be once compiled and sit in

Jörg

There are ways to pull in binaries into the "all" maven project ... (e.g. using a private maven repository)

wolf

Adobe is actively discouraning using a proper CDN configuration (configuration is still limited, campign string blocking is a simple change that everyone has been waiting on for years) and it is discouraging BYOCDN (via licensing terms and lack of support to connect at correct endpoints). What incentive is there for a customer to optimise and actually apply good caching practices? There are models of operation (e.g. post content sync flush to ensure coherency) that we have not even heard of being in the plan…

Jörg

There is the incentive of delivering a decent experience to your users. Which is much harder if the CDN is not used. (Approach me regarding the postcontent-sync flush, want to understand more about that requirement. Thanks!)

Krystian Panek

Could you describe in more detail in the documentation how an AEMaaCS author cluster works in practice? There are hidden details that can cause problems for developers when building features. For example, sometimes the same node handles both HTTP traffic and Sling jobs, while other times these responsibilities are separated.

Jörg

Sling jobs are executed based on the queue properties. And that can mean, that either the local node executes the job or just the cluster leader. Right now we haven't implemented a way to split the job execution into dedicated services, which don't response to incoming HTTP requests.

Konrad

How can we manually defer making a server ready (to wait a short time for a custom service to do some initialization)?

Radu Cotescu

You could use the org.apache.felix.hc.generalchecks.DsComponentsCheck and friends, with these tags: "hc.tags":[ "systemalive", "systemready"] Starting with release 21005 there's a REQUIRED_CUSTOM_OSGI_COMPONENTS environment variable that you could set to provide a CSV list of component names that have to be active to consider the system in a ready state. Jörg worked on this feature, so he could give you some more details about it.