Challenges when operating 1000+ different AEM applications
Over the course of almost 15 years, many APIs were added to AEM, and even more implementations. The extensible nature of OSGI, Sling and the JVM itself made it possible to develop a multitude of solutions, which no one has ever thought about when building AEM. But that also makes it hard to evolve the platform while ensuring backwards compatibility and not breaking anything.
And evolution is necessary, either to build new features, but also to improve operational efficiency. Because not everything what was possible in the past is a feature worth to support in the future.
This talk wants to give an overview of some challenges which we as the Site Reliability Team of AEM CS are facing in this area, and how we plan to solve them. And why your collaboration as customers and partners is important to make this work.
Tomasz Sobczyk
Could you elaborate on how leader selection works when aem authors are recycling / restarting? I have seen the isLeader to be unreliable and I wonder what is the proper way to find out which instance for example is responsible for running workflows or any other actions which need to run only ona single instance?
Jörg
IsLeader() is actually quite reliable, but it can happen that at some time isLeader() returns false on all cluster nodes. Improving this is on my todo list, although it does not have highest priority.
Michal
How do you synchronize the content on newly created publishers after startup? How do you decide when all the states on all publishers are the same, especially when there are ongoing publications?
Jörg
Please approach me during the conference, as I want to understand your requirements better. In general the replication is a totally async process, each publish imports changes at its own pace. Also the number of publishs can change at all time, so you cannot wait for N publishs to have reported back that they have imported. (And that's definitely worth a blog post)
Maverick
Any numbers about how much start up time was reduced across years and how long it takes on average to have a new instance ready serve traffic during a peak?
Jörg
Quite hard to say ... especially as the amount of bundles and services was constantly growing, and that definitely contributed to a slower startup (comparing AEM 6.0 and AEM CS). On the other hand side we improved a lot of operations and moved them out of the critical path on startup. Right now the startup of AEM instances can be quite fast, in the range of 3-5 minutes, including the provisioning of all resources and services required.
Helge
How bad is it to have async communication on startup? Anything to watch out for? Any rule of thumb like handling timeouts/retries and other failovers ? Like getting feature flags from an external registry that need to be registered
Jörg
In general async communication is not that much a problem; the tough ones are the synchronous calls with requests and during startups
Helge
Is there any insight how to measure time all the related services take during a request/response cycle?
Jörg
I am not aware of any ootb way. A profiler could give you that information, another option could be AOP to instrument service methods.
Yves De Bruyne
Would providing an incentive to use the CDN help? Currently CDN hits cost the sama as publisher responses...
Jörg
That could be a way :-) But it would require to change the licensing, and for that reason rather hard to implement.
Helge
Is there a way to optimize for build time in cm for large repos?
Jörg
the actual build time should be independent of the repository size. For the deployment duration itself Adobe should be be responsible, and often times you as a customer cannot influence it that much. (We will let you know if you can/should).
Tomasz Sobczyk
Are you looking to provide customers / partners with any of your internal tooling for the monitoring. It doesn't sound very practical to ask everyone to to build their own logging / observability of a as-a-cloud service platform
Radu Cotescu
For logging there's log forwarding: https://experienceleague.adobe.com/en/docs/experience-manager-cloud-service/content/implementing/developing/log-forwarding For monitoring you can integrate with New Relic One: https://experienceleague.adobe.com/en/docs/experience-manager-cloud-service/content/implementing/using-cloud-manager/user-access-new-relic
Jörg
No, it's not practical,and you should not be required to do it. Adobe should take care of it. But caching is entirely project specific, and for that reason the caching decisions you make have a high impact. And for that reason we are working on integrating these information and recommendations into Cloud Manager.
Rogier
Will you add an post-content-sync CDN cache flush to help us all avoid content update issues?
Jörg
As said, it is not that easy to achieve. Ping me here on adaptTo(), then I can outline the implications. And for the other attendees I will write it up as a blog post :-)
Tomasz Sobczyk
Are you expecting to reduce the fullstack deployment times any further from the current 30+ minutes?
Jörg
What target time would you expect? Please be aware, that fullstack builds always include a full build plus a deployment on all instances. And to be honest, I don't expect it to reduce dramatically. If you need a faster turnaround time, I suggest to use RDEs, as there many short-cuts are made.
Divanshu Goyal
At times, we have multimodule projects created with the multiple projects and general idea is migrated old project and new project for new features from the day it will be released to aemaacs but we keep old cq projects but just make them compatible. After a time it becomes bottleneck as it keeps on getting compiled everytime while it is meant to just sit on the instance but because of immutable behaviour of repo we can't take it away. Is there a way that project can be once compiled and sit in
Jörg
There are ways to pull in binaries into the "all" maven project ... (e.g. using a private maven repository)
wolf
Adobe is actively discouraning using a proper CDN configuration (configuration is still limited, campign string blocking is a simple change that everyone has been waiting on for years) and it is discouraging BYOCDN (via licensing terms and lack of support to connect at correct endpoints). What incentive is there for a customer to optimise and actually apply good caching practices? There are models of operation (e.g. post content sync flush to ensure coherency) that we have not even heard of being in the plan…
Jörg
There is the incentive of delivering a decent experience to your users. Which is much harder if the CDN is not used. (Approach me regarding the postcontent-sync flush, want to understand more about that requirement. Thanks!)
Krystian Panek
Could you describe in more detail in the documentation how an AEMaaCS author cluster works in practice? There are hidden details that can cause problems for developers when building features. For example, sometimes the same node handles both HTTP traffic and Sling jobs, while other times these responsibilities are separated.
Jörg
Sling jobs are executed based on the queue properties. And that can mean, that either the local node executes the job or just the cluster leader. Right now we haven't implemented a way to split the job execution into dedicated services, which don't response to incoming HTTP requests.
Konrad
How can we manually defer making a server ready (to wait a short time for a custom service to do some initialization)?
Radu Cotescu
You could use the org.apache.felix.hc.generalchecks.DsComponentsCheck and friends, with these tags: "hc.tags":[ "systemalive", "systemready"] Starting with release 21005 there's a REQUIRED_CUSTOM_OSGI_COMPONENTS environment variable that you could set to provide a CSV list of component names that have to be active to consider the system in a ready state. Jörg worked on this feature, so he could give you some more details about it.