Leveraging Asynchronous Jobs in AEM's Multi-Site Manager
In this session, we explore how AEM's asynchronous job framework works with Multi-Site Manager (MSM) to efficiently handle resource-intensive operations like creating live copies and content rollouts across multiple sites.
Combined with AEM's robust job management APIs, the MSM async jobs provide scalable content replication, improved performance for large operations, and reliable tracking mechanisms for complex site management tasks.
You will learn:
- How an async job works under the hood.
- How to implement custom async jobs following MSM job patterns for your own complex content operations
- How to track, monitor and manage async jobs using AEM's job management APIs
- How to avoid the common mistakes which might occur during a time consuming async operation.
This session will demonstrate practical techniques for extending AEM's async job framework, focusing on:
- Async Job Architecture – Understanding the core components of AEM's job system
- MSM Async Operations – Common features and limitations
- Job Monitoring and Error Handling – Implementing robust tracking and recovery mechanisms
- Custom Async Job Implementation – Creating your own async jobs following best practices from MSM
This session will help developers and architects leverage AEM's asynchronous capabilities to build more efficient and scalable content management solutions.
Henry Kuijpers
At our customer, we face a lot of issues with this, we have so much content (100+ websites running on 1 AEM). Often, there is some error in some content somewhere, we really have to dive in the logs to figure out what content is broken. It can't be fixed by an author, we would have to do it ourselves. Any improvements that you guys are going to make? Or is the topic now presented soon contained in AEM? So that we have better processing in a new service pack?
Alexandru Tudoran
As mentioned on the stage, the errors should be seen in their respective steps with a little bit more detailed information about why the step failed and at which page (path). Usually, if there is a content issue we believe that the content editors should take the decision on why that page was missing for example and whether they want to bring it back and continue with the rollout job. We don't want to start fixing and/or bringing content back that might've been intentionally deleted.
Robert Wunsch
Will this "Job-Manager" be open source and actively developed?
Alexandru Stancioiu
Job Manager interface is part of Sling Events Jobs API: https://github.com/apache/sling-org-apache-sling-event-api/blob/master/src/main/java/org/apache/sling/event/jobs/JobManager.java and its implementation that we can use in AEM is available in Sling here as well: https://github.com/apache/sling-org-apache-sling-jobs/blob/master/src/main/java/org/apache/sling/jobs/impl/JobManagerImpl.java. And Sling is open source and robust enough to be underlying framework for long-running operations in AEM for a very long time. Don't think it needs active development because it already works well and is pretty feature rich and developer friendly !
Henry Kuijpers
why does livecopies/blueprints/rollout stuff actually have to take SO MUCH time? Isn't it crazy that this amount of time is needed? I understand that a lot of stuff has to be figured out and that there are lots of writes to the repository, but we can't actually make customers understand this (and we don't understand it either). (And we can also not purchase SSDs just to make this a little bit faster.)
Alexandru Tudoran
We plan to have a workshop with the guys from oak and see what can we do to improve these operations. The content will surely won't shrink in the future and we want to make sure that these operations will get better instead of worse.
Konrad
Why don't you open the APIs for Granite Async Jobs ("com.adobe.granite:com.adobe.granite.jobs.async"). They are neither part of uber-jar/aem-sdk-api nor part of javadoc (https://developer.adobe.com/experience-manager/reference-materials/cloud-service/javadoc/index.html)?
Alexandru Stancioiu
We prefer to preserve the ability to make changes to the classes exported from that bundle and if customers start introducing dependencies on that package, we cannot easily do that without causing issues when rolling out a new cloud release. From a broader perspective, we would like to encourage the use of OpenAPI APIs instead of Java APIs and we started releasing more and more of them. For example, we have the MSM OpenAPIs here: https://developer.adobe.com/experience-cloud/experience-manager-apis/api/experimental/sites/msm/. Of course, it might not be flexible enough (as Java APIs are) to cover your use case, but it is worth some consideration at least !
Henry Kuijpers
For me, intermittent commits are not the solution to this problem; I would rather have 100% correct content (or unchanged content if something goes wrong), than 10% broken / unhandled content. This doesn't make sense??
Konrad
Oak does not scale that well with large uncommitted session changes, so that is only an option if the number of changes to the JCR is rather low!
Henry Kuijpers
Do we have any numbers about that? There would be 1000+ or maybe 10.000+ nodes modified probably, is that already too much for Oak?
Thomas
I would prefer a useful split (e.g. by country/languages) over a rollout that never works any time.
Henry Kuijpers
That's already a bit better -- But it means you'll need to sort the result set as well. But that could be an option indeed.
Konrad
FileVault commits after 1024 modified nodes: https://github.com/apache/jackrabbit-filevault/blob/4806680419c3b403c2a7ce8c46708eaa96a48e92/vault-core/src/main/java/org/apache/jackrabbit/vault/fs/io/AutoSave.java#L56. That works quite well, but I don't have numbers at hand. It is not only about performance but also about memory consumption though.
Henry Kuijpers
It could also just fix the content instead of breaking on it, or skip it, finish everything and report that a few paths are broken for whatever reason.
Robin
I hope this (the intermediate commits) is indeed not enabled by default. That can create so many dead links to failed pages etc
Alexandru Tudoran
The feature with intermitent commits is hidden before a feature toggle but however it is active for all of our customers that are on the September 2025 release or higher. It can be de-activated on demand even though we don't see a reason why wouldn't they want to benefit from it. Having a job failing after 90% and having 10% missing rather than broken content can easily be fixed in most of the cases by itself and in the cases where the customers have removed/modified/moved the content they can fix the issue which is now clearly visible in the Step Logs and either trigger a Rollout for the remaining content or retry the job from the Jobs Console UI.
Alexandru Tudoran
> I would prefer a useful split (e.g. by country/languages) over a rollout that never works any time. The algorithm that splits the pages is using a Breadth-First Search approach due to the fact that you cannot create a sub-tree without having its parent node already created in JCR. Therefore we buildup an ordered List which we then split and batch after X pages. However, a BFS approach can also be used completely fine and could also work nicer in cases where the content is always organised by /country/languages/etc. Maybe we can offer both implementations and let the customers decide which implementation they want to use.
Mehdi Al.
Did you consider relying on lower abstraction API like JCR API to accelerate the processing ?
Jörg
The performance overhead of higher-level APIs is not that high. I doubt that a rewrite with pure JCR will yield significantly better performance.
Amine
What’s the biggest challenge you still face with asynchronous jobs in MSM?
Alexandru Tudoran
We would like it to make it a bit smarter. We've seen customers with 1000 pages and very few properties on each page and customers with a few pages and loads of properties. Dynamically deciding after how many pages or more specifically, properties, we should commit is one idea that we have for the future. We're also thinking about a mechanism that caches the modified pages in such a way that the rollout operations does not have to check anymore the entire content tree for modifications to decide whether to rollout or not.