[CIR][MERGE-SPLIT-COMPILATION] Adds a compilation step that merges Device and Host CIR into a single module and can co-optimize their execution #2097

koparasy · 2026-01-05T19:25:36Z

This PR is a draft / working skeleton that documents my current progress on CIR host–device combine. The goal here is not to land this as-is, but to outline the individual components that need to change to enable the desired functionality.

Once I have an end-to-end pipeline working for PolyBench on AMD, I plan to split this into smaller, focused, reviewable PRs. Until then, this PR serves as a reference point for understanding the overall direction and moving pieces.

Current status

Introduce a new cc1 Action that reads two CIR inputs (host + device) and merges them under a single module.
Fix existing Actions to correctly operate on CIR bytecode instead of textual CIR
- ⚠️ There is a known memory leak here that still needs to be addressed.
Fix parsing of selected CIR attributes:
- GPUBinaryHandle
- ConstDataArrays
Add HIP ABI support (tracked separately as PR [CIR][HIP] Add ABI support for HIP #2087).
Allow the driver to emit .cir files and make the driver aware of CIR lowering.
This introduces a new driver flags that genrate an Action of TY_CIR type that is intentionally non-mergeable with earlier Actions.
- This separation is required so we can hook the follow-up Action that consumes both host and device CIR.

Next steps

Make the Driver dependency resolution work properly and call the individual tools properly.
Generate the respective tool. Do we want the tool to be a separate CC1 command or an individual tool.
Extend the tool to allow splitting of the CIR file. Every command needs to emit a single file to follow Driver semantics. Currently the implementation splits and emits both Host and Device.
Build PolyBench to verify that this is working on something more realistic.

This commit introduces a new cc1 frontend action, -cir-combine, which enables combining separately generated host and device CIR modules into a single CIR container for downstream processing. Key changes Add a new frontend program action frontend::CIRCombine and corresponding CIRCombineAction. Introduce typed cc1 options: -cir-host-input <file>: specifies the host CIR module. -cir-device-input <file>: specifies the device CIR module. Enforce strict argument validation in ParseFrontendArgs: -cir-combine requires -fclangir. Exactly one host and one device CIR input must be provided. User-provided positional inputs are rejected. Extend FrontendOptions to store CIR host/device inputs. Ensure correct cc1 argument round-trip by regenerating -cir-combine, -cir-host-input, and -cir-device-input. Treat -cir-combine as a non-source action: Suppress implicit stdin input and language (-x) emission. Inject a synthetic precompiled input to drive frontend encouraging execution without parsing source. Wire the new action through frontend action creation so it executes correctly. This lays the groundwork for split compilation workflows in ClangIR, allowing host and device CIR to be analyzed and transformed together before lowering, while preserving existing offload and bundling flows.

* Reads 2 CIR code files and creates a 'combined' region of these files * Improves CLI arg checking * Adds CLI test

koparasy · 2026-01-05T19:26:28Z

@RiverDave and @bcardosolopes I will keep this PR open so that both of you can have an idea of the progress.

graph between actions. They do not carry properly the Input suffix and their device. * Notes for next sprint: 1. CombineCIRActions needs to look more like an offloadAction * It will contain a Host Input and a Device Input (One of each) * It needs to explicitly set Device and Host Kinds 2. SplitCIRAction needs to follow the CombineCIR paradigm.

koparasy · 2026-01-06T16:40:31Z

Currently the clang driver action graph looks like this:

                     +- 0: input, "vecadd.cu", hip, (host-hip)
                  +- 1: preprocessor, {0}, hip-cpp-output, (host-hip)
               +- 2: compiler, {1}, cir, (host-hip)
               |     +- 3: input, "vecadd.cu", hip
               |  +- 4: preprocessor, {3}, hip-cpp-output
               |- 5: compiler, {4}, cir
            +- 6: comebinecir, {2, 5}, cir
         +- 7: splitcir, {6}, cir, (host-hip)
         |              +- 8: splitcir, {6}, cir
         |           +- 9: backend, {8}, assembler, (device-hip, gfx942)
         |        +- 10: assembler, {9}, object, (device-hip, gfx942)
         |     +- 11: linker, {10}, image, (device-hip, gfx942)
         |  +- 12: offload, "device-hip (amdgcn-amd-amdhsa:gfx942)" {11}, image
         |- 13: linker, {12}, hip-fatbin, (device-hip)
      +- 14: offload, "host-hip (x86_64-unknown-linux-gnu)" {7}, "device-hip (amdgcn-amd-amdhsa)" {13}, cir
   +- 15: backend, {14}, assembler, (host-hip)
+- 16: assembler, {15}, object, (host-hip)
17: linker, {16}, image, (host-hip)

We are losing information regarding the device arch.

* There is still room for improvement and reuse quite some of the existing baseclass Action capabilities Output looks like this: +- 0: input, "vecadd.cu", hip, (host-hip) +- 1: preprocessor, {0}, hip-cpp-output, (host-hip) +- 2: compiler, {1}, cir, (host-hip) | +- 3: input, "vecadd.cu", hip, (device-hip, gfx942) | +- 4: preprocessor, {3}, hip-cpp-output, (device-hip, gfx942) |- 5: compiler, {4}, cir, (device-hip, gfx942) +- 6: comebinecir, "host-hip (x86_64-unknown-linux-gnu)" {2}, "device-hip (amdgcn-amd-amdhsa:gfx942)" {5}, cir +- 7: splitcir, {6}, cir, (host-hip) | +- 8: splitcir, {6}, cir, (device-hip, gfx942) | +- 9: backend, {8}, assembler, (device-hip, gfx942) | +- 10: assembler, {9}, object, (device-hip, gfx942) | +- 11: linker, {10}, image, (device-hip, gfx942) | +- 12: offload, "device-hip (amdgcn-amd-amdhsa:gfx942)" {11}, image |- 13: linker, {12}, hip-fatbin, (device-hip) +- 14: offload, "host-hip (x86_64-unknown-linux-gnu)" {7}, "device-hip (amdgcn-amd-amdhsa)" {13}, cir +- 15: backend, {14}, assembler, (host-hip) +- 16: assembler, {15}, object, (host-hip) 17: linker, {16}, image, (host-hip)

bcardosolopes · 2026-01-13T00:17:36Z

clang/include/clang/CIR/Dialect/IR/CIROps.td

+  let summary = "Container for host and device CIR modules";
+  let description = [{
+    `cir.offload.container` is a top-level container used to keep host and device
+    CIR modules together for joint analysis and transformation.


I was originally thinking about something like:

module { cir.host { ... } cir.device { ... } }

Any reason why you need the cir.offload.container wrapping both of them?
Based on our experience with optimizing cross library with LLVM IR dialect, we had to flatten the namespace and put everything in the same module (while adding an extra attribute to indicate the symbol origin, such that we can split them back after optimizations). In our case lots of problems came from symbol definitions being only available within another cir.library and MLIR not being able to properly handle symbol tables across them, is this a problem for this approach?

bcardosolopes · 2026-01-13T00:18:38Z

clang/include/clang/Driver/Action.h

    StaticLibJobClass,
    BinaryAnalyzeJobClass,
    BinaryTranslatorJobClass,
+    CIRCombineJobClass,


CIRCombineJobClass -> CIRCombineHostDeviceJobClass

koparasy added 10 commits December 29, 2025 10:53

Remove tooling

cb4c90d

Combine CIR code into a single module

f94fbfb

* Reads 2 CIR code files and creates a 'combined' region of these files * Improves CLI arg checking * Adds CLI test

Allow cir-combine to split back to device/host modules

8a307be

Allow lowering to object files through emit-* flags

9dc8414

Add ABI support for HIP

63a8f40

GpuBinaryHandle Attribute is now stringref

a9476f7

Proper parsing of ConstDataArays

ed0ff3c

Read/Write bytecode

0e2a637

[WIP][Non-Functional] Make Driver aware of cir combine actions

e54f4b8

koparasy added 2 commits January 5, 2026 18:23

Minor changes to relax action checking

463d7c5

bcardosolopes reviewed Jan 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CIR][MERGE-SPLIT-COMPILATION] Adds a compilation step that merges Device and Host CIR into a single module and can co-optimize their execution #2097

[CIR][MERGE-SPLIT-COMPILATION] Adds a compilation step that merges Device and Host CIR into a single module and can co-optimize their execution #2097

koparasy commented Jan 5, 2026

Uh oh!

koparasy commented Jan 5, 2026

Uh oh!

koparasy commented Jan 6, 2026 •

edited

Loading

Uh oh!

bcardosolopes Jan 13, 2026

Uh oh!

bcardosolopes Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CIR][MERGE-SPLIT-COMPILATION] Adds a compilation step that merges Device and Host CIR into a single module and can co-optimize their execution #2097

Are you sure you want to change the base?

[CIR][MERGE-SPLIT-COMPILATION] Adds a compilation step that merges Device and Host CIR into a single module and can co-optimize their execution #2097

Conversation

koparasy commented Jan 5, 2026

Current status

Next steps

Uh oh!

koparasy commented Jan 5, 2026

Uh oh!

koparasy commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bcardosolopes Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

bcardosolopes Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

koparasy commented Jan 6, 2026 •

edited

Loading