Skip to content

Conversation

@koparasy
Copy link
Contributor

@koparasy koparasy commented Jan 5, 2026

This PR is a draft / working skeleton that documents my current progress on CIR host–device combine. The goal here is not to land this as-is, but to outline the individual components that need to change to enable the desired functionality.

Once I have an end-to-end pipeline working for PolyBench on AMD, I plan to split this into smaller, focused, reviewable PRs. Until then, this PR serves as a reference point for understanding the overall direction and moving pieces.

Current status

  • Introduce a new cc1 Action that reads two CIR inputs (host + device) and merges them under a single module.
  • Fix existing Actions to correctly operate on CIR bytecode instead of textual CIR
    • ⚠️ There is a known memory leak here that still needs to be addressed.
  • Fix parsing of selected CIR attributes:
    - GPUBinaryHandle
    - ConstDataArrays
  • Add HIP ABI support (tracked separately as PR [CIR][HIP] Add ABI support for HIP #2087).
  • Allow the driver to emit .cir files and make the driver aware of CIR lowering.
  • This introduces a new driver flags that genrate an Action of TY_CIR type that is intentionally non-mergeable with earlier Actions.
    - This separation is required so we can hook the follow-up Action that consumes both host and device CIR.

Next steps

  • Make the Driver dependency resolution work properly and call the individual tools properly.
  • Generate the respective tool. Do we want the tool to be a separate CC1 command or an individual tool.
  • Extend the tool to allow splitting of the CIR file. Every command needs to emit a single file to follow Driver semantics. Currently the implementation splits and emits both Host and Device.
  • Build PolyBench to verify that this is working on something more realistic.

This commit introduces a new cc1 frontend action, -cir-combine, which
enables combining separately generated host and device CIR modules into
a single CIR container for downstream processing.

Key changes

Add a new frontend program action frontend::CIRCombine and corresponding
CIRCombineAction.

Introduce typed cc1 options:

-cir-host-input <file>: specifies the host CIR module.
-cir-device-input <file>: specifies the device CIR module.

Enforce strict argument validation in ParseFrontendArgs:

-cir-combine requires -fclangir.

Exactly one host and one device CIR input must be provided.

User-provided positional inputs are rejected.

Extend FrontendOptions to store CIR host/device inputs.

Ensure correct cc1 argument round-trip by regenerating -cir-combine,
-cir-host-input, and -cir-device-input.

Treat -cir-combine as a non-source action:

Suppress implicit stdin input and language (-x) emission.

Inject a synthetic precompiled input to drive frontend encouraging
execution without parsing source.

Wire the new action through frontend action creation so it executes
correctly.

This lays the groundwork for split compilation workflows in ClangIR,
allowing host and device CIR to be analyzed and transformed together
before lowering, while preserving existing offload and bundling flows.
* Reads 2 CIR code files and creates a 'combined' region of these files
* Improves CLI arg checking
* Adds CLI test
@koparasy
Copy link
Contributor Author

koparasy commented Jan 5, 2026

@RiverDave and @bcardosolopes I will keep this PR open so that both of you can have an idea of the progress.

graph between actions. They do not carry properly the Input suffix and
their device.

* Notes for next sprint:
  1. CombineCIRActions needs to look more like an offloadAction
    * It will contain a Host Input and a Device Input (One of each)
    * It needs to explicitly set Device and Host Kinds
  2. SplitCIRAction needs to follow the CombineCIR paradigm.
@koparasy
Copy link
Contributor Author

koparasy commented Jan 6, 2026

Currently the clang driver action graph looks like this:

                     +- 0: input, "vecadd.cu", hip, (host-hip)
                  +- 1: preprocessor, {0}, hip-cpp-output, (host-hip)
               +- 2: compiler, {1}, cir, (host-hip)
               |     +- 3: input, "vecadd.cu", hip
               |  +- 4: preprocessor, {3}, hip-cpp-output
               |- 5: compiler, {4}, cir
            +- 6: comebinecir, {2, 5}, cir
         +- 7: splitcir, {6}, cir, (host-hip)
         |              +- 8: splitcir, {6}, cir
         |           +- 9: backend, {8}, assembler, (device-hip, gfx942)
         |        +- 10: assembler, {9}, object, (device-hip, gfx942)
         |     +- 11: linker, {10}, image, (device-hip, gfx942)
         |  +- 12: offload, "device-hip (amdgcn-amd-amdhsa:gfx942)" {11}, image
         |- 13: linker, {12}, hip-fatbin, (device-hip)
      +- 14: offload, "host-hip (x86_64-unknown-linux-gnu)" {7}, "device-hip (amdgcn-amd-amdhsa)" {13}, cir
   +- 15: backend, {14}, assembler, (host-hip)
+- 16: assembler, {15}, object, (host-hip)
17: linker, {16}, image, (host-hip)

We are losing information regarding the device arch.

* There is still room for improvement and reuse quite some of the
  existing baseclass Action capabilities

Output looks like this:

                     +- 0: input, "vecadd.cu", hip, (host-hip)
                  +- 1: preprocessor, {0}, hip-cpp-output, (host-hip)
               +- 2: compiler, {1}, cir, (host-hip)
               |     +- 3: input, "vecadd.cu", hip, (device-hip, gfx942)
               |  +- 4: preprocessor, {3}, hip-cpp-output, (device-hip, gfx942)
               |- 5: compiler, {4}, cir, (device-hip, gfx942)
            +- 6: comebinecir, "host-hip (x86_64-unknown-linux-gnu)" {2}, "device-hip (amdgcn-amd-amdhsa:gfx942)" {5}, cir
         +- 7: splitcir, {6}, cir, (host-hip)
         |              +- 8: splitcir, {6}, cir, (device-hip, gfx942)
         |           +- 9: backend, {8}, assembler, (device-hip, gfx942)
         |        +- 10: assembler, {9}, object, (device-hip, gfx942)
         |     +- 11: linker, {10}, image, (device-hip, gfx942)
         |  +- 12: offload, "device-hip (amdgcn-amd-amdhsa:gfx942)" {11}, image
         |- 13: linker, {12}, hip-fatbin, (device-hip)
      +- 14: offload, "host-hip (x86_64-unknown-linux-gnu)" {7}, "device-hip (amdgcn-amd-amdhsa)" {13}, cir
   +- 15: backend, {14}, assembler, (host-hip)
+- 16: assembler, {15}, object, (host-hip)
17: linker, {16}, image, (host-hip)
let summary = "Container for host and device CIR modules";
let description = [{
`cir.offload.container` is a top-level container used to keep host and device
CIR modules together for joint analysis and transformation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally thinking about something like:

module {
  cir.host {
     ...
  }

  cir.device {
    ...
  }
}

Any reason why you need the cir.offload.container wrapping both of them?
Based on our experience with optimizing cross library with LLVM IR dialect, we had to flatten the namespace and put everything in the same module (while adding an extra attribute to indicate the symbol origin, such that we can split them back after optimizations). In our case lots of problems came from symbol definitions being only available within another cir.library and MLIR not being able to properly handle symbol tables across them, is this a problem for this approach?

StaticLibJobClass,
BinaryAnalyzeJobClass,
BinaryTranslatorJobClass,
CIRCombineJobClass,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CIRCombineJobClass -> CIRCombineHostDeviceJobClass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants