LoopSchedule Dialect Rationale

This document describes various design points of the loopschedule dialect, why it is the way it is, and current status. This follows in the spirit of other MLIR Rationale docs.

Introduction ¶

The loopschedule dialect provides a collection of ops to represent software-like loops after scheduling. There are currently two main kinds of loops that can be represented: pipelined and sequential. Pipelined loops allow multiple iterations of the loop to be in-flight at a time and have an associated initiation interval (II) to specify the number of cycles between the start of successive loop iterations. In contrast, sequential loops are guaranteed to only have one iteration in-flight at any given time.

A primary goal of the loopschedule dialect, as opposed to many other High-Level Synthesis (HLS) representations, is to maintain the structure of loops after scheduling. As such, the loopschedule ops are inspired by the scf and affine dialect ops.

Pipelined Loops ¶

Pipelined loops are represented with the loopschedule.pipeline op. A pipeline loop resembles a while loop in the scf dialect, but the body must contain only loopschedule.pipeline.stage and loopschedule.terminator ops. To have a better understanding of how loopschedule.pipeline works, we will look at the following example:

func.func @test1(%arg0: memref<10xi32>) -> i32 {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %c10 = arith.constant 10 : index
  %c0_i32 = arith.constant 0 : i32
  %0 = loopschedule.pipeline II = 1 iter_args(%arg1 = %c0, %arg2 = %c0_i32) : (index, i32) -> i32 {
    %1 = arith.cmpi ult, %arg1, %c10 : index
    loopschedule.register %1 : i1
  } do {
    %1:2 = loopschedule.pipeline.stage start = 0 {
      %3 = arith.addi %arg1, %c1 : index
      %4 = memref.load %arg0[%arg1] : memref<10xi32>
      loopschedule.register %3, %4 : index, i32
    } : index, i32
    %2 = loopschedule.pipeline.stage start = 1 {
      %3 = arith.addi %1#1, %arg2 : i32
      pipeline.register %3 : i32
    } : i32
    loopschedule.terminator iter_args(%1#0, %2), results(%2) : (index, i32) -> i32
  }
  return %0 : i32
}

A pipeline op first defines the initial values for the iter_args. iter_args are values that will be passed back to the first stage after the last stage of the pipeline. The pipeline also defines a specific, static II. Each pipeline stage in the do block represents a series of ops run in parallel.

Values are registered at the end of a stage and passed out as results for future pipeline stages to use. Each pipeline stage must have a defined start time, which is the number of cycles between the start of the pipeline and when the first valid data will be available as input to that stage.

Finally, the terminator is called with the iter_args for the next iteration and the result values that will be returned when the pipeline completes. Even though the terminator is located at the end of the loop body, its values are passed back to a previous stage whenever needed. We do not need to wait for an entire iteration to finish before iter_args become valid for the next iteration.

Multi-cycle and pipelined ops can also be supported in pipeline loops. In the following example, assume the multiply op is bound to a 3-stage pipelined multiplier:

func.func @test1(%arg0: memref<10xi32>, %arg1: memref<10xi32>) {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %c10 = arith.constant 10 : index
  %c1_i32 = arith.constant 1 : i32
  loopschedule.pipeline II = 1 iter_args(%arg2 = %c0) : (index, i32) -> () {
    %1 = arith.cmpi ult, %arg1, %c10 : index
    loopschedule.register %1 : i1
  } do {
    %1:2 = loopschedule.pipeline.stage start = 0 {
      %3 = arith.addi %arg1, %c1 : index
      %4 = memref.load %arg0[%arg2] : memref<10xi32>
      loopschedule.register %3, %4 : index, i32
    } : index, i32
    %2:2 = loopschedule.pipeline.stage start = 1 {
      %3 = arith.muli %1#1, %c1_i32 : i32
      pipeline.register %3, %1#0 : i32
    } : i32
    loopschedule.pipeline.stage start = 4 {
      memref.store %2#0, %arg0[%2#1] : i32
      pipeline.register
    } : i32
    loopschedule.terminator iter_args(%1#0), results() : (index, i32) -> ()
  }
  return
}

Here, the II is still 1 because new values can be introduced to the multiplier every cycle. The last stage is delayed by 3 cycles because of the 3 cycle latency of the multiplier. The pipeline op is currently tightly coupled to the lowering implementation used, as the latency of operators is not represented in the IR, but rather an implicit assumption made when lowering later. The scheduling problem is constructed with these implicit operator latencies in mind. This coupling can be addressed in the future with a proper operator library to maintain explicit operator latencies in the IR.

Status ¶

Added pipeline loop representation, more documentation and rationale to come as ops are added.