Chapter 14: Synthesis Subset

Loop Unrolling, Pipelining, and Hardware Scheduling

A deep dive into loop unrolling, pipelining pragmas, and hardware scheduling in SystemC High-Level Synthesis (HLS).

How to Read This Lesson

For synthesis, the question changes from 'can C++ run this?' to 'can hardware be built from this?' Keep storage, timing, and static structure in your head as you read.

Loop Unrolling and Pipelining in HLS

In standard C++ software, loops execute sequentially on a CPU. You don't have to worry about how long they take in terms of "clock cycles," only their general algorithmic complexity (O(N)).

In High-Level Synthesis (HLS), however, C++ loops are physically transformed into silicon. The way you write your loop—and specifically where you place wait() statements—dictates whether the HLS tool generates massive parallel combinational logic, sequential state machines, or optimized hardware pipelines.

It is crucial to understand the difference between how the Accellera SystemC Simulation Kernel treats a loop and how an HLS Compiler treats it.

Source and LRM Trail

For synthesis, use Docs/LRMs/SystemC_Synthesis_Subset_1_4_7.pdf as the primary contract and Docs/LRMs/SystemC_LRM_1666-2023.pdf for base SystemC semantics. Source internals explain simulation behavior, but synthesizability is a tool contract: focus on static structure, reset modeling, wait placement, and bounded loops.

The Kernel Reality vs. The HLS Compiler

When you compile a SystemC model with GCC or Clang and link against the Accellera kernel, your loop is just a standard C++ loop. It executes sequentially on your host machine's CPU. If there is no wait(), the loop runs to completion in a single delta cycle, blocking the cooperative scheduler (sc_simcontext::crunch()). If there is a wait(), the sc_thread_process saves its stack to a coroutine (QuickThreads/pthreads) and yields control back to the scheduler, to be resumed on the next clock edge.

An HLS compiler (like Siemens Catapult or Cadence Stratus) behaves very differently. It parses the Abstract Syntax Tree (AST) of your C++ code. It uses the wait() statements as explicit register boundaries to slice your C++ code into a Finite State Machine (FSM).

1. Loops Without wait(): Unrolling and Combinational Logic

If you write a for or while loop that does not contain a wait() statement, you are instructing the HLS compiler that all iterations of this loop must execute in the same clock cycle.

// Executed entirely within one clock cycle
int sum = 0;
for(int i = 0; i < 4; i++) {
    sum += data[i]; 
}
result.write(sum);
wait(); // Clock edge occurs HERE

To achieve this physically, the HLS tool must unroll the loop. It flattens the AST, creating four separate adders in hardware and chaining them together as pure combinational logic.

  • The Catch: If your loop iterates 10,000 times, the tool will try to generate 10,000 adders in a massive combinational chain. This will fail physical timing constraints (the clock period). Therefore, loops without wait() must have a small, statically determinable number of iterations.

2. Loops With wait(): Sequential Execution

If you place a wait() inside the loop, the HLS tool slices the AST at that boundary, generating an FSM state transition.

int sum = 0;
for(int i = 0; i < 4; i++) {
    sum += data[i]; 
    wait(); // Clock edge occurs on EVERY iteration
}
result.write(sum);

In this case, the HLS tool only needs to generate one physical adder. On clock cycle 1 (State 1), it adds data[0]. On cycle 2 (State 2), it adds data[1]. The loop will take exactly 4 clock cycles to complete, saving massive silicon area at the cost of latency.

Tool-Specific Pragmas: Unrolling and Pipelining

Because SystemC is standard C++, it doesn't have native language keywords for hardware micro-architecture. EDA vendors provide compiler directives (#pragma) to control exactly how the AST is transformed.

  • #pragma HLS UNROLL: Tells the compiler to explicitly replicate the hardware logic for the loop body. You can specify a factor (e.g., factor=2) to partially unroll a loop, balancing area and speed.
  • #pragma HLS PIPELINE: Rather than waiting for the entire loop iteration to finish, pipelining creates shift registers in the datapath, starting the next iteration of the loop while the current iteration is still executing. The time between starting consecutive iterations is known as the Initiation Interval (II).

Synthesis Subset LRM Restrictions

When dealing with loops, the SystemC Synthesis Subset 1.4.7 mandates:

  1. Static Bounds for Unrolling: If a loop contains no wait() (meaning it must be completely unrolled into combinational logic), the number of iterations must be statically determinable at compile time. You cannot use a dynamically changing port value as the termination condition for a loop without a wait().
  2. No wait() in functions: Generally, if a helper function contains a wait(), it must be inlined into the parent thread, and the parent thread's FSM scheduling is affected.

End-to-End Example: A Dot Product Unit

Below is a complete, compilable SystemC model of a Dot Product unit. The loop inside compute_thread lacks a wait(), making it a prime candidate for Loop Unrolling by an HLS compiler.

#include <systemc.h>
 
// -------------------------------------------------------------------------
// Synthesizable Hardware Module
// -------------------------------------------------------------------------
SC_MODULE(DotProductUnit) {
    sc_in<bool> clk;
    sc_in<bool> rst_n;
    
    // Arrays of ports for input vectors
    sc_in<int> a[4];
    sc_in<int> b[4];
    sc_in<bool> start;
    
    sc_out<int> result;
    sc_out<bool> valid;
 
    void compute_thread() {
        // --- RESET BLOCK ---
        result.write(0);
        valid.write(false);
        wait();
 
        // --- FUNCTIONAL BLOCK ---
        while (true) {
            if (start.read()) {
                int sum = 0;
                
                // --- LOOP UNROLLING CANDIDATE ---
                // Because there is no wait() inside this loop, the HLS compiler
                // will fully unroll this, generating 4 parallel multipliers 
                // and an adder tree that executes in a single clock cycle.
                //
                // Example vendor directive: 
                // #pragma HLS UNROLL
                for (int i = 0; i < 4; i++) {
                    sum += a[i].read() * b[i].read();
                }
                
                result.write(sum);
                valid.write(true);
            } else {
                valid.write(false);
            }
            wait(); // End of the clock cycle state
        }
    }
 
    SC_CTOR(DotProductUnit) {
        SC_CTHREAD(compute_thread, clk.pos());
        async_reset_signal_is(rst_n, false);
    }
};
 
// -------------------------------------------------------------------------
// Testbench / Simulation
// -------------------------------------------------------------------------
int sc_main(int argc, char* argv[]) {
    sc_clock clk("clk", 10, SC_NS);
    sc_signal<bool> rst_n;
    sc_signal<bool> start;
    
    sc_signal<int> a[4];
    sc_signal<int> b[4];
    sc_signal<int> result;
    sc_signal<bool> valid;
 
    // Instantiate and bind
    DotProductUnit dut("dut");
    dut.clk(clk);
    dut.rst_n(rst_n);
    dut.start(start);
    for(int i = 0; i < 4; ++i) {
        dut.a[i](a[i]);
        dut.b[i](b[i]);
    }
    dut.result(result);
    dut.valid(valid);
 
    // Initialization
    rst_n.write(false); // Assert reset
    start.write(false);
    for(int i = 0; i < 4; ++i) {
        a[i].write(0);
        b[i].write(0);
    }
 
    sc_start(15, SC_NS);
    rst_n.write(true); // Release reset
 
    // Test Case: Provide vector data
    std::cout << "@" << sc_time_stamp() << " Feeding inputs..." << std::endl;
    for(int i = 0; i < 4; ++i) {
        a[i].write(i + 1); // Vector A: [1, 2, 3, 4]
        b[i].write(2);     // Vector B: [2, 2, 2, 2]
    }
    start.write(true);
    
    // Step one clock cycle to capture inputs
    sc_start(10, SC_NS);
    start.write(false);
    
    // Step one more clock cycle to propagate outputs
    sc_start(10, SC_NS);
 
    // Expected: (1*2) + (2*2) + (3*2) + (4*2) = 2 + 4 + 6 + 8 = 20
    std::cout << "@" << sc_time_stamp() << " Result: " << result.read() 
              << " (Expected 20)" << std::endl;
    std::cout << "Valid: " << (valid.read() ? "true" : "false") << std::endl;
 
    return 0;
}

By carefully managing loops and wait() statements according to the Synthesis Subset LRM, you retain absolute control over whether your SystemC algorithm is synthesized into parallel combinational hardware or a sequential FSM, even though the Accellera kernel executes them all identically as software.

Deep Dive: Accellera Source for sc_signal and update()

The sc_signal<T> channel perfectly illustrates the Evaluate-Update paradigm of SystemC. In the Accellera source (src/sysc/communication/sc_signal.cpp), sc_signal inherits from sc_prim_channel.

The write() Implementation

When you call write(const T&), the signal does not immediately change its value. Instead, it stores the requested value in m_new_val and registers itself with the kernel:

template<class T>
inline void sc_signal<T>::write(const T& value_) {
    if( !(m_new_val == value_) ) {
        m_new_val = value_;
        this->request_update(); // Inherited from sc_prim_channel
    }
}

The request_update() call appends the channel to sc_simcontext::m_update_list.

The update() Phase

After the Evaluate phase finishes (all ready processes have run), the kernel iterates over m_update_list and calls the update() virtual function on each primitive channel. For sc_signal, this looks like:

template<class T>
inline void sc_signal<T>::update() {
    if( !(m_new_val == m_cur_val) ) {
        m_cur_val = m_new_val;
        m_value_changed_event.notify(SC_ZERO_TIME); // Notify processes sensitive to value_changed_event()
    }
}

This guarantees that all concurrent processes see the same old value until the delta cycle advances, perfectly mimicking hardware register delays.

Comments and Corrections