API documentation

class llm_kernel_tuner.LLMKernelTransformer(kernel_code: str, llm: BaseChatModel | None = None, tuning_strategy: BaseTuningStrategy | None = None, tests: List[KernelTest] = [], testing_strategies: List[BaseTestingStrategy] = [], transformer_retry_policy: RetryPolicy = RetryPolicy(max_retries=3), device: int = 0, clang_args: List[str] = [], cuda_gpu_arch: str | None = None, time_per_test: int = 15, strip_thinking_output: bool = False, thinking_pattern: str = '<think>.*?</think>\\s*', structured_output_type: StructuredOutputType = StructuredOutputType.DEFAULT, performance_threshold: float = 1.0, verbosity: Literal['none', 'debug', 'info', 'warning', 'all'] = 'info')

The main class that orchestrates the process of analyzing, transforming, and tuning a given compute kernel using LLMs and predefined strategies.

It sets up a workflow involving kernel analysis (description, problem size, outputs), test generation, test validation, and finally kernel tuning based on the provided strategies and retry policies.

Parameters:

kernel_code (str) – Kernel device code that will be tuned
llm (BaseChatModel, optional) – The language model instance to use for LLM interactions. Defaults to ChatOpenAI(model=”gpt-5”).
tuning_strategy (BaseTuningStrategy, optional) – The strategy for tuning the kernel’s parameters. Defaults to AutonomousTuningStrategy().
tests (List[KernelTest], optional) – A list of initial tests to validate the kernel’s correctness. Defaults to an empty list.
testing_strategies (List[BaseTestingStrategy]) – Strategies for generating tests. Defaults to List[NaiveLLMTester()].
transformer_retry_policy (RetryPolicy, optional) – The retry policy for the kernel analysis (get_kernel_info) subgraph. Defaults to default_transformer_retry_policy.
device (int, optional) – The CUDA device ID to use. Defaults to 0.
clang_args (List[str], optional) – Additional arguments that will be provided to clang when parsing the kernel code
cuda_gpu_arch (Optional[str], optional) – Pre-determined CUDA GPU architecture (e.g., “sm_86”). If provided, skips PyCUDA detection. Defaults to None.
time_per_test (int) –
The base time limit used for calculating timeouts for kernel tests or tuning runs. The time value in seconds.

For tune operations: The timeout is calculated as max(time_per_test * num_combinations, time_per_test) + buffer, where num_combinations is the product of the number of values for each parameter in tune_params, and buffer is a fixed additional time (currently 10 seconds). This ensures the timeout scales with the search space size but is never less than time_per_test plus the buffer.

For test operations: The timeout is calculated as time_per_test + buffer, as there is only one parameter combination being tested (num_combinations is effectively 1).
strip_thinking_output (bool, optional) – Whether to strip thinking/reasoning sections from LLM responses. When enabled, the LLM will be wrapped with ThinkingStripperWrapper to remove specified thinking patterns from the output. Defaults to False.
thinking_pattern (str, optional) – Regular expression pattern to match thinking sections that should be stripped from LLM responses. Only used when strip_thinking_output is True. If not provided, defaults to r"<think>.*?</think>\s*" which matches content within <think> tags.
structured_output_type (StructuredOutputType, optional) – The type of structured output format to use for LLM interactions. Defaults to StructuredOutputType.DEFAULT.
performance_threshold (float, optional) –
Minimum performance improvement threshold as a percentage required for accepting a new kernel version. This prevents accepting kernels with marginal improvements that may be due to measurement noise or system variability.

The improvement percentage is calculated using the formula: ((old_time - new_time) / old_time) * 100

A new kernel is accepted only if the calculated improvement percentage is greater than or equal to the threshold value.

Examples
- performance_threshold=1.0: Requires at least 1.0% improvement (default)
- performance_threshold=2.0: Requires at least 2.0% improvement (more conservative)
- performance_threshold=0.0: Accepts any improvement, however small
Defaults to 1.0.
verbosity (Literal["none", "debug", "info", "warning", "all"], optional) –
Controls the verbosity of logging output. Options are: - “none”: No logging output (uses NullHandler) - “debug”: Shows debug, info, warning, and error messages - “info”: Shows info, warning, and error messages (default) - “warning”: Shows only warning and error messages - “all”: Shows all logging levels including debug

Defaults to “info”.

add_test(test: KernelTest)

Add test to the test suite. This test will be used to test the correctness of the kernel while it is being tuned.

Parameters:: test (KernelTest) – test to be added to the testsuite.

make_kernel_tunable() → Tuple[TunableKernel, Dict[str, Any], PerformanceTracker]

Transforms the kernel to make it tunable.

Returns:: The transformed kernel as TunableKernel, the best tuning parameters as a dictionary, and the performance tracker containing optimization steps
Return type:: Tuple[TunableKernel, Dict[str, Any], PerformanceTracker]

class llm_kernel_tuner.TunableKernel(code: str, kernel_info: TunableKernelInfo)

Kernel that can be used for tuning.

Parameters:

code (str) – The kernel code that will be tuned.
kernel_info (TunableKernelInfo) – Object that stores general information about the kernel.
session. (This object is shared between internal copies of TunableKernel during tuning)

copy() → TunableKernel

Creates a full copy of the TunableKernel instance.

Returns:: A new instance with the same attribute values.
Return type:: TunableKernel

get_arg_position(arg: str) → int | None

Returns the position of the argument.

Parameters:: arg (str) – name of the argument.
Returns:: Returns the zero indexed position of the argument or None if the argument does not exist.
Return type:: int | None

test(test: KernelTest, tune_params: Dict[str, int | float], run_kernel_kwargs: Dict[str, Any] | None = None)

Tests the kernel with specific parameter values against a test case.

This method runs the kernel once with the specified parameter configuration and verifies that the output matches the expected output from the test case.

Parameters:

test (KernelTest) – A KernelTest instance containing input data and expected output for validating the kernel’s correctness.
tune_params (Dict[str, Union[int, float]]) – A dictionary mapping parameter names to their values for this specific test run.
run_kernel_kwargs (Optional[Dict[str, Any]]) – Optional additional keyword arguments to pass to the run_kernel function.

Raises:

FailedTestsError – If the kernel’s output doesn’t match the expected output.
CompileErrorError – If there is an error while compiling the kernel.
TimeoutError – If the kernel execution takes longer than the calculated timeout.

tune(test: KernelTest, tune_params: Dict[str, List[Any]], restrictions: List[str] | None = None, tune_kernel_kwargs: Dict[str, Any] | None = None) → TuneResult

Tunes the kernel with various parameter configurations to find optimal performance.

This method runs the kernel with different parameter configurations specified in tune_params, evaluates each configuration using the provided test case, and returns the best performing configuration.

Parameters:

test (KernelTest) – A KernelTest instance containing input data and expected output to validate the kernel’s correctness during tuning.
tune_params (Dict[str, List[Any]]) – A dictionary mapping parameter names to lists of possible values to be explored during the tuning process.
restrictions (Optional[List[str]]) – Optional list of restriction strings that define relationships between parameters to constrain the search space. Examples include “block_size_x>=tile_size” or “block_size_y==block_size_x”. If None, no restrictions are applied.
tune_kernel_kwargs (Optional[Dict[str, Any]]) – Optional additional keyword arguments to pass to the run_kernel function.

Returns:

An object containing the best parameter configuration and its: execution time.

Return type:

TuneResult

Raises:

CompileErrorError – If there is an error while compiling the kernel.
TimeoutError – If the tuning process takes longer than the calculated timeout.

class llm_kernel_tuner.tuning_strategies.BaseTuningStrategy

class llm_kernel_tuner.tuning_strategies.BaseTuningStrategy(retry_policy: RetryPolicy | None)

Base class for kernel tuning strategies.

This abstract class defines the interface for tuning strategies that optimize kernel parameters. Implementations of this class should tune the kernel, do the testing of the kernel and evaluation themselves.

Note

Subclasses must implement the create_graph method.

llm_kernel_tuner.tuning_state.State will be passed to this graph.

See also

Custom tuning strategy for example usage.

Parameters:: retry_policy (Any)

_ask_restrictions(kernel_code: str, tune_params: Dict[str, List[Any]]) → List[str]

Ask the LLM to generate parameter restrictions for kernel tuning.

This method queries the language model to determine appropriate restrictions between tuning parameters based on the kernel code and available parameters. The restrictions help constrain the parameter search space during optimization.

Parameters:

kernel_code (str) – The CUDA kernel source code to analyze.
tune_params (Dict[str, List[Any]]) – Dictionary mapping parameter names to lists of possible values that can be tuned.

Returns:

A list of restriction strings that define relationships: between parameters (e.g., “block_size_x==block_size_y” or “tile_size<=block_size_x”).

Return type:

List[str]

Raises:

RestrictionCheckError – If the LLM generates restrictions referencing parameters not present in tune_params.

Note

This method uses the configured retry policy if available. If restrictions contain invalid parameter names, the operation will be retried according to the retry policy configuration.

Example

Given a matrix multiplication kernel with parameters:

tune_params = {
    "block_size_x": [16, 32, 64],
    "block_size_y": [16, 32, 64],
    "tile_size": [2, 4, 8]
}

The method might return:

["block_size_x>=tile_size", "block_size_y>=tile_size"]

_extract_and_sanitize_kernel(answer_prompt: str) → str | None

Extracts and sanitizes kernel code from the model’s response.

This method processes an LLM response to extract only the kernel code. It first removes preprocessor directives and then extracts code that appears between triple backticks.

Note

The extracted and sanitized kernel code, or None if no code could be extracted.

Parameters:: answer_prompt (str) – The text response potentially containing kernel code.
Returns:: The extracted and sanitized kernel code, or None if no code could be extracted.
Return type:: str | None

Example

Given an LLM response like:

Here's an optimized CUDA kernel:

```cuda
#define BLOCK_SIZE 256

__global__ void matrixMul(float* A, float* B, float* C, int width) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < width && col < width) {
        float sum = 0.0f;
        for (int i = 0; i < width; i++) {
            sum += A[row * width + i] * B[i * width + col];
        }
        C[row * width + col] = sum;
    }
}
```

The method will return:

__global__ void matrixMul(float* A, float* B, float* C, int width) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < width && col < width) {
        float sum = 0.0f;
        for (int i = 0; i < width; i++) {
            sum += A[row * width + i] * B[i * width + col];
        }
        C[row * width + col] = sum;
    }
}

_run_tests(kernel: TunableKernel, tune_params: Dict[str, List[Any]], tests: List[KernelTest], restrictions: List[str] | None = None) → TuneResult

Tunes the kernel with the first test and evaluates correctness with remaining tests.

Parameters:

kernel (TunableKernel) – The tunable kernel to optimize.
tune_params (Dict[str, List[Any]]) – Dictionary mapping parameter names to lists of possible values.
tests (List[KernelTest]) – List of kernel tests to execute.
restrictions (List[str]) – List of restriction strings that define relationships between parameters to constrain the search space (e.g., “block_size_x>=tile_size”).

Returns:

Test result containing the tuning performance metrics.

Return type:

TuneResult

Note

Only the first test is used for parameter tuning to save computational resources. The remaining tests validate the correctness of the kernel. Results are cached based on kernel code and tuning parameters.

abstractmethod create_graph(llm: BaseChatModel) → CompiledStateGraph

This method must be implemented.

Parameters:: llm (BaseChatModel) – LLM model that user wants to use for tuning.
Returns:: langchain graph that will be called for tuning. llm_kernel_tuner.tuning_state.State will be passed to this graph.
Return type:: CompiledStateGraph

class llm_kernel_tuner.tuning_strategies.AutonomousTuningStrategy(breakdown_steps: bool = False, with_replanning: bool = True, max_breakdowns: int = 1, max_replanning: int = 3, retry_policy: ~typing.Any = <object object>)

A fully autonomous strategy for kernel optimization based on the plan-and-solve approach.

This strategy first generates an optimization plan and then executes each step sequentially. Steps can be broken down recursively into smaller steps and the strategy can replan if needed. The process includes validation of steps, breaking them down if necessary, and executing them while ensuring correctness.

Parameters:

retry_policy (RetryPolicy, optional) – Retry policy for LLM calls. If not provided, defaults to default_tuner_retry_policy. Can be set to None for no retries.
breakdown_steps (bool, optional) – Whether to enable breakdown of steps. If True, the strategy will recursively breakdown each step into smaller steps. Defaults to False.
with_replanning (bool, optional) – Whether to enable replanning. If True, the strategy will re-prompt the LLM for additional steps if needed. Defaults to True.
max_breakdowns (int, optional) – The maximum number of breakdown steps allowed per step. Because steps are broken down recursively the number of steps will grow exponentially, therefore it is advisable to not set this parameter higher than 2. Must be greater than 0 if breakdown_steps is enabled. Defaults to 1.
max_replanning (int, optional) – The maximum number of replanning attempts allowed. Must be greater than 0 if with_replanning is enabled. Defaults to 3.

See also

For a detailed explanation of the strategy workflow and examples, see Autonomous Tuning Strategy.

class llm_kernel_tuner.tuning_strategies.ExplicitTuningStrategy(tuning_steps: ~typing.List[~llm_kernel_tuner.tuning_strategies.tuning_step.TuningStep] | None = None, retry_policy: ~typing.Any = <object object>)

A strategy that applies a sequence of explicit tuning steps to a kernel.

This tuning strategy follows a predefined sequence of tuning steps, each applying a specific optimization technique to the kernel. Each step can have tunable parameters, dependencies on previous steps, and can be conditionally applied based on an evaluation.

Parameters:

tuning_steps (List[TuningStep] | None) – Optional[List[TuningStep]]: A list of tuning steps to apply. If None, uses the default explicit_tuning_steps.
retry_policy (Any)

Example

from llm_kernel_tuner import TunableKernel, LLMKernelTransformer
from llm_kernel_tuner.tuning_strategies import ExplicitTuningStrategy, TuningStep
from langchain_core.prompts import PromptTemplate

# Define custom tuning steps
my_tuning_steps = [
    TuningStep(
        id="shared_memory_tiling",
        prompt_template=PromptTemplate.from_template(...),
        tune_params={"tile_size": [16, 32, 64]}
    ),
    TuningStep(
        id="loop_unrolling",
        prompt_template=PromptTemplate.from_template(...),
        tune_params={"unroll_factor": [2, 4, 8]},
        depends_on=["shared_memory_tiling"]
    )
]

# Create the strategy with custom steps
strategy = ExplicitTuningStrategy(tuning_steps=my_tuning_steps)

# Use the strategy with a kernel transformer
kernel_transformer = LLMKernelTransformer(
    kernel="...",
    tuning_strategy=strategy
)

See also

Explicit Tuning Strategy for more information on how to use this strategy.

class llm_kernel_tuner.tuning_strategies.TuningStep(id: str, prompt_template: PromptTemplate, tune_params: Dict[str, List[Any]] = {}, depends_on: List[str] = [], skip_evaluation: bool = False)

Initialize a TuningStep instance.

A TuningStep represents a single optimization technique to be applied to a kernel. Each step has a unique identifier, a prompt template to guide the LLM in applying the optimization, optional parameters to tune, and dependency information.

Parameters:

id (str) – A unique identifier for this tuning step.
prompt_template (PromptTemplate) – The prompt template containing instructions for the LLM on how to implement this optimization technique.
tune_params (Dict[str, List[Any]], optional) – A dictionary mapping parameter names to lists of possible values to explore during tuning. Each parameter will be grid-searched across its possible values. Defaults to an empty dictionary.
depends_on (List[str], optional) – A list of tuning step IDs that must be completed before this step can be executed. Allows for defining dependencies between optimization techniques. Defaults to an empty list.
skip_evaluation (bool, optional) – If True, the system will not ask the LLM to evaluate whether this step is necessary and will always execute it. Defaults to False.

Kernel test that will test the correctness of the kernel being tuned.

Parameters:

input_data (np.ndarray) – Input array for kernel testing. Should be a 2D array where the first dimension represents different parameter sets and the second dimension contains the values for each parameter.
expected_output (np.ndarray) – Expected kernel output for validation. Should be a 2D array with the same first dimension as input_data. Elements with None value will be excluded from comparison with actual kernel output.
size (int | Tuple[int, ...]) – Problem size specification used by the tuning process. May be a single integer or a tuple of dimensions depending on the kernel requirements.

class llm_kernel_tuner.testing_strategies.BaseTestingStrategy

class llm_kernel_tuner.testing_strategies.BaseTestingStrategy(retry_policy: RetryPolicy | None)

class llm_kernel_tuner.testing_strategies.BaseTestingStrategy(retry_policy: RetryPolicy | None, max_data_size: int)

Base class for generating testing strategies for kernel.

This abstract class defines the interface for testing strategies that generate tests for a given kernel. Implementations of this class should generate tests and store them in state["tests"].

Parameters:

retry_policy (RetryPolicy, optional) – The retry policy to use for the testing strategy. Defaults to default_tester_retry_policy
max_data_size (int, optional) – The maximum size of the data that can be passed to the kernel. Defaults to 2GB.

Note

Subclasses must implement the create_graph method.

llm_kernel_tuner.tuning_state.State will be passed to this graph.

See also

Custom Testing Strategy for example usage.

get_test_from_code(kernel: TunableKernel, code: str, params: Dict[str, Any], timeout: float) → KernelTest

Generates a kernel test case from provided Python code.

Executes the given Python code string, wrapped in a template, in a separate process to obtain input data. The provided code should define a variable named input_data containing a list of numpy arrays for the kernel arguments. The wrapping template handles saving this data to a temporary file and printing its path.

Then, this method runs the kernel using the obtained inputs and the specified tuning params. Finally, it packages the inputs, corresponding kernel outputs, and the derived problem size into a KernelTest object.

Parameters:

kernel (TunableKernel) – The TunableKernel object representing the kernel to test.

code (str) –

A string containing Python code that defines a variable named input_data as a list of numpy arrays representing the kernel input arguments. Here is an example of how code is supposed to look like:

import numpy as np
size = 10000000

a = np.random.randn(size).astype(np.float32)
b = np.random.randn(size).astype(np.float32)
c = np.zeros_like(a)
n = np.int32(size)

input_data = [c, a, b, n]

params (Dict[str, Any]) – A dictionary containing the tuning parameters to be used when running the kernel to generate outputs.
timeout (float, optional) – Timeout in seconds for the process to generate input. Default is 120.

Returns:

A KernelTest object containing the generated input arguments, the corresponding kernel output arguments, and the problem size used for the kernel execution.

Raises:

InvalidTest – If the provided code fails to execute successfully within the template, if the template execution does not produce the expected output file path, or if the kernel

Return type:

KernelTest

class llm_kernel_tuner.tuning_state.State

A TypedDict representing the state of the kernel tuning process.

This class is used to maintain the state during the kernel tuning process. It contains information about the kernel being tuned, current parameters, test cases, performance tracking, and other necessary information for the tuning process.

If you are implementing custom tuning/testing strategies, you will need to manage this state yourself by implementing proper state transitions and updates in your strategy.

kernel

The kernel object that is being tuned, containing the kernel code and associated metadata.

Type:: TunableKernel

best_params

The best parameters found so far during the tuning process. None if no parameters have been evaluated yet.

Type:: Optional[Dict[str, Any]]

llm

The language model chosen by the user for tuning.

Type:: BaseChatModel

tests

A list of test cases used to validate the kernel during the tuning process.

Type:: List[KernelTest]

messages

A list of messages exchanged during the tuning process, typically for tracking conversation with the LLM.

Type:: List[BaseMessage]

curr_tune_params

The current set of tunable parameters being used or considered in the tuning process.

Type:: Dict[str, Any]

performance_tracker

Tracks successful optimization steps and generates performance overviews for the tuning process.

Type:: PerformanceTracker

class llm_kernel_tuner.PerformanceTracker

Tracks and manages successful optimization steps during kernel tuning.

This class provides functionality to record successful optimization steps, calculate performance improvements, and generate formatted overviews of the tuning process results.

generate_overview() → str

Generate a formatted overview of all recorded optimization steps.

Returns:: A formatted string containing the performance overview, or a message indicating no improvements were found if no steps were recorded.
Return type:: str

get_total_improvement() → float

Calculate the total performance improvement from baseline.

Returns:: The total improvement percentage from baseline to final result. Returns 0.0 if no baseline is set or no steps are recorded.
Return type:: float

has_improvements() → bool

Check if any optimization steps have been recorded.

Returns:: True if at least one optimization step has been recorded, False otherwise.
Return type:: bool

record_step(step: PerformanceStep) → None

Record a successful optimization step.

Parameters:: step (PerformanceStep) – The PerformanceStep to record
Return type:: None

set_baseline_time(time: float) → None

Set the baseline execution time before any optimizations.

Parameters:: time (float) – The initial kernel execution time
Return type:: None

class llm_kernel_tuner.PerformanceStep(step_description: str, kernel_code: str, old_execution_time: float | None, new_execution_time: float, improvement_percentage: float, tunable_parameters: Dict[str, List[Any]], restrictions: List[str], best_tune_params: Dict[str, Any], timestamp: datetime)

Data class representing a successful optimization step.

This class captures comprehensive information about each accepted optimization step, including the step description, kernel code changes, performance metrics, tunable parameters, and restrictions.

Parameters:

step_description (str)
kernel_code (str)
old_execution_time (float | None)
new_execution_time (float)
improvement_percentage (float)
tunable_parameters (Dict[str, List[Any]])
restrictions (List[str])
best_tune_params (Dict[str, Any])
timestamp (datetime)

step_description

Human-readable description of the optimization step

Type:: str

kernel_code

The optimized kernel code after this step

Type:: str

old_execution_time

Previous best execution time (None for first step)

Type:: float | None

new_execution_time

New execution time after optimization

Type:: float

improvement_percentage

Calculated improvement percentage

Type:: float

tunable_parameters

The tunable parameters used for this step

Type:: Dict[str, List[Any]]

restrictions

Parameter restrictions applied during tuning

Type:: List[str]

best_tune_params

The best parameter values found for this kernel

Type:: Dict[str, Any]

timestamp

When this step was recorded

Type:: datetime.datetime

class llm_kernel_tuner.retry.RetryPolicy(max_retries: int = 3, handlers: Dict[Type[Exception], Callable[[Any, Exception], Any]] | None = None, default_handler: Callable[[Any, Exception], Any] | None = None)

Defines a retry policy for a retry wrapper.

Parameters:

max_retries (int, optional) – The maximum number of retries to attempt.
handlers (Optional[Dict[Type[Exception], Callable[[Any, Exception], Any]]]) – A dictionary of exception types to handler functions. If an exception is raised during a retry, the handler function will be called with the current state.
default_handler (Optional[Callable[[Any, Exception], Any]]) – A default handler function to use if no specific handler is provided for an exception.

Note

Either handlers or default_handler must be provided.

See also

See Retry Policy for usage.

llm_kernel_tuner.retry.create_retry_wrapper(what_to_rerty: ~typing.Callable[[...], ~typing.Any] | ~langgraph.graph.state.CompiledStateGraph, policy: ~llm_kernel_tuner.retry.RetryPolicy, state_type: ~typing.Type[~typing.Any] = <class 'dict'>) → CompiledStateGraph

Creates a retry wrapper around a graph or a function using the specified policy. Will catch any exception made by the function or the graph wrapped and execute function defined in the policy.

Example usage:

...
graph = graph_builder.compile()

def value_error_handler(retry_state: State, error: Exception) -> State:
    print("ValueError occurred:", error)
    print(f"Current value: {retry_state['value']}")
    retry_state["value"] += 1
    return retry_state

retry_policy: RetryPolicy = RetryPolicy(max_retries=7, handlers={ValueError: value_error_handler})
wrapped_graph = create_retry_wrapper(graph, retry_policy)

The example above will create a retry policy that retries up to 7 times and will call the value_error_handler function if a ValueError is raised during a retry.

Parameters:

what_to_rerty (Callable[[Any], Any], CompiledStateGraph) – The graph or a function to wrap
policy (RetryPolicy) – The retry policy to use
state_type (Type[Any], optional) – Type of the state used for the graph

Returns:

Graph or function wrapped with the retry policy

Return type:

CompiledStateGraph