API documentation
- class llm_kernel_tuner.LLMKernelTransformer(kernel_code: str, llm: BaseChatModel | None = None, tuning_strategy: BaseTuningStrategy | None = None, tests: List[KernelTest] = [], testing_strategies: List[BaseTestingStrategy] = [], transformer_retry_policy: RetryPolicy = RetryPolicy(max_retries=3), device: int = 0, clang_args: List[str] = [], cuda_gpu_arch: str | None = None, time_per_test: int = 15, strip_thinking_output: bool = False, thinking_pattern: str = '<think>.*?</think>\\s*', structured_output_type: StructuredOutputType = StructuredOutputType.DEFAULT, performance_threshold: float = 1.0, verbosity: Literal['none', 'debug', 'info', 'warning', 'all'] = 'info')
The main class that orchestrates the process of analyzing, transforming, and tuning a given compute kernel using LLMs and predefined strategies.
It sets up a workflow involving kernel analysis (description, problem size, outputs), test generation, test validation, and finally kernel tuning based on the provided strategies and retry policies.
- Parameters:
kernel_code (str) – Kernel device code that will be tuned
llm (BaseChatModel, optional) – The language model instance to use for LLM interactions. Defaults to ChatOpenAI(model=”gpt-5”).
tuning_strategy (BaseTuningStrategy, optional) – The strategy for tuning the kernel’s parameters. Defaults to AutonomousTuningStrategy().
tests (List[KernelTest], optional) – A list of initial tests to validate the kernel’s correctness. Defaults to an empty list.
testing_strategies (List[BaseTestingStrategy]) – Strategies for generating tests. Defaults to List[NaiveLLMTester()].
transformer_retry_policy (RetryPolicy, optional) – The retry policy for the kernel analysis (get_kernel_info) subgraph. Defaults to default_transformer_retry_policy.
device (int, optional) – The CUDA device ID to use. Defaults to 0.
clang_args (List[str], optional) – Additional arguments that will be provided to clang when parsing the kernel code
cuda_gpu_arch (Optional[str], optional) – Pre-determined CUDA GPU architecture (e.g., “sm_86”). If provided, skips PyCUDA detection. Defaults to None.
time_per_test (int) –
The base time limit used for calculating timeouts for kernel tests or tuning runs. The time value in seconds.
For
tuneoperations: The timeout is calculated asmax(time_per_test * num_combinations, time_per_test) + buffer, wherenum_combinationsis the product of the number of values for each parameter intune_params, andbufferis a fixed additional time (currently 10 seconds). This ensures the timeout scales with the search space size but is never less thantime_per_testplus the buffer.For
testoperations: The timeout is calculated astime_per_test + buffer, as there is only one parameter combination being tested (num_combinationsis effectively 1).strip_thinking_output (bool, optional) – Whether to strip thinking/reasoning sections from LLM responses. When enabled, the LLM will be wrapped with ThinkingStripperWrapper to remove specified thinking patterns from the output. Defaults to False.
thinking_pattern (str, optional) – Regular expression pattern to match thinking sections that should be stripped from LLM responses. Only used when
strip_thinking_outputis True. If not provided, defaults tor"<think>.*?</think>\s*"which matches content within<think>tags.structured_output_type (StructuredOutputType, optional) – The type of structured output format to use for LLM interactions. Defaults to
StructuredOutputType.DEFAULT.performance_threshold (float, optional) –
Minimum performance improvement threshold as a percentage required for accepting a new kernel version. This prevents accepting kernels with marginal improvements that may be due to measurement noise or system variability.
The improvement percentage is calculated using the formula:
((old_time - new_time) / old_time) * 100A new kernel is accepted only if the calculated improvement percentage is greater than or equal to the threshold value.
Examples
performance_threshold=1.0: Requires at least 1.0% improvement (default)performance_threshold=2.0: Requires at least 2.0% improvement (more conservative)performance_threshold=0.0: Accepts any improvement, however small
Defaults to 1.0.
verbosity (Literal["none", "debug", "info", "warning", "all"], optional) –
Controls the verbosity of logging output. Options are: - “none”: No logging output (uses NullHandler) - “debug”: Shows debug, info, warning, and error messages - “info”: Shows info, warning, and error messages (default) - “warning”: Shows only warning and error messages - “all”: Shows all logging levels including debug
Defaults to “info”.
- add_test(test: KernelTest)
Add test to the test suite. This test will be used to test the correctness of the kernel while it is being tuned.
- Parameters:
test (KernelTest) – test to be added to the testsuite.
- make_kernel_tunable() Tuple[TunableKernel, Dict[str, Any], PerformanceTracker]
- Transforms the kernel to make it tunable.
- Returns:
The transformed kernel as
TunableKernel, the best tuning parameters as a dictionary, and the performance tracker containing optimization steps- Return type:
Tuple[TunableKernel, Dict[str, Any], PerformanceTracker]
- class llm_kernel_tuner.TunableKernel(code: str, kernel_info: TunableKernelInfo)
Kernel that can be used for tuning.
- Parameters:
code (str) – The kernel code that will be tuned.
kernel_info (TunableKernelInfo) – Object that stores general information about the kernel.
session. (This object is shared between internal copies of TunableKernel during tuning)
- copy() TunableKernel
Creates a full copy of the TunableKernel instance.
- Returns:
A new instance with the same attribute values.
- Return type:
- get_arg_position(arg: str) int | None
Returns the position of the argument.
- Parameters:
arg (str) – name of the argument.
- Returns:
Returns the zero indexed position of the argument or
Noneif the argument does not exist.- Return type:
int | None
- test(test: KernelTest, tune_params: Dict[str, int | float], run_kernel_kwargs: Dict[str, Any] | None = None)
Tests the kernel with specific parameter values against a test case.
This method runs the kernel once with the specified parameter configuration and verifies that the output matches the expected output from the test case.
- Parameters:
test (KernelTest) – A KernelTest instance containing input data and expected output for validating the kernel’s correctness.
tune_params (Dict[str, Union[int, float]]) – A dictionary mapping parameter names to their values for this specific test run.
run_kernel_kwargs (Optional[Dict[str, Any]]) – Optional additional keyword arguments to pass to the run_kernel function.
- Raises:
FailedTestsError – If the kernel’s output doesn’t match the expected output.
CompileErrorError – If there is an error while compiling the kernel.
TimeoutError – If the kernel execution takes longer than the calculated timeout.
- tune(test: KernelTest, tune_params: Dict[str, List[Any]], restrictions: List[str] | None = None, tune_kernel_kwargs: Dict[str, Any] | None = None) TuneResult
Tunes the kernel with various parameter configurations to find optimal performance.
This method runs the kernel with different parameter configurations specified in
tune_params, evaluates each configuration using the provided test case, and returns the best performing configuration.- Parameters:
test (KernelTest) – A KernelTest instance containing input data and expected output to validate the kernel’s correctness during tuning.
tune_params (Dict[str, List[Any]]) – A dictionary mapping parameter names to lists of possible values to be explored during the tuning process.
restrictions (Optional[List[str]]) – Optional list of restriction strings that define relationships between parameters to constrain the search space. Examples include “block_size_x>=tile_size” or “block_size_y==block_size_x”. If None, no restrictions are applied.
tune_kernel_kwargs (Optional[Dict[str, Any]]) – Optional additional keyword arguments to pass to the run_kernel function.
- Returns:
- An object containing the best parameter configuration and its
execution time.
- Return type:
TuneResult
- Raises:
CompileErrorError – If there is an error while compiling the kernel.
TimeoutError – If the tuning process takes longer than the calculated timeout.
- class llm_kernel_tuner.tuning_strategies.BaseTuningStrategy
- class llm_kernel_tuner.tuning_strategies.BaseTuningStrategy(retry_policy: RetryPolicy | None)
Base class for kernel tuning strategies.
This abstract class defines the interface for tuning strategies that optimize kernel parameters. Implementations of this class should tune the kernel, do the testing of the kernel and evaluation themselves.
Note
Subclasses must implement thecreate_graphmethod.llm_kernel_tuner.tuning_state.Statewill be passed to this graph.See also
Custom tuning strategy for example usage.
- Parameters:
retry_policy (Any)
- _ask_restrictions(kernel_code: str, tune_params: Dict[str, List[Any]]) List[str]
Ask the LLM to generate parameter restrictions for kernel tuning.
This method queries the language model to determine appropriate restrictions between tuning parameters based on the kernel code and available parameters. The restrictions help constrain the parameter search space during optimization.
- Parameters:
kernel_code (str) – The CUDA kernel source code to analyze.
tune_params (Dict[str, List[Any]]) – Dictionary mapping parameter names to lists of possible values that can be tuned.
- Returns:
- A list of restriction strings that define relationships
between parameters (e.g., “block_size_x==block_size_y” or “tile_size<=block_size_x”).
- Return type:
List[str]
- Raises:
RestrictionCheckError – If the LLM generates restrictions referencing parameters not present in tune_params.
Note
This method uses the configured retry policy if available. If restrictions contain invalid parameter names, the operation will be retried according to the retry policy configuration.
Example
Given a matrix multiplication kernel with parameters:
tune_params = { "block_size_x": [16, 32, 64], "block_size_y": [16, 32, 64], "tile_size": [2, 4, 8] }
The method might return:
["block_size_x>=tile_size", "block_size_y>=tile_size"]
- _extract_and_sanitize_kernel(answer_prompt: str) str | None
Extracts and sanitizes kernel code from the model’s response.
This method processes an LLM response to extract only the kernel code. It first removes preprocessor directives and then extracts code that appears between triple backticks.
Note
The extracted and sanitized kernel code, or None if no code could be extracted.
- Parameters:
answer_prompt (str) – The text response potentially containing kernel code.
- Returns:
The extracted and sanitized kernel code, or None if no code could be extracted.
- Return type:
str | None
Example
Given an LLM response like:
Here's an optimized CUDA kernel: ```cuda #define BLOCK_SIZE 256 __global__ void matrixMul(float* A, float* B, float* C, int width) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if (row < width && col < width) { float sum = 0.0f; for (int i = 0; i < width; i++) { sum += A[row * width + i] * B[i * width + col]; } C[row * width + col] = sum; } } ```The method will return:
__global__ void matrixMul(float* A, float* B, float* C, int width) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if (row < width && col < width) { float sum = 0.0f; for (int i = 0; i < width; i++) { sum += A[row * width + i] * B[i * width + col]; } C[row * width + col] = sum; } }
- _run_tests(kernel: TunableKernel, tune_params: Dict[str, List[Any]], tests: List[KernelTest], restrictions: List[str] | None = None) TuneResult
Tunes the kernel with the first test and evaluates correctness with remaining tests.
- Parameters:
kernel (TunableKernel) – The tunable kernel to optimize.
tune_params (Dict[str, List[Any]]) – Dictionary mapping parameter names to lists of possible values.
tests (List[KernelTest]) – List of kernel tests to execute.
restrictions (List[str]) – List of restriction strings that define relationships between parameters to constrain the search space (e.g., “block_size_x>=tile_size”).
- Returns:
Test result containing the tuning performance metrics.
- Return type:
TuneResult
Note
Only the first test is used for parameter tuning to save computational resources. The remaining tests validate the correctness of the kernel. Results are cached based on kernel code and tuning parameters.
- abstractmethod create_graph(llm: BaseChatModel) CompiledStateGraph
This method must be implemented.
- Parameters:
llm (BaseChatModel) – LLM model that user wants to use for tuning.
- Returns:
langchain graph that will be called for tuning.
llm_kernel_tuner.tuning_state.Statewill be passed to this graph.- Return type:
CompiledStateGraph
- class llm_kernel_tuner.tuning_strategies.AutonomousTuningStrategy(breakdown_steps: bool = False, with_replanning: bool = True, max_breakdowns: int = 1, max_replanning: int = 3, retry_policy: ~typing.Any = <object object>)
A fully autonomous strategy for kernel optimization based on the plan-and-solve approach.
This strategy first generates an optimization plan and then executes each step sequentially. Steps can be broken down recursively into smaller steps and the strategy can replan if needed. The process includes validation of steps, breaking them down if necessary, and executing them while ensuring correctness.
- Parameters:
retry_policy (RetryPolicy, optional) – Retry policy for LLM calls. If not provided, defaults to default_tuner_retry_policy. Can be set to None for no retries.
breakdown_steps (bool, optional) – Whether to enable breakdown of steps. If True, the strategy will recursively breakdown each step into smaller steps. Defaults to False.
with_replanning (bool, optional) – Whether to enable replanning. If True, the strategy will re-prompt the LLM for additional steps if needed. Defaults to True.
max_breakdowns (int, optional) – The maximum number of breakdown steps allowed per step. Because steps are broken down recursively the number of steps will grow exponentially, therefore it is advisable to not set this parameter higher than 2. Must be greater than 0 if breakdown_steps is enabled. Defaults to 1.
max_replanning (int, optional) – The maximum number of replanning attempts allowed. Must be greater than 0 if with_replanning is enabled. Defaults to 3.
See also
For a detailed explanation of the strategy workflow and examples, see Autonomous Tuning Strategy.
- class llm_kernel_tuner.tuning_strategies.ExplicitTuningStrategy(tuning_steps: ~typing.List[~llm_kernel_tuner.tuning_strategies.tuning_step.TuningStep] | None = None, retry_policy: ~typing.Any = <object object>)
A strategy that applies a sequence of explicit tuning steps to a kernel.
This tuning strategy follows a predefined sequence of tuning steps, each applying a specific optimization technique to the kernel. Each step can have tunable parameters, dependencies on previous steps, and can be conditionally applied based on an evaluation.
- Parameters:
tuning_steps (List[TuningStep] | None) – Optional[List[TuningStep]]: A list of tuning steps to apply. If None, uses the default explicit_tuning_steps.
retry_policy (Any)
Example
from llm_kernel_tuner import TunableKernel, LLMKernelTransformer from llm_kernel_tuner.tuning_strategies import ExplicitTuningStrategy, TuningStep from langchain_core.prompts import PromptTemplate # Define custom tuning steps my_tuning_steps = [ TuningStep( id="shared_memory_tiling", prompt_template=PromptTemplate.from_template(...), tune_params={"tile_size": [16, 32, 64]} ), TuningStep( id="loop_unrolling", prompt_template=PromptTemplate.from_template(...), tune_params={"unroll_factor": [2, 4, 8]}, depends_on=["shared_memory_tiling"] ) ] # Create the strategy with custom steps strategy = ExplicitTuningStrategy(tuning_steps=my_tuning_steps) # Use the strategy with a kernel transformer kernel_transformer = LLMKernelTransformer( kernel="...", tuning_strategy=strategy )
See also
Explicit Tuning Strategy for more information on how to use this strategy.
- class llm_kernel_tuner.tuning_strategies.TuningStep(id: str, prompt_template: PromptTemplate, tune_params: Dict[str, List[Any]] = {}, depends_on: List[str] = [], skip_evaluation: bool = False)
Initialize a TuningStep instance.
A TuningStep represents a single optimization technique to be applied to a kernel. Each step has a unique identifier, a prompt template to guide the LLM in applying the optimization, optional parameters to tune, and dependency information.
- Parameters:
id (str) – A unique identifier for this tuning step.
prompt_template (PromptTemplate) – The prompt template containing instructions for the LLM on how to implement this optimization technique.
tune_params (Dict[str, List[Any]], optional) – A dictionary mapping parameter names to lists of possible values to explore during tuning. Each parameter will be grid-searched across its possible values. Defaults to an empty dictionary.
depends_on (List[str], optional) – A list of tuning step IDs that must be completed before this step can be executed. Allows for defining dependencies between optimization techniques. Defaults to an empty list.
skip_evaluation (bool, optional) – If True, the system will not ask the LLM to evaluate whether this step is necessary and will always execute it. Defaults to False.
- class llm_kernel_tuner.KernelTest(input_data: List[List[ndarray[tuple[Any, ...], dtype[int32]] | ndarray[tuple[Any, ...], dtype[float32]] | int32 | float32]], expected_output: List[List[ndarray[tuple[Any, ...], dtype[int32]] | ndarray[tuple[Any, ...], dtype[float32]] | int32 | float32] | None], size: int | Tuple[int, ...])
Kernel test that will test the correctness of the kernel being tuned.
- Parameters:
input_data (np.ndarray) – Input array for kernel testing. Should be a 2D array where the first dimension represents different parameter sets and the second dimension contains the values for each parameter.
expected_output (np.ndarray) – Expected kernel output for validation. Should be a 2D array with the same first dimension as
input_data. Elements with None value will be excluded from comparison with actual kernel output.size (int | Tuple[int, ...]) – Problem size specification used by the tuning process. May be a single integer or a tuple of dimensions depending on the kernel requirements.
- class llm_kernel_tuner.testing_strategies.BaseTestingStrategy
- class llm_kernel_tuner.testing_strategies.BaseTestingStrategy(retry_policy: RetryPolicy | None)
- class llm_kernel_tuner.testing_strategies.BaseTestingStrategy(retry_policy: RetryPolicy | None, max_data_size: int)
Base class for generating testing strategies for kernel.
This abstract class defines the interface for testing strategies that generate tests for a given kernel. Implementations of this class should generate tests and store them in
state["tests"].- Parameters:
retry_policy (RetryPolicy, optional) – The retry policy to use for the testing strategy. Defaults to default_tester_retry_policy
max_data_size (int, optional) – The maximum size of the data that can be passed to the kernel. Defaults to 2GB.
Note
Subclasses must implement thecreate_graphmethod.llm_kernel_tuner.tuning_state.Statewill be passed to this graph.See also
Custom Testing Strategy for example usage.
- get_test_from_code(kernel: TunableKernel, code: str, params: Dict[str, Any], timeout: float) KernelTest
Generates a kernel test case from provided Python code.
Executes the given Python
codestring, wrapped in a template, in a separate process to obtain input data. The provided code should define a variable namedinput_datacontaining a list of numpy arrays for the kernel arguments. The wrapping template handles saving this data to a temporary file and printing its path.Then, this method runs the
kernelusing the obtained inputs and the specified tuningparams. Finally, it packages the inputs, corresponding kernel outputs, and the derived problem size into a KernelTest object.- Parameters:
kernel (TunableKernel) – The TunableKernel object representing the kernel to test.
code (str) –
A string containing Python code that defines a variable named
input_dataas a list of numpy arrays representing the kernel input arguments. Here is an example of howcodeis supposed to look like:import numpy as np size = 10000000 a = np.random.randn(size).astype(np.float32) b = np.random.randn(size).astype(np.float32) c = np.zeros_like(a) n = np.int32(size) input_data = [c, a, b, n]
params (Dict[str, Any]) – A dictionary containing the tuning parameters to be used when running the kernel to generate outputs.
timeout (float, optional) – Timeout in seconds for the process to generate input. Default is 120.
- Returns:
A KernelTest object containing the generated input arguments, the corresponding kernel output arguments, and the problem size used for the kernel execution.
- Raises:
InvalidTest – If the provided
codefails to execute successfully within the template, if the template execution does not produce the expected output file path, or if the kernel- Return type:
- class llm_kernel_tuner.tuning_state.State
A TypedDict representing the state of the kernel tuning process.
This class is used to maintain the state during the kernel tuning process. It contains information about the kernel being tuned, current parameters, test cases, performance tracking, and other necessary information for the tuning process.
If you are implementing custom tuning/testing strategies, you will need to manage this state yourself by implementing proper state transitions and updates in your strategy.
- kernel
The kernel object that is being tuned, containing the kernel code and associated metadata.
- Type:
- best_params
The best parameters found so far during the tuning process. None if no parameters have been evaluated yet.
- Type:
Optional[Dict[str, Any]]
- llm
The language model chosen by the user for tuning.
- Type:
BaseChatModel
- tests
A list of test cases used to validate the kernel during the tuning process.
- Type:
List[KernelTest]
- messages
A list of messages exchanged during the tuning process, typically for tracking conversation with the LLM.
- Type:
List[BaseMessage]
- curr_tune_params
The current set of tunable parameters being used or considered in the tuning process.
- Type:
Dict[str, Any]
- performance_tracker
Tracks successful optimization steps and generates performance overviews for the tuning process.
- Type:
- class llm_kernel_tuner.PerformanceTracker
Tracks and manages successful optimization steps during kernel tuning.
This class provides functionality to record successful optimization steps, calculate performance improvements, and generate formatted overviews of the tuning process results.
- generate_overview() str
Generate a formatted overview of all recorded optimization steps.
- Returns:
A formatted string containing the performance overview, or a message indicating no improvements were found if no steps were recorded.
- Return type:
str
- get_total_improvement() float
Calculate the total performance improvement from baseline.
- Returns:
The total improvement percentage from baseline to final result. Returns 0.0 if no baseline is set or no steps are recorded.
- Return type:
float
- has_improvements() bool
Check if any optimization steps have been recorded.
- Returns:
True if at least one optimization step has been recorded, False otherwise.
- Return type:
bool
- record_step(step: PerformanceStep) None
Record a successful optimization step.
- Parameters:
step (PerformanceStep) – The PerformanceStep to record
- Return type:
None
- set_baseline_time(time: float) None
Set the baseline execution time before any optimizations.
- Parameters:
time (float) – The initial kernel execution time
- Return type:
None
- class llm_kernel_tuner.PerformanceStep(step_description: str, kernel_code: str, old_execution_time: float | None, new_execution_time: float, improvement_percentage: float, tunable_parameters: Dict[str, List[Any]], restrictions: List[str], best_tune_params: Dict[str, Any], timestamp: datetime)
Data class representing a successful optimization step.
This class captures comprehensive information about each accepted optimization step, including the step description, kernel code changes, performance metrics, tunable parameters, and restrictions.
- Parameters:
step_description (str)
kernel_code (str)
old_execution_time (float | None)
new_execution_time (float)
improvement_percentage (float)
tunable_parameters (Dict[str, List[Any]])
restrictions (List[str])
best_tune_params (Dict[str, Any])
timestamp (datetime)
- step_description
Human-readable description of the optimization step
- Type:
str
- kernel_code
The optimized kernel code after this step
- Type:
str
- old_execution_time
Previous best execution time (None for first step)
- Type:
float | None
- new_execution_time
New execution time after optimization
- Type:
float
- improvement_percentage
Calculated improvement percentage
- Type:
float
- tunable_parameters
The tunable parameters used for this step
- Type:
Dict[str, List[Any]]
- restrictions
Parameter restrictions applied during tuning
- Type:
List[str]
- best_tune_params
The best parameter values found for this kernel
- Type:
Dict[str, Any]
- timestamp
When this step was recorded
- Type:
datetime.datetime
- class llm_kernel_tuner.retry.RetryPolicy(max_retries: int = 3, handlers: Dict[Type[Exception], Callable[[Any, Exception], Any]] | None = None, default_handler: Callable[[Any, Exception], Any] | None = None)
Defines a retry policy for a retry wrapper.
- Parameters:
max_retries (int, optional) – The maximum number of retries to attempt.
handlers (Optional[Dict[Type[Exception], Callable[[Any, Exception], Any]]]) – A dictionary of exception types to handler functions. If an exception is raised during a retry, the handler function will be called with the current state.
default_handler (Optional[Callable[[Any, Exception], Any]]) – A default handler function to use if no specific handler is provided for an exception.
Note
Either
handlersordefault_handlermust be provided.See also
See Retry Policy for usage.
- llm_kernel_tuner.retry.create_retry_wrapper(what_to_rerty: ~typing.Callable[[...], ~typing.Any] | ~langgraph.graph.state.CompiledStateGraph, policy: ~llm_kernel_tuner.retry.RetryPolicy, state_type: ~typing.Type[~typing.Any] = <class 'dict'>) CompiledStateGraph
Creates a retry wrapper around a graph or a function using the specified policy. Will catch any exception made by the function or the graph wrapped and execute function defined in the policy.
Example usage:
... graph = graph_builder.compile() def value_error_handler(retry_state: State, error: Exception) -> State: print("ValueError occurred:", error) print(f"Current value: {retry_state['value']}") retry_state["value"] += 1 return retry_state retry_policy: RetryPolicy = RetryPolicy(max_retries=7, handlers={ValueError: value_error_handler}) wrapped_graph = create_retry_wrapper(graph, retry_policy)
The example above will create a retry policy that retries up to 7 times and will call the value_error_handler function if a ValueError is raised during a retry.
- Parameters:
what_to_rerty (Callable[[Any], Any], CompiledStateGraph) – The graph or a function to wrap
policy (RetryPolicy) – The retry policy to use
state_type (Type[Any], optional) – Type of the state used for the graph
- Returns:
Graph or function wrapped with the retry policy
- Return type:
CompiledStateGraph