# hub.solver.pi.pi

Domain specification

Domain

Only show finetuned characteristics Simplify signatures

# PI

Policy Iteration solver for Markov Decision Processes.

Enumerates all reachable states via BFS from the initial state, then alternates between policy evaluation (Gauss-Seidel sweeps computing V^pi) and policy improvement (greedy action selection maximizing Q(s,a) = R(s,a) + gamma * sum_s' P(s'|s,a) * V(s')) until the policy stabilizes.

Terminal states (where is_terminal() returns true) are absorbing; their value is set by the terminal_value functor (defaults to Value(reward=0) for goal-like terminals; use a large negative reward for dead-end-like terminals).

This implementation supports two optional warm-starts:

heuristic initializes V(s) before the first evaluation sweep (non-standard extension, defaults to Value(reward=0) = standard PI).
initial_policy seeds pi(s) with a domain-specific action before the first evaluation (defaults to first applicable action). When both are provided and consistent, the first evaluation converges very fast.

# Constructor PI

PI(
  domain_factory: Callable[[], Domain],
  heuristic: Callable[[Domain, D.T_state], StrDict[Value[D.T_value]]] = <lambda function>,
  terminal_value: Callable[[D.T_state], Value[D.T_value]] = <lambda function>,
  initial_policy: Optional[Callable[[Domain, D.T_state], StrDict[list[D.T_event]]]] = None,
  discount: float = 0.999,
  epsilon: float = 0.001,
  max_eval_sweeps: int = 0,
  parallel: bool = False,
  shared_memory_proxy = None,
  callback: Callable[[PI], bool] = <lambda function>,
  verbose: bool = False
) -> None

Construct a Policy Iteration solver instance

# Parameters

domain_factory: The lambda function to create a domain instance.
heuristic: Optional function h(domain, state) -> Value used to initialize V(s) = h(s).reward before the first policy evaluation sweep (non-standard warm-start). Defaults to Value(reward=0) = standard PI. terminal_value: Function f(state) -> Value assigning a fixed value to terminal (absorbing) states. Use Value(reward=0) for goal-like terminals and Value(reward=-penalty) for dead-end-like terminals. Defaults to Value(reward=0). initial_policy: Optional function pi(domain, state) -> action to seed the initial policy. When provided, each state's policy is initialized to the returned action (falling back to first applicable if the action is not applicable). Defaults to None (first applicable action). discount: Value function's discount factor. Defaults to 0.999. epsilon: Maximum Bellman error for policy evaluation convergence. Defaults to 0.001. max_eval_sweeps: Maximum number of Gauss-Seidel sweeps per policy evaluation phase. 0 means unlimited (exact evaluation until convergence). A positive value yields modified policy iteration, which can prevent divergence when discount=1.0 and the current policy has cycles. Defaults to 0. parallel: Parallelize evaluation sweeps on different processes. Defaults to False. shared_memory_proxy: The optional shared memory proxy. Defaults to None. callback: Lambda function called at the end of each evaluate/improve iteration, taking the solver as argument, returning true to stop. Defaults to never stop. verbose: Whether verbose messages should be logged. Defaults to False.

# call_domain_method ParallelSolver

call_domain_method(
  self,
  name,
  *args
)

Calls a parallel domain's method. This is the only way to get a domain method for a parallel domain.

# check_domain Solver

check_domain(
  domain: Domain
) -> bool

Check whether a domain is compliant with this solver type.

By default, Solver.check_domain() provides some boilerplate code and internally calls Solver._check_domain_additional() (which returns True by default but can be overridden to define specific checks in addition to the "domain requirements"). The boilerplate code automatically checks whether all domain requirements are met.

# Parameters

domain: The domain to check.

# Returns

True if the domain is compliant with the solver type (False otherwise).

# close ParallelSolver

close(
  self
)

Joins the parallel domains' processes.

# complete_with_default_hyperparameters Hyperparametrizable

complete_with_default_hyperparameters(
  kwargs: dict[str, Any],
  names: Optional[list[str]] = None
)

Add missing hyperparameters to kwargs by using default values

Args: kwargs: keyword arguments to complete (e.g. for __init__, init_model, or solve) names: names of the hyperparameters to add if missing. By default, all available hyperparameters.

Returns: a new dictionary, completion of kwargs

# copy_and_update_hyperparameters Hyperparametrizable

copy_and_update_hyperparameters(
  names: Optional[list[str]] = None,
  **kwargs_by_name: dict[str, Any]
) -> list[Hyperparameter]

Copy hyperparameters definition of this class and update them with specified kwargs.

This is useful to define hyperparameters for a child class for which only choices of the hyperparameter change for instance.

Args: names: names of hyperparameters to copy. Default to all. **kwargs_by_name: for each hyperparameter specified by its name, the attributes to update. If a given hyperparameter name is not specified, the hyperparameter is copied without further update.

Returns:

# get_default_hyperparameters Hyperparametrizable

get_default_hyperparameters(
  names: Optional[list[str]] = None
) -> dict[str, Any]

Get hyperparameters default values.

Args: names: names of the hyperparameters to choose. By default, all available hyperparameters will be suggested.

Returns: a mapping between hyperparameter's name_in_kwargs and its default value (None if not specified)

# get_domain ParallelSolver

get_domain(
  self
)

Returns the domain, optionally creating a parallel domain if not already created.

# get_domain_requirements Solver

get_domain_requirements(
) -> list[type]

Get domain requirements for this solver class to be applicable.

Domain requirements are classes from the skdecide.builders.domain package that the domain needs to inherit from.

# Returns

A list of classes to inherit from.

# get_explored_states PI

get_explored_states(
  self
) -> set[StrDict[D.T_observation]]

Get all reachable states discovered by BFS

# get_hyperparameter Hyperparametrizable

get_hyperparameter(
  name: str
) -> Hyperparameter

Get hyperparameter from given name.

# get_hyperparameters_by_name Hyperparametrizable

get_hyperparameters_by_name(
) -> dict[str, Hyperparameter]

Mapping from name to corresponding hyperparameter.

# get_hyperparameters_names Hyperparametrizable

get_hyperparameters_names(
) -> list[str]

List of hyperparameters names.

# get_nb_iterations PI

get_nb_iterations(
  self
) -> int

Get the number of evaluate/improve iterations performed

# get_nb_of_explored_states PI

get_nb_of_explored_states(
  self
) -> int

Get the number of states discovered by BFS

# get_next_action DeterministicPolicies

get_next_action(
  self,
  observation: StrDict[D.T_observation],
  domain: Optional[Domain] = None
) -> StrDict[list[D.T_event]]

Get the next deterministic action (from the solver's current policy).

# Parameters

observation: The observation for which next action is requested.
domain: the domain source of the observation. Typically used to get current applicable actions or action mask. NB: Be careful that the domain has not been autocast, so may not respect the T_domain specs.

# Returns

The next deterministic action.

# get_next_action_distribution UncertainPolicies

get_next_action_distribution(
  self,
  observation: StrDict[D.T_observation],
  domain: Optional[Domain] = None
) -> Distribution[StrDict[list[D.T_event]]]

Get the probabilistic distribution of next action for the given observation (from the solver's current policy).

# Parameters

observation: The observation to consider.
domain: the domain source of the observation. Typically used to get current applicable actions or action mask.

# Returns

The probabilistic distribution of next action.

# get_policy PI

get_policy(
  self
) -> dict[StrDict[D.T_observation], tuple[StrDict[list[D.T_event]], float]]

Get the full solution policy

# get_policy_changed_states PI

get_policy_changed_states(
  self
) -> set[StrDict[D.T_observation]]

Get states where the policy action changed in the last improvement

# get_solving_time PI

get_solving_time(
  self
) -> int

Get the solving time in milliseconds

# get_utility Utilities

get_utility(
  self,
  observation: StrDict[D.T_observation]
) -> D.T_value

Get the estimated on-policy utility of the given observation.

In mathematical terms, for a fully observable domain, this function estimates:

where

is the current policy, any

represents a trajectory sampled from the policy,

is the return (cumulative reward) and

the initial state for the trajectories.

# Parameters

observation: The observation to consider.

# Returns

The estimated on-policy utility of the given observation.

# is_policy_defined_for Policies

is_policy_defined_for(
  self,
  observation: StrDict[D.T_observation]
) -> bool

Check whether the solver's current policy is defined for the given observation.

# Parameters

observation: The observation to consider.

# Returns

True if the policy is defined for the given observation memory (False otherwise).

# reset Solver

reset(
  self
) -> None

Reset whatever is needed on this solver before running a new episode.

This function does nothing by default but can be overridden if needed (e.g. to reset the hidden state of a LSTM policy network, which carries information about past observations seen in the previous episode).

# sample_action Policies

sample_action(
  self,
  observation: StrDict[D.T_observation],
  domain: Optional[Domain] = None
) -> StrDict[list[D.T_event]]

Sample an action for the given observation (from the solver's current policy).

# Parameters

observation: The observation for which an action must be sampled.
domain: the domain source of the observation. Typically used to get current applicable actions or action mask.

# Returns

The sampled action.

# solve FromInitialState

solve(
  self,
  from_memory: Optional[Memory[D.T_state]] = None
) -> None

Run the solving process.

# Parameters

from_memory: The source memory (state or history) from which we begin the solving process. If None, initial state is used if the domain is initializable, else a ValueError is raised.

TIP

The nature of the solutions produced here depends on other solver's characteristics like policy and assessibility.

# solve_from FromAnyState

solve_from(
  self,
  memory: Memory[D.T_state]
) -> None

Run the solving process from a given state.

# Parameters

memory: The source memory (state or history) of the transition.

TIP

The nature of the solutions produced here depends on other solver's characteristics like policy and assessibility.

# suggest_hyperparameter_with_optuna Hyperparametrizable

suggest_hyperparameter_with_optuna(
  trial: optuna.trial.Trial,
  name: str,
  prefix: str,
  **kwargs
) -> Any

Suggest hyperparameter value during an Optuna trial.

This can be used during Optuna hyperparameters tuning.

Args: trial: optuna trial during hyperparameters tuning name: name of the hyperparameter to choose prefix: prefix to add to optuna corresponding parameter name (useful for disambiguating hyperparameters from subsolvers in case of meta-solvers) **kwargs: options for optuna hyperparameter suggestions

Returns:

kwargs can be used to pass relevant arguments to

trial.suggest_float()
trial.suggest_int()
trial.suggest_categorical()

For instance it can

add a low/high value if not existing for the hyperparameter or override it to narrow the search. (for float or int hyperparameters)
add a step or log argument (for float or int hyperparameters, see optuna.trial.Trial.suggest_float())
override choices for categorical or enum parameters to narrow the search

# suggest_hyperparameters_with_optuna Hyperparametrizable

suggest_hyperparameters_with_optuna(
  trial: optuna.trial.Trial,
  names: Optional[list[str]] = None,
  kwargs_by_name: Optional[dict[str, dict[str, Any]]] = None,
  fixed_hyperparameters: Optional[dict[str, Any]] = None,
  prefix: str
) -> dict[str, Any]

Suggest hyperparameters values during an Optuna trial.

Args: trial: optuna trial during hyperparameters tuning names: names of the hyperparameters to choose. By default, all available hyperparameters will be suggested. If fixed_hyperparameters is provided, the corresponding names are removed from names. kwargs_by_name: options for optuna hyperparameter suggestions, by hyperparameter name fixed_hyperparameters: values of fixed hyperparameters, useful for suggesting subbrick hyperparameters, if the subbrick class is not suggested by this method, but already fixed. Will be added to the suggested hyperparameters. prefix: prefix to add to optuna corresponding parameters (useful for disambiguating hyperparameters from subsolvers in case of meta-solvers)

Returns: mapping between the hyperparameter name and its suggested value. If the hyperparameter has an attribute name_in_kwargs, this is used as the key in the mapping instead of the actual hyperparameter name. the mapping is updated with fixed_hyperparameters.

kwargs_by_name[some_name] will be passed as **kwargs to suggest_hyperparameter_with_optuna(name=some_name)

# _check_domain_additional Solver

_check_domain_additional(
  domain: Domain
) -> bool

Check whether the given domain is compliant with the specific requirements of this solver type (i.e. the ones in addition to "domain requirements").

This is a helper function called by default from Solver.check_domain(). It focuses on specific checks, as opposed to taking also into account the domain requirements for the latter.

# Parameters

domain: The domain to check.

# Returns

True if the domain is compliant with the specific requirements of this solver type (False otherwise).

# _get_next_action DeterministicPolicies

_get_next_action(
  self,
  observation: StrDict[D.T_observation],
  domain: Optional[Domain] = None
) -> StrDict[list[D.T_event]]

Get the next deterministic action (from the solver's current policy).

# Parameters

observation: The observation for which next action is requested.
domain: the domain source of the observation. Typically used to get current applicable actions or action mask. NB: Be careful that the domain has not been autocast, so may not respect the T_domain specs.

# Returns

The next deterministic action.

# _get_next_action_distribution UncertainPolicies

_get_next_action_distribution(
  self,
  observation: StrDict[D.T_observation],
  domain: Optional[Domain] = None
) -> Distribution[StrDict[list[D.T_event]]]

Get the probabilistic distribution of next action for the given observation (from the solver's current policy).

# Parameters

observation: The observation to consider.
domain: the domain source of the observation. Typically used to get current applicable actions or action mask. NB: Be careful that the domain has not been autocast, so may not respect the T_domain specs.

# Returns

The probabilistic distribution of next action.

# _get_utility Utilities

_get_utility(
  self,
  observation: StrDict[D.T_observation]
) -> D.T_value

Get the estimated on-policy utility of the given observation.

In mathematical terms, for a fully observable domain, this function estimates:

where

is the current policy, any

represents a trajectory sampled from the policy,

is the return (cumulative reward) and

the initial state for the trajectories.

# Parameters

observation: The observation to consider.

# Returns

The estimated on-policy utility of the given observation.

# _initialize Solver

_initialize(
  self
)

Launches the parallel domains. This method requires to have previously recorded the self._domain_factory, the set of lambda functions passed to the solver's constructor (e.g. heuristic lambda for heuristic-based solvers), and whether the parallel domain jobs should notify their status via the IPC protocol (required when interacting with other programming languages like C++)

# _is_policy_defined_for Policies

_is_policy_defined_for(
  self,
  observation: StrDict[D.T_observation]
) -> bool

Check whether the solver's current policy is defined for the given observation.

# Parameters

observation: The observation to consider.

# Returns

True if the policy is defined for the given observation memory (False otherwise).

# _reset Solver

_reset(
  self
) -> None

Reset whatever is needed on this solver before running a new episode.

# _sample_action Policies

_sample_action(
  self,
  observation: StrDict[D.T_observation],
  domain: Optional[Domain] = None
) -> StrDict[list[D.T_event]]

Sample an action for the given observation (from the solver's current policy).

# Parameters

observation: The observation for which an action must be sampled.
domain: the domain source of the observation. Typically used to get current applicable actions or action mask. NB: Be careful that the domain has not been autocast, so may not respect the T_domain specs.

# Returns

The sampled action.

# _solve FromInitialState

_solve(
  self,
  from_memory: Optional[Memory[D.T_state]] = None
) -> None

Run the solving process.

# Parameters

from_memory: The source memory (state or history) from which we begin the solving process. If None, initial state is used if the domain is initializable, else a ValueError is raised.

TIP

The nature of the solutions produced here depends on other solver's characteristics like policy and assessibility.

# _solve_from FromAnyState

_solve_from(
  self,
  memory: Memory[D.T_state]
) -> None

Run Policy Iteration from a given state

# Parameters

memory: State from which to enumerate reachable states and run PI