Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Zero-Shot Trajectory Planning for Signal Temporal Logic Tasks

Authors: Ruijia Liu, Ancheng Hou, Xiao Yu, Xiang Yin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Simulation results demonstrate its effectiveness in generating dynamically feasible trajectories across diverse long-horizon STL tasks. Project Page: https://cps-sjtu.github.io/Zero-Shot-STL/ 1 Introduction Signal Temporal Logic (STL) is a formal specification language used to describe the temporal behavior of continuous signals. It has become widely adopted for specifying high-level robotic behaviors due to its expressiveness and the availability of both Boolean and quantitative evaluation measures. Controlling robots under STL task constraints, however, is a challenging problem, as it requires balancing both the satisfaction of the task and the feasibility of the system dynamics. In cases where the environment and system dynamics are fully known, several representative methods have been developed, including optimization-based approaches [1, 2, 3], gradient-based techniques [4, 5], and sampling-based methods [6]. However, these methods are often difficult to apply in practical scenarios, where the system dynamics and environment are either unknown or difficult to model. To address the challenge of unknown dynamics, several learning-based approaches have been proposed. One typical method is reinforcement learning (RL) [7, 8, 9, 10, 11, 12], where an appropriate reward function is designed to approximate the satisfaction of the STL task. However, these methods often struggle with long-horizon STL tasks and lack generalization capabilities across different tasks. Another approach involves first learning a system model and then integrating it with model-based planning methods. For example, in [13], the authors trained a neural network to approximate the system dynamics and combined it with an optimization-based approach. However, this method is limited to simple short-horizon STL tasks due to its high computational cost. In [14], the authors used goal-conditioned RL to train multiple goal-conditioned policies, referred to as skills," to accomplish specific objectives. They then applied a search algorithm to determine the optimal sequence of skills" needed to satisfy the given STL tasks. While this approach enables a certain degree of task generalizations, these tasks must be based on pre-defined objectives associated with the skills. More recently, generative models, such as diffusion models [15], have emerged as a new approach for generating trajectories for systems with unknown dynamics [16, 17, 18, 19, 20], gaining popularity across various applications. Compared to traditional model-based reinforcement learning methods, these generative approaches are better suited for long-horizon decision-making and offer greater test-time flexibility [16], making them particularly effective for complex tasks. For example, for finite Linear Temporal Logic (LTLf) tasks, [21] introduced a classifier-based guidance approach to steer the sampling of diffusion models, ensuring that generated trajectories satisfy LTLf requirements. Similarly, [22] proposed a hierarchical framework that decomposes co-safe LTL tasks into subtasks using hierarchical reinforcement learning. This framework employs a diffusion model with a determinant-based sampling strategy to generate diverse low-level trajectories, improving both planning success rates and task generalization. In the context of STL trajectory planning, the use of generative models has also been explored recently. For example, [23] proposed a classifier-based guidance approach that leverages robustness gradients to guide diffusion model sampling, enabling the generation of vehicle trajectories that adhere to traffic rules specified by STL. Building on this, [24] introduced a data augmentation method to enhance trajectory diversity and improve rule satisfaction rates. However, these approaches are still limited to simpler STL tasks, primarily due to the complexity of optimizing robustness values and the inherent trade-off between maximizing reward objectives and maintaining the feasibility of the generated trajectories [25]. In this paper, considering that trajectories satisfying complex STL specifications are typically difficult to collect in real-world scenarios, we focus on composing such trajectories by stitching together short trajectory chunks. The main challenge lies in determining appropriate ways to combine these chunks such that the resulting trajectory satisfies the global STL specification while maintaining dynamic consistency and feasibility. Inspired by recent advances in decomposition-based STL planning[26], we propose a novel hierarchical framework that integrates task decomposition, search algorithms, and generative models. First, complex STL tasks are decomposed into a set of time-aware reach-avoid progresses. Next, a search algorithm, heuristically guided by the trajectory data, is employed to allocate these progresses and generate a sequence of waypoints with corresponding timestamps. Finally, a pre-trained diffusion model, trained on task-agnostic data, is used to sequentially generate trajectory chunks that connect adjacent timed waypoints. All trajectory chunks are then stitched together to form the complete trajectory. To the best of our knowledge, our algorithm is the first data-driven approach with zero-shot generalization capability for complex STL tasks. We have formally proven the soundness of our STL decomposition and planning algorithm, which guarantees that the generated trajectories satisfy any given STL specification. Furthermore, we empirically evaluate the dynamic consistency and feasibility of the planned trajectories through simulation experiments. Simulation results demonstrate that our method achieves a high execution success rate across diverse long-horizon STL tasks, where the diffusion-based baseline fails. Moreover, it outperforms both the diffusion-based baseline and a standard non-data-driven method in planning efficiency.
Researcher Affiliation	Academia	Ruijia Liu Department of Automation Shanghai Jiao Tong University EMAIL Ancheng Hou Department of Automation Shanghai Jiao Tong University EMAIL Xiao Yu Institute of Artificial Intelligence Xiamen University EMAIL Xiang Yin Department of Automation Shanghai Jiao Tong University EMAIL
Pseudocode	Yes	Algorithm 1 Main-Allocation Input: Initial state x0, reachability progresses PR, invariance progresses PI, time variable constraints T Output: A valid waypoints sequence s or None if no solution is found 1: Initialize: 2: current state x x0; current time t 0 3: sequence s [(x, t)] 4: stack [(x, t, PR, T, s)] 5: while stack is not empty do 6: (x, t, PR, T, s) pop(stack) 7: if PR = then 8: return s, T // All reachability progresses satisfied 9: for each progress R(aΛ, bΛ, µ) PR do 10: t , x Sample State(R(aΛ, bΛ, µ), x, t, T, PI) Algorithm 2 Sample State Input: reachability progress R(a, b, µ), current state x, time step t, time variable constraints T, invariance progresses PI Output: Assigned satisfaction time tnew of constraint R(a, b, µ) and new state x or None if no solution is found 1: tmin amin Λ,T , tmax bmax Λ,T 2: for up to Nmax attempts do 3: Sample state x such that x \|= µ 4: Initialize: conflict time interval O 5: for all I(c, dΛ, µ ) PI det with determined starting time do 6: if x µ then 7: O O [c, dmin Λ,T ] 8: end if 9: end for 10: t t + Time Predict(x, x ) 11: if t > tmax or [max{t , tmin}, tmax] O = then 12: Continue to next sampling attempt 13: end if 14: tnew earliest time in [max{t , tmin}, tmax] O 15: return tnew, x 16: end for 17: return None // No valid time found
Open Source Code	Yes	We provide detailed implementation descriptions and key experimental settings in both the main text and the appendix. In addition, we have included the experimental code in the supplementary materials. We also commit to open-sourcing the code and will add the corresponding link in the paper upon publication.
Open Datasets	Yes	Both algorithms use diffusion models trained on the D4RL dataset [38] to generate trajectories and a PD controller for trajectory execution.
Dataset Splits	No	Both algorithms use diffusion models trained on the D4RL dataset [38] to generate trajectories and a PD controller for trajectory execution.
Hardware Specification	Yes	All experiments were run on a PC with an Intel i7-13700K CPU and Nvidia 4090 GPU.
Software Dependencies	No	Using the open-source library stlpy [2], we calculate the robustness values for both the planned and actual execution trajectories as 0.180 and 0.115, respectively.
Experiment Setup	Yes	To generate random STL tasks, we design nine STL task templates, as illustrated in Table F.1. For each template, we randomly sample time intervals as well as the positions and sizes of the regions corresponding to the atomic predicates, thereby producing diverse randomized STL tasks. The feasibility of each generated STL task is verified using the Progress Allocation module of our method in all experiments, except for the custom-built environment experiment, where feasibility is instead verified using the sound-and-complete baseline algorithm. Table F.1: STL Task Templates for Experiments Type STL Templates 1 FI1µ1 G( µ2) 2 FI1µ1 FI2µ2 3 FI1µ1 ( µ1UI1µ2) 4 FI1(µ1 (FI2(µ2 FI3(µ3 FI4(µ4))))) 5 FI1(µ1 (FI2(µ2 FI3(µ3)))) G( µ4) 6 FI1(µ1) FI2(µ2) FI3(µ3) G( µ4) 7 FI1(GI2(µ1)) FI3(µ2) G( µ3) 8 FI1(µ1 FI2(GI3(µ2))) 9 FI1(µ1 FI2(µ2) FI3(µ3) GI4(µ4)) F.3 Details of Experiment in Maze2d Environment Baseline Algorithm. We compare our algorithm with the method proposed in [23], which adopts classifier-based guidance and directly leverages the gradient of the trajectory s robustness value to guide the sampling process of the diffusion model, thereby optimizing the robustness of the generated trajectory. The gradient of robustness is calculated by the STLCG method proposed in [55]. In the following text, we refer to this algorithm as the Robustness Guided Diffuser (RGD). Experimental Settings. The diffusion models used in both RGD and our algorithm are trained following the procedure in [16] using the D4RL dataset [38]. A simple multilayer perceptron (MLP) with four fully connected layers is used as the Time Predict model in our algorithm and it is also trained using the same dataset. In our experiments, we employ diffusion model to generate only the state sequence of the trajectory and use a simple PD controller to follow the state sequence during running to get the actual execution trajectory as described in Section 3.5. Evaluation Metrics. For each pair of Maze2D environment (U-Maze, Medium, Large) and task template listed in Table F.1, we generate 150 feasible random STL formulae as testing cases and test RGD and our algorithm on them and record the following metrics: Execution Success Rate (SR): The proportion of cases where the actual execution trajectory achieve non-negative robustness values. Average Robustness Value (RV): The average robustness value of the executed trajectories, after discarding the top and bottom 5% to mitigate the effect of outliers. Total Planning Time (T0): The average total running time (in seconds) to plan a trajectory per case. In addition, we also record the average Trajectory Generation Time (T1), which is the average time spent by the Trajectory Generation module of our algorithm per case. By recording this metric, we analyze the proportion of runtime contributed by each module in our algorithm. For total planning time, we report both the mean and standard deviation in our results. For success rate, we compute the proportion of successful cases over the total number of test cases. Full Results. The full experimental results are presented in Table F.2. More Cases. We visualized some of the experimental results. The actual execution trajectories for some successful cases (where the STL tasks were satisfied by the execution trajectories) are shown in Figure F.2 to Figure F.4. For some failed cases (where the STL tasks were not satisfied), the trajectories planned by our algorithm are shown in Figure F.5. Failure-case Analysis. Notably, by analyzing the failure-cases, we identified that the primary reason for execution failure is that the trajectories generated by the diffusion model significantly violated system dynamics, such as colliding with obstacles in the environment or having excessively large distances between consecutive states. To further enhance the actual execution success rate, our method can be integrated with some receding horizon control methods [56] or online replanning strategies [57]. This extension will be explored in our future work. F.4 Details of Experiments under More Complex Dynamics To further evaluate the generality and robustness of our framework, we conduct experiments on two dynamics-rich domains from the offline goal-conditioned RL benchmark OGBench [42]: cube-singleplay ( Cube ) and antmaze-medium-navigate ( Ant Maze ). Experimental Settings. The Cube environment involves a 6-Do F UR5e robotic arm manipulating a cube, while Ant Maze features an 8-Do F quadruped ant navigating through a complex maze. In both environments, training relies solely on STL task-agnostic trajectory datasets provided by the benchmark; during evaluation, we replace the original goal-conditioned objectives with randomly generated STL tasks that encode multi-stage temporal and spatial constraints. For Ant Maze, we adopt the STL formulation described in Section 5.1: the agent must visit designated regions in a specific temporal order to satisfy the task specification. Planning is performed in the two-dimensional x-y workspace, while execution uses an inverse dynamics controller that maps a 29-dimensional observation and the next x-y target to an 8-dimensional action at each control step as Table F.2: Full Result of Experiment in Maze2D Environment. U:U-Maze; M:Medium; L:Large; RGD: Robustness Guided Diffuser; T1:Trajectory Generation Time. means that RGD fails to generate feasible trajectory. Env Type Success Rate(%) Robustness Value Total Planning Time(s) T1(s) RGD ours RGD ours RGD ours 1 80.00 97.33 0.1084 0.1132 0.1938 0.0715 13.43 1.51 0.86 0.13 0.86 2 36.67 92.00 -0.2208 0.2826 0.1504 0.0814 16.65 2.06 0.64 0.17 0.64 3 32.00 91.33 -0.1695 0.2521 0.1354 0.0745 19.68 11.76 1.31 0.15 1.21 4 90.00 0.1120 0.1424 1.64 0.20 1.38 5 84.67 0.0921 0.0984 2.59 0.45 2.46 6 86.67 0.1003 0.0806 2.35 0.42 2.35 7 89.33 0.1293 0.0766 1.86 0.33 1.86 8 97.33 0.1965 0.0435 0.98 0.15 0.81 9 88.67 0.1047 0.1081 1.53 0.27 1.52 1 70.00 94.67 0.0885 0.2768 0.3205 0.0957 53.90 5.78 3.74 0.34 3.74 2 34.67 89.33 -0.2277 0.3089 0.2013 0.1950 70.69 8.27 2.72 0.67 2.72 3 35.33 83.33 -0.2393 0.3130 0.1761 0.2060 129.87 27.25 5.53 0.33 5.43 4 83.33 0.1534 0.2615 7.09 0.54 6.80 5 82.00 0.1457 0.2182 11.36 1.32 11.22 6 84.67 0.1477 0.2454 11.76 1.36 11.76 7 90.00 0.2148 0.1599 8.01 1.21 8.00 8 91.33 0.2712 0.1772 3.87 0.38 3.71 9 82.67 0.1320 0.2454 6.53 0.74 6.52 1 34.67 92.00 -0.1927 0.3198 0.3180 0.1228 55.48 5.31 3.62 0.33 3.62 2 16.67 81.33 -0.3926 0.2284 0.1672 0.2411 68.70 7.65 2.88 0.63 2.87 3 26.67 79.33 -0.2886 0.2934 0.1639 0.2286 136.35 38.12 5.59 0.34 5.49 4 69.33 0.0589 0.3133 7.46 0.52 7.13 5 79.33 0.1324 0.2663 12.62 0.75 12.45 6 73.33 0.1104 0.2673 12.37 1.35 12.37 7 84.00 0.1770 0.2360 8.18 0.68 8.18 8 85.33 0.2424 0.2502 3.93 0.33 3.75 9 76.67 0.0823 0.2703 6.93 0.60 6.92 in [47]. Given the higher dynamic complexity of this domain, we set k = 2, such that each planning timestep corresponds to two control steps, as described in Section 3.5. For the Cube environment, we focus on the end-effector s Cartesian motion rather than the physical manipulation of objects. Trajectories are generated in x-y-z space and tracked via a PD controller, as in Section 3.5. This abstraction ensures methodological consistency with other experiments while still capturing the essential planning characteristics of high-dimensional robotic control. Evaluation Metrics. For each environment and STL task template (as defined in Table F.1), we randomly generate 100 feasible STL formulae as test cases. We report two quantitative metrics: Execution Success Rate (SR) and Total Planning Time (T0). Results. The complete results are summarized in Table F.3. Our framework achieves high success rates across both environments, with moderate computational cost despite the distinct dynamic structures of the two systems. Table F.3: Results in Cube and Ant Maze. SR: Execution Success Rate; T0: Total Planning Time; Env Type SR(%) T0(s) 1 100.0 0.69 0.12 2 95.0 0.69 0.26 3 95.0 1.20 0.20 4 84.0 1.43 0.17 5 89.0 2.12 0.32 6 88.0 1.83 0.18 7 91.0 1.22 0.16 8 91.0 0.76 0.13 9 85.0 1.27 0.14 1 92.0 7.95 2.87 2 94.0 4.53 0.60 3 81.0 9.95 2.81 4 83.0 9.72 1.01 5 62.0 32.42 5.89 6 64.0 24.63 5.57 7 82.0 15.80 4.78 8 91.0 4.58 0.57 9 60.0 8.98 0.94 Analysis. In the Cube environment, our method achieves consistently strong performance, with execution success rates exceeding 90% on most task templates. The low planning time (typically below 2 seconds) indicates that the framework efficiently handles the dynamics of the 6-Do F manipulator. Performance degradation is observed only in more complex templates (Types 4 and 9), which involve long-horizon temporal dependencies and nested STL tasks. These cases require more extensive search and multiple calls to the trajectory generator, slightly increasing the planning time while reducing the success rate due to accumulated modeling uncertainty. Nevertheless, the overall high success rate demonstrates that our framework generalizes well to this manipulation-like domains. In contrast, the Ant Maze environment presents a substantially greater challenge due to its highdimensional locomotion dynamics and strong coupling between joint configurations and global motion. Here, the average success rate remains around 80 90% for moderate templates but drops to around 60% for the most complex ones (Types 5, 6, and 9). A closer examination reveals that these harder tasks often involve stringent temporal nesting and avoid constraints, which are more sensitive to execution noise and controller-induced deviation. The higher planning times-especially for longhorizon templates (e.g., Type 5 and 6)-reflect the reduced efficiency of the trajectory generation module in scenarios requiring longer trajectories due to more complex dynamics. Nevertheless, our method is still able to produce feasible trajectories in most cases within an acceptable time frame. Overall, these results confirm that the proposed framework scales effectively from smooth, fully actuated manipulation systems to complex locomotion environments. The modular combination of decomposition, allocation, and diffusion-based trajectory generation allows efficient reasoning over STL objectives, maintaining both computational efficiency and high task success rates across diverse dynamical regimes. F.5 Details of Comparative Experiment with Optimization-based Method To further evaluate the success rate of the progress allocation module in our algorithm, we compare it against a widely-used optimization-based algorithm [4] in a custom-built simulation environment. The baseline algorithm is employed as a sound and complete solution to accurately assess the feasibility of randomly generated test cases. Experimental Settings. The experiment is conducted within a bounded 10 10 square 2D plane containing a circular obstacle. The underlying system dynamics are modeled using a double integrator. The agent starts from a randomly generated position and must complete the randomly generated STL tasks by reaching the target region within the specified time interval. In this experiment, the baseline algorithm is implemented using the open-source library stlpy [2] and has full knowledge of the environmental information and system dynamics, while our algorithm only has access to the trajectory dataset. To generate the trajectory dataset, we randomly sample start and end points in the environment and use the baseline algorithm to solve reach-avoid tasks. This process produces 200,000 collision-free trajectories that satisfy the system dynamics, which are then used to train the diffusion model and the Time Predictor. To further enhance the trajectory generator s ability to produce trajectories of varying lengths, we train two diffusion models with different horizons: one dedicated to generating shorter trajectories and the other specialized for longer trajectories. The first model is trained on trajectory segments of length 16 for shorter trajectories, while the second model uses segments of length 32 to improve generalization to longer trajectories. We generate 200 feasible STL tasks for each template, as described in Section F.2. The deterministic baseline algorithm is used to ensure the feasibility of these tasks. However, for templates 4 and 5, which involve multi-layer nesting of temporal operators, the baseline algorithm fails to find solutions within an acceptable time. In these cases, we still employ our algorithm s progress allocation module to verify feasibility. Evaluation Metrics. In addition to the Execution Success Rate (SR) and Total Planning Time (T0) metrics described in Section 5, we introduce an additional evaluation metric: Progress Allocation Success Rate (SR0): The proportion of cases where the progress allocation module successfully identifies a sequence of waypoints. This metric specifically measures the reliability of the progress allocation module in our algorithm. Table F.4: Result of Experiment in Custom-built Environment. SR0: Progress Allocation Success Rate; SR: Execution Success Rate; Type SR0(%) SR(%) Total Planning Time(s) ours baseline 1 96.0 93.5 0.99 0.08 3.82 1.44 2 98.0 96.5 0.81 0.03 6.30 1.36 3 96.0 89.0 1.26 0.05 31.60 10.46 4 78.5 2.10 0.26 Timeout 5 83.0 2.57 0.10 Timeout 6 97.5 69.5 2.87 0.12 24.23 6.39 7 80.0 73.5 1.80 0.08 7.71 3.50 8 89.5 89.0 0.82 0.03 106.58 82.19 9 81.0 72.0 1.61 0.06 151.19 78.82 Analysis. The experimental results are summarized in Table F.4. Since the optimization-based baseline is used as an expert solver to certify the feasibility of randomly generated STL tasks, its execution success rate (SR) is naturally 100% and thus omitted from Table F.4. Our algorithm achieves consistently high success rates across test cases generated from diverse task templates. Notably, the Execution Success Rate (SR) exceeds 69% in all scenarios, demonstrating the algorithm s strong generalization capability for STL tasks. For all templates except 4 and 5, the Progress Allocation Success Rate (SR0) exceeds 80%, indicating that the progress allocation module is generally reliable, albeit slightly conservative. Finally, by comparing the Total Planning Time, our algorithm significantly outperforms the optimization-based baseline algorithm, highlighting the efficiency of the task decomposition and planning framework employed in our approach. Notably, for templates 4 and 5, which involve multilayered nested STL tasks, the baseline algorithm fails to find feasible solutions within a reasonable time. In contrast, our algorithm demonstrates both high success rates and high efficiency, even in these complex scenarios. F.6 Analysis of the Predictor-Generator-Controller Framework In our framework, the Time Predictor serves as a crucial component that provides heuristic guidance on system reachability during progress allocation. By estimating the time required for transitions between states, it allows the allocation process to consider not only the logical satisfaction of STL constraints but also the underlying dynamical feasibility between consecutive waypoints, thereby creating favorable conditions for subsequent trajectory generation. The Time Predictor operates in close collaboration with the diffusion-based trajectory generator. Specifically, the predictor estimates the expected travel time between two states, and the generator then produces a trajectory of the corresponding length. Although the true time-to-reach can vary considerably due to stochasticity and unmodeled dynamics, a generator trained on trajectory segments of diverse lengths demonstrates strong generalization. In practice, even a moderately accurate statistical estimate from the predictor typically suffices to produce a dynamically feasible trajectory. Moreover, the feedback control layer further compensates for residual prediction errors, ensuring consistent execution despite model approximation. To empirically validate the effectiveness of this collaboration, we conducted an additional experiment in both Maze2D and Ant Maze environments. Experimental Settings. For each environment, we randomly sampled 1000 start-goal pairs (with goal region radii ranging from 3% to 6% of the arena size). For each case, the Time Predictor estimated the required trajectory length from the start position to the center of the goal region; the trajectory generator then produced a trajectory of that length, which was subsequently executed using a PD controller (in Maze2D) or an inverse dynamics model (in Ant Maze) under the strict time-synchronous control protocol described in Section 3.5 (with k = 1 in Maze2D and k = 2 in Ant Maze). All modules were employed exactly as implemented in our main framework, without additional tuning. Results and Analysis. The resulting execution success rates, summarized in Table F.5, demonstrate that even in the more challenging Ant Maze setting, the current predictor-generator-controller pipeline achieves high overall reliability. These results underscore the effectiveness of modular integration among prediction, generation, and control within our framework. Given this modular design, each component can be further improved independently-for example, by incorporating more accurate time prediction models [51, 52, 58] or more expressive trajectory generators-to handle increasingly complex dynamical systems. We leave such extensions as promising directions for future work. Table F.5: Execution Success Rate (%) of the Predictor-Generator-Controller Pipeline in Different Environments. Each result is computed over 1000 randomly sampled start-goal pairs. Environment Umaze Medium Large Ant Maze Execution Success Rate (%) 93.9 89.0 83.7 84.8 F.7 Implementation Details of Experiments F.7.1 Calculation of the Robustness Value In our experiments, we compute the robustness values of execution trajectories using the opensource library stlpy [2], which implements quantitative semantics for Signal Temporal Logic (STL). However, in environments such as Maze2D, Ant Maze, and Cube, transitions between states often require relatively long trajectories. As a result, the corresponding STL task intervals become lengthy, leading to a substantial computational burden when evaluating robustness directly on the full-resolution trajectories often exceeding the available computational resources. To mitigate this issue, we introduce a temporal sampling factor η, which defines the mapping between the time scale of the STL task and the resolution of the system trajectory used for evaluation. Specifically, one discrete time step in the STL task corresponds to η time steps in the system trajectory. When computing robustness, we sample one state every η steps to obtain a down-sampled trajectory, and then evaluate robustness on this sampled sequence. Importantly, the STL formula used for evaluation is also temporally rescaled, with all time intervals divided by η, so that the robustness value is computed with respect to a temporally consistent but shorter-horizon STL specification. This procedure effectively reduces computational overhead while preserving the temporal and logical structure of the original task. It is worth emphasizing that the parameter η is conceptually distinct from the parameter k introduced in the control protocol (Section 3.5). While k determines the number of low-level control updates executed per planning step during runtime thus linking planning and control frequencies η only affects the post-hoc evaluation of robustness by defining how densely the executed trajectory is sampled for STL computation. F.7.2 Experimental Parameter Settings Some of the parameters involved in the experiments are listed below, and their specific values are shown in Table F.6: Table F.6: Parameters Used in the Experiments Env Nmax H N γ η k Maze2D-Umaze 1 128 64 0.8 8 1 Maze2D-Medium 1 256 256 0.9 8 1 Maze2D-Large 1 384 256 1.1 8 1 Ant Maze 1 512 512 1.0 12 2 Cube 1 128 64 1.2 4 1 Custom-Built 1 16&32 64 1 1 1 Maximum Number of Attempts (Nmax): The maximum number of attempts for new state sampling in Algorithm 2. Horizon (H): The planning horizon used during the training of the diffusion model. Total Denoise Steps (N): The total number of steps in the denoising process. Scaling Factor (γ): Applied to the predicted mean trajectory length, used to control the conservativeness of the Progress Allocation Module, as described in Section 3.3. Sampling Factor (η): Used when computing the robustness value of trajectories, as described in Section F.7.1. Control Frequency (k): The number of low-level control updates executed per planning step during runtime, as described in Section 3.5.