Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Uncertainty-Sensitive Privileged Learning

Authors: Fan-Ming Luo, Lei Yuan, Yang Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across nine tasks demonstrate that USPL significantly reduces the behavioral discrepancies, achieving superior deployment performance compared to baselines. Additional visualization results show that the DP accurately quantifies its uncertainty, and the PP effectively adapts to uncertainty variations.
Researcher Affiliation Academia Fan-Ming Luo Lei Yuan Yang Yu National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China Polixir.ai EMAIL, EMAIL, EMAIL
Pseudocode Yes The entire algorithmic process is summarized in Alg. 1 in App. A.1. In each iteration, we first collect the common, target, and privileged observations, denoted as oc, ot, and op, respectively, from the environment. Next, the observation encoder is used to compute the current uncertainty σp. The privileged observation op is then perturbed using Eq. 3 or Eq. 4 to obtain ˆop. The privileged policy is invoked to generate the action output a based on (ˆop, oc, σp). After executing the action, the next set of observations and the reward are received from the environment. The reward is scaled using Eq. 2 to obtain ˆr. Finally, the state transition and ˆr are inserted into both the on-policy and off-policy buffers. After sample collection is complete, we train the policy using PPO [37] with the (ˆop, oc, ˆr, a) samples from the on-policy buffer. Additionally, the observation encoder is optimized using the op, oc, ot samples from both the on-policy and off-policy buffers in conjunction with Eq. 1. More implementation details can be found in App. C.3. Algorithm 1 Uncertainty-Sensitive Privileged Learning (USPL)
Open Source Code Yes Code is available at https://github.com/Fanming L/USPL.
Open Datasets No We consider three common types of robots in Isaacgym [38]: a quadruped robot, a quadrotor UAV, and a robotic arm. As shown in Fig. 3, we constructed 7 environments based on these three types of robots, with two of them supporting both image-based and non-image-based inputs, resulting in a total of 9 tasks.
Dataset Splits No The paper describes the creation of custom environments within the Isaacgym simulator for reinforcement learning tasks, rather than utilizing pre-existing datasets with defined splits for training, validation, or testing. The experimental setup involves continuous interaction with these environments. Therefore, explicit dataset splits are not mentioned.
Hardware Specification Yes Our experiments were conducted on GPUs with 82.6 TFLOPS of compute and 24 GB of memory. For non-image-based tasks, training was performed using a single GPU, and the training times are summarized in Tab. 6. For image-based tasks, we utilized five GPUs in parallel to accelerate training.
Software Dependencies No The paper mentions the use of specific algorithms and frameworks like PPO [37], RESe L [25], and Isaacgym [38]. It also states that the training framework is built upon the RESe L codebase and the CNN encoder is adapted from an implementation in [15]. However, it does not provide specific version numbers for these software components or for general programming languages/libraries (e.g., Python, PyTorch versions).
Experiment Setup Yes C.3 Implementation Details of USPL C.3.1 More Implementation Tricks Learning Schedule. In our implementation, we employ a cosine schedule to anneal both the standard deviation of the PPO policy and its learning rate. In addition, a separate cosine schedule is applied to gradually increase the degree of privileged blurring and reward transformation. Training begins with a 100-epoch warm-up phase... C.3.2 Hyperparamters Tabs. 4 and 5 list the hyperparameters used for the quadruped robot and for the other two environments, respectively. C.3.3 Network Architectures In this part, we describe the architectures of the two key modules involved in USPL. Privileged Policy The architecture of the Privileged Policy is consistent with the policy design in RESe L [25], as our training framework is built upon the RESe L codebase. A context encoder, implemented as a two-layer MLP with hidden dimensions [256, 256], first encodes the environment information based on the last observation, last action, and current observation... Observation Encoder (non-image) For non-image-based tasks, the observation encoder first processes the common observation through a three-layer MLP with hidden sizes [512, 256, 128]... Observation Encoder (image) In image-based tasks, the image input is first processed by a CNN encoder. The encoded image features are then concatenated with the common observation features...