V-MIN: Efficient Reinforcement Learning through Demonstrations and Relaxed Reward Demands

Authors: David Martínez, Guillem Alenyà, Carme Torras

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The performance of V-MIN has been validated through experimentation, including domains from the international planning competition.5 Experimental Results This section analyzes V-MIN performance: 1. Comparing V-MIN with REX-D (Mart ınez et al. 2014) and REX (Lang, Toussaint, and Kersting 2012). 2. Using different values of Vmin and the increasing Vmin. 3. Adaptation to changes in the tasks.
Researcher Affiliation Academia David Mart ınez and Guillem Aleny a and Carme Torras Institut de Rob otica i Inform atica Industrial (CSIC-UPC) C/ Llorens i Artigas 4-6. 08028 Barcelona, Spain {dmartinez,galenya,torras}@iri.upc.edu
Pseudocode Yes Algorithm 1 V-MIN Input: Reward function R, confidence threshold ζ Updates: Set of experiences E 1: Observe state s0 2: loop 3: Update transition model T according to E 4: Create Tvmin(T, R, ζ) 5: Plan an action at using Tvmin 6: if at == Teacher Request() then 7: at = Request demonstration 8: else 9: Execute at 10: end if 11: Observe new state st+1 12: Add {(st, at, st+1)} to E 13: end loop
Open Source Code No The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper.
Open Datasets Yes In this experiment we compared V-MIN with REX-D (Mart ınez et al. 2014) and the R-MAX variant of REX in two problems of the International Probabilistic Planning Competition (2008).
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. It mentions "Means over 250 runs" or "Means over 100 runs" but no explicit splits.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment. It mentions using 'Pasula, Zettlemoyer, and Kaelbling (2007) s learner and the Gourmand planner (Kolobov, Mausam, and Weld 2012)' but no version numbers for their implementations.
Experiment Setup Yes Episodes were limited to 100 actions before considering them as a failure. The exploration threshold was set to ζ = 3, which yielded good results. Based on the performance analysis in Sec. 4, using a discount factor γ = 0.9, an accuracy parameter ϵ = 0.1 and an exploration threshold ζ = |S|/(ϵ2(1 γ)4) = 3, the upper bound for the R-MAX sample complexity is proportional to |S||A|ζ/(ϵ(1 γ)2) = 50 43 3/(0.1(1 0.9)2) = 2.2 106.