Curriculum Learning for Cumulative Return Maximization

Authors: Francesco Foglino, Christiano Coletto Christakou, Ricardo Luna Gutierrez, Matteo Leonetti

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally compare our task sequencing algorithm to several popular metaheuristic algorithms for combinatorial optimization, and show that it achieves significantly better performance on the problem of cumulative return maximization. Furthermore, we validate our algorithm on a critical task, optimizing a home controller for a micro energy grid.
Researcher Affiliation Academia Francesco Foglino , Christiano Coletto Christakou , Ricardo Luna Gutierrez and Matteo Leonetti University of Leeds, United Kingdom {scff,mm18ccc,scrlg,m.leonetti}@leeds.ac.uk
Pseudocode Yes Algorithm 1 HTS-CR; Algorithm 2 Evaluate Pairs
Open Source Code Yes The complete source code of all our experiments can be found at https://github.com/francescofoglino/Curriculum-Learning
Open Datasets No The paper mentions using domains implemented within the Burlap software library and historical data for MGEnv, but does not provide specific access information (link, DOI, formal citation) for these datasets.
Dataset Splits No For MGEnv, the paper states, 'We divided them into a training and test set of 5 tasks each.' For other domains, it mentions 'intermediate tasks' and 'final tasks'. However, it does not provide specific percentages, sample counts, or explicit standard citations for dataset splits needed for reproduction.
Hardware Specification No The paper does not specify any hardware details (e.g., CPU, GPU models, or cloud computing instance types) used for running the experiments.
Software Dependencies No The paper mentions using Sarsa(λ), Tile Coding, ϵ-greedy, Actor-Critic architecture, Proximal Policy Optimization, and Progressive Neural Networks. However, it does not provide specific version numbers for any software libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In Tabu Search (TS) [Glover, 1989], we create the neighborhood of a current curriculum by: generating a list of curricula R composed of all the curricula obtained by removing from, or adding to, the last task in the current best curriculum; and generating all curricula resulting from any pairwise swap of any two tasks of a curriculum in R. We empty the tabu list of size T, when full, following a FIFO strategy. In our experiments T = 30. For Genetic Algorithm (GA) [Goldberg, 1989], we set the initial population as Q randomly sampled curricula from C L. At each iteration we select two parent curricula with a roulette wheel selection. Given two parent curricula we generate a new population of Q candidate curricula by applying a standard single point cross over at randomized lengths along each parent gene (sequence of intermediate tasks). Each cross over step produces two children curricula and the process is repeated until Q children curricula are created. We also included a form of elitism in order to improve the performance of the algorithm by adding the parents to the population they generated. Genetic algorithms also include the definition of a mutation operator. In our implementation this acts on each candidate curriculum in the newly generated population with probability pm. The mutation can be of two equally probable types: task-wise mutation, which given a candidate curriculum of length l, changes each of its intermediate tasks with probability equal to 1/l; length-wise mutation, where equal probability is given to either dropping or adding a new source at a randomly selected position of a candidate curriculum. In our experiments Q = 50 and pm = 0.5. For Ant Colony Optimization (ACO) [Dorigo et al., 1991], each agent in the colony moves towards the goal by adding a new intermediate task to the current candidate curriculum c which represents the trail walked by the ant. Given a source task mi its probability of being selected is P(mi) = [(τmi + K)α + Iβ mi]/[P E [(τmj + K)α + Iβ mj]] with E = {mj T |mj / c}. The variable τmi represents the quantity of pheromone on task mi while following the current candidate curriculum c. The visibility Imi is calculated as the performance improvement obtained by adding task mi to the current candidate curriculum when positive, and zero otherwise. Parameters α and β control the influence of the pheromone versus the improvement, while K is a threshold to control from what pheromone value the search starts to take it into account. The pheromone evaporation rate is specified with the parameter ρ while the maximum level of pheromone to be accumulated over a candidate solution is set to fmax. In our experiments α = 1, β = 1.2, K = 5, fmax = 50, ρ = 0.2 and the number of ants is 20.