Author(s): Eduardo Muñoz Originally published on Towards AI. A brief description of this adaptation of Reinforced Self-Training (ReST) to an agentic configuration. Picture by Aaron Burden from Unsplash This article describes a very interesting and inspiring proposal to improve a ReAct agent with reasoning and action response with external knowledge. Besides offering promising results, it seems to me that it presents workflows that can be used in many approaches, and therefore, I find it very useful to read and understand. Introduction In December 2023, Google researchers published the paper titled “ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent” [1]. The paper discusses the implementation of a search and-answer procedure for a multi-step reasoning LLM agent. The agent uses web search to generate long-form answers for knowledge-seeking questions. The paper focuses on improving the agent’s performance and robustness through self-critique, AI feedback, and synthetic data generation. It describes a ReST–like (Reinforced Self-Training)[2] algorithm used for iterative fine-tuning of the agent’s reasoning traces. The contributions of the paper include building a ReAct agent with self-critique, defining a proxy evaluation metric, demonstrating the effectiveness of Rest-style iterative fine-tuning, and using synthetic data for distilling the agent into smaller models. Search Agent The research paper discusses a specialized agent called Search Agent, which is a variant of the ReAct agent introduced by Yao et al. in 2022 [4]. This particular agent incorporates Reflexion, a concept presented by Shin et al. in 2023 [5]. Reflexion presents an innovative approach to reinforcement learning for language agents, relying on linguistic feedback and reflective text in an episodic memory buffer. The framework’s flexibility and notable performance improvements across various tasks underscore its potential as an effective and versatile tool for training large language models in goal-driven scenarios The primary function of the Search Agent is to address knowledge-seeking open-ended questions by leveraging web search as a tool to generate comprehensive and explicitly attributable answers. The workflow of the Search Agent is outlined as follows: Given a question, the agent initiates a search loop using a search tool, summarizes the pieces of text, and determines if additional information is required. Utilizing the information gathered during the search loop, the agent formulates the initial attempt or draft of the answer. The agent undergoes two rounds of self-revision before producing the final answer: verifies the relevance of the answer and ensures the answer is grounded in the snippets from the search process The Search Agent employs a systematic approach involving iterative search loops and self-revisions to generate detailed answers for diverse open-ended questions. This methodology allows the agent to refine and validate its responses, contributing to the production of accurate and well-founded information. Implementation and methodology The iterative self-improvement process is described in the paper: “Start with a model capable of performing Search Agent task at a certain level, for example, with prompted PaLM 2-L model. Collect reasoning trajectories from this model based on our set of 2000 initial questions (essentially the “grow” stage of ReST, with the difference that we keep the set of initial questions fixed). • Convert the trajectories into the fine-tuning mixture. Apply re-ranking with RM during the conversion (this is roughly equivalent to the “improve” stage of ReST, though we only do one iteration of “improve”). • Fine-tune the new model (of the same size) on this mixture and verify that it’s performing better than the original model (we will discuss how to do it in the following section). Repeat the process, starting with this new, better model.” [1] For the reranking reward model (RM), an instruction-tuned PaLM 2-L is applied with a prompt specifically designed that receives the model input, multiple sampled outputs, and guidance on how to rank them. The highest-ranked samples are used for fine-tuning instead of the default sample chosen based on the perplexity value. This approach differs from ReST and aligns more closely with RAFT (Reward rAnked FineTuning) [3], emphasizing the importance of reward model rankings in the selection process, particularly for off-policy trajectory rollouts. Picture by Brett Jordan from Unsplash Ablation studies explore the impact of human filtering and the use of multiple trajectories per question in the fine-tuning process. Surprisingly, fine-tuning on filtered data results in a small performance drop, hypothesized to be due to reduced data size and the preservation of “bad” steps in other examples. Using two trajectories per question in fine-tuning shows a performance gain, but further increases do not significantly improve results. The self-critique aspect of the multi-step setup is examined, showing a small but measurable positive boost in overall agent performance, particularly during the “Answer Generation” step. Evaluation The research focuses on evaluating the performance of the Search Agent using the Bamboogle dataset (Press et al., 2023), a semi-adversarial set of 2-hop questions deliberately designed to be unanswerable through direct Google search, but with answers available in Wikipedia. The improvement in the Search Agent’s performance on Bamboogle indicates its enhanced ability to effectively use web search as a tool. To address the challenges associated with human evaluations, the paper introduces an LLM-based auto-eval approach. This auto-eval method is shown to be highly correlated with human evaluation scores, with a Pearson correlation of 0.98 and a Spearman correlation of 0.83. The authors use Bamboogle auto-eval to estimate the final model performance and answer various questions related to model optimization, such as sampling temperature selection, checkpoint choices for different model sizes, the impact of multiple trajectories on fine-tuning, and the effectiveness of self-checks. The research also addresses the trade-off between data quantity and quality, highlighting that the quality of data matters more than its quantity. And it emphasizes the importance of better data in reducing evaluation trajectory variance. To mitigate the risk of overfitting and address shortcomings found during human evaluations, a new dataset called BamTwoogle is introduced. This dataset, serving as a test set, is a slightly more challenging sequel to Bamboogle, requiring 2+ steps to answer each question. BamTwoogle is handcrafted and includes 100 information-seeking questions, ensuring they require multiple searches or […]
↧