Author(s): Yunzhe Wang Originally published on Towards AI. Superintelligent AI systems will be extraordinarily powerful; humans could face catastrophic risks including even extinction if those systems are misaligned or misused. It is important for AI developers to have a plan for aligning superhuman models ahead of time — before they have the potential to cause irreparable harm. (Appendex G in the paper) illustration by Midjourney Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of… arxiv.org Human supervision plays a critical role in overseeing large artificial intelligence models, like GPT-4, today. However, as we progress towards creating superintelligence — AI that surpasses even the smartest humans in complex tasks — human supervision inevitably becomes weaker. To ensure such models remain beneficial to humanity and under human control, questions arise: Can weak human supervision effectively elicit and control the full capabilities of strong models? Is there additional work we can do to make weak supervision more efficient? Although superintelligence does not yet exist, solving this problem beforehand is critical. The OpenAI superalignment team approaches this problem through an analogy: weak AI models (GPT-2) are to strong AI models (GPT-4) as humans are to superintelligence. weak-to-strong superalignment analogy (figure 1 from the paper) The main results and takeaways have been summarized quite well in the official blog post. This blog post offers a complementary overview of some key technical concepts and methodologies introduced in this paper for learning purposes. Throughout the post, consider GPT-2 as the weak supervisor and GPT-4 as the strong student. This paper examines the extent of performance GPT-4 can achieve by training it with labels generated by GPT-2 on given tasks of interest. Concepts and Methodologies Weak-to-Strong Generalization: The phenomenon where a strong AI model, when fine-tuned on labels generated by a weaker model, performs better than the weaker supervisor. Performance Gap Recovered (PGR): A metric to evaluate weak-to-strong generalization over a pair of weak and strong models. Bootstrapping with Intermediate Model Sizes: Sequential training of models of increasing strength, improving weak-to-strong generalization. Auxiliary Confidence Loss: Encourage strong models to learn a broader understanding of the weak supervisor at the early stage and shift towards gaining confidence in their own predictions, avoiding replicating errors made by the weak supervisor, thereby improving weak-to-strong generalization. Generative Supervision: Self-supervised finetuning on task-relevant data to enhance the salient representation of the task at hand, thereby improving weak-to-strong generalization. The Challenge of Imitation and the Inverse Scaling Phenomenon: Strong models not only learn from the weak supervisor but also overfit its errors. However, a phenomenon is observed that larger student models tend to agree less with the errors of the supervisor than smaller student models. Enhancing Concept Saliency through Linear Probing: Finetuning language models on weak labels followed by linear probing with ground truth labels significantly improves concept saliency, indicating the weak label finetuning process makes the task more linear. The Superalignment Challenge Performance Gap Recovered (PGR) PGR is a metric that measures the effectiveness of a weaker AI model’s supervision in improving the performance of a stronger AI model. It is a function of three performance metrics: Weak Performance: The evaluation of a weak model trained on ground-truth labels. Predictions of the weak model are used as weak labels. Strong Ceiling Performance: The evaluation of a strong model trained on ground-truth labels. Weak-to-Strong Performance: The evaluation of a strong model trained on weak labels generated by the weak model. Performance Gap Recovered Equation (illustration from the paper) Note: PGR == 1 indicates perfect weak-to-strong generalization PGR == 0 indicates negligible weak-to-strong generalization. Bootstrapping with Intermediate Model Sizes Bootstrapping is a long-standing concept in alignment: instead of directly aligning superhuman models, we could first align a slightly superhuman model, then use it to align an even smarter model, and so on. Specifically, consider a sequence of model sizes: M1 < M2 < … < Mn. We use the weak labels from M1 to fine-tune M2, then use M2 to generate new weak labels to fine-tune the next model in the sequence, M3, and so on. This study found that Bootstrapping improves weak-to-strong generalization. Auxiliary Confidence Loss Directly training a strong student model to mimic a weaker supervisor risks the student replicating the supervisor’s mistakes. To prevent this, a regularization loss term is introduced. This term encourages the strong model to maintain confidence in its own answers when the weak supervisor errs, learning the supervisor’s intent without imitating its mistakes. In particular CE(⋅ , ⋅) is the cross-entropy function f(x) ∈ [0, 1] is the predictive distribution of the strong model f_w(x) ∈ [0, 1] is the predictive distribution of the weak model (weak labels) ˆf_t(x) = I[f(x) > t]∈ {0, 1} is an indicator function that hardens the predictive distribution of the strong model based on threshold t. Let’s say the strong model outputs a prob distribution like [0.9, 0.2, 0.3, …]. With threshold t = 0.7, the hardened distribution becomes [1, 0, 0, …]. This makes the strong model learn by becoming more confident in its own answers. Let's see what happens when we gradually increase α. When α is low (near 0), the loss primarily minimizes the discrepancy between the strong model’s predictions and the weak labels, allowing the strong model to initially learn from the broader, possibly noisier information (intent) provided by the weak supervisor. As α increases, the loss function gradually shifts its focus to reinforce the strong model’s predictions by hardening their confidence, thereby encouraging the model to increasingly rely on its own learned representations and decisions. Generative Supervision A broader scope of self-supervised learning objectives. Source Generative Supervision (or more broadly self-supervised learning) involves fine-tuning a pre-trained model on task-relevant data without using traditional labeled data. This method helps the model to learn by generating data that reflects the task, enhancing its understanding and performance. For example, in a sentiment analysis task, by fine-tuning a language model on a large volume […]
↧