A Large Language Model (LLM) is increasingly used in critical domains and complex applications, so it becomes essential to know whether their output is trustworthy. By requiring models to produce justifications that are clear and verifiable, the company can enhance trust in their outputs. To this end, OpenAI with Prover-Verifier Games improves the legibility of language model outputs.
Therefore, by optimizing the problem-solving process of strong models to get a correct answer, the result solution becomes harder to understand. Meanwhile, human evaluators when asked to optimize the solution came up with twice as many errors. The finding focuses not only on the highlights of correctness but also on ease of verification in AI-generated text.
The motive is to create advanced language models to verify the text easily, and also humans can evaluate this text more effectively, this process is certainly called legibility. Legibility is where prover-verifier games come into play. These games involve two players: a “prover” that generates a solution and a “verifier” that checks the accuracy level. Moreover, this method easily ensures that the output is correct with verification by both humans and AI systems.
Prover-Verifier Games Improve Legibility
To improve legibility, the chain-of-thoughts training procedure was inspired by the Prover-Verifier Game, a framework to encourage learning variably. This method achieves the performance in optimizing the correctness while maintaining the ability of human evaluators to access the solutions.
How it works
The “prover”, produces text that is legible to humans, the training scheme utilizes the strong model’s ability to produce solutions in weak models, and the “verifier” can easily verify them. The large and small models of the GPT-4 family with 3 orders have supervised the human post-training. The multiple rounds are alternated between two phases particularly generate the solutions related to maths problems:
- Verifier Training: the weak verifier models are trained to predict the correctness of solutions generated by the prover.
- Prover Training: the strong prover model was instructed to play either a “helpful” or a “sneaky” role.
In the ‘helpful’ role ‘correct’ solutions are scored highly according to the verifier and in the ‘sneaky’ role ‘incorrect’ solutions are scored highly according to the verifier.
Key findings
- Performance vs. Legibility Trade-Off
A model with only supervised human-written derivations holds high legibility but poor accuracy. It has been optimized solely for correctness leading to high accuracy but poor legibility. And game method balanced both, retaining high legibility with moderate accuracy.
- Three Useful Models:
- Robust Verifier:- it enables to effectively distinguish correct from incorrect solutions and is designed to be misleading.
- Helpful Prover:- it generates solutions that remain legible to humans while reducing human evaluator errors.
- Sneaky Prover:- it produces subtle, incorrect solutions that initially confuse human evaluators, highlighting areas for further model improvement.
In conclusion, a significant advantage of this methodology was that it reduced the reliance on human demonstrations and judgments regarding legibility. And particularly to reliably align AI systems with human values and expectations without direct human oversight. Certainly, their methodologies will be instrumental in developing AI systems whose outputs are not only correct but also transparently verifiable, thereby enhancing trust and safety in their real-world applications.