Surpassing GPT-4: Exploring How Agent Workflows Forge the Next Frontier of LLM Performance!
Recently, Professor Andrew Ng from letters expressed in his newsletter that agent workflows with large language models (LLMs) will be a key trend in the AI field and may drive significant AI advancements this year — even more so than the next generation of foundation models.
“I think AI agent workflows will drive massive AI progress this year — perhaps even more than the next generation of foundation models. This is an important trend, and I urge everyone who works in AI to pay attention to it.” — Andrew Ng
Agent workflow
While we typically expect to get the desired results with a single input into ChatGPT, the outcome is often not ideal. We iterate on ChatGPT’s responses based on feedback prompts until we achieve the desired output. The question arises: can we standardize this process?
Some in the industry have proposed Agent workflows. In such a workflow, we can ask the LLM to iterate on documents multiple times to produce high-quality answers. By mimicking the iterative approach humans use to solve problems, this method makes the outputs generated by artificial intelligence more accurate and detailed. This approach not only leverages the advantages of large language models but also addresses their shortcomings by introducing feedback loops for continuous improvement. Through a cycle of planning, action, review, and adjustment, AI can produce higher-quality results. Professor Ng’s team tested this on the Human Eval dataset, mentioned in the paper “Evaluating Large Language Models Trained on Code,” and obtained the following chart.
From the chart above, we can see that GPT-3.5 and GPT-4 achieved approximately 48% and 67% accuracy in zero-shot settings, respectively, while using the Agent workflow, some agents even reached 95% accuracy on GPT-3.5. It is evident that the performance improvement achieved through iterative Agent workflows far exceeds the improvement from GPT-3.5 to GPT-4. This finding emphasizes the importance of Agent workflows in enhancing AI performance, perhaps even leading to the early arrival of GPT-5. Professor Ng summarized the current industry research and proposed four workflow design patterns:
- Reflection: LLM reflects on its work and suggests improvements.
- Tool use: Provide tools to LLM, such as web searches, code execution, or any other functionality, to help it gather information, take action, or process data.
- Planning: LLM devises and executes a multi-step plan to achieve goals (e.g., outlining an article, then conducting online research, then drafting, etc.).
- Multi-agent collaboration: Multiple agents collaborate, assign tasks, and discuss and debate ideas to come up with better solutions than a single agent.
Agent System Overview
A common agent architecture system is proposed in a blog post titled “LLM Powered Autonomous Agents” by Liliang Weng, the Applications Manager at OpenAI. This blog post provides a systematic description of an Agent workflow similar to AutoGPT, BabyAGI, etc., utilizing LLMs as the brain to automatically complete tasks.
- Overview
As the brain of the Agent system, LLM is responsible for planning, reflection, memory, and tool use, among other key functions. Planning involves task decomposition and self-reflection, enabling agents to efficiently handle complex tasks. Memory is divided into short-term memory and long-term memory, with short-term memory involving context learning, and long-term memory utilizing external vector storage and rapid retrieval. Tool use involves calling external APIs to obtain missing information or perform specific tasks.
- Planning
Task Decomposition: Using techniques such as Chain of Thought (CoT) and Tree of Thoughts (ToT), complex tasks are broken down into smaller, more manageable sub-tasks. Self-reflection: Through frameworks like ReAct and Reflexion, agents can engage in self-critique and reflection to improve future actions.
- Memory
Different types of memory in the human brain are introduced and mapped to memory mechanisms in the Agent system. Maximum Inner Product Search (MIPS) and related algorithms such as Locality Sensitive Hashing (LSH), Approximate Nearest Neighbors (ANNOY), Hierarchical Navigable Small World (HNSW), and Facebook AI Similarity Search (FAISS) are discussed for optimizing retrieval speed of external memory.
- Tool use
Human characteristics of tool use are discussed, and the application of this feature to LLMs is explored to extend the model’s capabilities. The Modular Reasoning, Knowledge, and Language (MRKL) architecture, which combines expert modules and a generic LLM as a router, is mentioned. TALM (Tool-Augmented Language Model) and Toolformer are introduced as methods for fine-tuning LMs to learn how to use external tool APIs. Examples such as ChatGPT plugins and OpenAI API calls demonstrate enhancing tool usage capabilities in practice.
Next Steps
In the upcoming articles, we will explore the Reflection of the Agent workflow and validate tests with the previously constructed Llama.cpp. Stay tuned, give it a clap if you found this article helpful, drop a comment to share your thoughts, and don’t forget to follow me for the latest updates!