Agent-oriented software engineering: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 18:43, 13 April 2023 edit David Eppstein (talk \| contribs) Autopatrolled, Administrators 235,633 edits Robyn Lutz ← Previous edit		Latest revision as of 19:12, 1 January 2025 edit undo Maxeto0910 (talk \| contribs) Extended confirmed users 116,764 edits No edit summary Tag: Visual edit
(8 intermediate revisions by 5 users not shown)
Line 1: {{Short description\|Software}} {{Multiple issues\| {{essay-like\|date=December 2008}} Line 4 ⟶ 5: }} '''Agent-~~Oriented~~oriented ~~Software~~software ~~Engineering~~engineering''' ('''AOSE''') is a ~~new~~ software engineering [[paradigm]] that arose to apply best practice in the development of complex [[Multi-agent systems\|Multi-Agent Systems]] (MAS) by focusing on the use of agents, and organizations (communities) of agents as the main abstractions. The field of [[Product Family Engineering\|Software Product Line]]s (SPL) covers all the [[software]] development lifecycle necessary to develop a family of products where the derivation of concrete products is made systematically and rapidly. ==Commentary== Line 11 ⟶ 12: Multiagent Systems Product Lines (MAS-PL) is a research field devoted to combining the two approaches: applying the SPL philosophy for building a MAS. This will afford all of the advantages of SPLs and make MAS development more practical. ==Benchmarks== Several benchmarks have been developed to evaluate the capabilities of AI coding agents and large language models in software engineering tasks. Here are some of the key benchmarks: {\|class="wikitable" \|+ Agentic software engineering benchmarks ! Benchmark !! Description \|- \| [https://www.swebench.com/ SWE-bench] \| Assesses the ability of AI models to solve real-world software engineering issues sourced from GitHub repositories. The benchmark involves: * Providing agents with a code repository and issue description * Challenging them to generate a patch that resolves the described problem * Evaluating the generated patch against unit tests \|- \| [https://github.com/snap-stanford/MLAgentBench ML-Agent-Bench] \| Designed to evaluate AI agent performance on machine learning tasks \|- \| [https://github.com/sierra-research/tau-bench τ-Bench] \| τ-Bench is a benchmark developed by Sierra AI to evaluate AI agent performance and reliability in real-world settings. It focuses on: * Testing agents on complex tasks with dynamic user and tool interactions * Assessing the ability to follow ___domain-specific policies * Measuring consistency and reliability at scale \|- \| [https://github.com/web-arena-x/webarena WebArena] \| Evaluates AI agents in a simulated web environment. The benchmark tasks include: * Navigating complex websites to complete user-driven tasks * Extracting relevant information from the web * Testing the adaptability of agents to diverse web-based challenges \|- \| [https://github.com/THUDM/AgentBench AgentBench] \| A benchmark designed to assess the capabilities of AI agents in handling multi-agent coordination tasks. The key areas of evaluation include: * Communication and cooperation between agents * Task efficiency and resource management * Adaptability in dynamic environments \|- \| [https://github.com/aryopg/mmlu-redux MMLU-Redux] \| An enhanced version of the MMLU benchmark, focusing on evaluating AI models across a broad range of academic subjects and domains. It measures: * Subject matter expertise across multiple disciplines * Ability to handle complex problem-solving tasks * Consistency in providing accurate answers across topics \|- \| [https://github.com/MCEVAL/McEval McEval] \| A coding benchmark designed to test AI models' ability to solve coding challenges. The benchmark evaluates: * Code correctness and efficiency * Ability to handle diverse programming languages * Performance across different coding paradigms and tasks \|- \| [https://csbench.github.io/ CS-Bench] \| A specialized benchmark for evaluating AI performance in computer science-related tasks. The key focus areas include: * Algorithms and data structures * Computational complexity and optimization * Theoretical and applied computer science concepts \|- \| [https://github.com/allenai/WildBench WildBench] \| Tests AI models in understanding and reasoning about real-world wild environments. It emphasizes: * Handling noisy and unstructured data * Adapting to unpredictable changes in the environment * Performing well in multi-modal scenarios with real-world relevance \|- \| [https://huggingface.co/datasets/baharef/ToT Test of Time] \| A benchmark that focuses on evaluating AI models' ability to reason about temporal sequences and events over time. It assesses: * Understanding of temporal logic and sequence prediction * Ability to make decisions based on time-dependent data * Performance in tasks requiring long-term planning and foresight \|} == Software engineering agent systems == There are several software engineering (SWE) agent systems in development. Here are some examples: {\| class="wikitable" \|+ List of SWE Agent Systems ! SWE Agent System !! Backend LLM \|- \| [https://salesforce-research-dei-agents.github.io/ Salesforce Research DEIBASE-1] \|\| gpt4o \|- \| [https://cosine.sh/ Cosine Genie] \|\| Fine-tuned OpenAI GPT \|- \| [https://aide.dev/ CodeStory Aide] \|\| gpt4o + Claude 3.5 Sonnet \|- \| [https://mentat.ai/blog/mentatbot-sota-coding-agent AbenteAI MentatBot] \|\| gpt4o \|- \| Salesforce Research DEIBASE-2 \|\| gpt4o \|- \| Salesforce Research DEI-Open \|\| gpt4o \|- \| [https://www.marscode.com/ Bytedance MarsCode] \|\| gpt4o \|- \| [https://arxiv.org/abs/2406.01422 Alibaba Lingma] \|\| gpt-4-1106-preview \|- \| [https://www.factory.ai/ Factory Code Droid] \|\| Anthropic + OpenAI \|- \| [https://autocoderover.dev/ AutoCodeRover] \|\| gpt4o \|- \| [https://aws.amazon.com/q/developer/ Amazon Q Developer] \|\| (unknown) \|- \| [https://github.com/NL2Code/CodeR CodeR] \|\| gpt-4-1106-preview \|- \| [https://github.com/masai-dev-agent/masai MASAI] \|\| (unknown) \|- \| [https://github.com/swe-bench/experiments/tree/main/evaluation/lite/20240706_sima_gpt4o SIMA] \|\| gpt4o \|- \| [https://github.com/OpenAutoCoder/Agentless Agentless] \|\| gpt4o \|- \| [https://github.com/aorwall/moatless-tools Moatless Tools] \|\| Claude 3.5 Sonnet \|- \| [https://github.com/swe-bench/experiments/tree/main/evaluation/lite/20240612_IBM_Research_Agent101 IBM Research Agent] \|\| (unknown) \|- \| [https://github.com/paul-gauthier/aider Aider] \|\| gpt4o + Claude 3 Opus \|- \| [https://docs.all-hands.dev/ OpenDevin + CodeAct] \|\| gpt4o \|- \| [https://github.com/FSoft-AI4Code/AgileCoder AgileCoder] \|\| (various) \|- \| [https://chatdev.ai/ ChatDev] \|\| (unknown) \|- \| [https://github.com/geekan/MetaGPT MetaGPT] \|\| gpt4o \|} == External links == * ''Agent-Oriented Software Engineering: Reflections on Architectures, Methodologies, Languages, and Frameworks'' {{ISBN\|978-3642544316}} == References == * Michael Winikoff and Lin Padgham. ''Agent Oriented Software Engineering''. Chapter 15 (pages 695-757) In G. Weiss (Ed.). [http://mitpress.mit.edu/multiagentsystems Multiagent Systems]. 2nd Edition. MIT Press. {{ISBN\|978-0-262-01889-0}} (a recent survey of the field) * Site of the MaCMAS methodology which is applying MAS-PL. ~~http~~https://~~www~~web.~~macmas~~archive.org~~{{dead link\|date=October 2016 \|bot=InternetArchiveBot \|fix-attempted=yes }}~~/web/20100922120209/http://james.eii.us.es/MaCMAS/index.php/Main_Page * MAS Product Lines site: https://web.archive.org/web/20140518122645/http://mas-productlines.org/ * Joaquin Peña, Michael G. Hinchey, and Antonio Ruiz-Cortés. Multiagent system product lines: Challenges and benefits. Communications of the ACM, December 2006, volume 49, issue number 12. {{doi\|10.1145/1183236.1183272}}