Agent-oriented software engineering: Difference between revisions

Content deleted Content added
No edit summary
 
(8 intermediate revisions by 5 users not shown)
Line 1:
{{Short description|Software}}
{{Multiple issues|
{{essay-like|date=December 2008}}
Line 4 ⟶ 5:
}}
 
'''Agent-Orientedoriented Softwaresoftware Engineeringengineering''' ('''AOSE''') is a new software engineering [[paradigm]] that arose to apply best practice in the development of complex [[Multi-agent systems|Multi-Agent Systems]] (MAS) by focusing on the use of agents, and organizations (communities) of agents as the main abstractions. The field of [[Product Family Engineering|Software Product Line]]s (SPL) covers all the [[software]] development lifecycle necessary to develop a family of products where the derivation of concrete products is made systematically and rapidly.
 
==Commentary==
Line 11 ⟶ 12:
 
Multiagent Systems Product Lines (MAS-PL) is a research field devoted to combining the two approaches: applying the SPL philosophy for building a MAS. This will afford all of the advantages of SPLs and make MAS development more practical.
 
==Benchmarks==
Several benchmarks have been developed to evaluate the capabilities of AI coding agents and large language models in software engineering tasks. Here are some of the key benchmarks:
 
{|class="wikitable"
|+ Agentic software engineering benchmarks
! Benchmark !! Description
|-
| [https://www.swebench.com/ SWE-bench]
| Assesses the ability of AI models to solve real-world software engineering issues sourced from GitHub repositories. The benchmark involves:
* Providing agents with a code repository and issue description
* Challenging them to generate a patch that resolves the described problem
* Evaluating the generated patch against unit tests
|-
| [https://github.com/snap-stanford/MLAgentBench ML-Agent-Bench]
| Designed to evaluate AI agent performance on machine learning tasks
|-
| [https://github.com/sierra-research/tau-bench τ-Bench]
| τ-Bench is a benchmark developed by Sierra AI to evaluate AI agent performance and reliability in real-world settings. It focuses on:
* Testing agents on complex tasks with dynamic user and tool interactions
* Assessing the ability to follow ___domain-specific policies
* Measuring consistency and reliability at scale
|-
| [https://github.com/web-arena-x/webarena WebArena]
| Evaluates AI agents in a simulated web environment. The benchmark tasks include:
* Navigating complex websites to complete user-driven tasks
* Extracting relevant information from the web
* Testing the adaptability of agents to diverse web-based challenges
|-
| [https://github.com/THUDM/AgentBench AgentBench]
| A benchmark designed to assess the capabilities of AI agents in handling multi-agent coordination tasks. The key areas of evaluation include:
* Communication and cooperation between agents
* Task efficiency and resource management
* Adaptability in dynamic environments
|-
| [https://github.com/aryopg/mmlu-redux MMLU-Redux]
| An enhanced version of the MMLU benchmark, focusing on evaluating AI models across a broad range of academic subjects and domains. It measures:
* Subject matter expertise across multiple disciplines
* Ability to handle complex problem-solving tasks
* Consistency in providing accurate answers across topics
|-
| [https://github.com/MCEVAL/McEval McEval]
| A coding benchmark designed to test AI models' ability to solve coding challenges. The benchmark evaluates:
* Code correctness and efficiency
* Ability to handle diverse programming languages
* Performance across different coding paradigms and tasks
|-
| [https://csbench.github.io/ CS-Bench]
| A specialized benchmark for evaluating AI performance in computer science-related tasks. The key focus areas include:
* Algorithms and data structures
* Computational complexity and optimization
* Theoretical and applied computer science concepts
|-
| [https://github.com/allenai/WildBench WildBench]
| Tests AI models in understanding and reasoning about real-world wild environments. It emphasizes:
* Handling noisy and unstructured data
* Adapting to unpredictable changes in the environment
* Performing well in multi-modal scenarios with real-world relevance
|-
| [https://huggingface.co/datasets/baharef/ToT Test of Time]
| A benchmark that focuses on evaluating AI models' ability to reason about temporal sequences and events over time. It assesses:
* Understanding of temporal logic and sequence prediction
* Ability to make decisions based on time-dependent data
* Performance in tasks requiring long-term planning and foresight
|}
 
== Software engineering agent systems ==
 
There are several software engineering (SWE) agent systems in development. Here are some examples:
 
{| class="wikitable"
|+ List of SWE Agent Systems
! SWE Agent System !! Backend LLM
|-
| [https://salesforce-research-dei-agents.github.io/ Salesforce Research DEIBASE-1] || gpt4o
|-
| [https://cosine.sh/ Cosine Genie] || Fine-tuned OpenAI GPT
|-
| [https://aide.dev/ CodeStory Aide] || gpt4o + Claude 3.5 Sonnet
|-
| [https://mentat.ai/blog/mentatbot-sota-coding-agent AbenteAI MentatBot] || gpt4o
|-
| Salesforce Research DEIBASE-2 || gpt4o
|-
| Salesforce Research DEI-Open || gpt4o
|-
| [https://www.marscode.com/ Bytedance MarsCode] || gpt4o
|-
| [https://arxiv.org/abs/2406.01422 Alibaba Lingma] || gpt-4-1106-preview
|-
| [https://www.factory.ai/ Factory Code Droid] || Anthropic + OpenAI
|-
| [https://autocoderover.dev/ AutoCodeRover] || gpt4o
|-
| [https://aws.amazon.com/q/developer/ Amazon Q Developer] || (unknown)
|-
| [https://github.com/NL2Code/CodeR CodeR] || gpt-4-1106-preview
|-
| [https://github.com/masai-dev-agent/masai MASAI] || (unknown)
|-
| [https://github.com/swe-bench/experiments/tree/main/evaluation/lite/20240706_sima_gpt4o SIMA] || gpt4o
|-
| [https://github.com/OpenAutoCoder/Agentless Agentless] || gpt4o
|-
| [https://github.com/aorwall/moatless-tools Moatless Tools] || Claude 3.5 Sonnet
|-
| [https://github.com/swe-bench/experiments/tree/main/evaluation/lite/20240612_IBM_Research_Agent101 IBM Research Agent] || (unknown)
|-
| [https://github.com/paul-gauthier/aider Aider] || gpt4o + Claude 3 Opus
|-
| [https://docs.all-hands.dev/ OpenDevin + CodeAct] || gpt4o
|-
| [https://github.com/FSoft-AI4Code/AgileCoder AgileCoder] || (various)
|-
| [https://chatdev.ai/ ChatDev] || (unknown)
|-
| [https://github.com/geekan/MetaGPT MetaGPT] || gpt4o
|}
 
== External links ==
* ''Agent-Oriented Software Engineering: Reflections on Architectures, Methodologies, Languages, and Frameworks'' {{ISBN|978-3642544316}}
 
== References ==
* Michael Winikoff and Lin Padgham. ''Agent Oriented Software Engineering''. Chapter 15 (pages 695-757) In G. Weiss (Ed.). [http://mitpress.mit.edu/multiagentsystems Multiagent Systems]. 2nd Edition. MIT Press. {{ISBN|978-0-262-01889-0}} (a recent survey of the field)
* Site of the MaCMAS methodology which is applying MAS-PL. httphttps://wwwweb.macmasarchive.org{{dead link|date=October 2016 |bot=InternetArchiveBot |fix-attempted=yes }}/web/20100922120209/http://james.eii.us.es/MaCMAS/index.php/Main_Page
* MAS Product Lines site: https://web.archive.org/web/20140518122645/http://mas-productlines.org/
* Joaquin Peña, Michael G. Hinchey, and Antonio Ruiz-Cortés. Multiagent system product lines: Challenges and benefits. Communications of the ACM, December 2006, volume 49, issue number 12. {{doi|10.1145/1183236.1183272}}