Content deleted Content added
m Wikified as part of the Wikification drive |
Maxeto0910 (talk | contribs) No edit summary |
||
(32 intermediate revisions by 25 users not shown) | |||
Line 1:
{{Short description|Software}}
{{Multiple issues|
{{essay-
{{No footnotes|date=April 2009}}
}}
==Commentary==
same techniques, adaptations, and approaches. The field is thus ripe for exploiting the benefits of SPL: reduced costs, improved time-to-market, etc. and enhancing agent technology in such a way that it is more industrially applicable.
Multiagent Systems Product Lines (MAS-PL) is a research field devoted to combining the two approaches: applying the SPL philosophy for building a MAS. This will afford all of the advantages of SPLs and make MAS development more practical.
==Benchmarks==
Several benchmarks have been developed to evaluate the capabilities of AI coding agents and large language models in software engineering tasks. Here are some of the key benchmarks:
{|class="wikitable"
|+ Agentic software engineering benchmarks
! Benchmark !! Description
|-
| [https://www.swebench.com/ SWE-bench]
| Assesses the ability of AI models to solve real-world software engineering issues sourced from GitHub repositories. The benchmark involves:
* Providing agents with a code repository and issue description
* Challenging them to generate a patch that resolves the described problem
* Evaluating the generated patch against unit tests
|-
| [https://github.com/snap-stanford/MLAgentBench ML-Agent-Bench]
| Designed to evaluate AI agent performance on machine learning tasks
|-
| [https://github.com/sierra-research/tau-bench τ-Bench]
| τ-Bench is a benchmark developed by Sierra AI to evaluate AI agent performance and reliability in real-world settings. It focuses on:
* Testing agents on complex tasks with dynamic user and tool interactions
* Assessing the ability to follow ___domain-specific policies
* Measuring consistency and reliability at scale
|-
| [https://github.com/web-arena-x/webarena WebArena]
| Evaluates AI agents in a simulated web environment. The benchmark tasks include:
* Navigating complex websites to complete user-driven tasks
* Extracting relevant information from the web
* Testing the adaptability of agents to diverse web-based challenges
|-
| [https://github.com/THUDM/AgentBench AgentBench]
| A benchmark designed to assess the capabilities of AI agents in handling multi-agent coordination tasks. The key areas of evaluation include:
* Communication and cooperation between agents
* Task efficiency and resource management
* Adaptability in dynamic environments
|-
| [https://github.com/aryopg/mmlu-redux MMLU-Redux]
| An enhanced version of the MMLU benchmark, focusing on evaluating AI models across a broad range of academic subjects and domains. It measures:
* Subject matter expertise across multiple disciplines
* Ability to handle complex problem-solving tasks
* Consistency in providing accurate answers across topics
|-
| [https://github.com/MCEVAL/McEval McEval]
| A coding benchmark designed to test AI models' ability to solve coding challenges. The benchmark evaluates:
* Code correctness and efficiency
* Ability to handle diverse programming languages
* Performance across different coding paradigms and tasks
|-
| [https://csbench.github.io/ CS-Bench]
| A specialized benchmark for evaluating AI performance in computer science-related tasks. The key focus areas include:
* Algorithms and data structures
* Computational complexity and optimization
* Theoretical and applied computer science concepts
|-
| [https://github.com/allenai/WildBench WildBench]
| Tests AI models in understanding and reasoning about real-world wild environments. It emphasizes:
* Handling noisy and unstructured data
* Adapting to unpredictable changes in the environment
* Performing well in multi-modal scenarios with real-world relevance
|-
| [https://huggingface.co/datasets/baharef/ToT Test of Time]
| A benchmark that focuses on evaluating AI models' ability to reason about temporal sequences and events over time. It assesses:
* Understanding of temporal logic and sequence prediction
* Ability to make decisions based on time-dependent data
* Performance in tasks requiring long-term planning and foresight
|}
== Software engineering agent systems ==
There are several software engineering (SWE) agent systems in development. Here are some examples:
{| class="wikitable"
|+ List of SWE Agent Systems
! SWE Agent System !! Backend LLM
|-
| [https://salesforce-research-dei-agents.github.io/ Salesforce Research DEIBASE-1] || gpt4o
|-
| [https://cosine.sh/ Cosine Genie] || Fine-tuned OpenAI GPT
|-
| [https://aide.dev/ CodeStory Aide] || gpt4o + Claude 3.5 Sonnet
|-
| [https://mentat.ai/blog/mentatbot-sota-coding-agent AbenteAI MentatBot] || gpt4o
|-
| Salesforce Research DEIBASE-2 || gpt4o
|-
| Salesforce Research DEI-Open || gpt4o
|-
| [https://www.marscode.com/ Bytedance MarsCode] || gpt4o
|-
| [https://arxiv.org/abs/2406.01422 Alibaba Lingma] || gpt-4-1106-preview
|-
| [https://www.factory.ai/ Factory Code Droid] || Anthropic + OpenAI
|-
| [https://autocoderover.dev/ AutoCodeRover] || gpt4o
|-
| [https://aws.amazon.com/q/developer/ Amazon Q Developer] || (unknown)
|-
| [https://github.com/NL2Code/CodeR CodeR] || gpt-4-1106-preview
|-
| [https://github.com/masai-dev-agent/masai MASAI] || (unknown)
|-
| [https://github.com/swe-bench/experiments/tree/main/evaluation/lite/20240706_sima_gpt4o SIMA] || gpt4o
|-
| [https://github.com/OpenAutoCoder/Agentless Agentless] || gpt4o
|-
| [https://github.com/aorwall/moatless-tools Moatless Tools] || Claude 3.5 Sonnet
|-
| [https://github.com/swe-bench/experiments/tree/main/evaluation/lite/20240612_IBM_Research_Agent101 IBM Research Agent] || (unknown)
|-
| [https://github.com/paul-gauthier/aider Aider] || gpt4o + Claude 3 Opus
|-
| [https://docs.all-hands.dev/ OpenDevin + CodeAct] || gpt4o
|-
| [https://github.com/FSoft-AI4Code/AgileCoder AgileCoder] || (various)
|-
| [https://chatdev.ai/ ChatDev] || (unknown)
|-
| [https://github.com/geekan/MetaGPT MetaGPT] || gpt4o
|}
== External links ==
* ''Agent-Oriented Software Engineering: Reflections on Architectures, Methodologies, Languages, and Frameworks'' {{ISBN|978-3642544316}}
== References ==
* Michael Winikoff and Lin Padgham. ''Agent Oriented Software Engineering''. Chapter 15 (pages 695-757) In G. Weiss (Ed.). [http://mitpress.mit.edu/multiagentsystems Multiagent Systems]. 2nd Edition. MIT Press. {{ISBN|978-0-262-01889-0}} (a recent survey of the field)
* Site of the MaCMAS methodology which is applying MAS-PL.
* MAS Product Lines site:
* Joaquin Peña, Michael G. Hinchey, and Antonio Ruiz-Cortés. Multiagent system product lines: Challenges and benefits. Communications of the ACM, December 2006, volume 49, issue number 12.
*
* Joaquin Peña, Michael G. Hinchey, Antonio Ruiz-Cortés, and Pablo Trinidad. Building the Core Architecture of a NASA Multiagent System Product Line. In 7th International Workshop on Agent Oriented Software Engineering 2006, page to be published, Hakodate, Japan, May
* Joaquin Peña, Michael G. Hinchey, Manuel Resinas, Roy Sterritt, James L. Rash. Managing the Evolution of an Enterprise Architecture using a MAS-Product-Line Approach. 5th Int. Workshop on System/Software Architectures (IWSSA’06). Nevada, USA. 2006
* Soe-Tsyr Yuan. MAS Building Environments with Product-Line-Architecture Awareness.
* [https://web.archive.org/web/20070517214904/http://www.cs.iastate.edu/~dehlinge/publications.html Josh_Dehlinger] and [
* [https://web.archive.org/web/20091231195122/http://james.eii.us.es/MaCMAS/images/6/69/Current-Research-MAS-PL-TF4-Lisbon.pdf
[[Category:Software project management]]
{{software-eng-stub}}
|