Content deleted Content added
Since we have two other tags which specify issues, no need to ancient {{cleanup}} tag without reason |
Maxeto0910 (talk | contribs) No edit summary |
||
(23 intermediate revisions by 16 users not shown) | |||
Line 1:
{{Short description|Software}}
{{Multiple issues|
{{essay-like|date=December 2008}}
Line 4 ⟶ 5:
}}
'''Agent-
==Commentary==
Line 11 ⟶ 12:
Multiagent Systems Product Lines (MAS-PL) is a research field devoted to combining the two approaches: applying the SPL philosophy for building a MAS. This will afford all of the advantages of SPLs and make MAS development more practical.
==Benchmarks==
Several benchmarks have been developed to evaluate the capabilities of AI coding agents and large language models in software engineering tasks. Here are some of the key benchmarks:
{|class="wikitable"
|+ Agentic software engineering benchmarks
! Benchmark !! Description
|-
| [https://www.swebench.com/ SWE-bench]
| Assesses the ability of AI models to solve real-world software engineering issues sourced from GitHub repositories. The benchmark involves:
* Providing agents with a code repository and issue description
* Challenging them to generate a patch that resolves the described problem
* Evaluating the generated patch against unit tests
|-
| [https://github.com/snap-stanford/MLAgentBench ML-Agent-Bench]
| Designed to evaluate AI agent performance on machine learning tasks
|-
| [https://github.com/sierra-research/tau-bench τ-Bench]
| τ-Bench is a benchmark developed by Sierra AI to evaluate AI agent performance and reliability in real-world settings. It focuses on:
* Testing agents on complex tasks with dynamic user and tool interactions
* Assessing the ability to follow ___domain-specific policies
* Measuring consistency and reliability at scale
|-
| [https://github.com/web-arena-x/webarena WebArena]
| Evaluates AI agents in a simulated web environment. The benchmark tasks include:
* Navigating complex websites to complete user-driven tasks
* Extracting relevant information from the web
* Testing the adaptability of agents to diverse web-based challenges
|-
| [https://github.com/THUDM/AgentBench AgentBench]
| A benchmark designed to assess the capabilities of AI agents in handling multi-agent coordination tasks. The key areas of evaluation include:
* Communication and cooperation between agents
* Task efficiency and resource management
* Adaptability in dynamic environments
|-
| [https://github.com/aryopg/mmlu-redux MMLU-Redux]
| An enhanced version of the MMLU benchmark, focusing on evaluating AI models across a broad range of academic subjects and domains. It measures:
* Subject matter expertise across multiple disciplines
* Ability to handle complex problem-solving tasks
* Consistency in providing accurate answers across topics
|-
| [https://github.com/MCEVAL/McEval McEval]
| A coding benchmark designed to test AI models' ability to solve coding challenges. The benchmark evaluates:
* Code correctness and efficiency
* Ability to handle diverse programming languages
* Performance across different coding paradigms and tasks
|-
| [https://csbench.github.io/ CS-Bench]
| A specialized benchmark for evaluating AI performance in computer science-related tasks. The key focus areas include:
* Algorithms and data structures
* Computational complexity and optimization
* Theoretical and applied computer science concepts
|-
| [https://github.com/allenai/WildBench WildBench]
| Tests AI models in understanding and reasoning about real-world wild environments. It emphasizes:
* Handling noisy and unstructured data
* Adapting to unpredictable changes in the environment
* Performing well in multi-modal scenarios with real-world relevance
|-
| [https://huggingface.co/datasets/baharef/ToT Test of Time]
| A benchmark that focuses on evaluating AI models' ability to reason about temporal sequences and events over time. It assesses:
* Understanding of temporal logic and sequence prediction
* Ability to make decisions based on time-dependent data
* Performance in tasks requiring long-term planning and foresight
|}
== Software engineering agent systems ==
There are several software engineering (SWE) agent systems in development. Here are some examples:
{| class="wikitable"
|+ List of SWE Agent Systems
! SWE Agent System !! Backend LLM
|-
| [https://salesforce-research-dei-agents.github.io/ Salesforce Research DEIBASE-1] || gpt4o
|-
| [https://cosine.sh/ Cosine Genie] || Fine-tuned OpenAI GPT
|-
| [https://aide.dev/ CodeStory Aide] || gpt4o + Claude 3.5 Sonnet
|-
| [https://mentat.ai/blog/mentatbot-sota-coding-agent AbenteAI MentatBot] || gpt4o
|-
| Salesforce Research DEIBASE-2 || gpt4o
|-
| Salesforce Research DEI-Open || gpt4o
|-
| [https://www.marscode.com/ Bytedance MarsCode] || gpt4o
|-
| [https://arxiv.org/abs/2406.01422 Alibaba Lingma] || gpt-4-1106-preview
|-
| [https://www.factory.ai/ Factory Code Droid] || Anthropic + OpenAI
|-
| [https://autocoderover.dev/ AutoCodeRover] || gpt4o
|-
| [https://aws.amazon.com/q/developer/ Amazon Q Developer] || (unknown)
|-
| [https://github.com/NL2Code/CodeR CodeR] || gpt-4-1106-preview
|-
| [https://github.com/masai-dev-agent/masai MASAI] || (unknown)
|-
| [https://github.com/swe-bench/experiments/tree/main/evaluation/lite/20240706_sima_gpt4o SIMA] || gpt4o
|-
| [https://github.com/OpenAutoCoder/Agentless Agentless] || gpt4o
|-
| [https://github.com/aorwall/moatless-tools Moatless Tools] || Claude 3.5 Sonnet
|-
| [https://github.com/swe-bench/experiments/tree/main/evaluation/lite/20240612_IBM_Research_Agent101 IBM Research Agent] || (unknown)
|-
| [https://github.com/paul-gauthier/aider Aider] || gpt4o + Claude 3 Opus
|-
| [https://docs.all-hands.dev/ OpenDevin + CodeAct] || gpt4o
|-
| [https://github.com/FSoft-AI4Code/AgileCoder AgileCoder] || (various)
|-
| [https://chatdev.ai/ ChatDev] || (unknown)
|-
| [https://github.com/geekan/MetaGPT MetaGPT] || gpt4o
|}
== External links ==
* ''Agent-Oriented Software Engineering: Reflections on Architectures, Methodologies, Languages, and Frameworks'' {{ISBN|978-3642544316}}
== References ==
* Michael Winikoff and Lin Padgham. ''Agent Oriented Software Engineering''. Chapter 15 (pages 695-757) In G. Weiss (Ed.). [http://mitpress.mit.edu/multiagentsystems Multiagent Systems]. 2nd Edition. MIT Press. {{ISBN|978-0-262-01889-0}} (a recent survey of the field)
* Site of the MaCMAS methodology which is applying MAS-PL. https://web.archive.org/web/20100922120209/http://james.eii.us.es/MaCMAS/index.php/Main_Page
* MAS Product Lines site: https://web.archive.org/web/20140518122645/http://mas-productlines.org/
* Joaquin Peña, Michael G. Hinchey,
* {{cite journal | last1 = Peña | first1 = Joaquin | last2 = Hinchey | first2 = Michael G. | last3 = Resinas | first3 = Manuel | last4 = Sterritt | first4 = Roy | last5 = Rash | first5 = James L. | title = Designing and Managing Evolving Systems using a MAS-Product-Line Approach | doi = 10.1016/j.scico.2006.10.007 | journal = Journal of Science of Computer Programming | year = 2007 | volume = 66| pages = 71–86| url = https://pure.ulster.ac.uk/en/publications/0a91f377-9421-4585-957b-77060a458644 | doi-access = free }}
* Joaquin Peña, Michael G. Hinchey, Antonio Ruiz-Cortés, and Pablo Trinidad. Building the Core Architecture of a NASA Multiagent System Product Line. In 7th International Workshop on Agent Oriented Software Engineering 2006, page to be published, Hakodate, Japan, May 2006. LNCS. https://doi.org/10.1007%2F978-3-540-70945-9_13
* Joaquin Peña, Michael G. Hinchey, Manuel Resinas, Roy Sterritt, James L. Rash. Managing the Evolution of an Enterprise Architecture using a MAS-Product-Line Approach. 5th Int. Workshop on System/Software Architectures (IWSSA’06). Nevada, USA. 2006
* Soe-Tsyr Yuan. MAS Building Environments with Product-Line-Architecture Awareness.
* [https://web.archive.org/web/20070517214904/http://www.cs.iastate.edu/~dehlinge/publications.html Josh_Dehlinger] and [
* [https://web.archive.org/web/20091231195122/http://james.eii.us.es/MaCMAS/images/6/69/Current-Research-MAS-PL-TF4-Lisbon.pdf MAS-PL -- Current research]. In [http://www.irit.fr/ACTIVITES/EQ_SMI/SMAC/TFG4_CFP.html THE FOURTH TECHNICAL FORUM (TF4) of AgentLink]. December 2006.
[[Category:Software project management]]
{{software-eng-stub}}
|