Content deleted Content added
m Use archive.org link for dead link |
|||
Line 11:
Multiagent Systems Product Lines (MAS-PL) is a research field devoted to combining the two approaches: applying the SPL philosophy for building a MAS. This will afford all of the advantages of SPLs and make MAS development more practical.
== External Links ==▼
==Benchmarks==
* ''Agent-Oriented Software Engineering: Reflections on Architectures, Methodologies, Languages, and Frameworks'' {{ISBN|978-3642544316}}▼
Several benchmarks have been developed to evaluate the capabilities of AI coding agents and large language models in software engineering tasks. Here are some of the key benchmarks:
{|class="wikitable"
|+ Agentic software engineering benchmarks
! Benchmark !! Description
|-
| [https://www.swebench.com/ SWE-bench]
| Assesses the ability of AI models to solve real-world software engineering issues sourced from GitHub repositories. The benchmark involves:
* Providing agents with a code repository and issue description
* Challenging them to generate a patch that resolves the described problem
* Evaluating the generated patch against unit tests
|-
| [https://github.com/snap-stanford/MLAgentBench ML-Agent-Bench]
| Designed to evaluate AI agent performance on machine learning tasks
|-
| [https://github.com/sierra-research/tau-bench τ-Bench]
| τ-Bench is a benchmark developed by Sierra AI to evaluate AI agent performance and reliability in real-world settings. It focuses on:
* Testing agents on complex tasks with dynamic user and tool interactions
* Assessing the ability to follow ___domain-specific policies
* Measuring consistency and reliability at scale
|-
| [https://github.com/web-arena-x/webarena WebArena]
| Evaluates AI agents in a simulated web environment. The benchmark tasks include:
* Navigating complex websites to complete user-driven tasks
* Extracting relevant information from the web
* Testing the adaptability of agents to diverse web-based challenges
|-
| [https://github.com/THUDM/AgentBench AgentBench]
| A benchmark designed to assess the capabilities of AI agents in handling multi-agent coordination tasks. The key areas of evaluation include:
* Communication and cooperation between agents
* Task efficiency and resource management
* Adaptability in dynamic environments
|-
| [https://github.com/aryopg/mmlu-redux MMLU-Redux]
| An enhanced version of the MMLU benchmark, focusing on evaluating AI models across a broad range of academic subjects and domains. It measures:
* Subject matter expertise across multiple disciplines
* Ability to handle complex problem-solving tasks
* Consistency in providing accurate answers across topics
|-
| [https://github.com/MCEVAL/McEval McEval]
| A coding benchmark designed to test AI models' ability to solve coding challenges. The benchmark evaluates:
* Code correctness and efficiency
* Ability to handle diverse programming languages
* Performance across different coding paradigms and tasks
|-
| [https://csbench.github.io/ CS-Bench]
| A specialized benchmark for evaluating AI performance in computer science-related tasks. The key focus areas include:
* Algorithms and data structures
* Computational complexity and optimization
* Theoretical and applied computer science concepts
|-
| [https://github.com/allenai/WildBench WildBench]
| Tests AI models in understanding and reasoning about real-world wild environments. It emphasizes:
* Handling noisy and unstructured data
* Adapting to unpredictable changes in the environment
* Performing well in multi-modal scenarios with real-world relevance
|-
| [https://huggingface.co/datasets/baharef/ToT Test of Time]
| A benchmark that focuses on evaluating AI models' ability to reason about temporal sequences and events over time. It assesses:
* Understanding of temporal logic and sequence prediction
* Ability to make decisions based on time-dependent data
* Performance in tasks requiring long-term planning and foresight
|}
▲== External Links ==
▲* ''Agent-Oriented Software Engineering: Reflections on Architectures, Methodologies, Languages, and Frameworks'' {{ISBN|978-3642544316}}
== References ==
|