Revision as of 20:24, 17 May 2024 edit Sprhodes (talk \| contribs) Extended confirmed users 701 edits m Use archive.org link for dead link ← Previous edit		Revision as of 16:35, 18 August 2024 edit undo MovGP0 (talk \| contribs) Extended confirmed users 588 edits + →Benchmarks Next edit →
Line 11: Multiagent Systems Product Lines (MAS-PL) is a research field devoted to combining the two approaches: applying the SPL philosophy for building a MAS. This will afford all of the advantages of SPLs and make MAS development more practical. == External Links ==▼ ==Benchmarks== * ''Agent-Oriented Software Engineering: Reflections on Architectures, Methodologies, Languages, and Frameworks'' {{ISBN\|978-3642544316}}▼ Several benchmarks have been developed to evaluate the capabilities of AI coding agents and large language models in software engineering tasks. Here are some of the key benchmarks: {\|class="wikitable" \|+ Agentic software engineering benchmarks ! Benchmark !! Description \|- \| [https://www.swebench.com/ SWE-bench] \| Assesses the ability of AI models to solve real-world software engineering issues sourced from GitHub repositories. The benchmark involves: * Providing agents with a code repository and issue description * Challenging them to generate a patch that resolves the described problem * Evaluating the generated patch against unit tests \|- \| [https://github.com/snap-stanford/MLAgentBench ML-Agent-Bench] \| Designed to evaluate AI agent performance on machine learning tasks \|- \| [https://github.com/sierra-research/tau-bench τ-Bench] \| τ-Bench is a benchmark developed by Sierra AI to evaluate AI agent performance and reliability in real-world settings. It focuses on: * Testing agents on complex tasks with dynamic user and tool interactions * Assessing the ability to follow ___domain-specific policies * Measuring consistency and reliability at scale \|- \| [https://github.com/web-arena-x/webarena WebArena] \| Evaluates AI agents in a simulated web environment. The benchmark tasks include: * Navigating complex websites to complete user-driven tasks * Extracting relevant information from the web * Testing the adaptability of agents to diverse web-based challenges \|- \| [https://github.com/THUDM/AgentBench AgentBench] \| A benchmark designed to assess the capabilities of AI agents in handling multi-agent coordination tasks. The key areas of evaluation include: * Communication and cooperation between agents * Task efficiency and resource management * Adaptability in dynamic environments \|- \| [https://github.com/aryopg/mmlu-redux MMLU-Redux] \| An enhanced version of the MMLU benchmark, focusing on evaluating AI models across a broad range of academic subjects and domains. It measures: * Subject matter expertise across multiple disciplines * Ability to handle complex problem-solving tasks * Consistency in providing accurate answers across topics \|- \| [https://github.com/MCEVAL/McEval McEval] \| A coding benchmark designed to test AI models' ability to solve coding challenges. The benchmark evaluates: * Code correctness and efficiency * Ability to handle diverse programming languages * Performance across different coding paradigms and tasks \|- \| [https://csbench.github.io/ CS-Bench] \| A specialized benchmark for evaluating AI performance in computer science-related tasks. The key focus areas include: * Algorithms and data structures * Computational complexity and optimization * Theoretical and applied computer science concepts \|- \| [https://github.com/allenai/WildBench WildBench] \| Tests AI models in understanding and reasoning about real-world wild environments. It emphasizes: * Handling noisy and unstructured data * Adapting to unpredictable changes in the environment * Performing well in multi-modal scenarios with real-world relevance \|- \| [https://huggingface.co/datasets/baharef/ToT Test of Time] \| A benchmark that focuses on evaluating AI models' ability to reason about temporal sequences and events over time. It assesses: * Understanding of temporal logic and sequence prediction * Ability to make decisions based on time-dependent data * Performance in tasks requiring long-term planning and foresight \|} ▲== External Links == ▲* ''Agent-Oriented Software Engineering: Reflections on Architectures, Methodologies, Languages, and Frameworks'' {{ISBN\|978-3642544316}} == References ==

Agent-oriented software engineering: Difference between revisions