MatToolBench

Key Contributions

Led the design of a benchmark spanning multiple materials science tools and task modalities.

Built isolated Windows-based execution environments with standardized reset and evaluation hooks.

Helped define layered evaluation methods suited to outputs ranging from GUI states to scripts and result files.

Benchmark scope

MatToolBench covers a mix of GUI-heavy tools, code-oriented tools, and workflows that require moving across both. That mix matters because professional research software rarely fits into a single interaction pattern.

Systems contribution

The benchmark is not only a task list. It also depends on a stable runner, environment reset logic, and evaluation interfaces that match the outputs of different software systems. I spent much of my effort on that systems layer so the benchmark could scale beyond manually curated demos.

Research value

The result is a clearer picture of where general multimodal agents break down in specialized domains and what kinds of capabilities future systems need in order to become genuinely useful scientific assistants.

Problem

Key Contributions

Results

Benchmark scope

Systems contribution

Research value