[feat][evaluation]trae eval by tpfz · Pull Request #551 · coze-dev/coze-loop

tpfz · 2026-06-17T11:46:15Z

What type of PR is this?

Check the PR title

This PR title match the format: [<type>][<scope>] <description>. For example: [fix][backend] flaky fix
The description of this PR title is user-oriented and clear enough for others to understand.
Add documentation if the current PR requires user awareness at the usage level.
This PR is written in English. PRs not in English will not be reviewed.

(Optional) Translate the PR title into Chinese

(Optional) More detailed description for this PR(en: English/zh: Chinese)

en:
zh(optional):

(Optional) Which issue(s) this PR fixes

…for item-centric experiment refactoring - Add expt_item_ref table (new flat item-binding table with item_config) - Add item_version_id to expt_item_result/run_log, expt_turn_result/run_log, eval_target_record, evaluator_record - Add source_type/inline_key/alias/target_record_id to evaluator_record (Builtin/Alias/Inline unification) - Add source_type/inline_key/alias to expt_turn_evaluator_result_ref; rebuild 7-col unique key - Add eval_set_source_type to experiment (new/legacy path discriminator) - Add alias/filter/binding_config/eval_set_id to expt_evaluator_ref (query-only snapshot) - Update generate.go: add expt_item_ref to experiment tables; add evaluator_version/evaluator_record to evaluator tables - Regenerate all affected gorm_gen model/query files Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…experiment - domain/expt.thrift: add ExptEvalSetSourceType enum, EvalSetConfig/ExptTargetConf/ ExptEvaluatorConf/ExptFilter/ExptFilterField/ExptEvalSetDetail structs - domain/expt.thrift: add fields 110-114 to Experiment DTO (eval_set_source_type, eval_set_configs, eval_set_details, evaluators_concur_num, total_item_count) - coze.loop.evaluation.expt.thrift: add eval_set_configs field 70 to CreateExperimentRequest, field 75 to SubmitExperimentRequest - regenerate kitex_gen for evaluation domain Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… EvaluatorResults - expt.go: add ExptEvalSetSourceType enum, ExptItemRef/ExptItemConfig structs, EvalSetConfig/ExptTargetConf/ExptEvaluatorConf/ExptItemFilter domain types; add EvalSetSourceType to Experiment, EvalSetConfigs to EvaluationConfiguration - expt_result.go: add ItemVersionID to ExptItemResult/ExptItemResultRunLog/ ExptTurnResult/ExptTurnResultRunLog; upgrade EvaluatorResults to dual-array format (Registered+Inline) with legacy EvalVerIDToResID backward compat; add SourceType/InlineKey/Alias to ExptTurnEvaluatorResultRef; add Alias/Filter/BindingConfig/EvalSetID to ExptEvaluatorRef - evaluator_record.go: add ItemVersionID, SourceType, InlineKey, Alias, TargetRecordID fields; add EvaluatorRecordSourceType enum; add EvaluatorRunStatusSkipped constant - target_record.go: add ItemVersionID field Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…etItemCount entity - Add IExptItemRefRepo interface with BatchCreate, ListByExptID, GetByExptIDAndItemID, MGetByExptIDAndItemIDs, CountByEvalSetGrouped methods - Add ExptEvalSetItemCount struct for eval_set grouped item counts - Update go:generate directive to include IExptItemRefRepo Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ields in converters - mysql/expt_item_ref.go: IExptItemRefDAO interface + implementation (BatchCreateNX, ListByExptID, GetByExptIDAndItemID, MGetByExptIDAndItemIDs, CountByEvalSetGrouped) - experiment/expt_item_ref.go: IExptItemRefRepo implementation wiring DAO + converter - convert/expt_item_ref.go: ExptItemRef DO/PO converter with item_config JSON marshaling - convert/expt_result.go: propagate ItemVersionID in ExptItemResult/ExptTurnResult converters - convert/expt_item_result_run_log.go: propagate ItemVersionID - convert/expt_run_log.go: propagate ItemVersionID in ExptTurnResultRunLog converter - convert/expt_turn_evaluator_result_ref.go: add SourceType/InlineKey/Alias fields - convert/expt_evaluator_ref.go: add EvalSetID/Alias/Filter/BindingConfig fields - evaluator/convertor/evaluator_record.go: add all new fields (ItemVersionID, SourceType, InlineKey, Alias_, TargetRecordID) - target/convertor/eval_target_record.go: add ItemVersionID - domain/repo/expt.go: add IExptItemRefRepo interface Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… convertor - entity/param.go: add EvalSetConfigs to CreateExptParam - entity/expt.go: update ToEvaluatorRefDO to generate alias/filter/binding_config for MultiSetConfig path; add EvalSetSourceType to Experiment struct - service/expt_manage_impl.go: detect EvalSetConfigs in CreateExpt; set EvalSetSourceType=MultiSetConfig or SingleSet accordingly; serialize EvalSetConfigs into eval_conf - infra/convert/expt.go: propagate EvalSetSourceType in DO2PO/PO2DO - application/convertor/experiment/expt.go: ConvertCreateReq handles eval_set_configs field 70 → domain EvalSetConfigs; ToExptDTO fills fields 110-114 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…atorResults upgrade ExptStart (Task 6): - ExptSubmitExec: add exptItemRefRepo field (variadic backward-compat constructor) - ExptStart: detect EvalSetSourceType==MultiSetConfig → call exptStartMultiSet - exptStartMultiSet: paginate items per set, build ExptItemRef with item_config, batch-write expt_item_ref then expt_item_result/expt_turn_result - buildItemConfigFromSetConf: materialize per-set evaluator/target conf into item_config storeTurnRunResult / RecordItemRunLogs (Task 7): - storeTurnRunResult: write new EvaluatorResults{Registered:[], Alias:''} format - NewTurnEvaluatorResultRefs: handle both old map format and new Registered+Inline arrays; set SourceType=Builtin for registered, SourceType=Inline for inline - terminateZombieEvaluatorRecords: extract record IDs from both old/new format - RecordItemRunLogs: extract evaluator IDs from both formats for weight calculation Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- expt_result_impl_test.go: TestNewTurnEvaluatorResultRefs_NewFormat covers nil/old-map/Registered/Inline/mixed cases with SourceType assertion - expt_result_eval_test.go: TestEvaluatorResultsSerializeCompat covers new-format serialize, old-format JSON backward-compat, new-format deserialize - expt_convertor_test.go: TestConvertCreateReq_OldPath/EvalSetConfigs, TestConvertEvalSetConfigsDTOToDO - application convertor new path - expt_run_scheduler_mode_multiset_test.go: TestBuildItemConfigFromSetConf covers empty/single-evaluator/with-target-conf cases All 11 new test cases PASS Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…s collision Replace sync.Map(int64 key) with mutex+slice (evalRecordCollector) in callEvaluators/asyncCallEvaluator so alias multi-instances of the same evaluatorVersionID are stored as separate records instead of overwriting each other. ExptTurnRunResult.EvaluatorResults changes from map[int64]*EvaluatorRecord to []*EvaluatorRecord; GetEvaluatorRecord uses linear search by versionID for backward compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

新实验类型 (MultiSetConfig) 聚合时跳过 EvaluatorScore/WeightedScore 维度 (多评测集 + alias 多实例下"实验级单一平均分"语义不成立), 保留 Target latency/tokens + Annotation 维度。附加 Skipped record 过滤兜底和 field_key 兼容解析工具。 - domain/entity/aggr_field_key.go: 新增 ParseEvaluatorScoreFieldKey 工具, 兼容纯数字 (当前 DB 行) 和 verID:alias (未来扩展) - expt_result_aggr_impl.go: * CreateExptAggrResult 入口按 EvalSetSourceType 分流, 提取 computeEvaluatorAggrGroup 为独立方法 * createWeightedScoreAggrResult 入口 guard MultiSetConfig 返回 nil * UpdateExptAggrResult 入口 guard MultiSetConfig 早返回 * 两处 recordMap 构建加 Status=Skipped 过滤 * 3 处 ParseInt(FieldKey) 切换为 ParseEvaluatorScoreFieldKey - filter.go: ConvertFilter ParseInt 切兼容解析 - 新增 11 个单测 (7 个工具 + 4 个分流), 既有聚合测试 0 回归 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

新 item-centric 路径 (MultiSetConfig) 下 ConvertCreateReq 只设 EvalSetConfigs，未构建 EvalConf.ConnectorConf，而 CheckRun->CheckConnector->checkEvaluatorsConnector 仍按老连接器结构做同步字段映射校验，导致 SubmitExperiment 报 'invalid evaluator connector' (cause: nil EvaluatorConf)。方案A: 新增 buildExptConfFromEvalSetConfigs，从 eval_set_configs[].evaluator_confs /target_confs 的字段映射展平派生老连接器 (EvaluatorsConf + TargetConf)，按 (evaluator_version_id, alias) 去重，使两侧由同一份输入派生、天然一致。 EvalSetConfigs 仍是落库与调度的权威源。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

新实验类型 (MultiSetConfig) 执行 evaluator 时按 expt_item_ref.item_config 的 EvaluatorConfs 行级配置做 filter 判断, filter 不命中则跳过该 evaluator (不实际调用 evaluator service)。老实验类型走原 expt 级 EvaluatorsConf 路径, 零行为变化。 - domain/entity/expt_run.go: * ExptItemEvalCtx 加 ItemConfig 字段 * ExptTurnRunResult 加 GetEvaluatorRecordByVerAlias(versionID, alias) 方法, 为后续 alias 多实例独立定位预留 (老 GetEvaluatorRecord 保留) - domain/service/expt_run_item_event_impl.go: * ExptItemEventEvalServiceImpl 加 exptItemRefRepo 字段; NewExptRecordEvalService 构造函数加 IExptItemRefRepo 参数 * BuildExptRecordEvalCtx 按 EvalSetSourceType 分流: MultiSetConfig 调 GetByExptIDAndItemID 拿 item_config 注入 ctx; 读不到降级 nil (走老路径) - domain/service/expt_item_filter_match.go (新增): * ShouldRunByFilter 综合 FilterMode (None/Include/Exclude) + matcher 决策 * MatchExptItemFilter 支持 QueryAndOr (AND/OR) + 多 FilterField * matchByQueryType 支持 equal/in/not_equal/not_in/contains/not_contains; 未识别 QueryType 默认放行 (不阻断执行) * shouldRunEvaluatorByItemConfig 桥接器: 找到 versionID 对应的首个 conf 做 filter 判定 - domain/service/expt_run_item_turn_impl.go: * CallEvaluators 入口加 ItemConfig 守卫: filter 不命中 → continue 跳过该 evaluator, 不进 pendingEvaluatorVersionIDs - wire: experiment/wire.go 加 NewExptItemRefRepo, mysql/wire.go 加 NewExptItemRefDAO, wire_gen.go 自动重新生成 (含 iExptItemRefRepo 变量) - 新增 18 个单测 (12 filter matcher + 6 桥接器 + 4 双键查找), 既有测试 0 回归 tech debt (后续 PR): - alias 多实例独立执行 + Skipped 占位 record 持久化 — 依赖 evaluatorService .RunEvaluator API 扩展 Alias/SourceType 字段; 当前 MVP 只是不实际调用 evaluator (filter 行为正确), 但 GUI 看不到 Skipped 标记 - 同 versionID 多 alias 实例只取首个 conf 判定; 等 alias 真正落地后切换 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

PR2 只在 BuildExptRecordEvalCtx (record eval 路径) 注入了 IExptItemRefRepo, 但 scheduler 路径的 DefaultSchedulerModeFactory 没注入。新实验类型 (MultiSetConfig) 首次调度触发 ExptSubmitExec.exptStartMultiSet 时, e.exptItemRefRepo 为 nil, 直接报错 "exptItemRefRepo is nil, cannot run multi-set ExptStart"。 BOE/PPE MQ consumer 已实锤复现, error stack 含 commit ff64c57 (PR2 之前 + 同事 commit, 两者都未含此修复)。修复: - DefaultSchedulerModeFactory struct 加 exptItemRefRepo 字段; NewSchedulerModeFactory 加参数, factory NewSchedulerMode 内调用 NewExptSubmitMode / NewExptTrialRunMode 时传 f.exptItemRefRepo - NewExptTrialRunMode 加 variadic IExptItemRefRepo 透传给内部 NewExptSubmitMode (和后者 variadic 形式对齐) - wire 重新生成 wire_gen.go, NewSchedulerModeFactory 调用点自动传入 iExptItemRefRepo - 测试 TestNewSchedulerModeFactory 补 MockIExptItemRefRepo 参数 FailRetry/Append/RetryAll/RetryItems struct 都没 exptItemRefRepo 字段, 不受影响, 本次只修 Submit + TrialRun 路径。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… BAM parse domain/expt.thrift 同时 include data + observability 两个同名 filter.thrift, basename 相同导致 Thrift 别名冲突。thriftgo/kitex 能自动消歧(filter0),但 BAM 解析器把 filter.Filter 误绑到 observability namespace(无 Filter 结构)→ 'struct not defined' → 阻塞 BAM idl update / AGW 发布。重命名 data 侧 filter.thrift → data_filter.thrift,引用方改用 data_filter.Filter, 消除同名歧义。namespace 不变 → Go 包路径不变 → 后端业务代码零改动。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CallEvaluators 入口在 ItemConfig 非空时分流到 callEvaluatorsByItemConfig: - 主循环按 ItemConfig.EvaluatorConfs 遍历, 同 versionID 不同 alias 跑两次 - existResult 按 (versionID, alias) 双键复用 - filter 不命中 → 写 Skipped 占位 record (带 alias) - 命中 → 调 RunEvaluator/AsyncRunEvaluator, 把 alias + SourceType 透传 evaluator_impl.RunEvaluator/AsyncRunEvaluator 写入端把 request.Alias/SourceType 落到 EvaluatorRecord, 兜底 SourceType=Unknown -> Builtin. 清理 PR2 留下的 dead helper shouldRunEvaluatorByItemConfig + 单测, 该函数按 versionID 反查首个 conf, 在 alias 多实例下语义不正确; 现在 callEvaluatorsByItemConfig 直接按 conf 调 ShouldRunByFilter 不再需要它.

commit 2ced00e 把 submitReq := experiment.OpenAPITemplateToSubmitExperimentRequest(...) 误写为 expeinternal / cmd / evaluation / experiment.goriment.OpenAPITemplateToSubmitExperimentRequest(...), 导致 application 包整体不编译; 还原为原调用。

Get/MGet/List 实验响应里 eval_set_details[].item_count 此前一直是空。本 PR 把读路径接通: - entity.Experiment 加 EvalSetDetails 字段 - entity 新增 ExptEvalSetDetail (与 IDL domain/expt.thrift ExptEvalSetDetail 同构) - ExptMangerImpl.NewExptManager 加 IExptItemRefRepo 参数 + struct 字段 - packExperimentResult 末尾调 fillEvalSetDetails (新实验类型才进, 老实验跳过) - fillEvalSetDetails: 按 EvalConf.EvalSetConfigs 拼骨架, IsPrimary 与 EvalSetID 一致, ItemCount 来源 IExptItemRefRepo.CountByEvalSetGrouped, 首跑前为 0 - DTO converter 把 entity.EvalSetDetails 转 thrift Experiment.EvalSetDetails - wire_gen 重新生成 5 个子测试 PASS: nil repo 跳过 / 仅 MultiSetConfig 触发 repo / IsPrimary 与 EvalSetID 一致 / ItemCount 缺失补 0 / repo 错误传播 / EvalConf 缺失跳过 / 无 MultiSetConfig 不调 repo。全 evaluation 模块 go test ./... 0 回归。

新实验类型 MultiSetConfig 支持同评估器版本多别名(alias)，治理"按 evaluator_version_id 索引的 map 在 alias 多实例下撞 key"问题，覆盖聚合/ 加速数仓(CK)/导出三条旧消费链路。向前兼容：旧实验(alias 为空)编码退化为裸 version_id，全链路 byte 级不变。 - 新增 entity.EncodeEvaluatorInstanceKey: alias="" 退化裸 verID，否则 verID:alias；与 ParseEvaluatorScoreFieldKey 互逆 - 聚合: 去掉 3 处 MultiSetConfig 跳过分支，恢复加权平均；computeEvaluator AggrGroup 按 (version,alias) 分桶；field_key 编码 verID:alias - 加权计算: CalculateWeightedScore 两 map key int64→string；新增类型感知 buildScoreWeights，MultiSetConfig 从 EvalSetConfigs[].EvaluatorConfs 带 alias 取权重 - CK 宽表: 新增 item_version_id 列(String DEFAULT '0'，含两处 init-sql)； key_mapping FromField/lookup 升级 verID:alias；CreateExpt + ManualUpsert 的 key_mapping 按 (version,alias) 实例展开(全局递增 ToKey + 去重) - 导出: ExportExptResult_ 入口按 EvalSetSourceType 拒绝 MultiSetConfig，不动 DoExportCSV / 洞察分析 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…s 筛选) 两处缺口: 1. GetExperimentsOApi 单实验 Get 走 OpenAPIExptDO2DTO, 此前未回填 item-centric 字段 (eval_set_source_type/eval_set_details/evaluators_concur_num/total_item_count); §5 的回填只加在 DomainExperimentDTO2OpenAPI (List/Create resp 路径)。本次在 OpenAPIExptDO2DTO 补回填, total_item_count 仅 MultiSetConfig 回显 (对齐 ToExptDTO)。 2. ListExperimentsOApi 的 eval_set_source_types 双层缺失: OpenAPI ExperimentFilterOption 缺字段 + OpenAPIExperimentFilterOptionDTO2Domain 未透传。本次 OpenAPI IDL 加 eval_set_source_types(field 2, 与 fuzzy_name 同级) + regen kitex_gen + 转换 string→int 枚举透传 + 修「只传 source_types 被判空返回 nil」。 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-17T12:16:54Z

Codecov Report

❌ Patch coverage is 3.89294% with 395 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...modules/evaluation/application/eval_openapi_app.go	0.00%	116 Missing ⚠️
...uation/application/convertor/experiment/openapi.go	0.00%	106 Missing and 2 partials ⚠️
...domain/service/target_source_sandbox_agent_impl.go	0.00%	58 Missing ⚠️
...uation/application/convertor/target/eval_target.go	0.00%	51 Missing and 1 partial ⚠️
...d/modules/evaluation/domain/service/target_impl.go	34.28%	23 Missing ⚠️
...uation/infra/rpc/agent_studio/sandbox_scheduler.go	0.00%	12 Missing ⚠️
...d/modules/evaluation/application/experiment_app.go	18.18%	8 Missing and 1 partial ⚠️
...api/handler/coze/loop/apis/eval_open_apiservice.go	0.00%	4 Missing ⚠️
backend/modules/evaluation/domain/entity/param.go	0.00%	3 Missing ⚠️
...valuation/application/convertor/experiment/expt.go	0.00%	1 Missing and 1 partial ⚠️
... and 4 more

❌ Your patch check has failed because the patch coverage (3.89%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

@@            Coverage Diff             @@
##             main     #551      +/-   ##
==========================================
- Coverage   77.65%   77.41%   -0.25%     
==========================================
  Files         670      672       +2     
  Lines       76101    76488     +387     
==========================================
+ Hits        59098    59210     +112     
- Misses      13545    13808     +263     
- Partials     3458     3470      +12

Flag	Coverage Δ
unittests	`77.41% <3.89%> (-0.25%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...aluation/domain/service/expt_run_item_turn_impl.go	`86.46% <100.00%> (ø)`
...valuation/application/convertor/experiment/expt.go	`83.61% <0.00%> (-0.36%)`	⬇️
.../application/convertor/experiment/expt_template.go	`88.38% <0.00%> (-0.21%)`	⬇️
backend/modules/evaluation/domain/entity/expt.go	`95.87% <0.00%> (-2.02%)`	⬇️
backend/modules/evaluation/domain/entity/target.go	`97.93% <33.33%> (-2.07%)`	⬇️
...ules/evaluation/domain/service/expt_manage_impl.go	`76.32% <0.00%> (-0.25%)`	⬇️
backend/modules/evaluation/domain/entity/param.go	`80.70% <0.00%> (-4.49%)`	⬇️
...api/handler/coze/loop/apis/eval_open_apiservice.go	`0.00% <0.00%> (ø)`
...d/modules/evaluation/application/experiment_app.go	`83.85% <18.18%> (-0.50%)`	⬇️
...uation/infra/rpc/agent_studio/sandbox_scheduler.go	`0.00% <0.00%> (ø)`
... and 5 more

... and 9 files with indirect coverage changes

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df9d61c...7276ae7. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

老唯一键只到 (space_id, expt_id, expt_turn_result_id, evaluator_version_id), 同 version 不同 alias 的第二行会被 ON DUPLICATE KEY UPDATE 覆盖。改为 uniq_expt_turn_evaluator_result (expt_id, evaluator_version_id, expt_turn_result_id, inline_key, alias): - 去 space_id (expt_id 雪花 ID 已全局唯一) - 去 source_type (inline_key/alias 互斥, 二元组即可区分 Builtin/Inline 实例) - 不加合并列、不回填 (本表 3000 万行 + NDB 8.0 不支持 CHAR()) helm init-sql 顺带补齐缺失的 source_type/inline_key/alias 三列。线上既有表需走 migration ALTER (CREATE IF NOT EXISTS 对旧表不生效)。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

名字带上消歧列 inline_key/alias, 一眼看出是 alias 多实例键。 gorm model + 2 个 init-sql 同步, 与 BOE 线上 ALTER 索引名保持一致。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

GetExperimentsOApi 之前不回显 eval_set_configs，导致 OpenAPI 端拿不到 per-set item_filter / evaluator·target 配置。本次在 OpenAPIExptDO2DTO 里按实验类型分流：MultiSetConfig 回填 eval_set_configs，SingleSet 旧实验不回显。 version_id→version 字符串从读对象已加载数据反查(零额外 RPC)。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The gen-commercial package references ExptNotificationConf, WebhookNotificationConf, and FeishuNotificationConf types as aliases from the domain/expt package. These types were added on main but not yet on feat/eval0630. Adding minimal stubs to unblock the commercial build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a new EvalTargetType.SandboxAgent (=17) for evaluating agents launched via CLI inside a sandbox container. Wired through IDL, generated kitex code, DO entities, DTO/DO converters (internal + openapi), MySQL convertor, and turn-execution dispatch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…cord Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Route SandboxAgent eval target through the async execution path so that external sandbox runs can report results via ReportEvalTargetInvokeResult: - AsyncCallTarget() returns true for SandboxAgent targets - Register a SandboxAgent ISourceEvalTargetOperateService that allocates an invoke id placeholder in AsyncExecute; actual execution is performed outside and reported back through the existing async report endpoint Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drop the redundant /evaluation segment from /v1/loop/evaluation/eval_targets/async_debug to match the sibling ReportEvalTargetInvokeResult route at /v1/loop/eval_targets/result. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add a sandbox_agent field on AsyncDebugEvalTargetOApiRequest and a SandboxAgent case in the openapi async-debug switch so that debugging a SandboxAgent target no longer fails with "unsupported eval target type: sandbox_agent". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

新增沙箱调度 RPC 适配器接口与配套 DO： - domain/component/rpc/sandbox_scheduler.go：ISandboxSchedulerAdapter 接口（Init/Run/Get/GetTaskInfo/Destroy）与 Sandbox{Execute,Task}{Info,Status...} 等 DO/枚举 - infra/rpc/agent_studio/：开源 backend 内的占位实现（5 个方法返回 not implement），由商业仓库覆盖真正调用 stone.cozeloop.agent_studio 的逻辑 - SandboxAgentSourceEvalTargetServiceImpl 注入 ISandboxSchedulerAdapter，NewSourceTargetOperators 形参跟随扩展 - TargetDomainServiceSet 引入 agent_studio.AgentStudioRPCSet 与接口 wire.Bind，重新生成 wire_gen.go

…heduler IDL DestroyType enum now matches new IDL (Task/Execute); SandboxRunRequest carries optional Image field.

image is now passed via the Param map (key=image); sandbox agent already serializes all input fields into Param, so the adapter does not need a dedicated Image field on SandboxRunRequest.

…d per-row completion Init a sandbox task via SandboxSchedulerAdapter in SubmitExperiment for SandboxAgent eval targets (TaskID=expt.ID, Concurrency=ItemConcurNum). When each row finishes (ReportInvokeRecords), best-effort destroy that execute via SandboxSchedulerAdapter.Destroy so sandbox resources are released; failures only warn and never block the report. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wire SandboxAgent eval-target type through the submit-experiment OpenAPI: add sandbox_agent to SubmitExperimentEvalTargetParam and internal CreateEvalTargetParam IDLs, propagate it through OpenAPI/RPC convertors and entity.Opt (WithSandboxAgent), and let the SandboxAgent operator BuildBySource construct an EvalTarget from the request payload so the experiment can be created in one OpenAPI call. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Init sandbox task with TaskID "sandbox_debug" (concurrency=10) before AsyncDebugEvalTargetOApi dispatches a SandboxAgent debug run, and destroy the execute via defer in ReportEvalTargetInvokeResult_ regardless of success or failure when the report comes from a debug context (actx.Event == nil). Also fixes a nil deref on actx.Event in the log line for debug reports.

CLAassistant · 2026-06-24T13:06:23Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
4 out of 5 committers have signed the CLA.

✅ xueyizheng
✅ VinCinx
✅ tpfz
✅ alanyf
❌ NoahKex
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

xueyizheng and others added 30 commits June 13, 2026 22:20

fix

fd52d96

feat(eval): AI创建实验接口修改记录

e71112e

fix

4ca8dba

feat(eval): experiment create open api

08c67ab

feat(eval): 新实验模式需要强制传eval_set_source_type

2ced00e

feat(eval): list&get experiment

db5b73a

feat(eval): experiment list filter

4167db0

feat(eval): experiment list filter

cde491c

feat(eval): 评估器运行支持动态参数打通

0524346

feat(eval): experiment list filter

5d4d0d4

feat(eval): experiment create api filter

24b2792

feat(eval): 尝试修复多评测集运行时，实验结果没有第二个评测集数据和实验结果的问题

f1478cd

tpfz force-pushed the feat/wzq/trae_eval branch from 233352c to fcd209c Compare June 22, 2026 07:57

xueyizheng and others added 5 commits June 23, 2026 15:10

feat(cli): 增加评估器筛选数据能力

2c64774

tpfz force-pushed the feat/wzq/trae_eval branch from f53db8b to 7014835 Compare June 24, 2026 12:48

tpfz and others added 13 commits June 24, 2026 20:55

[feat][evaluation] add openapi for eval_target async_debug and get_re…

c115035

…cord Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

[feat][evaluation] add CustomFieldSchemas to SandboxAgent entity

fa0fff5

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chore(evaluation): align ISandboxSchedulerAdapter with new sandbox_sc…

4397abb

…heduler IDL DestroyType enum now matches new IDL (Task/Execute); SandboxRunRequest carries optional Image field.

chore(evaluation): drop Image from SandboxRunRequest

07c05e4

image is now passed via the Param map (key=image); sandbox agent already serializes all input fields into Param, so the adapter does not need a dedicated Image field on SandboxRunRequest.

[feat][evaluation] bump SandboxAgent debug concurrency to 50

394e837

tpfz force-pushed the feat/wzq/trae_eval branch from 7014835 to 394e837 Compare June 24, 2026 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat][evaluation]trae eval#551

[feat][evaluation]trae eval#551
tpfz wants to merge 49 commits into
mainfrom
feat/wzq/trae_eval

tpfz commented Jun 17, 2026 •

edited by Tiny1028

Loading

Uh oh!

codecov Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

tpfz commented Jun 17, 2026 • edited by Tiny1028 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

Check the PR title

(Optional) Translate the PR title into Chinese

(Optional) More detailed description for this PR(en: English/zh: Chinese)

(Optional) Which issue(s) this PR fixes

Uh oh!

codecov Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

CLAassistant commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tpfz commented Jun 17, 2026 •

edited by Tiny1028

Loading

codecov Bot commented Jun 17, 2026 •

edited

Loading

CLAassistant commented Jun 24, 2026 •

edited

Loading