Skip to content

[feat][evaluation]trae eval#551

Open
tpfz wants to merge 49 commits into
mainfrom
feat/wzq/trae_eval
Open

[feat][evaluation]trae eval#551
tpfz wants to merge 49 commits into
mainfrom
feat/wzq/trae_eval

Conversation

@tpfz

@tpfz tpfz commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

What type of PR is this?

Check the PR title

  • This PR title match the format: [<type>][<scope>] <description>. For example: [fix][backend] flaky fix
  • The description of this PR title is user-oriented and clear enough for others to understand.
  • Add documentation if the current PR requires user awareness at the usage level.
  • This PR is written in English. PRs not in English will not be reviewed.

(Optional) Translate the PR title into Chinese

(Optional) More detailed description for this PR(en: English/zh: Chinese)

en:
zh(optional):

(Optional) Which issue(s) this PR fixes

xueyizheng and others added 30 commits June 13, 2026 22:20
…for item-centric experiment refactoring

- Add expt_item_ref table (new flat item-binding table with item_config)
- Add item_version_id to expt_item_result/run_log, expt_turn_result/run_log, eval_target_record, evaluator_record
- Add source_type/inline_key/alias/target_record_id to evaluator_record (Builtin/Alias/Inline unification)
- Add source_type/inline_key/alias to expt_turn_evaluator_result_ref; rebuild 7-col unique key
- Add eval_set_source_type to experiment (new/legacy path discriminator)
- Add alias/filter/binding_config/eval_set_id to expt_evaluator_ref (query-only snapshot)
- Update generate.go: add expt_item_ref to experiment tables; add evaluator_version/evaluator_record to evaluator tables
- Regenerate all affected gorm_gen model/query files

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…experiment

- domain/expt.thrift: add ExptEvalSetSourceType enum, EvalSetConfig/ExptTargetConf/
  ExptEvaluatorConf/ExptFilter/ExptFilterField/ExptEvalSetDetail structs
- domain/expt.thrift: add fields 110-114 to Experiment DTO (eval_set_source_type,
  eval_set_configs, eval_set_details, evaluators_concur_num, total_item_count)
- coze.loop.evaluation.expt.thrift: add eval_set_configs field 70 to
  CreateExperimentRequest, field 75 to SubmitExperimentRequest
- regenerate kitex_gen for evaluation domain

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… EvaluatorResults

- expt.go: add ExptEvalSetSourceType enum, ExptItemRef/ExptItemConfig structs,
  EvalSetConfig/ExptTargetConf/ExptEvaluatorConf/ExptItemFilter domain types;
  add EvalSetSourceType to Experiment, EvalSetConfigs to EvaluationConfiguration
- expt_result.go: add ItemVersionID to ExptItemResult/ExptItemResultRunLog/
  ExptTurnResult/ExptTurnResultRunLog; upgrade EvaluatorResults to dual-array
  format (Registered+Inline) with legacy EvalVerIDToResID backward compat;
  add SourceType/InlineKey/Alias to ExptTurnEvaluatorResultRef;
  add Alias/Filter/BindingConfig/EvalSetID to ExptEvaluatorRef
- evaluator_record.go: add ItemVersionID, SourceType, InlineKey, Alias,
  TargetRecordID fields; add EvaluatorRecordSourceType enum;
  add EvaluatorRunStatusSkipped constant
- target_record.go: add ItemVersionID field

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…etItemCount entity

- Add IExptItemRefRepo interface with BatchCreate, ListByExptID, GetByExptIDAndItemID,
  MGetByExptIDAndItemIDs, CountByEvalSetGrouped methods
- Add ExptEvalSetItemCount struct for eval_set grouped item counts
- Update go:generate directive to include IExptItemRefRepo

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ields in converters

- mysql/expt_item_ref.go: IExptItemRefDAO interface + implementation
  (BatchCreateNX, ListByExptID, GetByExptIDAndItemID, MGetByExptIDAndItemIDs,
  CountByEvalSetGrouped)
- experiment/expt_item_ref.go: IExptItemRefRepo implementation wiring DAO + converter
- convert/expt_item_ref.go: ExptItemRef DO/PO converter with item_config JSON marshaling
- convert/expt_result.go: propagate ItemVersionID in ExptItemResult/ExptTurnResult converters
- convert/expt_item_result_run_log.go: propagate ItemVersionID
- convert/expt_run_log.go: propagate ItemVersionID in ExptTurnResultRunLog converter
- convert/expt_turn_evaluator_result_ref.go: add SourceType/InlineKey/Alias fields
- convert/expt_evaluator_ref.go: add EvalSetID/Alias/Filter/BindingConfig fields
- evaluator/convertor/evaluator_record.go: add all new fields
  (ItemVersionID, SourceType, InlineKey, Alias_, TargetRecordID)
- target/convertor/eval_target_record.go: add ItemVersionID
- domain/repo/expt.go: add IExptItemRefRepo interface

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… convertor

- entity/param.go: add EvalSetConfigs to CreateExptParam
- entity/expt.go: update ToEvaluatorRefDO to generate alias/filter/binding_config
  for MultiSetConfig path; add EvalSetSourceType to Experiment struct
- service/expt_manage_impl.go: detect EvalSetConfigs in CreateExpt; set
  EvalSetSourceType=MultiSetConfig or SingleSet accordingly; serialize
  EvalSetConfigs into eval_conf
- infra/convert/expt.go: propagate EvalSetSourceType in DO2PO/PO2DO
- application/convertor/experiment/expt.go: ConvertCreateReq handles eval_set_configs
  field 70 → domain EvalSetConfigs; ToExptDTO fills fields 110-114

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…atorResults upgrade

ExptStart (Task 6):
- ExptSubmitExec: add exptItemRefRepo field (variadic backward-compat constructor)
- ExptStart: detect EvalSetSourceType==MultiSetConfig → call exptStartMultiSet
- exptStartMultiSet: paginate items per set, build ExptItemRef with item_config,
  batch-write expt_item_ref then expt_item_result/expt_turn_result
- buildItemConfigFromSetConf: materialize per-set evaluator/target conf into item_config

storeTurnRunResult / RecordItemRunLogs (Task 7):
- storeTurnRunResult: write new EvaluatorResults{Registered:[], Alias:''} format
- NewTurnEvaluatorResultRefs: handle both old map format and new Registered+Inline arrays;
  set SourceType=Builtin for registered, SourceType=Inline for inline
- terminateZombieEvaluatorRecords: extract record IDs from both old/new format
- RecordItemRunLogs: extract evaluator IDs from both formats for weight calculation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- expt_result_impl_test.go: TestNewTurnEvaluatorResultRefs_NewFormat
  covers nil/old-map/Registered/Inline/mixed cases with SourceType assertion
- expt_result_eval_test.go: TestEvaluatorResultsSerializeCompat
  covers new-format serialize, old-format JSON backward-compat, new-format deserialize
- expt_convertor_test.go: TestConvertCreateReq_OldPath/EvalSetConfigs,
  TestConvertEvalSetConfigsDTOToDO - application convertor new path
- expt_run_scheduler_mode_multiset_test.go: TestBuildItemConfigFromSetConf
  covers empty/single-evaluator/with-target-conf cases

All 11 new test cases PASS

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s collision

Replace sync.Map(int64 key) with mutex+slice (evalRecordCollector) in
callEvaluators/asyncCallEvaluator so alias multi-instances of the same
evaluatorVersionID are stored as separate records instead of overwriting
each other.

ExptTurnRunResult.EvaluatorResults changes from map[int64]*EvaluatorRecord
to []*EvaluatorRecord; GetEvaluatorRecord uses linear search by versionID
for backward compatibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
新实验类型 (MultiSetConfig) 聚合时跳过 EvaluatorScore/WeightedScore 维度
(多评测集 + alias 多实例下"实验级单一平均分"语义不成立), 保留 Target
latency/tokens + Annotation 维度。附加 Skipped record 过滤兜底和 field_key
兼容解析工具。

- domain/entity/aggr_field_key.go: 新增 ParseEvaluatorScoreFieldKey 工具,
  兼容纯数字 (当前 DB 行) 和 verID:alias (未来扩展)
- expt_result_aggr_impl.go:
  * CreateExptAggrResult 入口按 EvalSetSourceType 分流, 提取
    computeEvaluatorAggrGroup 为独立方法
  * createWeightedScoreAggrResult 入口 guard MultiSetConfig 返回 nil
  * UpdateExptAggrResult 入口 guard MultiSetConfig 早返回
  * 两处 recordMap 构建加 Status=Skipped 过滤
  * 3 处 ParseInt(FieldKey) 切换为 ParseEvaluatorScoreFieldKey
- filter.go: ConvertFilter ParseInt 切兼容解析
- 新增 11 个单测 (7 个工具 + 4 个分流), 既有聚合测试 0 回归

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
新 item-centric 路径 (MultiSetConfig) 下 ConvertCreateReq 只设 EvalSetConfigs,
未构建 EvalConf.ConnectorConf,而 CheckRun->CheckConnector->checkEvaluatorsConnector
仍按老连接器结构做同步字段映射校验,导致 SubmitExperiment 报
'invalid evaluator connector' (cause: nil EvaluatorConf)。

方案A: 新增 buildExptConfFromEvalSetConfigs,从 eval_set_configs[].evaluator_confs
/target_confs 的字段映射展平派生老连接器 (EvaluatorsConf + TargetConf),
按 (evaluator_version_id, alias) 去重,使两侧由同一份输入派生、天然一致。
EvalSetConfigs 仍是落库与调度的权威源。

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
新实验类型 (MultiSetConfig) 执行 evaluator 时按 expt_item_ref.item_config
的 EvaluatorConfs 行级配置做 filter 判断, filter 不命中则跳过该 evaluator
(不实际调用 evaluator service)。老实验类型走原 expt 级 EvaluatorsConf 路径,
零行为变化。

- domain/entity/expt_run.go:
  * ExptItemEvalCtx 加 ItemConfig 字段
  * ExptTurnRunResult 加 GetEvaluatorRecordByVerAlias(versionID, alias) 方法,
    为后续 alias 多实例独立定位预留 (老 GetEvaluatorRecord 保留)
- domain/service/expt_run_item_event_impl.go:
  * ExptItemEventEvalServiceImpl 加 exptItemRefRepo 字段; NewExptRecordEvalService
    构造函数加 IExptItemRefRepo 参数
  * BuildExptRecordEvalCtx 按 EvalSetSourceType 分流: MultiSetConfig 调
    GetByExptIDAndItemID 拿 item_config 注入 ctx; 读不到降级 nil (走老路径)
- domain/service/expt_item_filter_match.go (新增):
  * ShouldRunByFilter 综合 FilterMode (None/Include/Exclude) + matcher 决策
  * MatchExptItemFilter 支持 QueryAndOr (AND/OR) + 多 FilterField
  * matchByQueryType 支持 equal/in/not_equal/not_in/contains/not_contains;
    未识别 QueryType 默认放行 (不阻断执行)
  * shouldRunEvaluatorByItemConfig 桥接器: 找到 versionID 对应的首个 conf
    做 filter 判定
- domain/service/expt_run_item_turn_impl.go:
  * CallEvaluators 入口加 ItemConfig 守卫: filter 不命中 → continue 跳过
    该 evaluator, 不进 pendingEvaluatorVersionIDs
- wire: experiment/wire.go 加 NewExptItemRefRepo, mysql/wire.go 加
  NewExptItemRefDAO, wire_gen.go 自动重新生成 (含 iExptItemRefRepo 变量)
- 新增 18 个单测 (12 filter matcher + 6 桥接器 + 4 双键查找), 既有测试 0 回归

tech debt (后续 PR):
- alias 多实例独立执行 + Skipped 占位 record 持久化 — 依赖 evaluatorService
  .RunEvaluator API 扩展 Alias/SourceType 字段; 当前 MVP 只是不实际调用
  evaluator (filter 行为正确), 但 GUI 看不到 Skipped 标记
- 同 versionID 多 alias 实例只取首个 conf 判定; 等 alias 真正落地后切换

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR2 只在 BuildExptRecordEvalCtx (record eval 路径) 注入了 IExptItemRefRepo,
但 scheduler 路径的 DefaultSchedulerModeFactory 没注入。新实验类型
(MultiSetConfig) 首次调度触发 ExptSubmitExec.exptStartMultiSet 时,
e.exptItemRefRepo 为 nil, 直接报错 "exptItemRefRepo is nil, cannot run
multi-set ExptStart"。

BOE/PPE MQ consumer 已实锤复现, error stack 含 commit ff64c57
(PR2 之前 + 同事 commit, 两者都未含此修复)。

修复:
- DefaultSchedulerModeFactory struct 加 exptItemRefRepo 字段;
  NewSchedulerModeFactory 加参数, factory NewSchedulerMode 内调用
  NewExptSubmitMode / NewExptTrialRunMode 时传 f.exptItemRefRepo
- NewExptTrialRunMode 加 variadic IExptItemRefRepo 透传给内部
  NewExptSubmitMode (和后者 variadic 形式对齐)
- wire 重新生成 wire_gen.go, NewSchedulerModeFactory 调用点
  自动传入 iExptItemRefRepo
- 测试 TestNewSchedulerModeFactory 补 MockIExptItemRefRepo 参数

FailRetry/Append/RetryAll/RetryItems struct 都没 exptItemRefRepo
字段, 不受影响, 本次只修 Submit + TrialRun 路径。

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… BAM parse

domain/expt.thrift 同时 include data + observability 两个同名 filter.thrift,
basename 相同导致 Thrift 别名冲突。thriftgo/kitex 能自动消歧(filter0),但 BAM
解析器把 filter.Filter 误绑到 observability namespace(无 Filter 结构)→
'struct not defined' → 阻塞 BAM idl update / AGW 发布。

重命名 data 侧 filter.thrift → data_filter.thrift,引用方改用 data_filter.Filter,
消除同名歧义。namespace 不变 → Go 包路径不变 → 后端业务代码零改动。

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CallEvaluators 入口在 ItemConfig 非空时分流到 callEvaluatorsByItemConfig:
- 主循环按 ItemConfig.EvaluatorConfs 遍历, 同 versionID 不同 alias 跑两次
- existResult 按 (versionID, alias) 双键复用
- filter 不命中 → 写 Skipped 占位 record (带 alias)
- 命中 → 调 RunEvaluator/AsyncRunEvaluator, 把 alias + SourceType 透传

evaluator_impl.RunEvaluator/AsyncRunEvaluator 写入端把 request.Alias/SourceType
落到 EvaluatorRecord, 兜底 SourceType=Unknown -> Builtin.

清理 PR2 留下的 dead helper shouldRunEvaluatorByItemConfig + 单测,
该函数按 versionID 反查首个 conf, 在 alias 多实例下语义不正确;
现在 callEvaluatorsByItemConfig 直接按 conf 调 ShouldRunByFilter 不再需要它.
commit 2ced00e 把 submitReq := experiment.OpenAPITemplateToSubmitExperimentRequest(...)
误写为 expeinternal / cmd / evaluation / experiment.goriment.OpenAPITemplateToSubmitExperimentRequest(...),
导致 application 包整体不编译; 还原为原调用。
Get/MGet/List 实验响应里 eval_set_details[].item_count 此前一直是空。
本 PR 把读路径接通:

- entity.Experiment 加 EvalSetDetails 字段
- entity 新增 ExptEvalSetDetail (与 IDL domain/expt.thrift ExptEvalSetDetail 同构)
- ExptMangerImpl.NewExptManager 加 IExptItemRefRepo 参数 + struct 字段
- packExperimentResult 末尾调 fillEvalSetDetails (新实验类型才进, 老实验跳过)
- fillEvalSetDetails: 按 EvalConf.EvalSetConfigs 拼骨架, IsPrimary 与 EvalSetID 一致,
  ItemCount 来源 IExptItemRefRepo.CountByEvalSetGrouped, 首跑前为 0
- DTO converter 把 entity.EvalSetDetails 转 thrift Experiment.EvalSetDetails
- wire_gen 重新生成

5 个子测试 PASS: nil repo 跳过 / 仅 MultiSetConfig 触发 repo / IsPrimary 与 EvalSetID 一致 /
ItemCount 缺失补 0 / repo 错误传播 / EvalConf 缺失跳过 / 无 MultiSetConfig 不调 repo。
全 evaluation 模块 go test ./... 0 回归。
新实验类型 MultiSetConfig 支持同评估器版本多别名(alias),治理"按
evaluator_version_id 索引的 map 在 alias 多实例下撞 key"问题,覆盖聚合/
加速数仓(CK)/导出三条旧消费链路。向前兼容:旧实验(alias 为空)编码退化为
裸 version_id,全链路 byte 级不变。

- 新增 entity.EncodeEvaluatorInstanceKey: alias="" 退化裸 verID,否则
  verID:alias;与 ParseEvaluatorScoreFieldKey 互逆
- 聚合: 去掉 3 处 MultiSetConfig 跳过分支,恢复加权平均;computeEvaluator
  AggrGroup 按 (version,alias) 分桶;field_key 编码 verID:alias
- 加权计算: CalculateWeightedScore 两 map key int64→string;新增类型感知
  buildScoreWeights,MultiSetConfig 从 EvalSetConfigs[].EvaluatorConfs
  带 alias 取权重
- CK 宽表: 新增 item_version_id 列(String DEFAULT '0',含两处 init-sql);
  key_mapping FromField/lookup 升级 verID:alias;CreateExpt + ManualUpsert
  的 key_mapping 按 (version,alias) 实例展开(全局递增 ToKey + 去重)
- 导出: ExportExptResult_ 入口按 EvalSetSourceType 拒绝 MultiSetConfig,
  不动 DoExportCSV / 洞察分析

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s 筛选)

两处缺口:
1. GetExperimentsOApi 单实验 Get 走 OpenAPIExptDO2DTO, 此前未回填 item-centric 字段
   (eval_set_source_type/eval_set_details/evaluators_concur_num/total_item_count); §5 的回填只加在
   DomainExperimentDTO2OpenAPI (List/Create resp 路径)。本次在 OpenAPIExptDO2DTO 补回填, total_item_count
   仅 MultiSetConfig 回显 (对齐 ToExptDTO)。
2. ListExperimentsOApi 的 eval_set_source_types 双层缺失: OpenAPI ExperimentFilterOption 缺字段 +
   OpenAPIExperimentFilterOptionDTO2Domain 未透传。本次 OpenAPI IDL 加 eval_set_source_types(field 2,
   与 fuzzy_name 同级) + regen kitex_gen + 转换 string→int 枚举透传 + 修「只传 source_types 被判空返回 nil」。

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 3.89294% with 395 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...modules/evaluation/application/eval_openapi_app.go 0.00% 116 Missing ⚠️
...uation/application/convertor/experiment/openapi.go 0.00% 106 Missing and 2 partials ⚠️
...domain/service/target_source_sandbox_agent_impl.go 0.00% 58 Missing ⚠️
...uation/application/convertor/target/eval_target.go 0.00% 51 Missing and 1 partial ⚠️
...d/modules/evaluation/domain/service/target_impl.go 34.28% 23 Missing ⚠️
...uation/infra/rpc/agent_studio/sandbox_scheduler.go 0.00% 12 Missing ⚠️
...d/modules/evaluation/application/experiment_app.go 18.18% 8 Missing and 1 partial ⚠️
...api/handler/coze/loop/apis/eval_open_apiservice.go 0.00% 4 Missing ⚠️
backend/modules/evaluation/domain/entity/param.go 0.00% 3 Missing ⚠️
...valuation/application/convertor/experiment/expt.go 0.00% 1 Missing and 1 partial ⚠️
... and 4 more

❌ Your patch check has failed because the patch coverage (3.89%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #551      +/-   ##
==========================================
- Coverage   77.65%   77.41%   -0.25%     
==========================================
  Files         670      672       +2     
  Lines       76101    76488     +387     
==========================================
+ Hits        59098    59210     +112     
- Misses      13545    13808     +263     
- Partials     3458     3470      +12     
Flag Coverage Δ
unittests 77.41% <3.89%> (-0.25%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...aluation/domain/service/expt_run_item_turn_impl.go 86.46% <100.00%> (ø)
...valuation/application/convertor/experiment/expt.go 83.61% <0.00%> (-0.36%) ⬇️
.../application/convertor/experiment/expt_template.go 88.38% <0.00%> (-0.21%) ⬇️
backend/modules/evaluation/domain/entity/expt.go 95.87% <0.00%> (-2.02%) ⬇️
backend/modules/evaluation/domain/entity/target.go 97.93% <33.33%> (-2.07%) ⬇️
...ules/evaluation/domain/service/expt_manage_impl.go 76.32% <0.00%> (-0.25%) ⬇️
backend/modules/evaluation/domain/entity/param.go 80.70% <0.00%> (-4.49%) ⬇️
...api/handler/coze/loop/apis/eval_open_apiservice.go 0.00% <0.00%> (ø)
...d/modules/evaluation/application/experiment_app.go 83.85% <18.18%> (-0.50%) ⬇️
...uation/infra/rpc/agent_studio/sandbox_scheduler.go 0.00% <0.00%> (ø)
... and 5 more

... and 9 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df9d61c...7276ae7. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@tpfz tpfz force-pushed the feat/wzq/trae_eval branch from 233352c to fcd209c Compare June 22, 2026 07:57
xueyizheng and others added 5 commits June 23, 2026 15:10
老唯一键只到 (space_id, expt_id, expt_turn_result_id, evaluator_version_id),
同 version 不同 alias 的第二行会被 ON DUPLICATE KEY UPDATE 覆盖。改为
uniq_expt_turn_evaluator_result (expt_id, evaluator_version_id,
expt_turn_result_id, inline_key, alias):
- 去 space_id (expt_id 雪花 ID 已全局唯一)
- 去 source_type (inline_key/alias 互斥, 二元组即可区分 Builtin/Inline 实例)
- 不加合并列、不回填 (本表 3000 万行 + NDB 8.0 不支持 CHAR())
helm init-sql 顺带补齐缺失的 source_type/inline_key/alias 三列。
线上既有表需走 migration ALTER (CREATE IF NOT EXISTS 对旧表不生效)。

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
名字带上消歧列 inline_key/alias, 一眼看出是 alias 多实例键。
gorm model + 2 个 init-sql 同步, 与 BOE 线上 ALTER 索引名保持一致。

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
GetExperimentsOApi 之前不回显 eval_set_configs,导致 OpenAPI 端拿不到
per-set item_filter / evaluator·target 配置。本次在 OpenAPIExptDO2DTO 里
按实验类型分流:MultiSetConfig 回填 eval_set_configs,SingleSet 旧实验不回显。
version_id→version 字符串从读对象已加载数据反查(零额外 RPC)。

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The gen-commercial package references ExptNotificationConf,
WebhookNotificationConf, and FeishuNotificationConf types as aliases
from the domain/expt package. These types were added on main but
not yet on feat/eval0630. Adding minimal stubs to unblock the
commercial build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tpfz tpfz force-pushed the feat/wzq/trae_eval branch from f53db8b to 7014835 Compare June 24, 2026 12:48
tpfz and others added 13 commits June 24, 2026 20:55
Add a new EvalTargetType.SandboxAgent (=17) for evaluating agents
launched via CLI inside a sandbox container. Wired through IDL,
generated kitex code, DO entities, DTO/DO converters (internal +
openapi), MySQL convertor, and turn-execution dispatch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cord

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Route SandboxAgent eval target through the async execution path so that
external sandbox runs can report results via ReportEvalTargetInvokeResult:
- AsyncCallTarget() returns true for SandboxAgent targets
- Register a SandboxAgent ISourceEvalTargetOperateService that allocates
  an invoke id placeholder in AsyncExecute; actual execution is performed
  outside and reported back through the existing async report endpoint

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the redundant /evaluation segment from /v1/loop/evaluation/eval_targets/async_debug to match the sibling ReportEvalTargetInvokeResult route at /v1/loop/eval_targets/result.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a sandbox_agent field on AsyncDebugEvalTargetOApiRequest and a SandboxAgent case in the openapi async-debug switch so that debugging a SandboxAgent target no longer fails with "unsupported eval target type: sandbox_agent".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
新增沙箱调度 RPC 适配器接口与配套 DO:
- domain/component/rpc/sandbox_scheduler.go:ISandboxSchedulerAdapter 接口(Init/Run/Get/GetTaskInfo/Destroy)与 Sandbox{Execute,Task}{Info,Status...} 等 DO/枚举
- infra/rpc/agent_studio/:开源 backend 内的占位实现(5 个方法返回 not implement),由商业仓库覆盖真正调用 stone.cozeloop.agent_studio 的逻辑
- SandboxAgentSourceEvalTargetServiceImpl 注入 ISandboxSchedulerAdapter,NewSourceTargetOperators 形参跟随扩展
- TargetDomainServiceSet 引入 agent_studio.AgentStudioRPCSet 与接口 wire.Bind,重新生成 wire_gen.go
…heduler IDL

DestroyType enum now matches new IDL (Task/Execute); SandboxRunRequest carries optional Image field.
image is now passed via the Param map (key=image); sandbox agent already
serializes all input fields into Param, so the adapter does not need a
dedicated Image field on SandboxRunRequest.
…d per-row completion

Init a sandbox task via SandboxSchedulerAdapter in SubmitExperiment for SandboxAgent eval targets (TaskID=expt.ID, Concurrency=ItemConcurNum). When each row finishes (ReportInvokeRecords), best-effort destroy that execute via SandboxSchedulerAdapter.Destroy so sandbox resources are released; failures only warn and never block the report.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire SandboxAgent eval-target type through the submit-experiment OpenAPI:
add sandbox_agent to SubmitExperimentEvalTargetParam and internal
CreateEvalTargetParam IDLs, propagate it through OpenAPI/RPC convertors
and entity.Opt (WithSandboxAgent), and let the SandboxAgent operator
BuildBySource construct an EvalTarget from the request payload so the
experiment can be created in one OpenAPI call.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Init sandbox task with TaskID "sandbox_debug" (concurrency=10) before
AsyncDebugEvalTargetOApi dispatches a SandboxAgent debug run, and destroy
the execute via defer in ReportEvalTargetInvokeResult_ regardless of
success or failure when the report comes from a debug context
(actx.Event == nil). Also fixes a nil deref on actx.Event in the log line
for debug reports.
@tpfz tpfz force-pushed the feat/wzq/trae_eval branch from 7014835 to 394e837 Compare June 24, 2026 13:06
@CLAassistant

CLAassistant commented Jun 24, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
4 out of 5 committers have signed the CLA.

✅ xueyizheng
✅ VinCinx
✅ tpfz
✅ alanyf
❌ NoahKex
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants