Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions backend/src/mastra/agents/populate.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,21 @@ You do broad research to see which rows to add, and then you spin up sub-agents
Your job is to make sure you dispatch and manage your army of sub agents to build up a dataset with 100 rows in it. Stop as soon as the dataset reaches 100 rows.

WORKFLOW:
1. Understand the data that is is needed and do some research to find places on the web where this data may be obvious and easy to find, collect these links to see what the task of scraping the web is going to look like.
1. Understand the data that is needed and do some research to find places on the web where this data may be obvious and easy to find. Collect these links to see what the task of scraping the web is going to look like.
If the dataset is to look at YC Companies, collect links for the YC Startup registry and so on.

2. Trigger sub agents. Start doing broad research and identify basic information of the rows in the dataset. Let's say you find a company named "Boody", trigger the run_subagent tool with all the necesarry context (links and places to look) so that it can go and effectivly fill in the data.
2. Trigger sub agents. Start doing broad research and identify basic information of the rows in the dataset. Let's say you find a company named "Boody", trigger the run_subagent tool with all the necessary context (links and places to look) so that it can go and effectively fill in the data.

3. See what the subagent reports back with, if all good and it gives you some information, use that to give better instuctions to subsequent sub agents.
3. See what the subagent reports back with, if all good and it gives you some information, use that to give better instructions to subsequent sub agents.

CRITICAL:
- Do not stop after only calling search_web or fetch_page. That creates zero dataset rows.
- As soon as you have one concrete lead, call run_subagent for that lead.
- Your run is not useful unless at least one run_subagent inserts a row.

Keep going until you have 100 rows, then finish immediately. If run_subagent reports ROW_LIMIT_REACHED, stop calling tools and finish the run.

This process should become faster overtime as you just find new rows to go and build, and you keep invoking sub agents in parallel to fill them in.
This process should become faster over time as you just find new rows to go and build, and you keep invoking sub agents in parallel to fill them in.

Duplicates are rejected automatically based on primary key columns. If a subagent reports a duplicate, don't re-investigate the same entity — move on to a new one.
`;
Expand Down
19 changes: 18 additions & 1 deletion backend/src/mastra/agents/refresh.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,20 @@ const openrouter = createOpenRouter({
apiKey: process.env.OPENROUTER_API_KEY!,
});

const OPENROUTER_MODEL_SLUG_PATTERN =
/^[a-z0-9][a-z0-9._-]*\/[a-z0-9][a-z0-9._:-]*$/i;

function assertValidOpenRouterModelSlug(modelSlug: string): string {
if (
modelSlug.length > 200 ||
!OPENROUTER_MODEL_SLUG_PATTERN.test(modelSlug)
) {
throw new Error(`Invalid investigateSubagent model slug: "${modelSlug}"`);
}

return modelSlug;
}

function buildRefreshInstructions(columns: PopulateColumn[]): string {
const columnNames = columns.map((c) => c.name);
const columnsDesc = columns
Expand Down Expand Up @@ -56,6 +70,9 @@ export function buildRefreshAgent(
authContext: AuthContext,
columns: PopulateColumn[],
): Agent {
const modelSlug = assertValidOpenRouterModelSlug(
authContext.modelConfig!.investigateSubagent,
);
const { update_row } = buildPopulateTools(
authorizedDatasetId,
authContext,
Expand All @@ -64,7 +81,7 @@ export function buildRefreshAgent(
id: "refresh-agent",
name: "Dataset Refresh Agent",
instructions: buildRefreshInstructions(columns),
model: openrouter("qwen/qwen3.7-max"),
model: openrouter(modelSlug),
tools: {
update_row,
search_web: searchWebTool,
Expand Down
35 changes: 35 additions & 0 deletions backend/src/mastra/workflows/populate.ts
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,7 @@ ${columnsDesc}${pkNote}${manifestNote}${strategyNote}

Search the web broadly to find real entities that fit this dataset topic.
For each lead you find, call run_subagent with the primary key values and any context/URLs you have found.
Do not finish after only searching or fetching pages. If you have any concrete lead, call run_subagent before ending the run.
If run_subagent returns ROW_LIMIT_REACHED, stop immediately and do not make any more tool calls.
Stop the populate run as soon as the dataset reaches 100 rows.`;

Expand Down Expand Up @@ -253,6 +254,40 @@ const agentStep = createStep({
metrics.addOrchestratorResult(result);
// Use result.toolCalls (flat accumulated list) — same reasoning as investigate-tool.ts.
metrics.countToolCalls(result.toolCalls ?? []);

let rowCount = await convex.query(internal.datasetRows.countByDataset, {
datasetId: inputData.authorizedDatasetId,
});

if (rowCount === 0) {
console.warn(
`[populate-agent] first pass inserted 0 rows for dataset ${inputData.authorizedDatasetId}; retrying with stricter instructions`,
);
const retryPrompt = `${inputData.prompt}

The previous pass ended without inserting any rows.

You MUST now:
1. Identify 3-5 concrete candidate rows from the dataset topic.
2. For each candidate, call run_subagent with primary_keys, context, and URLs.
3. Do not call only search_web/fetch_page and stop.
4. Stop as soon as at least one row has been inserted if you cannot confidently find more rows.`;

const retryResult = await agent.generate(retryPrompt, { maxSteps: 80 });
metrics.addOrchestratorResult(retryResult);
metrics.countToolCalls(retryResult.toolCalls ?? []);

rowCount = await convex.query(internal.datasetRows.countByDataset, {
datasetId: inputData.authorizedDatasetId,
});

if (rowCount === 0) {
throw new Error("Populate workflow completed with 0 rows after retry");
}

return { text: `${result.text}\n\nRetry result:\n${retryResult.text}` };
}

return { text: result.text };
Comment thread
coderabbitai[bot] marked this conversation as resolved.
} catch (err) {
status = "error";
Expand Down