Feature/tasks architectural refactoring#283
Feature/tasks architectural refactoring#283MOHITKOURAV01 wants to merge 7 commits intoopenml:mainfrom
Conversation
…tasets and tasks routers
… and optimize feature fetching
…e to pure function, and simplify bindparams
WalkthroughThe PR adds 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use your project's `ruff` configuration to improve the quality of Python code reviews.Add a Ruff configuration file to your project to customize how CodeRabbit runs |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The
list_datasetsendpoint has gained additional branching and parameter-building logic (extra quality filters, expanding bindparams, etc.); consider extracting the quality-filter/parameter construction into small helper functions to keep the main handler simpler and reduce the need fornoqa: C901, PLR0913, PLR0915. - The lookup-table whitelist is currently enforced in the router while
get_lookup_datastill interpolates the table name via f-strings; you might centralize the whitelist or table-name validation in the database layer so that any future callers ofget_lookup_datacannot bypass the restriction.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `list_datasets` endpoint has gained additional branching and parameter-building logic (extra quality filters, expanding bindparams, etc.); consider extracting the quality-filter/parameter construction into small helper functions to keep the main handler simpler and reduce the need for `noqa: C901, PLR0913, PLR0915`.
- The lookup-table whitelist is currently enforced in the router while `get_lookup_data` still interpolates the table name via f-strings; you might centralize the whitelist or table-name validation in the database layer so that any future callers of `get_lookup_data` cannot bypass the restriction.
## Individual Comments
### Comment 1
<location path="src/routers/openml/datasets.py" line_range="174-176" />
<code_context>
if not range_:
- return ""
+ return "", {}
if not (match := re.match(integer_range_regex, range_)):
msg = f"`range_` not a valid range: {range_}"
raise ValueError(msg)
start, end = match.groups()
- value = f"`value` BETWEEN {start} AND {end[2:]}" if end else f"`value`={start}"
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `ValueError` for invalid range input may lead to a 500 instead of a client-facing 4xx response.
If `range_` fails `integer_range_regex`, this `ValueError` will likely propagate as a 500 response, even though it’s due to bad input. Consider raising an `HTTPException` with a 4xx status (e.g., 400/422) here, or ensure `ValueError` is translated to an appropriate 4xx at a higher layer so clients see a correct classification of the error.
</issue_to_address>
### Comment 2
<location path="src/database/tasks.py" line_range="120-129" />
<code_context>
return [row.tag for row in tag_rows]
+
+
+async def get_lookup_data(table: str, id_: int, expdb: AsyncConnection) -> RowMapping | None:
+ # table is already whitelisted in the router
+ result = await expdb.execute(
+ text(
+ f"""
+ SELECT *
+ FROM {table}
+ WHERE `id` = :id_
+ """, # noqa: S608
+ ),
+ parameters={"id_": id_},
+ )
+ return result.mappings().one_or_none()
</code_context>
<issue_to_address>
**🚨 issue (security):** Relying on external whitelisting for `table` makes `get_lookup_data` easy to misuse.
The inline comment assumes all future callers will enforce the same whitelist as the current router. Because this helper sits in the DB layer, it’s easy for someone to call it later with a user-influenced `table` and bypass that assumption, reintroducing SQL injection risk via the interpolated table name. Please either (a) enforce the whitelist inside this function, (b) narrow the type to something like an enum of allowed tables instead of `str`, or (c) explicitly restrict this helper to internal, non-user-derived table names and document that clearly.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/routers/openml/tasks.py (1)
132-142: Consider handling missing task_inputs key more gracefully.On line 140,
task_inputs[table]will raise aKeyErrorif the template contains a[LOOKUP:table.column]directive buttask_inputsdoesn't have a corresponding entry for that table. This would result in a 500 error rather than a more descriptive error message.💡 Proposed defensive handling
if table not in ALLOWED_LOOKUP_TABLES: msg = f"Table {table} is not allowed for lookup." raise HTTPException(status_code=400, detail=msg) + if table not in task_inputs: + msg = f"Missing input for lookup table '{table}'" + raise ValueError(msg) + row_data = await database.tasks.get_lookup_data( table=table, id_=int(task_inputs[table]), expdb=connection, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/routers/openml/tasks.py` around lines 132 - 142, The code currently assumes task_inputs contains the lookup key and will KeyError; update the lookup handling in the same block that checks ALLOWED_LOOKUP_TABLES to first confirm table in task_inputs (and that task_inputs[table] is not empty), and if missing raise an HTTPException(status_code=400) with a clear message; also guard the int conversion used for id_ before calling database.tasks.get_lookup_data (catch ValueError and raise HTTPException 400) so database.tasks.get_lookup_data(...) is only called with a validated integer id.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/database/tasks.py`:
- Around line 120-132: get_lookup_data currently always queries WHERE `id` =
:id_, which fails for tables whose PKs are named differently (e.g., task_type
uses ttid, dataset uses did); update get_lookup_data to either accept a
pk_column parameter or maintain an internal mapping from table name to primary
key column and use that column in the WHERE clause. Specifically, modify
get_lookup_data (and its callers if adding a param) to resolve the correct
column for table names "task_type" -> "ttid" and "dataset" -> "did" (fall back
to "id" for others) and interpolate that column name into the SQL text instead
of hardcoding `id`, keeping the table name whitelist behavior intact to avoid
injection.
---
Nitpick comments:
In `@src/routers/openml/tasks.py`:
- Around line 132-142: The code currently assumes task_inputs contains the
lookup key and will KeyError; update the lookup handling in the same block that
checks ALLOWED_LOOKUP_TABLES to first confirm table in task_inputs (and that
task_inputs[table] is not empty), and if missing raise an
HTTPException(status_code=400) with a clear message; also guard the int
conversion used for id_ before calling database.tasks.get_lookup_data (catch
ValueError and raise HTTPException 400) so database.tasks.get_lookup_data(...)
is only called with a validated integer id.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: a97e9c77-e321-4d99-be86-9333ff00fa45
📒 Files selected for processing (3)
src/database/tasks.pysrc/routers/openml/datasets.pysrc/routers/openml/tasks.py
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/database/datasets.py`:
- Around line 348-377: The SQL text uses list parameters for the IN clauses (IN
:statuses and IN :data_ids) but doesn't mark them as expanding; update the code
that builds/binds the statement (the sql variable created with text(...)) to use
sqlalchemy.bindparam(..., expanding=True) for "statuses" and, when data_ids is
present, for "data_ids", and import bindparam from sqlalchemy; then pass the
bound statement to connection.execute (i.e., replace or wrap sql with
sql.bindparams(bindparam("statuses", expanding=True), bindparam("data_ids",
expanding=True)) only when data_ids is provided) while keeping the parameters
dict unchanged.
In `@src/routers/openml/datasets.py`:
- Around line 104-121: The call to database.datasets.list_datasets must be
updated to the new signature: replace passing pagination, data_id, status, and
user with low-level primitives limit, offset, data_ids, statuses, user_id, and
is_admin; compute limit/offset from the existing pagination object, convert
single data_id (or data_id param) into a list data_ids if present, map status
(handling DatasetStatusFilter.ALL by expanding it into the full list of
statuses) into statuses, and pass user.id (or None) and user.is_admin (or False)
for user_id and is_admin respectively before calling list_datasets in this
block.
In `@src/routers/openml/tasks.py`:
- Around line 144-154: The branch handling a missing lookup row raises a raw
ValueError, causing a 500 instead of the intended 400; change the row_data is
None path to raise an HTTPException(status_code=400, detail=...) consistent with
the except block. Specifically, in the block around the call to
database.tasks.get_lookup_data (variables table, id_val, connection, row_data,
task_inputs), replace the ValueError(msg) with raising HTTPException(400)
carrying the same message so get_task() returns the same lookup-validation
response as other get_lookup_data failures.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 2031f5f1-3029-49f1-a895-d0d5a1af9f43
📒 Files selected for processing (4)
src/database/datasets.pysrc/database/tasks.pysrc/routers/openml/datasets.pysrc/routers/openml/tasks.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/database/tasks.py
…on, and error handling
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/database/datasets.py (1)
310-315:⚠️ Potential issue | 🟠 MajorParameterize
user_idinstead of direct string interpolation.Line 313 interpolates
user_iddirectly into the SQL string. Whileuser_idis currently an integer from the authenticated user, direct string interpolation into SQL is a security anti-pattern and should be avoided. Use a bind parameter instead.Suggested fix
if is_admin: visible_to_user = "TRUE" elif user_id: - visible_to_user = f"(`visibility`='public' OR `uploader`={user_id})" + visible_to_user = "(`visibility`='public' OR `uploader`=:user_id)" else: visible_to_user = "`visibility`='public'"Then ensure
user_idis included in the parameters dict (it already is at line 371, but only whenuser_idis truthy which aligns with this condition).Note: The parameters dict at line 366-375 doesn't currently include
user_id. You'll need to add it:parameters = { "data_name": data_name, "data_version": data_version, "uploader": uploader, "tag": tag, "statuses": statuses, "limit": limit, "offset": offset, + "user_id": user_id, **q_params, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/database/datasets.py` around lines 310 - 315, The SQL construction uses direct interpolation of user_id into visible_to_user (visible_to_user = f"(`visibility`='public' OR `uploader`={user_id})"), which is unsafe; change that to use a bind parameter (e.g. "(`visibility`='public' OR `uploader` = %(user_id)s" or the project's existing placeholder style) and add user_id to the parameters dict used to execute the query (the same params dict referenced near where visible_to_user is used) so the query is executed with a bound parameter rather than string interpolation.
🧹 Nitpick comments (2)
src/database/datasets.py (1)
263-279: Minor: duplicated regex matching with_get_quality_filter.Both
_get_quality_filterand_get_range_paramsperform the same regex match againstinteger_range_regex. Since_get_quality_filteris called first and raisesValueErroron invalid input, this function's regex check at line 270 is effectively unreachable for invalid input. Consider having_get_quality_filterreturn the parsed groups to avoid the duplicated regex work.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/database/datasets.py` around lines 263 - 279, Avoid duplicating the integer_range_regex match: extract the parsing into a small helper (e.g., parse_integer_range or have _get_quality_filter return the matched groups) so that _get_quality_filter does the regex once and returns the parsed start/end groups; then update _get_range_params to accept those parsed groups (or call the shared parse helper) instead of re-running re.match against integer_range_regex. Update references to the returned tuple from _get_quality_filter (or the new parse function) in callers and remove the redundant regex check in _get_range_params.src/routers/openml/datasets.py (1)
139-139: Consider usingstrict=Truefor defensive error detection.The SQL query returns exactly 6 columns matching the
columnslist, so functionally this works. However,strict=Truewould catch any future mismatch between the query and columns list during development rather than silently truncating/ignoring extra values.Suggested change
- row.did: dict(zip(columns, row, strict=False)) for row in rows + row.did: dict(zip(columns, row, strict=True)) for row in rows🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/routers/openml/datasets.py` at line 139, The dict construction using dict(zip(columns, row, strict=False)) should use strict=True to fail fast on column/row length mismatches; update the expression in the comprehension that constructs rows (the line referencing zip(columns, row, strict=False)) to zip(columns, row, strict=True). Optionally wrap the comprehension in a try/except (catching TypeError) around the code that builds the dicts so you can log or raise a clearer error if the strict zip fails during development.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/database/datasets.py`:
- Around line 310-315: The SQL construction uses direct interpolation of user_id
into visible_to_user (visible_to_user = f"(`visibility`='public' OR
`uploader`={user_id})"), which is unsafe; change that to use a bind parameter
(e.g. "(`visibility`='public' OR `uploader` = %(user_id)s" or the project's
existing placeholder style) and add user_id to the parameters dict used to
execute the query (the same params dict referenced near where visible_to_user is
used) so the query is executed with a bound parameter rather than string
interpolation.
---
Nitpick comments:
In `@src/database/datasets.py`:
- Around line 263-279: Avoid duplicating the integer_range_regex match: extract
the parsing into a small helper (e.g., parse_integer_range or have
_get_quality_filter return the matched groups) so that _get_quality_filter does
the regex once and returns the parsed start/end groups; then update
_get_range_params to accept those parsed groups (or call the shared parse
helper) instead of re-running re.match against integer_range_regex. Update
references to the returned tuple from _get_quality_filter (or the new parse
function) in callers and remove the redundant regex check in _get_range_params.
In `@src/routers/openml/datasets.py`:
- Line 139: The dict construction using dict(zip(columns, row, strict=False))
should use strict=True to fail fast on column/row length mismatches; update the
expression in the comprehension that constructs rows (the line referencing
zip(columns, row, strict=False)) to zip(columns, row, strict=True). Optionally
wrap the comprehension in a try/except (catching TypeError) around the code that
builds the dicts so you can log or raise a clearer error if the strict zip fails
during development.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 2db5b3ac-6f89-448b-9a99-5539ceef7d86
📒 Files selected for processing (3)
src/database/datasets.pysrc/routers/openml/datasets.pysrc/routers/openml/tasks.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/routers/openml/tasks.py
Description
This Pull Request implements an architectural refactoring of the Tasks router in src/routers/openml/tasks.py. Building on the security enhancements from PR 1, this change decouples the API routing logic from direct database interactions by moving SQL execution to a dedicated database module.
Key Changes:
LOOKUPdirectives to ensure the system remains secure against SQL injection.Checklist
pre-commitpassed; static analysis verified withmypyandruff).