Skip to content

Feature/tasks architectural refactoring#283

Open
MOHITKOURAV01 wants to merge 7 commits intoopenml:mainfrom
MOHITKOURAV01:feature/tasks-architectural-refactoring
Open

Feature/tasks architectural refactoring#283
MOHITKOURAV01 wants to merge 7 commits intoopenml:mainfrom
MOHITKOURAV01:feature/tasks-architectural-refactoring

Conversation

@MOHITKOURAV01
Copy link

Description

This Pull Request implements an architectural refactoring of the Tasks router in src/routers/openml/tasks.py. Building on the security enhancements from PR 1, this change decouples the API routing logic from direct database interactions by moving SQL execution to a dedicated database module.

Key Changes:

  • Database Abstraction: Added a new get_lookup_data function in src/database/tasks.py to handle dynamic table lookups required by task templates.
  • Router Refactoring: Updated the recursive _fill_json_template function in src/routers/openml/tasks.py to use the new database layer, removing raw SQL execution from the router.
  • Security Maintenance: Preserved and integrated the strict table whitelisting for LOOKUP directives to ensure the system remains secure against SQL injection.
  • Code Quality: Improved maintainability by centralizing database access logic, making the Tasks router more focused on business logic and template processing.

Checklist

  • I have performed a self-review of my own pull request to ensure it contains all relevant information, and the proposed changes are minimal but sufficient to accomplish their task.
  • Tests pass locally (pre-commit passed; static analysis verified with mypy and ruff).
  • I have commented my code in hard-to-understand areas, and provided or updated docstrings as needed.
  • Changes are already covered under existing tests.
  • This PR and the commits have been created autonomously by a bot/agent.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

Walkthrough

The PR adds \get_lookup_data`tosrc/database/tasks.pyand introduces`ALLOWED_LOOKUP_TABLES`and`PK_MAPPING`. The OpenML datasets router now delegates dataset querying to database.datasets.list_datasetsand uses non-strict zipping;get_dataset_featuresobtains nominal feature values viadatabase.datasets.get_feature_values_bulk. The tasks router replaces inline lookup SQL with get_lookup_data, adds input validation, and maps lookup errors to HTTP 400. src/database/datasets.pygainsget_feature_values_bulk, list_datasets, and helpers for numeric range parsing; get_feature_ontologies` was adjusted for explicit typing and initialization.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main architectural refactoring of the Tasks router, decoupling routing logic from database interactions and moving SQL execution to a dedicated database module.
Description check ✅ Passed The description is directly related to the changeset, detailing the database abstraction layer, router refactoring, security maintenance, and code quality improvements implemented across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use your project's `ruff` configuration to improve the quality of Python code reviews.

Add a Ruff configuration file to your project to customize how CodeRabbit runs ruff.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The list_datasets endpoint has gained additional branching and parameter-building logic (extra quality filters, expanding bindparams, etc.); consider extracting the quality-filter/parameter construction into small helper functions to keep the main handler simpler and reduce the need for noqa: C901, PLR0913, PLR0915.
  • The lookup-table whitelist is currently enforced in the router while get_lookup_data still interpolates the table name via f-strings; you might centralize the whitelist or table-name validation in the database layer so that any future callers of get_lookup_data cannot bypass the restriction.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `list_datasets` endpoint has gained additional branching and parameter-building logic (extra quality filters, expanding bindparams, etc.); consider extracting the quality-filter/parameter construction into small helper functions to keep the main handler simpler and reduce the need for `noqa: C901, PLR0913, PLR0915`.
- The lookup-table whitelist is currently enforced in the router while `get_lookup_data` still interpolates the table name via f-strings; you might centralize the whitelist or table-name validation in the database layer so that any future callers of `get_lookup_data` cannot bypass the restriction.

## Individual Comments

### Comment 1
<location path="src/routers/openml/datasets.py" line_range="174-176" />
<code_context>
         if not range_:
-            return ""
+            return "", {}
         if not (match := re.match(integer_range_regex, range_)):
             msg = f"`range_` not a valid range: {range_}"
             raise ValueError(msg)
         start, end = match.groups()
-        value = f"`value` BETWEEN {start} AND {end[2:]}" if end else f"`value`={start}"
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `ValueError` for invalid range input may lead to a 500 instead of a client-facing 4xx response.

If `range_` fails `integer_range_regex`, this `ValueError` will likely propagate as a 500 response, even though it’s due to bad input. Consider raising an `HTTPException` with a 4xx status (e.g., 400/422) here, or ensure `ValueError` is translated to an appropriate 4xx at a higher layer so clients see a correct classification of the error.
</issue_to_address>

### Comment 2
<location path="src/database/tasks.py" line_range="120-129" />
<code_context>
     return [row.tag for row in tag_rows]
+
+
+async def get_lookup_data(table: str, id_: int, expdb: AsyncConnection) -> RowMapping | None:
+    # table is already whitelisted in the router
+    result = await expdb.execute(
+        text(
+            f"""
+            SELECT *
+            FROM {table}
+            WHERE `id` = :id_
+            """,  # noqa: S608
+        ),
+        parameters={"id_": id_},
+    )
+    return result.mappings().one_or_none()
</code_context>
<issue_to_address>
**🚨 issue (security):** Relying on external whitelisting for `table` makes `get_lookup_data` easy to misuse.

The inline comment assumes all future callers will enforce the same whitelist as the current router. Because this helper sits in the DB layer, it’s easy for someone to call it later with a user-influenced `table` and bypass that assumption, reintroducing SQL injection risk via the interpolated table name. Please either (a) enforce the whitelist inside this function, (b) narrow the type to something like an enum of allowed tables instead of `str`, or (c) explicitly restrict this helper to internal, non-user-derived table names and document that clearly.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/routers/openml/tasks.py (1)

132-142: Consider handling missing task_inputs key more gracefully.

On line 140, task_inputs[table] will raise a KeyError if the template contains a [LOOKUP:table.column] directive but task_inputs doesn't have a corresponding entry for that table. This would result in a 500 error rather than a more descriptive error message.

💡 Proposed defensive handling
             if table not in ALLOWED_LOOKUP_TABLES:
                 msg = f"Table {table} is not allowed for lookup."
                 raise HTTPException(status_code=400, detail=msg)

+            if table not in task_inputs:
+                msg = f"Missing input for lookup table '{table}'"
+                raise ValueError(msg)
+
             row_data = await database.tasks.get_lookup_data(
                 table=table,
                 id_=int(task_inputs[table]),
                 expdb=connection,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/tasks.py` around lines 132 - 142, The code currently
assumes task_inputs contains the lookup key and will KeyError; update the lookup
handling in the same block that checks ALLOWED_LOOKUP_TABLES to first confirm
table in task_inputs (and that task_inputs[table] is not empty), and if missing
raise an HTTPException(status_code=400) with a clear message; also guard the int
conversion used for id_ before calling database.tasks.get_lookup_data (catch
ValueError and raise HTTPException 400) so database.tasks.get_lookup_data(...)
is only called with a validated integer id.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/database/tasks.py`:
- Around line 120-132: get_lookup_data currently always queries WHERE `id` =
:id_, which fails for tables whose PKs are named differently (e.g., task_type
uses ttid, dataset uses did); update get_lookup_data to either accept a
pk_column parameter or maintain an internal mapping from table name to primary
key column and use that column in the WHERE clause. Specifically, modify
get_lookup_data (and its callers if adding a param) to resolve the correct
column for table names "task_type" -> "ttid" and "dataset" -> "did" (fall back
to "id" for others) and interpolate that column name into the SQL text instead
of hardcoding `id`, keeping the table name whitelist behavior intact to avoid
injection.

---

Nitpick comments:
In `@src/routers/openml/tasks.py`:
- Around line 132-142: The code currently assumes task_inputs contains the
lookup key and will KeyError; update the lookup handling in the same block that
checks ALLOWED_LOOKUP_TABLES to first confirm table in task_inputs (and that
task_inputs[table] is not empty), and if missing raise an
HTTPException(status_code=400) with a clear message; also guard the int
conversion used for id_ before calling database.tasks.get_lookup_data (catch
ValueError and raise HTTPException 400) so database.tasks.get_lookup_data(...)
is only called with a validated integer id.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a97e9c77-e321-4d99-be86-9333ff00fa45

📥 Commits

Reviewing files that changed from the base of the PR and between 5a9eb8c and 7d8d391.

📒 Files selected for processing (3)
  • src/database/tasks.py
  • src/routers/openml/datasets.py
  • src/routers/openml/tasks.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/database/datasets.py`:
- Around line 348-377: The SQL text uses list parameters for the IN clauses (IN
:statuses and IN :data_ids) but doesn't mark them as expanding; update the code
that builds/binds the statement (the sql variable created with text(...)) to use
sqlalchemy.bindparam(..., expanding=True) for "statuses" and, when data_ids is
present, for "data_ids", and import bindparam from sqlalchemy; then pass the
bound statement to connection.execute (i.e., replace or wrap sql with
sql.bindparams(bindparam("statuses", expanding=True), bindparam("data_ids",
expanding=True)) only when data_ids is provided) while keeping the parameters
dict unchanged.

In `@src/routers/openml/datasets.py`:
- Around line 104-121: The call to database.datasets.list_datasets must be
updated to the new signature: replace passing pagination, data_id, status, and
user with low-level primitives limit, offset, data_ids, statuses, user_id, and
is_admin; compute limit/offset from the existing pagination object, convert
single data_id (or data_id param) into a list data_ids if present, map status
(handling DatasetStatusFilter.ALL by expanding it into the full list of
statuses) into statuses, and pass user.id (or None) and user.is_admin (or False)
for user_id and is_admin respectively before calling list_datasets in this
block.

In `@src/routers/openml/tasks.py`:
- Around line 144-154: The branch handling a missing lookup row raises a raw
ValueError, causing a 500 instead of the intended 400; change the row_data is
None path to raise an HTTPException(status_code=400, detail=...) consistent with
the except block. Specifically, in the block around the call to
database.tasks.get_lookup_data (variables table, id_val, connection, row_data,
task_inputs), replace the ValueError(msg) with raising HTTPException(400)
carrying the same message so get_task() returns the same lookup-validation
response as other get_lookup_data failures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2031f5f1-3029-49f1-a895-d0d5a1af9f43

📥 Commits

Reviewing files that changed from the base of the PR and between 7d8d391 and 2054eb7.

📒 Files selected for processing (4)
  • src/database/datasets.py
  • src/database/tasks.py
  • src/routers/openml/datasets.py
  • src/routers/openml/tasks.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/database/tasks.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/database/datasets.py (1)

310-315: ⚠️ Potential issue | 🟠 Major

Parameterize user_id instead of direct string interpolation.

Line 313 interpolates user_id directly into the SQL string. While user_id is currently an integer from the authenticated user, direct string interpolation into SQL is a security anti-pattern and should be avoided. Use a bind parameter instead.

Suggested fix
     if is_admin:
         visible_to_user = "TRUE"
     elif user_id:
-        visible_to_user = f"(`visibility`='public' OR `uploader`={user_id})"
+        visible_to_user = "(`visibility`='public' OR `uploader`=:user_id)"
     else:
         visible_to_user = "`visibility`='public'"

Then ensure user_id is included in the parameters dict (it already is at line 371, but only when user_id is truthy which aligns with this condition).

Note: The parameters dict at line 366-375 doesn't currently include user_id. You'll need to add it:

     parameters = {
         "data_name": data_name,
         "data_version": data_version,
         "uploader": uploader,
         "tag": tag,
         "statuses": statuses,
         "limit": limit,
         "offset": offset,
+        "user_id": user_id,
         **q_params,
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/database/datasets.py` around lines 310 - 315, The SQL construction uses
direct interpolation of user_id into visible_to_user (visible_to_user =
f"(`visibility`='public' OR `uploader`={user_id})"), which is unsafe; change
that to use a bind parameter (e.g. "(`visibility`='public' OR `uploader` =
%(user_id)s" or the project's existing placeholder style) and add user_id to the
parameters dict used to execute the query (the same params dict referenced near
where visible_to_user is used) so the query is executed with a bound parameter
rather than string interpolation.
🧹 Nitpick comments (2)
src/database/datasets.py (1)

263-279: Minor: duplicated regex matching with _get_quality_filter.

Both _get_quality_filter and _get_range_params perform the same regex match against integer_range_regex. Since _get_quality_filter is called first and raises ValueError on invalid input, this function's regex check at line 270 is effectively unreachable for invalid input. Consider having _get_quality_filter return the parsed groups to avoid the duplicated regex work.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/database/datasets.py` around lines 263 - 279, Avoid duplicating the
integer_range_regex match: extract the parsing into a small helper (e.g.,
parse_integer_range or have _get_quality_filter return the matched groups) so
that _get_quality_filter does the regex once and returns the parsed start/end
groups; then update _get_range_params to accept those parsed groups (or call the
shared parse helper) instead of re-running re.match against integer_range_regex.
Update references to the returned tuple from _get_quality_filter (or the new
parse function) in callers and remove the redundant regex check in
_get_range_params.
src/routers/openml/datasets.py (1)

139-139: Consider using strict=True for defensive error detection.

The SQL query returns exactly 6 columns matching the columns list, so functionally this works. However, strict=True would catch any future mismatch between the query and columns list during development rather than silently truncating/ignoring extra values.

Suggested change
-        row.did: dict(zip(columns, row, strict=False)) for row in rows
+        row.did: dict(zip(columns, row, strict=True)) for row in rows
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` at line 139, The dict construction using
dict(zip(columns, row, strict=False)) should use strict=True to fail fast on
column/row length mismatches; update the expression in the comprehension that
constructs rows (the line referencing zip(columns, row, strict=False)) to
zip(columns, row, strict=True). Optionally wrap the comprehension in a
try/except (catching TypeError) around the code that builds the dicts so you can
log or raise a clearer error if the strict zip fails during development.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/database/datasets.py`:
- Around line 310-315: The SQL construction uses direct interpolation of user_id
into visible_to_user (visible_to_user = f"(`visibility`='public' OR
`uploader`={user_id})"), which is unsafe; change that to use a bind parameter
(e.g. "(`visibility`='public' OR `uploader` = %(user_id)s" or the project's
existing placeholder style) and add user_id to the parameters dict used to
execute the query (the same params dict referenced near where visible_to_user is
used) so the query is executed with a bound parameter rather than string
interpolation.

---

Nitpick comments:
In `@src/database/datasets.py`:
- Around line 263-279: Avoid duplicating the integer_range_regex match: extract
the parsing into a small helper (e.g., parse_integer_range or have
_get_quality_filter return the matched groups) so that _get_quality_filter does
the regex once and returns the parsed start/end groups; then update
_get_range_params to accept those parsed groups (or call the shared parse
helper) instead of re-running re.match against integer_range_regex. Update
references to the returned tuple from _get_quality_filter (or the new parse
function) in callers and remove the redundant regex check in _get_range_params.

In `@src/routers/openml/datasets.py`:
- Line 139: The dict construction using dict(zip(columns, row, strict=False))
should use strict=True to fail fast on column/row length mismatches; update the
expression in the comprehension that constructs rows (the line referencing
zip(columns, row, strict=False)) to zip(columns, row, strict=True). Optionally
wrap the comprehension in a try/except (catching TypeError) around the code that
builds the dicts so you can log or raise a clearer error if the strict zip fails
during development.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2db5b3ac-6f89-448b-9a99-5539ceef7d86

📥 Commits

Reviewing files that changed from the base of the PR and between 2054eb7 and db9b7fb.

📒 Files selected for processing (3)
  • src/database/datasets.py
  • src/routers/openml/datasets.py
  • src/routers/openml/tasks.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/routers/openml/tasks.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant