security: implement SQL parameterization and table whitelisting in da…#282
security: implement SQL parameterization and table whitelisting in da…#282MOHITKOURAV01 wants to merge 3 commits intoopenml:mainfrom
Conversation
…tasets and tasks routers
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughSQL string interpolation was replaced with parameterized queries in src/routers/openml/datasets.py: query inputs (pagination, statuses, optional user_id, data_ids, and quality filters) are collected into a single 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The
allowed_tablesset in_fill_json_templateis recreated on every call; consider moving it to a module-level constant or using an Enum so the whitelist is centralized and easier to maintain. - Although table names are now whitelisted in
_fill_json_template, the SQL is still constructed via f-strings; consider mapping the user-facing keys to hardcoded SQL table identifiers instead of interpolating the string directly to further reduce surface for mistakes. list_datasetsis now complex enough to require multiple noqa flags (C901, PLR0913, PLR0915); consider extracting some of the filter/parameter-building logic into smaller helper functions to simplify control flow and make it easier to reason about the query.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `allowed_tables` set in `_fill_json_template` is recreated on every call; consider moving it to a module-level constant or using an Enum so the whitelist is centralized and easier to maintain.
- Although table names are now whitelisted in `_fill_json_template`, the SQL is still constructed via f-strings; consider mapping the user-facing keys to hardcoded SQL table identifiers instead of interpolating the string directly to further reduce surface for mistakes.
- `list_datasets` is now complex enough to require multiple noqa flags (C901, PLR0913, PLR0915); consider extracting some of the filter/parameter-building logic into smaller helper functions to simplify control flow and make it easier to reason about the query.
## Individual Comments
### Comment 1
<location path="src/routers/openml/tasks.py" line_range="133" />
<code_context>
table, _ = field.split(".")
+ # List of tables allowed for [LOOKUP:table.column] directive.
+ # This is a security measure to prevent SQL injection via table names.
+ allowed_tables = {"estimation_procedure", "evaluation_measure", "task_type", "dataset"}
+ if table not in allowed_tables:
+ msg = f"Table {table} is not allowed for lookup."
</code_context>
<issue_to_address>
**suggestion:** Hoist `allowed_tables` to a module-level constant and consider a more specific exception type.
Right now the set is rebuilt on every call; defining `ALLOWED_LOOKUP_TABLES = {...}` at module level avoids that overhead and makes the allowlist easier to manage. For the error, consider raising something the router can reliably translate to an appropriate HTTP status (e.g., a FastAPI `HTTPException` or a custom domain error) instead of a bare `ValueError` that may bubble up as a 500 if unhandled.
Suggested implementation:
```python
(field,) = match.groups()
if field not in fetched_data:
table, _ = field.split(".")
if table not in ALLOWED_LOOKUP_TABLES:
msg = f"Table {table} is not allowed for lookup."
raise HTTPException(status_code=400, detail=msg)
```
To fully implement the suggestion, you should also:
1. Define the module-level constant (near the top of `src/routers/openml/tasks.py`, after imports or alongside other constants):
```python
ALLOWED_LOOKUP_TABLES = {
"estimation_procedure",
"evaluation_measure",
"task_type",
"dataset",
}
```
2. Ensure `HTTPException` is imported from FastAPI. If you currently have something like:
```python
from fastapi import APIRouter
```
update it to:
```python
from fastapi import APIRouter, HTTPException
```
(or add a separate `from fastapi import HTTPException` line, depending on your style).
If you prefer a custom domain error instead of `HTTPException`, define that exception class in an appropriate module and replace the `HTTPException(...)` call with raising your domain-specific exception that the router already translates into an HTTP 4xx status.
</issue_to_address>
### Comment 2
<location path="src/routers/openml/datasets.py" line_range="156" />
<code_context>
else ""
)
- def quality_clause(quality: str, range_: str | None) -> str:
+ def quality_clause(range_: str | None, param_name: str) -> str:
if not range_:
</code_context>
<issue_to_address>
**issue (complexity):** Consider making `quality_clause` a pure function that returns its SQL and params and simplifying inline `bindparams` usage to avoid hidden coupling and extra mutable state.
You can keep the parameterization benefits while reducing indirection and hidden coupling by making `quality_clause` pure and letting the caller manage params, and by simplifying `bindparams` handling.
### 1. Make `quality_clause` explicit and side‑effect free
Instead of mutating `params` and relying on magic names like `quality_{param_name}`, have `quality_clause` return both the SQL fragment and its own parameter dict:
```python
def quality_clause(
quality_name: str,
range_: str | None,
*,
prefix: str,
) -> tuple[str, dict[str, Any]]:
if not range_:
return "", {}
if not (match := re.match(integer_range_regex, range_)):
msg = f"`range_` not a valid range: {range_}"
raise ValueError(msg)
start, end = match.groups()
clause_params: dict[str, Any] = {"quality_name": quality_name}
if end:
clause_params[f"{prefix}_start"] = int(start)
clause_params[f"{prefix}_end"] = int(end[2:])
value_clause = f"`value` BETWEEN :{prefix}_start AND :{prefix}_end"
else:
clause_params[f"{prefix}_val"] = int(start)
value_clause = f"`value` = :{prefix}_val"
sql = f""" AND
d.`did` IN (
SELECT `data`
FROM data_quality
WHERE `quality`=:quality_name AND {value_clause}
)
"""
return sql, clause_params
```
And at the call site:
```python
number_instances_filter, p = quality_clause(
"NumberOfInstances", number_instances, prefix="instances"
)
params.update(p)
number_classes_filter, p = quality_clause(
"NumberOfClasses", number_classes, prefix="classes"
)
params.update(p)
number_features_filter, p = quality_clause(
"NumberOfFeatures", number_features, prefix="features"
)
params.update(p)
number_missing_values_filter, p = quality_clause(
"NumberOfMissingValues", number_missing_values, prefix="missing"
)
params.update(p)
```
This removes:
- `params["quality_instances"] = "NumberOfInstances"` etc.
- The implicit dependency on `params` naming conventions inside `quality_clause`.
All parameter names are now explicit and local to the call site.
### 2. Simplify `bindparams` usage
You can also avoid a separate `bind_params` list and keep everything inline:
```python
matching_filter = (
text(
f"""
SELECT d.`did`, d.`name`, d.`version`, d.`format`, d.`file_id`,
IFNULL(cs.`status`, 'in_preparation')
FROM dataset AS d
LEFT JOIN ({current_status}) AS cs ON d.`did` = cs.`did`
WHERE {visible_to_user} {where_name} {where_version} {where_uploader}
{where_data_id} {matching_tag} {number_instances_filter} {number_features_filter}
{number_classes_filter} {number_missing_values_filter}
AND IFNULL(cs.`status`, 'in_preparation') IN :statuses
LIMIT :limit OFFSET :offset
"""
)
.bindparams(
bindparam("statuses", expanding=True),
bindparam("data_ids", expanding=True) if "data_ids" in params else bindparam("data_ids"),
)
)
```
Or, if you prefer, only bind `data_ids` when needed:
```python
matching_filter = text(...)
if "data_ids" in params:
matching_filter = matching_filter.bindparams(
bindparam("statuses", expanding=True),
bindparam("data_ids", expanding=True),
)
else:
matching_filter = matching_filter.bindparams(
bindparam("statuses", expanding=True),
)
```
This keeps the “what is bound” logic close to the query construction and removes the extra mutable `bind_params` list to track.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
…e to pure function, and simplify bindparams
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/routers/openml/datasets.py`:
- Around line 236-239: The bindparams call in the query building code currently
always adds bindparam("data_ids") even when "data_ids" is not in params, causing
SQLAlchemy ArgumentError; change the logic around the bindparams call (the
bindparams invocation shown) to construct a list of bind parameters (e.g.,
always include bindparam("statuses", expanding=True) and only append
bindparam("data_ids", expanding=True) when "data_ids" exists in params) and then
call .bindparams(*params_list) so "data_ids" is not bound on the no-filter path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 000a2fb3-103f-40fb-98af-53276317bd1f
📒 Files selected for processing (2)
src/routers/openml/datasets.pysrc/routers/openml/tasks.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/routers/openml/tasks.py
Changes
This Pull Request addresses critical SQL Injection (SQLi) vulnerabilities identified in the
datasets.pyandtasks.pyrouters. The changes replace insecure raw string concatenation with secure SQL parameterization and implement strict table whitelisting.Key Changes:
list_datasetsinsrc/routers/openml/datasets.pyto use SQLAlchemytext()andbindparam. This includes correctly handling list parameters withexpanding=TrueforINclauses to prevent injection via tags or IDs.src/routers/openml/tasks.pyto prevent unauthorized table access through manipulated input.mypytype-hint mismatches indatasets.pyand suppressed false-positive security warnings (S608) after verifying structural safety with the new parameterized approach.pre-commithooks (includingmypyandruff) are passing.How to Test
'study_14'); DROP TABLE dataset; --) to the/datasets/listendpoint.pre-commit run --all-filesto verify linting and security checks.pytestto ensure that standard dataset and task lookups still function correctly.Checklist
pre-commitpassed; module loading verified in a fresh venv).Related Issues
Closes #281