[Major Rewrite] 600K->71K LOC Backend Rewrite, dynamic IL emission and kernel inlining, SIMD parallelism, 25 new np.* and more#573
Conversation
Leftover: Regen Templates That Could Become IL-GeneratedAnalysis of remaining hard-coded switch cases (~82K lines) that could be migrated to ILKernelGenerator: High Priority: Axis Reductions (~45K lines)The
High Impact: BLAS Operations (~36K lines)Triple-nested switches (result × left × right = 1,728 type combinations):
Medium Priority: Other Operations
Reduction Ops Pending (defined in
|
Progress Update: SIMD-Optimized Matrix MultiplicationNew Commits
What ChangedReplaced 20K-line Regen template with clean 300-line implementation:
MatMul Optimizations
Performance vs Old Scalar Implementation
NumPy Comparison (i9-13900K)Benchmarked against NumPy 2.4.2 with OpenBLAS 0.3.31:
Key Findings
Future: Native BLAS IntegrationFor full NumPy parity, P/Invoke to OpenBLAS/MKL would close the remaining gap. The pure C# SIMD implementation is a solid foundation without native dependencies. |
MatMul Performance UpdateAdded cache-blocked SIMD matrix multiplication achieving 14-17 GFLOPS (single-threaded). Commits Added
Performance Results
Implementation DetailsSimdMatMul.cs - New cache-blocked implementation:
Comparison to OpenBLAS
The cache-blocked implementation achieves ~40% of OpenBLAS single-thread performance without any parallelization, which is reasonable for a pure C#/.NET implementation without hand-tuned assembly. |
IL Kernel Migration Progress UpdateCompleted Migrations
Bug Fixes
Tests Added~265 new NumPy-based tests covering edge cases Benchmarks (1M float64)
|
IL Kernel Migration Progress - Batch 2Just Completed
New IL Infrastructure
Bug Fixes
Tests
Cumulative Progress
Remaining Work
|
Definition of Done - IL Kernel Migration (Updated)✅ CompletedIL Kernel Infrastructure (15,272+ lines)
Major Migrations (This Session)
Bug Fixes in PR
Issues Closed
|
| Item | Status | Notes |
|---|---|---|
| #576 SIMD Reductions | Flat ✅, Axis iterator | SIMD axis is complex |
| #577 SIMD Unary | Scalar Math.* calls | Would need SVML/libm |
| NaN Reductions | Not implemented | NanSum, NanProd, NanMin, NanMax |
| ATan2 | 141 lines Regen | Has stride/offset bug |
🔲 Optional Post-Merge Enhancements
Quick Wins
- Fix
Bug_Modulo_CSharpSemantics- Python mod semantics - Fix
Bug_Argmin_IgnoresNaN- NaN handling - Fix
Bug_Prod_BoolArray_Crashes- Bool dtype support - Fix
Bug_Cumsum_BoolArray_Crashes- Bool dtype support
Medium Effort
- Migrate
Default.ATan2.cs- Fix stride bug - Add NanSum/NanProd/NanMin/NanMax reductions
- ClipNDArray (array bounds) - 595 lines
📊 Final Metrics
| Metric | Value |
|---|---|
| Lines removed | ~500K+ |
| Lines added (IL) | ~15K |
| Net reduction | ~485K lines |
| Operations migrated | 50+ |
| Tests added | ~265 |
| OpenBugs fixed | 4+ |
✅ Merge Criteria Met
- All existing tests pass
- No performance regressions
- CLAUDE.md documentation updated
- PR comments document progress
- Definition of Done documented
…hift, Var/Std SIMD Major changes: - Dot.NDMD: 15,880 → 419 lines (97% reduction) with SIMD for float/double - CumSum axis: IL kernel with caching, optimized inner contiguous path - LeftShift/RightShift: New ILKernelGenerator.Shift.cs (546 lines) with SIMD - Var/Std axis: SIMD support for int/long/short/byte types New IL infrastructure: - ILKernelGenerator.Shift.cs - Bit shift operations with Vector256 - ILKernelGenerator.Scan.cs - Extended with axis cumsum support - ILKernelGenerator.Reduction.cs - SIMD for integer types in Var/Std Bug fixes: - Single element Var/Std with ddof >= size returns NaN (NumPy parity) - Dot tests Dot3412x5621 and Dot311x511 now pass (removed OpenBugs) Documentation: - CLAUDE.md updated with all migrations - PR #573 comments with progress updates and Definition of Done Test coverage: - All Var/Std/CumSum/Shift/Dot tests passing
IL Kernel Migration - Final Cleanup BatchCode Removed This Session
Bugs Fixed
Tests
Remaining Legacy CodeAfter comprehensive audit:
Ready for ReviewBranch has multiple commits ready for merge. |
IL Kernel Migration - Final Batch Complete 🎉Just Completed
ATan2 Improvements
ClipNDArray SIMD
NaN Reductions (New Feature!)
Test Results
PR SummaryThis PR is now feature-complete for the IL kernel migration:
Ready for review and merge. |
IL Kernel Migration - Final Summary & Future Work✅ Completed (Ready for Merge)IL Kernel Infrastructure (17K+ lines):
Migrated Operations:
Bug Fixes:
Tests: 3700+ passing, ~265 new NumPy-based tests 🔧 Future Work (Post-Merge)Easy Wins:
Medium Effort:
Deferred (Complex):
Code Review Findings (Verified)What Works Well:
What Could Be Improved:
Issues Status
PR is feature-complete and ready for merge. 🚀 |
Performance Optimizations BatchNew Feature: CumProd (Cumulative Product)
MethodInfo Cache (~30 calls optimized)Replaced inline
Before: Reflection lookup on every kernel generation Integer Abs/Sign Bitwise
Benefit: Pure integer ops, no float conversion overhead Vector512 Support Extended
IL-generated code was already V512-ready via abstraction layer. Test Results
|
PR #573 - Final Status: Ready for Merge ✅All Planned Work CompleteIssue #576: SIMD Axis Reductions ✅
Issue #577: SIMD Transcendentals ⏸️
Complete Feature List
Optimizations Added
Code Metrics
Issues Closing
This PR is complete and ready for final review and merge. 🚀 |
PR #573: ILKernelGenerator - Complete SummaryHigh-Level Metrics
Core Achievement: ILKernelGenerator19,069 lines across 27 partial class files in Runtime IL generation via Core Infrastructure:
Unary Operations:
Reductions:
Scans & Specialized:
Masking:
Supporting Infrastructure
Deleted Regen Template Code (~536K lines)63 files deleted in Math/ alone, each ~8,417 lines:
Massive File Refactorings
New APIs Implemented
Bug FixesFixed Comparison/Operator Bugs:
Fixed Type/Dtype Bugs:
Fixed Math/Reduction Bugs:
Fixed Shift/BLAS Bugs:
Performance Improvements
NEP50 Type Promotion (NumPy 2.x)
Memory Allocation ModernizationReplaced deprecated Test InfrastructureCategories:
NumPy-Ported Edge Case Tests (2,820 lines added):
Benchmark InfrastructureNew
Categories: Allocation, Binary ops, Reduction, MatMul, Unary Documentation Added
Execution Path Classification
This PR represents a fundamental architectural transformation of NumSharp from template-generated code to runtime IL generation, achieving massive code reduction (~536K lines), significant performance improvements (35-100x for MatMul), and comprehensive NumPy 2.x alignment. |
Fixed ArgumentOutOfRangeException when performing matrix multiplication
on arrays with more than 2 dimensions (e.g., (3,1,2,2) @ (3,2,2)).
Root causes:
1. Default.MatMul.cs: Loop count used `l.size` (total elements) instead
of `iterShape.size` (number of matrix pairs to multiply)
2. UnmanagedStorage.Getters.cs: When indexing into broadcast arrays:
- sliceSize incorrectly used parent's BufferSize for non-broadcast
subshapes instead of the subshape's actual size
- Shape offset was double-counted (once in GetSubshape, again because
InternalArray.Slice already positioned at offset)
The fix ensures:
- Correct iteration count over batch dimensions
- Proper sliceSize calculation based on subshape broadcast status
- Shape offset reset to 0 after array slicing
Verified against NumPy 2.4.2 output.
The tests incorrectly expected both arrays to have IsBroadcasted=True after np.broadcast_arrays(). Per NumPy semantics, only arrays that actually get broadcasted (have stride=0 for dimensions with size>1) should be flagged. When broadcasting (1,1,1) with (1,10,1): - Array 'a' (1,1,1→1,10,1): IsBroadcasted=True (strides become 0) - Array 'b' (1,10,1→1,10,1): IsBroadcasted=False (no change, no zero strides) NumSharp's behavior was correct; the test expectations were wrong.
When np.sum() or np.mean() is called with keepdims=True and no axis specified (element-wise reduction), the result should preserve all dimensions as size 1. Before: np.sum(arr_2d, keepdims=True).shape = (1) After: np.sum(arr_2d, keepdims=True).shape = (1, 1) Fixed in both ReduceAdd and ReduceMean by reshaping to an array of 1s with the same number of dimensions as the input, instead of just calling ExpandDimension(0) once. Verified against NumPy 2.4.2 behavior.
BREAKING CHANGE: shuffle now correctly shuffles along axis (default 0) instead of shuffling individual elements randomly. Changes: - Add optional `axis` parameter matching NumPy Generator.shuffle API - Implement Fisher-Yates shuffle algorithm for axis-based shuffling - Optimize 1D contiguous arrays with direct memory swapping - Support negative axis values - Throw ArgumentException for 0-d arrays (matches NumPy TypeError) - Throw ArgumentOutOfRangeException for invalid axis The previous implementation incorrectly shuffled individual elements randomly across the entire array. NumPy's shuffle operates along an axis (default 0), reordering subarrays while preserving their contents. Example (2D array): Before: [[0,1,2], [3,4,5], [6,7,8]] axis=0: rows shuffled → [[6,7,8], [0,1,2], [3,4,5]] axis=1: within-row shuffle → [[2,0,1], [5,3,4], [8,6,7]]
Tests based on actual NumPy output covering: - 1D·1D inner product - 2D·1D and 1D·2D matrix-vector products - 2D·2D matrix multiplication - Scalar operations - Mixed dtypes (int32·float64 → float64) - Empty array edge case - 3D·2D higher-dimension products - Non-contiguous (strided) arrays - Transposed arrays - Column vector · row vector (outer product) - Row vector · column vector - Large matrices Marked as OpenBugs: - ND·1D: NumSharp returns wrong shape (2,4) instead of (2,3) The last axis of a should contract with the only axis of b
np.matmul tests: - 2D @ 2D matrix multiplication ✓ - Large matrices ✓ - Transposed arrays ✓ - [OpenBugs] 1D @ 1D (requires 2D inputs in NumSharp) - [OpenBugs] 2D @ 1D (returns (n,1) instead of (n,)) - [OpenBugs] 1D @ 2D (throws dimension mismatch) - [OpenBugs] 3D broadcasting (crashes) np.outer tests: - Basic 1D outer product ✓ - Different sizes ✓ - 2D inputs (flattened) ✓ - Float arrays ✓ - Single element ✓ All passing tests verified against actual NumPy output.
BREAKING: Removed incorrectly added `axis` parameter. NumPy has two distinct shuffle APIs: 1. Legacy: np.random.shuffle(x) - axis 0 only, no axis param 2. Generator: rng.shuffle(x, axis=0) - supports axis param This implementation now correctly matches the legacy API: - Only shuffles along first axis (axis=0) - No axis parameter - Throws ArgumentException for 0-d arrays The previous commit incorrectly added an axis parameter which does not exist in NumPy's legacy np.random.shuffle. For axis support, users should use a future Generator API implementation.
Fixes: - Renamed `stardard_normal` to `standard_normal` (typo fix) - Added backwards-compat alias with [Obsolete] warning - Added `random()` as alias for `random_sample()` (NumPy compat) NumPy random API audit findings: - 19 functions implemented correctly - 1 typo fixed (stardard_normal → standard_normal) - 1 alias added (random → random_sample) - 33 functions still missing (mostly rare distributions) - bernoulli() is NumSharp-specific (not in NumPy, use scipy)
Parameter name changes to match NumPy exactly: - beta: alpha,betaValue → a,b - binomial: dims → size - chisquare: dims → size - choice: probabilities → p - exponential: dims → size - gamma: dims → size - geometric: dims → size - lognormal: dims → size - normal: dims → size - poisson: dims → size - rand: size → d0 (matches NumPy's *args style) - randn: size → d0 (matches NumPy's *args style) - bernoulli: dims → size (NumSharp-specific) Documentation improvements: - Added NumPy doc links to all functions - Improved parameter descriptions - Added usage notes and examples - Clarified default values No functional changes - all existing tests pass.
Removed the [Obsolete] stardard_normal alias. The typo is fixed, no need for backwards compatibility shims.
Add Shape parameter overloads to match other random functions: - randn(Shape shape) - delegates to randn(shape.dimensions) - normal(double loc, double scale, Shape size) - delegates to params overload - standard_normal(Shape size) - delegates to params overload Also includes minor doc improvements: - Align parameter names with NumPy (d0 → shape) - Use Greek letters in beta docs (Alpha → α, Beta → β) - Simplify random() alias docs
Migrate away from embedded DecimalMath.DecimalEx to internal NumSharp.Utilities.DecimalMath: - Create DecimalMath.cs with only the functions we need: Sqrt, Pow, ATan2, Exp, Log, Log10, ATan - Update Default.Reduction.Std.cs to use Utilities.DecimalMath.Sqrt - Update Default.ATan2.cs to use Utilities.DecimalMath.ATan2 - Update ILKernelGenerator.cs MethodInfo references to point to new class - Remove old DecimalEx.cs (~1000 lines -> ~300 lines) Benefits: - Cleaner namespace (NumSharp.Utilities vs external DecimalMath) - AggressiveInlining attribute for kernel integration - No external dependency - Only includes functions actually used Closes #588
NumPy's intp is a signed integer type matching the platform's pointer size (32-bit on x86, 64-bit on x64). Previously mapped to int (always 32-bit). - np.intp = typeof(nint) - native signed integer - np.uintp = typeof(nuint) - native unsigned integer (new) Note: These types are defined but not currently used in NumSharp operations. Full support would require adding NPTypeCode.NInt/NUInt and updating all type switches, but this change makes the type aliases correct.
Enforce clean architecture: all computation on NDArray goes through TensorEngine. ILKernelGenerator is now an internal implementation detail of DefaultEngine. Changes: - Add abstract methods to TensorEngine: Any, NanSum, NanProd, NanMin, NanMax, BooleanMask - Create Default.Any.cs with all 12 dtypes + SIMD support - Create Default.Reduction.Nan.cs for NaN-aware reductions with SIMD - Create Default.BooleanMask.cs for boolean masking with SIMD - Enhance Default.All.cs with all 12 dtypes + SIMD support - Simplify np.all/any/nansum/nanprod/nanmin/nanmax to single TensorEngine calls - Route NDArray.Indexing.Masking through TensorEngine.BooleanMask() - Replace KernelProvider. calls with ILKernelGenerator. in DefaultEngine partials Violations fixed: 7 files no longer import NumSharp.Backends.Kernels outside Backends/ Test results: 3907 passed, 0 failed
…atic Phase 5-7 of kernel architecture cleanup: - Delete IKernelProvider.cs - premature abstraction with no alternative backends - Remove DefaultEngine.DefaultKernelProvider static property - Remove DefaultEngine.KernelProvider protected field - Convert ILKernelGenerator from sealed class with Instance singleton to static class - Update all 27 ILKernelGenerator partial files to use static partial class - Update DefaultEngine to call ILKernelGenerator methods directly - Remove BackendFactory usage from np.array_manipulation.cs (use NDArray constructors) - Add NDArray(Type, Shape, char order) constructor for API consistency - Enhanced NDArray.Indexing.Masking with partial shape match and scalar boolean support ILKernelGenerator is now purely internal to DefaultEngine - all kernel access goes through TensorEngine, not direct kernel calls. Verification (all return no results): - grep "IKernelProvider" - interface removed - grep "DefaultKernelProvider" - static property removed - grep "ILKernelGenerator.Instance" - singleton removed - grep -l "using NumSharp.Backends.Kernels" | grep -v /Backends/ - no external access
Add BooleanIndexing.BattleTests.cs with 76 tests covering all NumPy boolean indexing behaviors verified against NumPy 2.4.2 output: - Same-shape boolean masks (1D, 2D, 3D) → 1D result - Axis-0 row selection with 1D masks - Partial shape match (2D mask on 3D array) - 0-D boolean indexing (arr[True], arr[False]) - Boolean mask assignment (scalar and array values) - Empty masks and edge cases - Shape mismatch error handling - Non-contiguous arrays (sliced, transposed) - Broadcast arrays and comparisons - All dtypes (float64, float32, int64, bool, byte, etc.) - NaN and Infinity handling - Logical operations on masks (&, |, !) - Chained boolean indexing - Result memory layout verification Tests validate that boolean indexing always returns a copy (not view) and that results are always contiguous.
Phase 8 of kernel architecture cleanup - broadcasting is pure shape math, not engine-specific logic. Shape is now the canonical location. Changes: - Create Shape.Broadcasting.cs with static broadcast methods - Simplify Default.Broadcasting.cs to delegate to Shape (no duplicate code) - Update all callers to use Shape.* directly: - np.are_broadcastable.cs -> Shape.AreBroadcastable() - np.broadcast.cs -> Shape.ResolveReturnShape() - np.broadcast_arrays.cs -> Shape.Broadcast() - np.broadcast_to.cs -> Shape.Broadcast() (9 occurrences) - MultiIterator.cs -> Shape.Broadcast() (2 occurrences) - Template files -> Shape.Broadcast() (16 occurrences) Result: 0 usages of DefaultEngine.Broadcast outside Backends (was 32+) All broadcasting logic now lives in Shape struct where it belongs.
Remove Compile Remove and None Include/Remove entries for deleted Regen template files (.template.cs, .tt) that are no longer part of the codebase after the ILKernelGenerator migration.
These files are internal development artifacts not meant for the public repo: - CHANGES.md (release changelog - premature) - .claude/SIMD_INVESTIGATION_RESULTS.md (investigation notes) - docs/*_MIGRATION.md, docs/*_PLAN.md, docs/*_AUDIT.md (internal planning) - docs/UNIFIED_ITERATOR_DESIGN.md, docs/SIMD_EXECUTION_PLAN.md (design docs)
Remove internal development artifacts: - scripts/test-extraction/SIMD_TEST_COVERAGE.md - docs/plans/*.md (6 planning/audit documents)
Revise and modernize the CLAUDE.md project documentation: remove explicit NumPy v2.4.2 pin, reword goals to target NumPy 2.x, and clarify core principles (breaking changes accepted to match NumPy). Replace and expand ILKernelGenerator section with concise coverage and file/category listing, update ILKernel implementation details and SIMD notes. Update Shape struct layout (add cached flags/hash/size fields, reorder fields). Remove the verbose Known Issues section, adjust Missing Functions count, and reorganize supported APIs into clearer categorized lists (Array creation, Math, Reductions, Linear Algebra, Random, I/O, etc.). Update CI test invocation to an explicit dotnet run for net10.0 and note OpenBugs file additions. Miscellaneous wording and consistency improvements throughout Q&A and architecture explanations.
Route all elementwise ArgMax/ArgMin cases (including Boolean, Single, Double) through the IL kernel path and remove the old scalar fallbacks. Added specialized IL helpers for float/double NaN-aware semantics and Boolean semantics (ArgMax/ArgMin helpers and EmitArgReductionStep variants), updated kernel generator to emit correct initial min/max for Boolean, and dispatch to type-specific helpers. Deleted legacy SimdReductionOptimized and a Boolean elementwise template, and adjusted engine calls to use ExecuteElementReduction for the unified path. These changes consolidate logic, ensure NumPy-like NaN handling (first NaN wins), and reduce duplication.
- Replace all docs.scipy.org/doc/numpy/ URLs with numpy.org/doc/stable/ - Fix numpy.random.* URLs: reference/generated/ → reference/random/generated/ - Fix numpy.bitwise_not.html → numpy.invert.html (function renamed) - Fix NEP 41 URL: nep-0041-improved-dtype.html → nep-0041-improved-dtype-support.html - Fix arrays.strings.html → routines.strings.html The scipy documentation URLs have been deprecated and now redirect to numpy.org. Some URLs were returning 404 because the paths changed in the new location. Files updated: - README.md - 16 docs/issues/*.md files - 2 docs/neps/*.md files - 1 docs/plans/*.md file - 4 src/NumSharp.Core/*.cs files
- Remove .claude/worktrees/benchmark and .claude/worktrees/npalign from tracking - These are local worktree references, not meant to be shared - Added .claude/worktrees/ to .gitignore to prevent future commits
- Remove version suffix from broadcasting URL: stable-1.15.0 → stable - Fix indexing URL path: reference/arrays.indexing → user/basics.indexing Both URLs now point to the latest stable numpy documentation.
Summary
This PR implements the IL Kernel Generator, replacing NumSharp's ~500K+ lines of template-generated type-switch code with ~7K lines of dynamic IL emission using
System.Reflection.Emit.Closes #544 - [Core] Replace ~636K lines of generated math code with DynamicMethod IL emission
Closes #545 - [Core] SIMD-Optimized IL Emission (SIMD for contiguous arrays AND scalar broadcast)
Changes
Core Kernel Infrastructure (~7K lines)
ILKernelGenerator.csSimdKernels.csReductionKernel.csBinaryKernel.csDispatch Files
DefaultEngine.BinaryOp.cs- Binary ops (Add, Sub, Mul, Div, Mod)DefaultEngine.UnaryOp.cs- 22 unary ops (Sin, Cos, Sqrt, Exp, etc.)DefaultEngine.CompareOp.cs- Comparisons (==, !=, <, >, <=, >=)DefaultEngine.BitwiseOp.cs- Bitwise AND/OR/XORDefaultEngine.ReductionOp.cs- Element-wise reductionsFiles Deleted (73 total)
Net change: -498,481 lines (13,553 additions, 512,034 deletions)
SIMD Optimizations
Execution Path SIMD Status
Note: Same-type operations (e.g.,
double + double) fall back to C#SimdKernels.cswhich has full SIMD for SimdFull, SimdScalarRight/Left, and SimdChunk paths.Scalar Broadcast Optimization
SIMD scalar operations hoist
Vector256.Create(scalar)outside the loop:Benchmark (10M elements):
Bug Fixes Included
operator &andoperator |- Were completely broken (returned null)Log1p- Incorrectly usingLog10instead ofLogint/intnow returnsfloat64(NumPy 2.x behavior)Sign(NaN)- Now returns NaN instead of throwingArithmeticExceptionTest Plan
Architecture
Performance
Future Work
int + double_scalarcasesAdditional: NativeMemory Modernization
Closes #528 - Modernize unmanaged allocation: Marshal.AllocHGlobal → NativeMemory
Replaced deprecated
Marshal.AllocHGlobal/FreeHGlobalwith modern .NET 6+NativeMemory.Alloc/FreeAPI across 5 allocation sites. Benchmarks confirmed identical performance.