Skip to content

Optimize JSON pretty print indentation performance#21474

Open
LamentXU123 wants to merge 8 commits intophp:masterfrom
LamentXU123:optimaze-1
Open

Optimize JSON pretty print indentation performance#21474
LamentXU123 wants to merge 8 commits intophp:masterfrom
LamentXU123:optimaze-1

Conversation

@LamentXU123
Copy link
Contributor

This PR optimizes the php_json_pretty_print_indent function in the JSON extension to improve performance when encoding structures with JSON_PRETTY_PRINT.

When I was reading this, I found that now, the indentation logic used a for loop to append spaces in increments of 4 characters per depth level. For a JSON structure with a depth of N, this resulted in N consecutive calls to smart_str_appendl. This approach introduces unnecessary overhead due to repeated function calls.

So I think I would introduce this space-time optimization to add a static constant string of pre-allocated spaces. For almost all typical JSON depths, this reduces the number of smart_str_appendl calls from O(N) to exactly 1. This significantly reduces function call overhead and improves CPU performance during json_encode with JSON_PRETTY_PRINT, especially for deeply nested data.

@staabm
Copy link
Contributor

staabm commented Mar 20, 2026

This significantly reduces function call overhead and improves CPU performance during json_encode with JSON_PRETTY_PRINT, especially for deeply nested data.

Could you give some before/after numbers?

@LamentXU123
Copy link
Contributor Author

This significantly reduces function call overhead and improves CPU performance during json_encode with JSON_PRETTY_PRINT, especially for deeply nested data.

Could you give some before/after numbers?

Sure do, but maybe tomorrow. I think this is a pretty obvious optimization so I don't do this initially

@iluuu1994
Copy link
Member

A short benchmark would be appreciated. smart_str_appendl() is inlined, so whether this actually improves performance is hard to predict.

@LamentXU123
Copy link
Contributor Author

LamentXU123 commented Mar 21, 2026

A short benchmark would be appreciated. smart_str_appendl() is inlined, so whether this actually improves performance is hard to predict.

(Edited) The benchmark script:

For very simple json structures (depth<=2), the original version is slightly faster.

<?php
$data = [];
$ptr = &$data;
for ($i = 0; $i < 2; $i++) {
    $ptr["level_$i"] = ["msg" => "opt_test", "id" => $i];
    $ptr = &$ptr["level_$i"];
}
for ($i = 0; $i < 500000; $i++) {
    json_encode($data, JSON_PRETTY_PRINT);
}
?>
image

For complex structures (depth=3), the optimized version is slightly faster.

<?php
$data = [];
$ptr = &$data;
for ($i = 0; $i < 3; $i++) {
    $ptr["level_$i"] = ["msg" => "opt_test", "id" => $i];
    $ptr = &$ptr["level_$i"];
}
for ($i = 0; $i < 500000; $i++) {
    json_encode($data, JSON_PRETTY_PRINT);
}
?>
image

For very complex structures (depth=50), the optimized version is faster.

<?php
$data = [];
$ptr = &$data;
for ($i = 0; $i < 50; $i++) {
    $ptr["level_$i"] = ["msg" => "opt_test", "id" => $i];
    $ptr = &$ptr["level_$i"];
}
for ($i = 0; $i < 500000; $i++) {
    json_encode($data, JSON_PRETTY_PRINT);
}
?>
image

It seems like the original one is faster in case of very simple structures, I am thinking something like:

static inline void php_json_pretty_print_indent(smart_str *buf, int options, const php_json_encoder *encoder) /* {{{ */
{
    if (options & PHP_JSON_PRETTY_PRINT) {
        int depth = encoder->depth;
        if (depth <= 2) {
            int i;
            for (i = 0; i < depth; i++) {
                smart_str_appendl(buf, "    ", 4);
            }
        } else {
            size_t remaining = (size_t) depth * 4;
            char *dst = smart_str_extend(buf, remaining);
            memset(dst, ' ', remaining);
        }
    }
}

I would run the benchmark again with the aforementioned code.

So with this code the performance difference in simple structure (depth<=2) is lowered to 1.03x
image

Overall, the optimized version is slightly slower in simple structures with depth lower than 3 (1.03x slower) but offer significant enhancement in nested data (2x faster in data with depth more than 50, 1.02x faster in data with 3 depth), I think this seems good to me.

@bukka
Copy link
Member

bukka commented Mar 21, 2026

So you should probably raise threshold in that condition (if (depth <= 2) {), right?

@LamentXU123
Copy link
Contributor Author

So you should probably raise threshold in that condition (if (depth <= 2) {), right?

Well when depth > 2 the performance of the optimized version start to be better than the original one, and the benefits become bigger when depth increase. I think the threshold can be increased to 8, because since than it provides a 1.10x optimization which is useful enough

Comment on lines +58 to +67
if (depth <= 8) {
int i;
for (i = 0; i < depth; i++) {
smart_str_appendl(buf, " ", 4);
}
} else {
size_t remaining = (size_t) depth * 4;
char *dst = smart_str_extend(buf, remaining);
memset(dst, ' ', remaining);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can the original loop can be faster? The inlined functions are almost the same:

size_t new_len = smart_str_alloc(dest, len, persistent);
memcpy(ZSTR_VAL(dest->s) + ZSTR_LEN(dest->s), str, len);
ZSTR_LEN(dest->s) = new_len;

size_t new_len = smart_str_alloc(dest, len, persistent);
char *ret = ZSTR_VAL(dest->s) + ZSTR_LEN(dest->s);
ZSTR_LEN(dest->s) = new_len;
return ret;

It the return variable the bottleneck? Or maybe use the original approach with longer spaces literal " ..." with like 64 spaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can the original loop can be faster?

I am wondering this too, probably because memset is causing extra cost... But I think thats ok the loss can be ignored when depth goes bigger.

I will run benchmarks to test the original approach later (but I highly doubt if it could be better than memset)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a guess, memset with an arbitrary length may be overspecialized for large sizes. In that case, I don't see a big point of this PR. 50 levels of nesting seem pretty artificial. Nevertheless, I'm not code owner so I'll keep that decision up to those who are.

Copy link
Contributor Author

@LamentXU123 LamentXU123 Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50 levels of nesting seem pretty artificial.

The original approach (by defining space constant) should work in simple json structures I guess. I will send in the test results later in this thread

Copy link
Contributor Author

@LamentXU123 LamentXU123 Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a guess, memset with an arbitrary length may be overspecialized for large sizes. In that case, I don't see a big point of this PR. 50 levels of nesting seem pretty artificial. Nevertheless, I'm not code owner so I'll keep that decision up to those who are.

<?php
$data = [];
$ptr = &$data;
for ($i = 0; $i < 2; $i++) {
    $ptr["level_$i"] = ["msg" => "opt_test", "id" => $i];
    $ptr = &$ptr["level_$i"];
}
for ($i = 0; $i < 500000; $i++) {
    json_encode($data, JSON_PRETTY_PRINT);
}
?>

depth 2:
image
depth 1:
image

So by using the original optimization approach the optimized version is 1.29x slower in depth 2 and 1.12x slower in depth 1.

So yes, this pr could only optimize nested data.

It the return variable the bottleneck? Or maybe use the original approach with longer spaces literal " ..." with like 64 spaces?

This happens to be even slower, weird. Probably because of compiler's optimization, because smart_str_appendl(buf, " ", 4) allows the compiler to optimize the constant size 4 into a single 32-bit integer store, bypassing the overhead of a variable-length memcpy and branch evaluations present in the optimized block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants