Friday, January 25, 2019

Re-Implementing Live2D runtime in LÖVE: Performance Optimization

Please see my previous blog post for more information. Do you think I'm really satisfied with 1.2ms performance? No. I think I can do more. Note that when I wrote a time measurement, that means it's time taken to update Kasumi casual summer model (shown below).


Code Optimization

I see that after my previous blog post, there's many optimization that can be done. I started it by reducing the temporary table creation. Instead of creating new table at function body, I created new table at file scope and reuse that table over and over. In the motion manager code, I used this variant of algorithm to remove motion data when necessary. I also localize function that's called every frame, mostly functions from math namespace like math.min, math.floor, and math.max. I also do cache more variable if that variable is used multiple times to reduce table lookup overhead.

Although the optimization I listed above doesn't really save significant amount of time when JIT is used, it's somewhat significant optimization for non-JIT codepath. Next optimization is by converting hair physics code to use FFI datatype for JIT codepath, and class otherwise. Testing gives better performance, 1.17ms. Not much but it's better than nothing.

Problem arise, when I inspect the verbose trace compiler output, I noticed lots of "NYI: register coalescing too complex" trace abort in the curved surface deformer algorithm, which indicate I'm using too many local variable there. At first this was bit hard to solve, but I managed to optimize it by analyzing the interpolation calculation done by curved surface deformer algorithm. Then it solve the trace aborts entirely. Testing gives slightly better performance, 1.15ms.

Rendering Optimization

The last optimization I done is the Mesh optimization. Since I copied Live2LOVE Mesh rendering codepath as-is, it's actually uploading lots of duplicate data to the GPU, duplicating the vertices based on vertex map manually in CPU side because I thought the vertex map can change. This can be very slow for the non-JIT codepath because the amount of data needs to be send in Mesh:setVertices can be too much. As a reference, before doing this optimization, the non-JIT codepath (LuaJIT interpreter) took 6ms.

After having better overview how Live2D rendering works, I'm safe to assume vertex map won't ever change, so I start by reducing amount of vertices that needs to be uploaded to GPU and send the vertex map. This gives more significant performance boost in CPU-side actually. The JIT codepath now runs at 1.05ms, it's very very close to Live2LOVE 1ms. Interpreter (LuaJIT) took 4ms, yes 4ms to update the model. Unfortunately, vanilla Lua 5.1 took as long as 12ms to update the model.

The non-JIT codepath is forced to use table variant of Mesh:setVertices because the overhead of FFI is higher than the benefit of using Data variant. Also the non-JIT codepath can't assume FFI is available at all. LuaJIT can be compiled without FFI support (but who wants to do this?) or it maybe run in vanilla Lua 5.1. One of my goal for this project is to provide maximum compatibility with Lua 5.1 too, despite LÖVE is compiled using LuaJIT by default.

Experimental Rendering Codepath

Unfortunately I have to throw away the mesh batching technique I mentioned in my previous blog post. This mesh batching technique causes very significant slowdown both in JIT and non-JIT codepath with very little performance improvement in GPU, so I decide to abandon this and use the old approach of updating models, drawing Mesh one by one. You can see at screenshot below that the model took 166 drawcall

and additional drawcall caused by IMGUI.

No comments:

Post a Comment