MikuAuahDark: luajit

Showing posts with label luajit. Show all posts

Friday, January 25, 2019

Re-Implementing Live2D runtime in LÖVE: Performance Optimization

Please see my previous blog post for more information. Do you think I'm really satisfied with 1.2ms performance? No. I think I can do more. Note that when I wrote a time measurement, that means it's time taken to update Kasumi casual summer model (shown below).

Code Optimization

I see that after my previous blog post, there's many optimization that can be done. I started it by reducing the temporary table creation. Instead of creating new table at function body, I created new table at file scope and reuse that table over and over. In the motion manager code, I used this variant of algorithm to remove motion data when necessary. I also localize function that's called every frame, mostly functions from math namespace like math.min, math.floor, and math.max. I also do cache more variable if that variable is used multiple times to reduce table lookup overhead.

Although the optimization I listed above doesn't really save significant amount of time when JIT is used, it's somewhat significant optimization for non-JIT codepath. Next optimization is by converting hair physics code to use FFI datatype for JIT codepath, and class otherwise. Testing gives better performance, 1.17ms. Not much but it's better than nothing.

Problem arise, when I inspect the verbose trace compiler output, I noticed lots of "NYI: register coalescing too complex" trace abort in the curved surface deformer algorithm, which indicate I'm using too many local variable there. At first this was bit hard to solve, but I managed to optimize it by analyzing the interpolation calculation done by curved surface deformer algorithm. Then it solve the trace aborts entirely. Testing gives slightly better performance, 1.15ms.

Rendering Optimization

The last optimization I done is the Mesh optimization. Since I copied Live2LOVE Mesh rendering codepath as-is, it's actually uploading lots of duplicate data to the GPU, duplicating the vertices based on vertex map manually in CPU side because I thought the vertex map can change. This can be very slow for the non-JIT codepath because the amount of data needs to be send in Mesh:setVertices can be too much. As a reference, before doing this optimization, the non-JIT codepath (LuaJIT interpreter) took 6ms.

After having better overview how Live2D rendering works, I'm safe to assume vertex map won't ever change, so I start by reducing amount of vertices that needs to be uploaded to GPU and send the vertex map. This gives more significant performance boost in CPU-side actually. The JIT codepath now runs at 1.05ms, it's very very close to Live2LOVE 1ms. Interpreter (LuaJIT) took 4ms, yes 4ms to update the model. Unfortunately, vanilla Lua 5.1 took as long as 12ms to update the model.

The non-JIT codepath is forced to use table variant of Mesh:setVertices because the overhead of FFI is higher than the benefit of using Data variant. Also the non-JIT codepath can't assume FFI is available at all. LuaJIT can be compiled without FFI support (but who wants to do this?) or it maybe run in vanilla Lua 5.1. One of my goal for this project is to provide maximum compatibility with Lua 5.1 too, despite LÖVE is compiled using LuaJIT by default.

Experimental Rendering Codepath

Unfortunately I have to throw away the mesh batching technique I mentioned in my previous blog post. This mesh batching technique causes very significant slowdown both in JIT and non-JIT codepath with very little performance improvement in GPU, so I decide to abandon this and use the old approach of updating models, drawing Mesh one by one. You can see at screenshot below that the model took 166 drawcall

and additional drawcall caused by IMGUI.

Wednesday, January 16, 2019

Re-Implementing Live2D runtime in LÖVE

(video above shows my implementation in action using LÖVE framework)

Live2D is a nice thing, the fluid character movement gives additional touch to the game which uses it. For my personal use, however it has annoying limitation: Lua is not officially supported, especially LÖVE. Well, there's 2 ways to overcome this.

Writing external Lua C module

This is probably the simplest way (but not that easy). Link with Live2D C++ runtime library files, add code which interact with Lua (and LÖVE), then you got Live2LOVE. This module is actually very fast, considering the Lua to C API overhead, and works by letting Live2D to do the model transformation and LÖVE to do the model rendering. Since it uses official Live2D runtime, it has these limitations:

Must link with OpenGL. This is not a problem since LÖVE uses OpenGL already.
VS2017 is not supported (you have to use Cubism 3 for that). However it supports down to VS2010, but LÖVE requires VS2013, so this is not really a problem unless you compile LÖVE to use VS2017 runtime.
MinGW/Cygwin compilation is not supported. Not really a problem since compiling LÖVE in Windows using MinGW/Cygwin itself is not supported.
Linux and macOS is not supported. This is the real problem. Not all people use Windows to run LÖVE.

So another idea that comes to my mind is:

Re-Implement Live2D Cubism 2 Runtime

In Lua, because why not. This is actually somewhat time-consuming process and took me more than 3 weeks to have model rendering working as intended. My additional goal for this is to have Live2LOVE-compatible interface too, so switching between implementation is simply changing the "require" module. From now on, I'll refer Live2D Runtime as Live2D Cubism 2 Runtime.

I start by downloading Live2D Runtime for WebGL (Javascript) and beautify the code (since it's in .min.js file). As expected, the function method names are obfuscated. So, I unpacked Android version of Live2D C++ Runtime and deobfuscate the method name by matching the arguments from Live2D Runtime C++ header files and with help of IDA pseudocode decompiler to compare the implementation with the Javascript ones. This whole process took 2 weeks.

Then I start by writing the Lua equivalent code based on Javascript Live2D Runtime. This is the easiest to do since Javascript is also dynamically typed. Carefully translating 0-based indexing code to 1-based. Then fixing bugs, writing Live2LOVE-compatible interface so I can use existing Live2D viewer code that using Live2LOVE, and testing.

After a week, I got model rendering, motion, and expression working. Using existing code from my LÖVE-based Live2D model viewer to use my own implementation instead of Live2LOVE. Then the next problem comes: it's 4x slower than Live2LOVE. Live2LOVE took 1ms to render Kasumi (casual summer clothes) while my implementation took 4ms to render same model. I already code the implementation carefully so that LuaJIT happily accepts my code and won't bail out to interpreter as possible.

I started optimization by using "Data" objects instead of plain table when updating the "Mesh" object for drawing. This cuts down the update time significantly from 4ms to 1.7ms so using table to update the "Mesh" object is always a bad idea. Someone in LÖVE Discord then says "try to use all FFI instead of plain table". At first I did not agree with him because I want to preserve compatibility with mobile, but then I decide to proceed by having falling back to tables in case FFI is not suitable (JIT is off, FFI support is not enabled, or using vanilla Lua). I swapped most types from plain table to FFI objects and I can get as low as 1.2ms, almost close to Live2LOVE 1ms.

Conclusion

Re-implementing Live2D Runtime is a nice experience. It gives me better overview when to start optimizing code instead of optimizing early and overview of how Live2D model transformation works. Apparently it can't beat C++ version of the official Live2D Runtime in terms of model updating, but I think it can beat it in terms of model rendering. I'm thinking of "Mesh" batching technique, which is basically: accumulate vertices to render then draw'em'all at once if flush requested. I'm currently satisfied with the current result, but I think I still can do better

... and hope to God it can success without problems.

Friday, June 29, 2018

LÖVE Optimization Tips

It’s common for people to write game and then “damn it performs very slow”, or “ok this game is fast enough” then when testing in slower hardware it complains “damn it’s slow af”. So these points will help you to optimize your game for more performance. This assume you’re using LOVE with LuaJIT as LuaVM. If you have different configuration (like LOVE with Lua 5.2 or with Lua 5.1), some points in here may not relevant.

Note that these things may be considered as micro-optimization in desktop PC, but it’s noticeable in mobile.

Move your game computation in `love.update`, and keep `love.draw` solely for rendering.

Okay let’s see why. First, math computation in LuaJIT are compiled, which mean it’s fast. Second, calls to love.graphics api in LuaJIT are not compiled and break the code around that area to be not compiled. So instead of

function love.draw()  
 var1 = var1 + 20 -- this computation is done in LuaJIT interpreter  
 -- because calls to love.graphics.draw below can't be compiled  
 love.graphics.draw(image, var1 * 0.02, 0)  
end

Do it like this.

function love.update()
 -- this computation is done in machine code because
 -- it's compiled.
 var1 = var1 + 20 * 0.02 -- LuaJIT will optimize 20*0.02 to 0.4  
end

function love.draw()
 love.graphics.draw(image, var1, 0)  
end

A note: This is less relevant for mobile (iOS and Android), as JIT compiler is disabled by default.

Use SpriteBatch, especially if it’s static.

Admit it. All calls to SpriteBatch object method will be never compiled. As of LOVE 11.0, there’s a feature called autobatching. That’s it. It automatically batches your draws, much like SpriteBatch with “dynamic” and “stream” mode.

Now, why use SpriteBatch for static batches? That’s because it’s better to call love.graphics.draw once to draw many things than calling love.graphics.draw to draw each thing. The former saves CPU due to Lua to C API call overhead, allowing your game to run faster.

Only use `pairs` when you don’t have a choice.

Even with LuaJIT 2.1 trace stitching feature (allows uncompiled function like love.graphics.draw not to break JIT compilation), pairs still breaks JIT compilation because it uses uncompiled bytecode. You can check what things that are compiled or not in here.

If you really need to index by things other than number, store the key in numered array, index that, then use the value to index the actual table. Then you can use ipairs or simple for loop to iterate the table (both are compiled). Something like this.

myTableKey = {"key1", "key2","key3"}
myTableValue = {key1 = "Hello", key2 = "World", key3 = io.write}
for i, v in ipairs(myTableKey) do
 local value = myTableValue[v]
 -- do something with `value`.
 -- It can be either "Hello", "World", or `io.write` function.
end

Oh also pairs doesn’t guarantee the order they’re defined. When you run pairs in myTableValue above, key3 might be shown first, then key1, then key2.

Use `io.write` for simple debugging, not `print`.

print is not compiled, but io.write is. The downside of io.write is that you’ll lose of automatically separated values and object-to-string conversion feature. If you don’t care about this, you can use print. Well, this is solely a preference. If you need the full power of print, then use print. For very simple variable printing, use io.write.

Never use FFI for computation optimization purpose.

Very true if you’re targetting mobile devices. Check out this chat I took from LOVE Discord server.

A: Will vector-light be faster than brinevector?
B: Depends. Brinevector is faster, but only on desktop. It’s terribly slow on mobile. Vector ligh (hump?) Is faster, but the code is messy since thats only for vector operations
A: if I made an FFI version of vector-light that was just pure functions, it would be faster right?

That’s where the problem begins. Once you step into FFI, your phone will start to cry when running your code.

Wait, based on benchmark, FFI is faster than plain Lua table.

That kind of argument doesn’t work once you have JIT compiler disabled. I’ve tried to benchmark it that in most cases FFI is ~20x slower than plain Lua when JIT compiler is disabled. Then what’s the matter for mobile devices? Take a look at first tips above. Mobile devices have JIT compiler disabled by default, which means slow FFI.

Well, you shouldn’t use FFI in mobile devices at all anyway. Starting from Android 7, any calls to ffi.load will just fail due to changes in Android 7 dlopen and dlsym which restrict those function greatly that it makes ffi.load cease to function. And in iOS, I don’t see the point of using FFI as all things must be statically compiled.

Switch the JIT compiler before `love.conf` is called in `conf.lua`

You may ask yourself. Why? The answer is, LOVE utilize some FFI call optimizations when it detect it runs with JIT compiler enabled, mainly in love.sound and love.image modules, and the detection is done before main.lua is loaded. Some people did this wrong by turning the JIT compiler on/off in main.lua, which can lead to slower performance.

Conclusion

Those kind of optimization is mostly helps in CPU optimization. For GPU, you need to reduce drawcalls, reduce canvas usage (or don’t use it at all), and use compressed texture. If it’s still slower, but the CPU stays low, that indicates you’re fillrate-limited, in which case, any optimization in here doesn’t really help you at all.

Note that I may still wrong, so if you have something to add, or something is invalid in above tips, just comment below to discuss it.

'What was the original problem you were trying to fix?' 'Well, I noticed one of the tools I was using had an inefficiency that was wasting my time.'

Also, did I also mention that JIT compiler for ARM64 is unstable?

Friday, March 16, 2018

Fixing Image Quality of SIF Cards: RGBA4444 vs. RGB565

Some people discussing about SIF cards image quality being sucks since New Year Kanan UR. They said somehow KLab change the dithering method used in cards and always uses RGBA4444. I argue that RGB565 for clean UR card are better and actually makes sense (clean UR card doesn't need transparency, but "navi"[1] requires transparency).

Here's the original card image, to be used for comparison (the latest Ruby UR card paired with Yoshiko as I'm writing this):

You can notice the grainy pictures, or noise caused by dithering and RGBA4444 color loss. I start to think that something wrong about the dithering algorithm. Here's the version with selective gaussian blur applied (someone in the discussion do the blur):

Selective gaussian blur does the job looks like, but I said, that waifu2x can achieve something better, but then he doesn't agree and says that waifu2x destroys the edges and some parts of the image. Well, I'm just gonna do it anyway. I tried many different methods, like noise level 1, level 2, noise level 2 then 2x scale then 50% downresize, but I found noise level 1 is already doing the job pretty well and noise level 2 destroys some details as he says before. Here's the image result:

Ok it looks good now, so let's apply RGBA4444 conversion and dithering to the noise removed image. Unfortunately, ImageMagick can't do RGBA4444 conversion (and any 16bit RGB color conversion) and specify dithering at same time, so I have to generate the necessary color table (RGBA4444 and RGBA565 for our purpose) and use -remap option in ImageMagick later. Here's the command that I use to generate the color table:

C:\Users\MikuAuahDark\Pictures\1489>magick convert -size 256x256 xc:none -channel R -fx "((i*w+j)&15)/15.0" -channel G -fx "(((i*w+j)>>4)&15)/15.0" -channel B -fx "(((i*w+j)>>8)&15)/15.0" -channel A -fx "(((i*w+j)>>12)&15)/15.0" rgba4444.png
C:\Users\MikuAuahDark\Pictures\1489>magick convert -size 256x256 xc:white -channel R -fx "((i*w+j)&31)/31.0" -channel G -fx "(((i*w+j)>>5)&63)/63.0" -channel B -fx "(((i*w+j)>>11)&31)/31.0" rgb565.png

Now we can tell ImageMagick to use our color table:

C:\Users\MikuAuahDark\Pictures\1489>magick convert 1489u-w2x-noise1.png -dither none -remap rgba4444.png 1489u-w2x-noise1-rgba444none.png
C:\Users\MikuAuahDark\Pictures\1489>magick convert 1489u-w2x-noise1.png -dither riemersma -remap rgba4444.png 1489u-w2x-noise1-rgba444rie.png
C:\Users\MikuAuahDark\Pictures\1489>magick convert 1489u-w2x-noise1.png -dither floydsteinberg -remap rgba4444.png 1489u-w2x-noise1-rgba444fstein.png

And here's the result:

-dither none -dither riemersma -dither floydsteinberg

With -dither none, the color difference is very noticeable, while using -dither riemersma or -dither floydsteinberg produces similar result to dithering algorithm that KLab uses (if you look closely, KLab dithering looks better a bit). So it's the RGBA4444 that causing image quality problems. Ok now let's switch to RGB565.

magick convert 1489u-w2x-noise1.png -dither none -remap rgb565.png 1489u-w2x-noise1-rgb565none.png
magick convert 1489u-w2x-noise1.png -dither riemersma -remap rgb565.png 1489u-w2x-noise1-rgb565rie.png
magick convert 1489u-w2x-noise1.png -dither floydsteinberg -remap rgb565.png 1489u-w2x-noise1-rgb565fstein.png

-dither none -dither riemersma -dither floydsteinberg

Oh! The RGB565 variant looks way better than ones encoded with RGBA4444. My argument is correct that RGB565 is certainly better for clean UR cards rather than RGBA4444. But the problem is, the RGB565 variant increase twice as RGBA4444 variant and that's just the PNG representation, so let's try to simulate how it's stored in SIF game engine texture bank format and compare their size.

SIF texture bank can store the raw pixel data in variety of different pixel formats[2]. For RGB, there are RGBA5551, RGBA4444, RGB565, and RGBA8888 (RGBA8888 is the usual pixel formats used in images and Photoshop RGB/8). The raw pixel data can be stored uncompressed or compressed with zlib[3], PVR, ETC, or other compression formats supported by the GPU, but most of the time KLab just uses zlib compression, so we use zlib compression. In this example, we picked -dither floydsteinberg with RGB565 and RGBA4444 variant. For sake of tool completeness, I used Linux WSL environment since I don't know how to specify compression level in openssl zlib.

Unfortunately, FFmpeg nor ImageMagick can't output to raw RGBA4444, so I write my own script to do it later in WSL. Here's the Lua script:

#!/usr/bin/env luajit
-- Expected RGBA8888 stdin

local ffi = require("ffi")
local w = assert(tonumber(arg[1]))
local h = assert(tonumber(arg[2]))

local rgba4444 = ffi.new("uint16_t[?]", w*h)
local rgbstruct = ffi.typeof("union {struct {uint8_t r,g,b,a;} rgba; uint8_t raw[4];}")

for i = 1, w*h do
    local d = rgbstruct {raw = io.read(4)}
    -- Remember FFI arrays are 0-based index
    local r = bit.lshift(bit.rshift(d.rgba.r, 4), 12)
    local g = bit.lshift(bit.rshift(d.rgba.g, 4), 8)
    local b = bit.lshift(bit.rshift(d.rgba.b, 4), 4)
    local a = bit.rshift(d.rgba.a, 4)
    rgba4444[i-1] = bit.bor(bit.bor(r, g), bit.bor(b, a))
end

io.write(ffi.string(rgba4444, w*h*ffi.sizeof("uint16_t")))
os.exit(0)

And here's the command I'm using (have to rename WSL bash.exe so it doesn't clash with MSYS/Cygwin bash):

C:\Users\MikuAuahDark\Pictures\1489>env winbash
mikuauahdark@Xyrillia-20166:/mnt/c/Users/MikuAuahDark/Pictures/1489$ ffmpeg -i 1489u-w2x-noise1-rgb565fstein.png -pix_fmt rgb565 -f rawvideo 1489u-w2x-noise1-fstein.rgb565
mikuauahdark@Xyrillia-20166:/mnt/c/Users/MikuAuahDark/Pictures/1489$ ffmpeg -i 1489u-w2x-noise1-rgba444fstein.png -pix_fmt -f rawvideo - | ./rgba4444.lua 512 720 > 1489u-w2x-noise1-fstein.rgba4444
mikuauahdark@Xyrillia-20166:/mnt/c/Users/MikuAuahDark/Pictures/1489$ cat 1489u-w2x-noise1-fstein.rgba4444 | pigz -z -11 > test-1489u-rgba4444.zz
mikuauahdark@Xyrillia-20166:/mnt/c/Users/MikuAuahDark/Pictures/1489$ cat 1489u-w2x-noise1-fstein.rgb565 | pigz -z -11 > test-1489u-rgb565.zz

With just this information, we can estimate the texture bank size. RGBA4444 variant size (compressed) is 245965 bytes, and the RGB565 variant size (compressed) is 399097 bytes. That's 162% size increase. Now let's calculate how much the additional size increase in SIF data if RGB565 is used. We also need to assume average 165% size increase, avg. 500KB of each card file (in RGBA4444), and all clean UR cards are encoded in RGBA4444 (actually, there are already cards that are encoded to RGB565 before, but let's assume all images were encoded to RGBA4444 at the moment to simplify it).

As of writing, School Idol Tomodachi says there are 86 SSR and 234 UR cards. Not all cards have unidolized form and only have idolized variant, but we still can get estimation of such cards from School Idol Tomodachi too, and it says there are no SSR and 79 UR cards with only idolized form[4], which in total there are 561 cards[5]. The cards total size when encoded is:

RGBA4444 Cards = 561 Cards * 500 KB = 274 MB
RGB565 Cards = RGBA4444 Cards (274 MB) * 165% = 452 MB
Size Increase = |274 MB - 452 MB| = 178MB

The size increase is ~178MB. How big is it? BanG Dream Girls Band Party voice size is ~150MB if I remember. Live Simulator: 2 APK size is 23MB, equivalent to 8 Live Simulator: 2 in your Android phone. 178MB is around twice of SIF install data size (the one preloaded into your phone when you started SIF for the first time). Size increase of 178MB should be acceptable, so I think it's gonna worth it if it's re-encoded to RGB565. The problem is that you'll have to download ~452MB of cards if this happends.

The conclusion is: I don't see reason why KLab uses RGBA4444 for clean UR cards except to reduce size, but the quality-size tradeoff is simply unbalanced (as in UNBALANCED LOVE), so KLab should change the pixel format for their clean UR cards from RGBA4444 to RGB565 to increase the image quality, and then use Genki Zenkai dither algorithm with DAY! DAY! DAY! recipe to decrease the size a bit.

[1]: "navi" means the transparent variant of SIF cards, the one used in your home SIF menu. Dunno if KLab is having joke with Avatar, but of course KLab doesn't know anything about the Legend of Aang.

[2]: https://github.com/MikuAuahDark/Itsudemo/blob/master/TEXB_stream_format.txt#L29

[3]: https://github.com/MikuAuahDark/Itsudemo/blob/master/src/TEXBLoad.cpp#L195

[4]: https://schoolido.lu/cards/?search=&rarity=UR&attribute=&is_promo=on and https://schoolido.lu/cards/?search=&rarity=SSR&attribute=&is_promo=on

[5]: IdolizedMultipler*(SSR TotalCard+UR TotalCard)-(SSR PromoCard+UR PromoCard)

I write the post in markdown and converted it to HTML, so I'm sorry if there are problems.

Friday, January 25, 2019

Re-Implementing Live2D runtime in LÖVE: Performance Optimization

Code Optimization

Rendering Optimization

Experimental Rendering Codepath

Wednesday, January 16, 2019

Re-Implementing Live2D runtime in LÖVE

Writing external Lua C module

Re-Implement Live2D Cubism 2 Runtime

Conclusion

Friday, June 29, 2018

LÖVE Optimization Tips

Move your game computation in love.update, and keep love.draw solely for rendering.

Use SpriteBatch, especially if it’s static.

Only use pairs when you don’t have a choice.

Use io.write for simple debugging, not print.

Never use FFI for computation optimization purpose.

Switch the JIT compiler before love.conf is called in conf.lua

Conclusion

Friday, March 16, 2018

Fixing Image Quality of SIF Cards: RGBA4444 vs. RGB565

Move your game computation in `love.update`, and keep `love.draw` solely for rendering.

Only use `pairs` when you don’t have a choice.

Use `io.write` for simple debugging, not `print`.

Switch the JIT compiler before `love.conf` is called in `conf.lua`