MikuAuahDark: 2019

Monday, October 21, 2019

Enabling Private DNS for Modified-Android that Lack Such Settings?

One of the best features in Android 9 is Private DNS feature which allows DNS request not to be modified in any way by third-party or even by your ISP (this is always in Indonesia huh). Basically it does encrypt your DNS request and sign it so no one (except destination) can see it. And even if they see it, they can't modify it because it's signed.

Enough for that, some phones running Pie unfortunately lack such features. Some OSes like MIUI actually just hid them, so invoking "am start com.android.settings/.Settings\$NetworkDashboardSetting" will show them. However, something like ColorOS completely removed it from their settings. So we shouldn't rely on that MIUI method.

Now my idea is "What if we set those options from ADB instead?". It took me some research and ADB shell + grep with my phone and here's what I found.

$ settings list global | grep dns
private_dns_mode=hostname
private_dns_specifier=1dot1dot1dot1.cloudflare-dns.com

It looks clear that we can simply set those values from ADB. The shell does have access to modify those settings, at least in my phone (Mi A1 as of writing). So here are the possible combination of setting.

$ settings put global private_dns_specifier resolver_hostname
$ settings put global private_dns_mode off|opportunistic|hostname

Change "resolver_hostname" to something like "1dot1dot1dot1.cloudflare-dns.com" and see if it works. Note that the Private DNS hostname setting only works if "private_dns_specifier" is set to "hostname". If your phone stops connecting to internet (can't resolve any hostname), that means you messed up the "private_dns_specifier". Double check and try again.

Note that this method works in my phone with ability to set those options in the UI too, so it would be good if someone can test this in phones that running Android 9 but lack that option in their settings UI.

Update: If you get something like "Neither user 2000 nor current process has android.permission.WRITE_SECURE_SETTINGS" that means the OS customizations enforce some additional protection. You may (or may not) able to disable those settings in developer options window too and try again. Thanks to my friend for testing this in Realme 3 Pro, the feature actually work as intended.

Sunday, June 30, 2019

MediaFoundation decoder for LOVE & decoding in-memory.

Windows 7 has its own COM-based API to aid decoding variety of audio formats, and if you concerned about patents (AAC, why), don’t worry as it’s a licensed decoder. This blog post is about how I wrote LOVE decoder that uses MediaFoundation.

Basics

First thing to know is how LOVE decoder class looks like.

class Decoder : public Object
{
public:
    static love::Type type;
    Decoder(Data *data, int bufferSize);
    virtual ~Decoder();
    static const int DEFAULT_BUFFER_SIZE = 16384;
    static const int DEFAULT_SAMPLE_RATE = 44100;
    static const int DEFAULT_CHANNELS = 2;
    static const int DEFAULT_BIT_DEPTH = 16;
    virtual Decoder *clone() = 0;
    virtual int decode() = 0;
    virtual int getSize() const;
    virtual void *getBuffer() const;
    virtual bool seek(double s) = 0;
    virtual bool rewind() = 0;
    virtual bool isSeekable() = 0;
    virtual bool isFinished();
    virtual int getChannelCount() const = 0;
    virtual int getBitDepth() const = 0;
    virtual int getSampleRate() const;
    virtual double getDuration() = 0;

protected:
    StrongRef<Data> data;
    int bufferSize;
    int sampleRate;
    void *buffer;
    bool eof;
};

Anything that ends with = 0; means pure virtual method which we must implement in our derived class. Now let’s derive the Decoder class.

MFDecoder Class

class MFDecoder: public Decoder
{
public:
    MFDecoder(Data *data, int bufferSize);
    virtual ~MFDecoder();

    static bool accepts(const std::string &ext);
    static void quit();
    Decoder *clone();
    int decode();
    bool seek(double s);
    bool rewind();
    bool isSeekable();
    int getChannelCount() const;
    int getBitDepth() const;
    double getDuration();

private:
    static bool initialize();
    static void *initData;
    // non-exposed datatype to prevent cluttering LOVE
    // includes with Windows.h
    void *mfData;
    // channel count
    int channels;
    // temporary buffer
    std::vector<char> tempBuffer;
    // amount of temporary PCM buffer
    int tempBufferCount;
    // byte depth
    int byteDepth;
    // duration
    double duration;
};

There are few points that must be noted here.

MFDecoder contstructor receives LOVE Data object, which is data located in block of memory, and the desired buffer size.
MediaFoundation API can return any number of samples so we use temporary buffer to contain leftovers after decoding.
You may notice that mfData is void*. The reason to do this is that there’s problem compiling LOVE if Windows.h is included BEFORE the keyboard module. We also don’t want to clutter the includes with Windows-specific includes and drag down the compilation time.
The tempBuffer is std::vector. Yes this is intentional so we don’t have to manage the allocated memory and take advantage of RAII. It also helps in case MediaFoundation returns data bigger than provided buffer by simply reallocating bigger temporary buffer.
Then there’s initData member. This is set at initialize function above it, which is called when new MediaFoundation decoder is created or if the compatible extensions are being checked.

Problems

Now there are some problems.

MediaFoundation doesn’t officially support loading media from memory.
MediaFoundation assume you have the media file reside somewhere in local filesystem or in network, probably due to their DRM nature. There’s this blog post, but it only works on Windows 8 and using C++/CLI which has its own 2 problems:
1. We can’t use C++/CLI when compiling LOVE, that would reduce our compilation times and increase the bloat which we want to try to minimize as possible.
2. My target is to make the decoder available for Windows 7 and later. IRandomAccessStream and the function it uses are Windows 8 and later.
Then I found out there’s MFCreateMFByteStreamOnStream which accepts IStream interface. IStream interface is available since Windows 2000, which then SHCreateMemStream can be used to create one.
The functions that I use requires linking to Mfplat.dll and Mfreadwrite.dll.
The latter is only available in Windows 7. Since I want to make sure it runs in Windows Vista too (without the MediaFoundation decoding capabilities of course), I have to dynamically load it, hence the MFDecoder::initialize() static function.
Once IMFByteStream is created, you have to set the MIME.
It’s done by casting the IMFByteStream to IMFAttributes and set the MIME. Unfortunately, as of LOVE commit ccf9e63, the decoder constructor no longer receives the audio extensions, so we have to test for every possible supported MIME types. Fortunately, Microsoft gives us list of supported media formats in their documentation, so I just get the MIME string from IIS MIME Types.

After those problems is resolved, it’s only matter of setting the properties of the IMFSourceReader like creating decoder which outputs PCM.

Seeking

Yes, the IMFSourceReader supports seeking, but there’s no guarantee that it will be accurate. The function you’re looking for is IMFSourceReader::SetCurrentPosition which accepts time as 100-nanoseconds (second to 100-nanosecond is multiply by 1e+7).

Another Problem: GUID

I’m getting linker errors when compiling LOVE with the MediaFoundation decoder as it complains about unresolved GUID. I also don’t want to link to any of MediaFoundation DLLs so it’s binary compatible with Windows Vista (XP support is dropped as of LOVE 11.0). Temporary workaround to fix this is to copy the GUID declaration in header files to const GUID variables.

Aftermath

After I get everything running, now I have LOVE build which can loads AAC and WMA using MediaFoundation, how good is that, huh? You can check the full source code in here. The respective header lies in same directory as the C++ file.

Now you may ask, what about MinGW? Well, unfortunately LOVE doesn’t support being compiled under MinGW in the first place, so compiling LOVE under Windows is only supported using MSVC compiler.

And if anyone wants to decode audio from memory using MediaFoundation, then this blog post is what he/she’s looking for.

Post is written in Markdown first then converted to HTML lol.

Wednesday, June 5, 2019

Fixing stack overflow on older game by limiting exposed OpenGL extensions.

There's this GameHouse game, called AirStrike 3D. This game itself is released back in very old days, not accounting for the hardware development and new GPUs and new OpenGL extensions. Things were mostly fixed-size buffers back then. Until at one point when I decided to install it back, I can't run this game.

Trying to run the game simply crashes. I run the game in Windowed mode by editing their config.ini

At first I thought this was Windows 10 problem since running the game in VirtualBox with Windows XP seems fine. Until I have an idea to run the game with Mesa3D instead, but still crashes. I decided to start Visual Studio debugger and the crash point to some random location and access violation about can't execute piece of code of RAM (due to the permissions), so I thought "this is probably stack overflow" so I decided to check the stack register and what a surprise: I see all the OpenGL extensions string in the stack register, so this is caused because the extension string returned by my OpenGL driver is simply too long for the game to handle. Also my laptop is dual-GPU but both GPUs can't run the game because the extension string too long.

After bit of search, I found this Mesa3D page which describe how to limit the OpenGL extension string returned to workaround some game. It work great but I don't want to use Mesa3D because my laptop is not an AMD Threadripper which has dozen of threads, so I decide to roll my own OpenGL32 which forwards all OpenGL calls to Windows OpenGL32.dll but intercepts glGetString(GL_EXTENSIONS) call and limit the extension string returned.

First, I take a look on Mesa3D source code on their extension table list which I can use and I found this header file which is exactly what I'm looking for. Then I realize that Windows OpenGL32.dll has 360 functions and I don't want to write the forwarding functions by hand, who wants to do that. So instead of writing them by hand, I used my Lua programming power to parse the gl.h and WinGDI.h header file to create a file which forwards the GL function to original Windows OpenGL functions. Fixing any calling conventions and making sure the result function names aren't somewhat decorated by generating def file too, I finally have working program.

After putting the new DLL to the game folder and setting the necessary environment variable, I'm surprised the game finally runs. The game runs at about 500FPS in my HD Graphics 620 (got CPU bottleneck), but this game is fixed-function pipeline so I'm not surprised about the absurdly high FPS.

Game runs at ~540FPS. For anyone curious, here's my astrike.log.

The source code of the hooked OpenGL32.dll that I'm talking about is available in my GitHub including the build instructions (it's CMake) and if you too bother to compile and only want the 32-bit OpenGL32.dll, just go to the releases folder. One thing that I notice that it only able to handle at most 4048 string length or the stack overflow occur, so setting the extensions year to 2009 or earlier should work.

If some older game have same issue but you don't want to use Mesa3D, you can try my hooked OpenGL32.dll above and tell me how it performs.

Tuesday, April 30, 2019

Fixing black border in your game image with ImageMagick.

Star with unintended black border

What's with that black around the star? That's because the image has "transparent black" (or in CSS notation, rgba(0, 0, 0, 0)) for all the fully transarent image and white with alpha otherwise. This is not a problem for image editing software like Photoshop but this is a problem for OpenGL especially if you use linear interpolation to resize your image.

What actually happend? If linear interpolation is enabled, the GPU will sample around the white and the "transparent black", thus will result in gray color with alpha around 0.5. This is not what you actually want as this can give your image an unintended black border, which may or may not bad for your game.

A solution for this is to modify your image to have "transparent white" instead. Based on this answer, assume you have ImageMagick version 7 or later, I come up with this command:

magick convert input.png -channel RGB xc:"#ffffff" -clut output.png

And here's the result.

Star without black border around it.

Now what happends here is that we replace all color channel to 255 (thus result in white), but we keep the alpha values intact. The GPU then will see all the colors as white but only varying in alpha, so it will only interpolate the alpha because all the colors are white.

And if you plan to pass your image to zopflipng, make sure not to pass --lossy_transparent as that option changes all completely transparent pixel to "black transparent" again, which is the source of the problem.

UPDATE: ImageMagick command above won't work for images with various colors. I forked alpha-bleeding program which uses LodePNG to ease MSVC compilation which can be found here: https://github.com/MikuAuahDark/alpha-bleeding.

Thursday, April 25, 2019

VS2013 RTM cl.exe and "Use Unicode UTF-8 for worldwide language support"

There's a feature in Windows 10 that lets you specify UTF-8 string to C function fopen and other ANSI WinAPI functions. This makes it feel Unix-like where fopen in Unix expects it to be UTF-8 filename. However this doesn't mean everything works as expected as Microsoft warns us about that feature which may break application where it assume multi-byte length is 2 bytes max. And unfortunately this is true for VS2013 RTM cl.exe compiler.

C:\Users\MikuAuahDark>cl.exe test.c
Microsoft (R) C/C++ Optimizing Compiler Version 18.00.21005.1 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

test.c
test.c : fatal error C1001: An internal error has occurred in the compiler.
(compiler file 'f:\dd\vctools\compiler\cxxfe\sl\p1\c\p0io.c', line 2812)
 To work around this problem, try simplifying or changing the program near the locations listed above.
Please choose the Technical Support command on the Visual C++
 Help menu, or open the Technical Support help file for more information

The file test.c is an empty file, but that error shows up regardless of input file you specify. What happend here?

Turns out, there's check in c1.dll which basically equivalent to this C code

CPINFO cpInfo;
UINT chcp = GetACP();
GetCPInfo(chcp, &cpInfo);

if (cpInfo.MaxCharSize > 2) internal_error("f:\dd\vctools\compiler\cxxfe\sl\p1\c\p0io.c", 2812);

It assume the max multi-byte size is 2 bytes max, but in this case, I enabled a feature called "Use Unicode UTF-8 for worldwide language support", thus these what happends:

GetACP returns 65001
GetCPInfo returns information about UTF-8 code page, where max char size is 4.

Is there any workaround for this? I'm afraid there's no way. Basically we must make sure cl.exe didn't see 65001 as codepage, otherwise there's explicit check for it. There's Locale Emulator but that only emulates locale string, not code page.

If anyone found how to update VS2013 in 2019, please comment below. Yes, using VS2013 is mandatory in my case because I need to ensure compatibility with Windows Vista where it must target lower than Windows 7 SP1.

Friday, January 25, 2019

Re-Implementing Live2D runtime in LÖVE: Performance Optimization

Please see my previous blog post for more information. Do you think I'm really satisfied with 1.2ms performance? No. I think I can do more. Note that when I wrote a time measurement, that means it's time taken to update Kasumi casual summer model (shown below).

Code Optimization

I see that after my previous blog post, there's many optimization that can be done. I started it by reducing the temporary table creation. Instead of creating new table at function body, I created new table at file scope and reuse that table over and over. In the motion manager code, I used this variant of algorithm to remove motion data when necessary. I also localize function that's called every frame, mostly functions from math namespace like math.min, math.floor, and math.max. I also do cache more variable if that variable is used multiple times to reduce table lookup overhead.

Although the optimization I listed above doesn't really save significant amount of time when JIT is used, it's somewhat significant optimization for non-JIT codepath. Next optimization is by converting hair physics code to use FFI datatype for JIT codepath, and class otherwise. Testing gives better performance, 1.17ms. Not much but it's better than nothing.

Problem arise, when I inspect the verbose trace compiler output, I noticed lots of "NYI: register coalescing too complex" trace abort in the curved surface deformer algorithm, which indicate I'm using too many local variable there. At first this was bit hard to solve, but I managed to optimize it by analyzing the interpolation calculation done by curved surface deformer algorithm. Then it solve the trace aborts entirely. Testing gives slightly better performance, 1.15ms.

Rendering Optimization

The last optimization I done is the Mesh optimization. Since I copied Live2LOVE Mesh rendering codepath as-is, it's actually uploading lots of duplicate data to the GPU, duplicating the vertices based on vertex map manually in CPU side because I thought the vertex map can change. This can be very slow for the non-JIT codepath because the amount of data needs to be send in Mesh:setVertices can be too much. As a reference, before doing this optimization, the non-JIT codepath (LuaJIT interpreter) took 6ms.

After having better overview how Live2D rendering works, I'm safe to assume vertex map won't ever change, so I start by reducing amount of vertices that needs to be uploaded to GPU and send the vertex map. This gives more significant performance boost in CPU-side actually. The JIT codepath now runs at 1.05ms, it's very very close to Live2LOVE 1ms. Interpreter (LuaJIT) took 4ms, yes 4ms to update the model. Unfortunately, vanilla Lua 5.1 took as long as 12ms to update the model.

The non-JIT codepath is forced to use table variant of Mesh:setVertices because the overhead of FFI is higher than the benefit of using Data variant. Also the non-JIT codepath can't assume FFI is available at all. LuaJIT can be compiled without FFI support (but who wants to do this?) or it maybe run in vanilla Lua 5.1. One of my goal for this project is to provide maximum compatibility with Lua 5.1 too, despite LÖVE is compiled using LuaJIT by default.

Experimental Rendering Codepath

Unfortunately I have to throw away the mesh batching technique I mentioned in my previous blog post. This mesh batching technique causes very significant slowdown both in JIT and non-JIT codepath with very little performance improvement in GPU, so I decide to abandon this and use the old approach of updating models, drawing Mesh one by one. You can see at screenshot below that the model took 166 drawcall

and additional drawcall caused by IMGUI.

Wednesday, January 16, 2019

Re-Implementing Live2D runtime in LÖVE

(video above shows my implementation in action using LÖVE framework)

Live2D is a nice thing, the fluid character movement gives additional touch to the game which uses it. For my personal use, however it has annoying limitation: Lua is not officially supported, especially LÖVE. Well, there's 2 ways to overcome this.

Writing external Lua C module

This is probably the simplest way (but not that easy). Link with Live2D C++ runtime library files, add code which interact with Lua (and LÖVE), then you got Live2LOVE. This module is actually very fast, considering the Lua to C API overhead, and works by letting Live2D to do the model transformation and LÖVE to do the model rendering. Since it uses official Live2D runtime, it has these limitations:

Must link with OpenGL. This is not a problem since LÖVE uses OpenGL already.
VS2017 is not supported (you have to use Cubism 3 for that). However it supports down to VS2010, but LÖVE requires VS2013, so this is not really a problem unless you compile LÖVE to use VS2017 runtime.
MinGW/Cygwin compilation is not supported. Not really a problem since compiling LÖVE in Windows using MinGW/Cygwin itself is not supported.
Linux and macOS is not supported. This is the real problem. Not all people use Windows to run LÖVE.

So another idea that comes to my mind is:

Re-Implement Live2D Cubism 2 Runtime

In Lua, because why not. This is actually somewhat time-consuming process and took me more than 3 weeks to have model rendering working as intended. My additional goal for this is to have Live2LOVE-compatible interface too, so switching between implementation is simply changing the "require" module. From now on, I'll refer Live2D Runtime as Live2D Cubism 2 Runtime.

I start by downloading Live2D Runtime for WebGL (Javascript) and beautify the code (since it's in .min.js file). As expected, the function method names are obfuscated. So, I unpacked Android version of Live2D C++ Runtime and deobfuscate the method name by matching the arguments from Live2D Runtime C++ header files and with help of IDA pseudocode decompiler to compare the implementation with the Javascript ones. This whole process took 2 weeks.

Then I start by writing the Lua equivalent code based on Javascript Live2D Runtime. This is the easiest to do since Javascript is also dynamically typed. Carefully translating 0-based indexing code to 1-based. Then fixing bugs, writing Live2LOVE-compatible interface so I can use existing Live2D viewer code that using Live2LOVE, and testing.

After a week, I got model rendering, motion, and expression working. Using existing code from my LÖVE-based Live2D model viewer to use my own implementation instead of Live2LOVE. Then the next problem comes: it's 4x slower than Live2LOVE. Live2LOVE took 1ms to render Kasumi (casual summer clothes) while my implementation took 4ms to render same model. I already code the implementation carefully so that LuaJIT happily accepts my code and won't bail out to interpreter as possible.

I started optimization by using "Data" objects instead of plain table when updating the "Mesh" object for drawing. This cuts down the update time significantly from 4ms to 1.7ms so using table to update the "Mesh" object is always a bad idea. Someone in LÖVE Discord then says "try to use all FFI instead of plain table". At first I did not agree with him because I want to preserve compatibility with mobile, but then I decide to proceed by having falling back to tables in case FFI is not suitable (JIT is off, FFI support is not enabled, or using vanilla Lua). I swapped most types from plain table to FFI objects and I can get as low as 1.2ms, almost close to Live2LOVE 1ms.

Conclusion

Re-implementing Live2D Runtime is a nice experience. It gives me better overview when to start optimizing code instead of optimizing early and overview of how Live2D model transformation works. Apparently it can't beat C++ version of the official Live2D Runtime in terms of model updating, but I think it can beat it in terms of model rendering. I'm thinking of "Mesh" batching technique, which is basically: accumulate vertices to render then draw'em'all at once if flush requested. I'm currently satisfied with the current result, but I think I still can do better

... and hope to God it can success without problems.