The Android/LibGDX conspiracy

KaiHH · June 27, 2016, 10:55pm

tl;dr: If you want your Android app to run twice as fast, put all your application classes in the package “com.badlogic.gdx” (or subpackages).

While trying to evaluate how fit JOML currently is for running on Android devices, I did some benchmarking. Using a fresh Android Studio installation I used an Android 6.0.1 device and a simple application containing some benchmarking of JOML methods.
90 ns. for a 4x4 matrix multiplication, I though, was not bad.
I also tested the math classes of the latest libGDX version with the arm64-v8a shared library against it. The shared library loaded just fine via System.loadLibary(“gdx”) and the native Matrix4.mul() method successfully called the native function. But its performance was nowhere near that of the pretty standard textbook Java-only solution in JOML. LibGDX performed at arount 1370 ns. per invocation.
Next I tried to optimize libGDX’s Matrix4.mul() method. First, this meant getting rid of the very slow JNI invocation, which internally also made calls to expensive VM runtime routines. The actual matrix multiplication arithmetic was the least of the runtime cost.
So, I translated JOML’s matrix multiplication to float[] accesses that libGDX uses.

Then I ran the benchmark again.

Result: 47 ns. for the new Java-only libGDX Matrix4.mul() method.
I thought: Nice, so on Android arithmetic and memory-accesses on float[] elements is actually twice as fast compared to using primitive float fields like JOML does.
So, I stripped JOML’s Matrix4f off everything except mul(), also used a float[16] array instead of the float fields, and finally used the same mul() method in JOML’s class which I already used for libGDX’s Matrix4.mul().
I expected that now JOML would perform exactly as fast as libGDX’s new Java-only mul() method.

That was not the case. Still around 90 ns.

That could not be, I thought.
Whatever I tried, changing the access patterns of the float[16] array in the mul() method from row to column major. Nothing worked. JOML’s Matrix4f class now literally looked exactly like my modified libGDX Matrix4 class. Something was fishy here.

Out of curiosity, I just refactored/moved JOML’s Matrix4f class into the com.badlogic.gdx.math package, rebuilt everything, wiped the app completely from the device (as always after each test) and reran the benchmark.

48 ns… wtf.

Moving libGDX’s Matrix4 class into any other package other than a subpackage of com.badlogic.gdx also degraded the performance of its mul() method from 49 ns. to 90 ns.

Result: If you want your application to run fast on an Android 6.0.1, just put your application classes in the com.badlogic.gdx package.

…I’ll look into Android’s sources to see whether they actually have a codepath for classes in the ‘com.badlogic.gdx’ package…

SHC · June 28, 2016, 1:23am

Thing number one: Are you making a Java library or an android library?

This is the first question to be solved and I think it will point to a solution. Android can support Java libraries, but android libraries are a different case that they have the package in the manifest so that the compiler can optimise those classes. The android libraries have the .aar extension which is short for android archives.

This might be happening in case of LibGDX, because it was declared in the manifest. Try making another manifest for your library.

I’m not completely sure but I think this might be the case. In case you already made JOML as an android library then something might be the issue.

KaiHH · June 28, 2016, 6:38am

Thanks for your answer, SHC!
What I did was creating a “Start a new Android Studio project” in Android Studio on its welcoming page.
There on the “Target Android Devices” page I selected “Phone and Tablet:” with “API 23: Android (Marshmallow)”.
Then on the next slide “Add an Activity to Mobile” I chose “Basic Activity”.
This set me up with a project structure in Android Studio and I could build and run the app on the device right away without specifying a package in a metadata information file.
Then I just copied the libGDX math classes from its original libGDX sources into my project into their corresponding package (which I created newly) in the app/src/main/java folder. Also without it being manually specified in any meta information file.
I did the same with JOML’s source files. Just copied them into the project source folder.
The file ultimately being generated by the Gradle build process is an .apk file under app/build/output/apk.
When I inspect this file, there is nowhere a file in it that mentions the com.badlogic.gdx package or my other package. There is just the AndroidManifest.xml (which strangely is not an xml file, but a binary file) containing only the fully-qualified name of the main activity to start when the application boots.

SHC · June 28, 2016, 9:53am

Okay this is interesting now. Thing number two, what is the Java source version that you are using?

Android VM is not like the standard Java VM and it is said that it still doesn’t support some features in Java 7. In this case the compiler generates code which is back ported to Java 6. This is especially true for invokeDynamic instructions which are unsupported by android.

The next thing to look for is using accessor methods. Whenever possible we should use the variables directly. The method count also has some impact on the performance. The tips specified in the android developer manual may be useful for you.

https://developer.android.com/training/articles/perf-tips.html

Apart from that, make sure that you enable Progaurd which is disabled by default. This helps as it eliminates the dead code prior to JIT execution.

KaiHH · June 28, 2016, 10:01am

No no no. The thing is, I am not using anything fancy from Java. The code would even be Java 1.1 compatible.
All it does is literally this now:


public class Matrix4f {
    public float m00, m01, m02, m03;
    public float m10, m11, m12, m13;
    public float m20, m21, m22, m23;
    public float m30, m31, m32, m33;

    public Matrix4f mul(Matrix4f right, Matrix4f dest) {
        float nm00 = m00 * right.m00 + m10 * right.m01 + m20 * right.m02 + m30 * right.m03;
        float nm01 = m01 * right.m00 + m11 * right.m01 + m21 * right.m02 + m31 * right.m03;
        float nm02 = m02 * right.m00 + m12 * right.m01 + m22 * right.m02 + m32 * right.m03;
        float nm03 = m03 * right.m00 + m13 * right.m01 + m23 * right.m02 + m33 * right.m03;
        float nm10 = m00 * right.m10 + m10 * right.m11 + m20 * right.m12 + m30 * right.m13;
        float nm11 = m01 * right.m10 + m11 * right.m11 + m21 * right.m12 + m31 * right.m13;
        float nm12 = m02 * right.m10 + m12 * right.m11 + m22 * right.m12 + m32 * right.m13;
        float nm13 = m03 * right.m10 + m13 * right.m11 + m23 * right.m12 + m33 * right.m13;
        float nm20 = m00 * right.m20 + m10 * right.m21 + m20 * right.m22 + m30 * right.m23;
        float nm21 = m01 * right.m20 + m11 * right.m21 + m21 * right.m22 + m31 * right.m23;
        float nm22 = m02 * right.m20 + m12 * right.m21 + m22 * right.m22 + m32 * right.m23;
        float nm23 = m03 * right.m20 + m13 * right.m21 + m23 * right.m22 + m33 * right.m23;
        float nm30 = m00 * right.m30 + m10 * right.m31 + m20 * right.m32 + m30 * right.m33;
        float nm31 = m01 * right.m30 + m11 * right.m31 + m21 * right.m32 + m31 * right.m33;
        float nm32 = m02 * right.m30 + m12 * right.m31 + m22 * right.m32 + m32 * right.m33;
        float nm33 = m03 * right.m30 + m13 * right.m31 + m23 * right.m32 + m33 * right.m33;
        dest.m00 = nm00;
        dest.m01 = nm01;
        dest.m02 = nm02;
        dest.m03 = nm03;
        dest.m10 = nm10;
        dest.m11 = nm11;
        dest.m12 = nm12;
        dest.m13 = nm13;
        dest.m20 = nm20;
        dest.m21 = nm21;
        dest.m22 = nm22;
        dest.m23 = nm23;
        dest.m30 = nm30;
        dest.m31 = nm31;
        dest.m32 = nm32;
        dest.m33 = nm33;
        return dest;
    }
}

Plus I have a button in the activity form which instantiates this class and calls this method a few million times and computes the average run time.
Now, when I copy this exact Matrix4f class into the package com.badlogic.gdx.math, it runs at around 48 ns.
When I copy it in ANY OTHER package other than under com.badlogic.gdx, it runs at 90 ns.
It totally freaks me out.
I am thus not interested in getting the code as fast as it could possible be with ProGuard or anything. I am interested in getting to know why the same code performs twice as fast only because being declared in the gdx package.

CoDi_R · June 28, 2016, 10:54am

Wow! Just wow! We all know that Mario is an evil mastermind, but this … :o ;D

May I suggest you upload your sample code and have someone verify your results on a different device?

KaiHH · June 28, 2016, 11:13am

Well, you have the code right there.

Just create a new Android project, put that class under a new package called “com.badlogic.gdx.math” (you don’t need to reference the libGDX library), instantiate it from your activity via a button press or something, put a System.nanoTime(), execute m.mul(m, m) a few million times in a loop, call System.nanoTime() afterwards and compute the average run time.
I printed the result timings on a simple TextView. (did not want to mess around with LogCat).
On an Android device with a 2.3GHz CPU it takes around 48ns.

Then, refactor/move the class into another package, such as “runs.slow.math” or something (anything, really), uninstall the app from the device, and push and execute the app again. On the same 2.3GHz CPU it gives around 90ns.

VaTTeRGeR · June 28, 2016, 11:42am

Did you try having this class in multiple different packages under different names in the same app release and then testing out all the copied versions in the same run?

Start App
test code in package variation 1
test code in package variation 2
test code in package variation 3
test code in package variation 4
test …
test code in package variation n
print results
Exit App.

Do you count the first few cycles (warm up), how do you compute the avg. cycle value?

KaiHH · June 28, 2016, 4:17pm

I did as you suggested. I tested both classes in the same test. Still the same timings. ~49ns. for the one in gdx package and around 90-100 ns. for the other one. Also when I swapped the order of the test execution, the timings swapped accordingly.
But it gets weirder: I created a new project with exactly the same layout/config and packages/classes and test fixture. The only thing that was different was the package in which the main activity resided and the project/activity name. Everything else the same.
Then I ran that: Now, both classes were equally fast/slow with 90-100 ns.

Then I renamed the package of the main activity in the first project (just appended a “2”).
All other packages (the gdx one and my other “slower” one) remained the same.
Result: Now both classes showed a timing of 90-100 ns.
There must be some really annoying caching going on, albeit I seemingly “uninstalled” the app all the time from the App Manager.

Anyway, I call it a day with this.

SHC · June 28, 2016, 9:50pm

This is what I originally said with what package did you declared in the manifest. You will be giving the package of the main activity class in the manifest which is located in src/AndroidManifest.xml file. The one in the apk is of course binary.

KaiHH · June 28, 2016, 9:55pm

Hm… yeah. I understand what you meant. But I never ever touched/modified any manifest file manually.
Android Studio / IntelliJ refactored that along with when I refactored/renamed the package.
Also it is not the package the actual class being called resides in. My activity is in “com.example.kai.appTest”.
The math class in org.joml or that libgdx package.
Anyway. I need more reading into Android development before I continue.
Because once I discovered the “Build Variants” (Debug, Release) and prepared a Release build, I was absolutely shocked by the run time I got by that. I only say: 25 times slower than Debug. Something is not right here.

Hydroque · June 29, 2016, 12:15am

If you right click windows task bar, bring up task manager and go to processes you can set priority to each individual process, as well as dictate how many cores to be put into use.

I wonder if that specific package is “elevated.” Although, I haven’t followed through with the understanding of what you said in this thread, so forgive me if I missed something in posting this. (Saying this because I am unsure about the packages existence in the workspace you are talking about, and I skimmed a lot)

SHC · June 29, 2016, 4:39am

Did I miss something here @Hydroque? I think the issue is with Android application running on an actual device, how did the windows task manager come into this context?

Hydroque · June 29, 2016, 6:20pm

I was relating it to it. Whenever there is a class inside a ‘math’ package it optimizes to run it. Though, I don’t develop for android, nor have I actually ‘stress tested’ this thing.

KaiHH · June 29, 2016, 7:47pm

I want to share my latest findings, which (for the first time) are consistent and reproducible.

It seems that on my device, the ART dex2oat ahead-of-time compiler, which produces native code out of dex classes, is not always being invoked when the app is installed and started on the device.

I said above that in the “Release” build type the Matrix4.mul() method was about 25 times slower compared to the “Debug” build type. In total it now takes around 2700 ns. compared to the debug version with 90-100ns.
This is strange when you think about it: A “release” version being slower than a “Debug” version.

So I played a bit with the project “Build Types” settings in Android Studio and tried different settings for the options “Debuggable”, “Signing” (signed or not) and “Minify”. The latter calls dexopt, which does some bytecode-only optimization on the dex classes, and calls ProGuard (if configured).
What I wanted to know is whether any of those options would trigger dex2oat to be run on the device or not.
There was also always a considerable startup time in the debug mode when I started the application on the device, showing a white screen without showing the activity UI for about 5 seconds. And this is in fact due to the ahead-of-time compiler compiling all the dex classes into native first. The “release” version however started immediately.
Afterwards, the application and the Matrix4.mul() method executed at peak speeds, really comparable to Java’s JIT compiler on the desktop (considering the slower speed of the CPU).

So I played around with these build type settings and the actual results for the settings are these:


Debuggable = true,  with signing, Minify = true  --> slow startup but fast execution
Debuggable = true,  with signing, Minify = false --> slow startup but fast execution
Debuggable = false, with signing, Minify = false --> fast startup but slow execution
Debuggable = false, with signing, Minify = true  --> fast startup but slow execution

So, the only option that has an effect is the “Debuggable” option, which in effect takes care that debug symbols will be emitted in the class files. If that option is enabled, the ART runtime invokes the ahead-of-time compiler when the app is installed and started on my device.

Probably, this is only an issue when deploying/installing/uploading the .apk via the USB wire protocol. It may be the case that applications are always optimized/ahead-of-time-compiled when installed from the Play Store.

Hydroque · June 29, 2016, 10:58pm

So there was optimization going on like I thought, but not on the scheduler side. You should report back to see if that is the case, with pre-optimized from the store or whatever. It would be interesting to manipulate the project in a way where you would have…

fast startup and fast execution