Tuesday, March 26, 2019

Qt painting performance with 4 different embedded GPUs (Mali, Adreno, PowerVR)


Summer is approaching, Qt 5.12.2 is out and we wanted to again get concrete painting performance numbers on some lower end embedded SOC GPUs. The kind of chipsets which people are using for embedded devices in multiple places. If user interface contains dynamically painted elements and target is 60fps (or maybe 30fps), how much could you actually paint and which Qt technologies to reach out?

To do this testing in easily approachable way, we grabbed 3 cheaper Android™ tablets with different GPUs for testing. Other option would have been different development boards, but those need a bit more setup time and time-to-market (or time-to-blog.. ;-) ) is important to all of us. We also wanted to give Qt 5.12.2 a go while it's hot.

So let's first introduce the tablets & chipsets used for this testing:

1) Lenovo Tab 7 Essential, MediaTek MT8167D, GPU: IMG PowerVR GE8300. This GPU is exactly the same which e.g. Renesas D3 uses, so if you are into automotives this GPU may be interesting.

2) Lenovo Tab E8, MediaTek MT8163B, GPU: ARM Mali-T720 MP2. This ARM GPU is quite common in lower-end MediaTek chipsets but also used in Allwinner H6 which is foung e.g. on Zidoo H6 and different Orange Pi boards.

3) Huawei MediaPad T3 10, Qualcomm Snapdragon 425, GPU: Adreno 308. This Adreno version is lower end chip from Qualcomm Snapdragon family, very commonly found from more affordable tablets and phones (like Samsung Galaxy J2, Motorola Moto E5, Nokia 2.1 etc.).

As a comparison we'll throw one more device in the set:

4) Nexus 6 (2014), Qualcomm Snapdragon 805, GPU: Adreno 420. This is the wild-card contender here, higher end phone from ~4.5 years ago. So how does a bit dated highend match to current low end? Well GPUs on these low end tablets are also mostly dated, but let's see.

Setup

These tablets have different screen resolutions, so to get more comparable results we first configure them all to use same resolution. Suitable resolution for our imaginary IOT touchscreen device could be 400x640 px. So using adb shell we change resolutions of each device with:

adb shell wm size 400x640

Now we want to know how these chipsets perform compared to each others. But we also want to know difference between CPU side QPainter drawing (Image rendertarget) vs. GPU side QPainter (Framebufferobject rendertarget) vs. GPU side QNanoPainter. So different rendering backends we will test are:
  • QPainter - CPU - antialiased
  • QPainter - CPU - non-antialiased
  • QPainter - GPU - non-antialiased
  • QNanoPainter - GPU - antialiased
  • QNanoPainter - GPU - non-antialiased
As you see, above list is missing antialiased GPU QPainter. The reason is that these devices don't support OpenGL extensions Qt requires for MSAA antialiasing so that combination is not available.

Testing

There are plenty of different testing possibilities and combinations we could try here, but we want to be quite general instead of going into specific detailed operations. Our first test is "How much stuff can you draw on a fullscreen item?" and our second test "How many smaller and less demanding items can you manage?". So let's start.

TEST 1: QNanoPainter vs. QPainter demo, all default test enabled (ruler, circles, lines, bars, icons), running fullscreen (remember all tablets are set to use 640x400 resolution). So quite heavy and versatile painting already by default. Then we increase how many times all tests are rendered and watch framerate dropping towards floor... Here's video of all devices running this test with QNanoPainter and single render count:



Results are following:






TEST1 Conclusions:
  • Performance of MediaTek MT8167D (PowerVR GE8300) and MediaTek MT8163B (Mali-T720 MP2) are very similar. Seems like first one has slightly faster GPU while second has slightly faster CPU.
  • Adreno 308 doesn't take as big overhead from QNanoPainter antialiasing as other two. While with others antialiased performance is ~50% of non-antialiased one, with Adreno 308 it is ~65%.
  • With MediaTek chipsets QPainter with FBO rendertarget achieves ~50% higher fps than QPainter with Image rendertarget.
  • If your UI requires repainting items of whole (640x400px) screen and you target 60fps, with these chipsets you should look towards QNanoPainter.
  • The comparison device (Nexus 6) Adreno 420 GPU is notably beefier and with QNanoPainter you can render all tests 4 times while keeping steady 60fps. But interestingly QQuickPaintedItem FBO rendertarget doesn't get much out of this GPU. What is causing this could be analyzed further.


TEST 2: QNanoPainter vs. QPainter demo, only circles test enabled, smaller 256x256px item size. Also, instead on increasing rendering count, we increase the amount of these QQuickItems. So output looks like this with 1, 2, 4, 8 and 16 items:


For this test we also add 6th rendering mode into test: QNanoPainter with QNANO_USE_RENDERNODE defined. With this, QSGRenderNode is used which basically means rendering directly into Qt Quick Scene Graph instead of rendering through FBO using QQuickFramebufferObject. When the amount of items increases, potential savings for not rendering through FBO also increases but we want to know how much and does it depend on GPUs.

Results are following:






TEST2 Conclusions:
  • Reducing size of items and the amount of painting makes these chipsets more viable option for dynamically painted UI elements.
  • If your UI contains 2 items like this, Snapdragon 425 can manage 60fps using QPainter - CPU. But CPUs on MediaTek chipsets can't quite reach that (44fps & 50fps). Using QPainter - GPU those all reach 60fps with 2 items.
  • With these simpler items, QNanoPainter antialiasing doesn't have major overhead on any of the chipsets. QPainter (CPU) antialiasing does obviously have notable overhead.
  • Using QNANO_USE_RENDERNODE (QSGRenderNode) with QNanoPainter gives notable performance increase in this case, ~20-30% depending on chipset. Our assumption about FBO overhead with more items was correct.
  • Comparison device (Nexus 6) can render at least 16 items with QNanoPainter at fluid 60fps, both antialiased and non-antialiased.

As a final conclusion I would say that if you are working on embedded system with these or similar chipsets and your user interface contains elements which require dynamic painting, consider utilizing QNanoPainter for those.

Sunday, March 17, 2019

Using QNanoPainter without QtQuick (pure C++)

Originally, QNanoPainter library was implemented to fulfill the needs of easy to use but performant custom OpenGL QQuickItems. For the needs, Qt Quick Scene graph QSG* classes felt a bit too low-level to be productive, while QQuickPaintedItem was slightly lacking in performance and rendering quality on mobile hardware. For more details, please read this QNanoPainter introduction blog post.

I like (OK, love) Qt Quick & QML and have used them successfully in many different projects. On desktop, mobile and embedded software. A lot. But Qt Quick doesn't suit all situations or it is not always required. As explained in this Qt blog post, Qt 5.12 improves Qt Quick performance and memory usage. But naturally there still is some memory and startup time additions coming from Qt Quick engine.

Fear not, QNanoPainter can be used also without Qt Quick. Available entry points are:
  1. QNanoQuickItem & QNanoQuickItemPainter - This is where it all started, use these to implement your QQuickItems.
  2. QNanoWidget - Based on QOpenGLWidget so can be used for widget based applications. Used similarly to QWidget and just contains QNanoPainter API for painting instead of QPainter. As QNanoPainter is OpenGL (ES) powered, in some cases this can substitute also QGLWidget based components.
  3. QNanoWindow - Based on QOpenGLWindow / QWindow so very lightweight. Optimal for embedded software which would only need a single QNanoWindow for the whole UI.
There are separate helloworld examples for all of these classes in QNanoPainter sources, so to educate ourselves let's see what the memory consumption differences of them are. First I unified all examples to look the same, like this:


Also made applications to exit automatically with timer after running for 2 seconds, to let the memory consumption stabilize. Measuring was done using MTuner memory profiler for Windows. Using freshly released 5.12.2, MTuner memory usage graphs look like this:


So QNanoQuickItem based version is using most memory, peaking at 28.9MB. QNanoWidget comes next with 23.3MB peak usage. And slimmest, as expected, is QNanoWindow app with 19.0MB.

Note that all of these use normal MSVC2015 Qt 5.12.2 from installer and release builds of applications, without extra compiler options or anything. With those and Qt Lite it would be possible to build more streamlined versions especially for QNanoWindow which doesn't depend on other Qt modules than Qt Core & GUI. Further optimizations and testing is left as an exercise for readers and the ones needing it :)

In conclusion: If you are working on embedded device with OpenGL ES 2 / 3 capable GPU, concerned of flash & RAM usage and require relatively simple user interface, I would encourage to check out QNanoWindow. You get hardware accelerated nicely antialiased graphics, in pure C++.