Tuesday, March 26, 2019

Qt painting performance with 4 different embedded GPUs (Mali, Adreno, PowerVR)


Summer is approaching, Qt 5.12.2 is out and we wanted to again get concrete painting performance numbers on some lower end embedded SOC GPUs. The kind of chipsets which people are using for embedded devices in multiple places. If user interface contains dynamically painted elements and target is 60fps (or maybe 30fps), how much could you actually paint and which Qt technologies to reach out?

To do this testing in easily approachable way, we grabbed 3 cheaper Android™ tablets with different GPUs for testing. Other option would have been different development boards, but those need a bit more setup time and time-to-market (or time-to-blog.. ;-) ) is important to all of us. We also wanted to give Qt 5.12.2 a go while it's hot.

So let's first introduce the tablets & chipsets used for this testing:

1) Lenovo Tab 7 Essential, MediaTek MT8167D, GPU: IMG PowerVR GE8300. This GPU is exactly the same which e.g. Renesas D3 uses, so if you are into automotives this GPU may be interesting.

2) Lenovo Tab E8, MediaTek MT8163B, GPU: ARM Mali-T720 MP2. This ARM GPU is quite common in lower-end MediaTek chipsets but also used in Allwinner H6 which is foung e.g. on Zidoo H6 and different Orange Pi boards.

3) Huawei MediaPad T3 10, Qualcomm Snapdragon 425, GPU: Adreno 308. This Adreno version is lower end chip from Qualcomm Snapdragon family, very commonly found from more affordable tablets and phones (like Samsung Galaxy J2, Motorola Moto E5, Nokia 2.1 etc.).

As a comparison we'll throw one more device in the set:

4) Nexus 6 (2014), Qualcomm Snapdragon 805, GPU: Adreno 420. This is the wild-card contender here, higher end phone from ~4.5 years ago. So how does a bit dated highend match to current low end? Well GPUs on these low end tablets are also mostly dated, but let's see.

Setup

These tablets have different screen resolutions, so to get more comparable results we first configure them all to use same resolution. Suitable resolution for our imaginary IOT touchscreen device could be 400x640 px. So using adb shell we change resolutions of each device with:

adb shell wm size 400x640

Now we want to know how these chipsets perform compared to each others. But we also want to know difference between CPU side QPainter drawing (Image rendertarget) vs. GPU side QPainter (Framebufferobject rendertarget) vs. GPU side QNanoPainter. So different rendering backends we will test are:
  • QPainter - CPU - antialiased
  • QPainter - CPU - non-antialiased
  • QPainter - GPU - non-antialiased
  • QNanoPainter - GPU - antialiased
  • QNanoPainter - GPU - non-antialiased
As you see, above list is missing antialiased GPU QPainter. The reason is that these devices don't support OpenGL extensions Qt requires for MSAA antialiasing so that combination is not available.

Testing

There are plenty of different testing possibilities and combinations we could try here, but we want to be quite general instead of going into specific detailed operations. Our first test is "How much stuff can you draw on a fullscreen item?" and our second test "How many smaller and less demanding items can you manage?". So let's start.

TEST 1: QNanoPainter vs. QPainter demo, all default test enabled (ruler, circles, lines, bars, icons), running fullscreen (remember all tablets are set to use 640x400 resolution). So quite heavy and versatile painting already by default. Then we increase how many times all tests are rendered and watch framerate dropping towards floor... Here's video of all devices running this test with QNanoPainter and single render count:



Results are following:






TEST1 Conclusions:
  • Performance of MediaTek MT8167D (PowerVR GE8300) and MediaTek MT8163B (Mali-T720 MP2) are very similar. Seems like first one has slightly faster GPU while second has slightly faster CPU.
  • Adreno 308 doesn't take as big overhead from QNanoPainter antialiasing as other two. While with others antialiased performance is ~50% of non-antialiased one, with Adreno 308 it is ~65%.
  • With MediaTek chipsets QPainter with FBO rendertarget achieves ~50% higher fps than QPainter with Image rendertarget.
  • If your UI requires repainting items of whole (640x400px) screen and you target 60fps, with these chipsets you should look towards QNanoPainter.
  • The comparison device (Nexus 6) Adreno 420 GPU is notably beefier and with QNanoPainter you can render all tests 4 times while keeping steady 60fps. But interestingly QQuickPaintedItem FBO rendertarget doesn't get much out of this GPU. What is causing this could be analyzed further.


TEST 2: QNanoPainter vs. QPainter demo, only circles test enabled, smaller 256x256px item size. Also, instead on increasing rendering count, we increase the amount of these QQuickItems. So output looks like this with 1, 2, 4, 8 and 16 items:


For this test we also add 6th rendering mode into test: QNanoPainter with QNANO_USE_RENDERNODE defined. With this, QSGRenderNode is used which basically means rendering directly into Qt Quick Scene Graph instead of rendering through FBO using QQuickFramebufferObject. When the amount of items increases, potential savings for not rendering through FBO also increases but we want to know how much and does it depend on GPUs.

Results are following:






TEST2 Conclusions:
  • Reducing size of items and the amount of painting makes these chipsets more viable option for dynamically painted UI elements.
  • If your UI contains 2 items like this, Snapdragon 425 can manage 60fps using QPainter - CPU. But CPUs on MediaTek chipsets can't quite reach that (44fps & 50fps). Using QPainter - GPU those all reach 60fps with 2 items.
  • With these simpler items, QNanoPainter antialiasing doesn't have major overhead on any of the chipsets. QPainter (CPU) antialiasing does obviously have notable overhead.
  • Using QNANO_USE_RENDERNODE (QSGRenderNode) with QNanoPainter gives notable performance increase in this case, ~20-30% depending on chipset. Our assumption about FBO overhead with more items was correct.
  • Comparison device (Nexus 6) can render at least 16 items with QNanoPainter at fluid 60fps, both antialiased and non-antialiased.

As a final conclusion I would say that if you are working on embedded system with these or similar chipsets and your user interface contains elements which require dynamic painting, consider utilizing QNanoPainter for those.

Sunday, March 17, 2019

Using QNanoPainter without QtQuick (pure C++)

Originally, QNanoPainter library was implemented to fulfill the needs of easy to use but performant custom OpenGL QQuickItems. For the needs, Qt Quick Scene graph QSG* classes felt a bit too low-level to be productive, while QQuickPaintedItem was slightly lacking in performance and rendering quality on mobile hardware. For more details, please read this QNanoPainter introduction blog post.

I like (OK, love) Qt Quick & QML and have used them successfully in many different projects. On desktop, mobile and embedded software. A lot. But Qt Quick doesn't suit all situations or it is not always required. As explained in this Qt blog post, Qt 5.12 improves Qt Quick performance and memory usage. But naturally there still is some memory and startup time additions coming from Qt Quick engine.

Fear not, QNanoPainter can be used also without Qt Quick. Available entry points are:
  1. QNanoQuickItem & QNanoQuickItemPainter - This is where it all started, use these to implement your QQuickItems.
  2. QNanoWidget - Based on QOpenGLWidget so can be used for widget based applications. Used similarly to QWidget and just contains QNanoPainter API for painting instead of QPainter. As QNanoPainter is OpenGL (ES) powered, in some cases this can substitute also QGLWidget based components.
  3. QNanoWindow - Based on QOpenGLWindow / QWindow so very lightweight. Optimal for embedded software which would only need a single QNanoWindow for the whole UI.
There are separate helloworld examples for all of these classes in QNanoPainter sources, so to educate ourselves let's see what the memory consumption differences of them are. First I unified all examples to look the same, like this:


Also made applications to exit automatically with timer after running for 2 seconds, to let the memory consumption stabilize. Measuring was done using MTuner memory profiler for Windows. Using freshly released 5.12.2, MTuner memory usage graphs look like this:


So QNanoQuickItem based version is using most memory, peaking at 28.9MB. QNanoWidget comes next with 23.3MB peak usage. And slimmest, as expected, is QNanoWindow app with 19.0MB.

Note that all of these use normal MSVC2015 Qt 5.12.2 from installer and release builds of applications, without extra compiler options or anything. With those and Qt Lite it would be possible to build more streamlined versions especially for QNanoWindow which doesn't depend on other Qt modules than Qt Core & GUI. Further optimizations and testing is left as an exercise for readers and the ones needing it :)

In conclusion: If you are working on embedded device with OpenGL ES 2 / 3 capable GPU, concerned of flash & RAM usage and require relatively simple user interface, I would encourage to check out QNanoWindow. You get hardware accelerated nicely antialiased graphics, in pure C++.

Monday, January 1, 2018

Qt 5.10 Windows Rendering Benchmarks

At the end of the previous blog post I promised to do a follow-up about Windows side rendering performance. So here we go.

But before that, let's revisit one earlier case. Previously decided to provide results with and without "Bezier lines" test because it seemed to perform particularly poorly when rendered using QML Shape backend. Instead of just letting this one go, I decided to dig a bit deeper to try to improve QML Shape performance. After all, just disabling slow tests doesn't sound like a preferred long-term plan... ;-)

Improving QML Shape paths performance

After some trial and error, found out that QQuickPath::createPath() uses considerable amount of time every time path changes and reason is in QPainterPath::length() which e.g. for all curved paths calls QBezier::length(). This isn't usually problem for PathView as its normal use-case is creating path once and then just moving elements along the path. But new ShapePath on the other hand might be animated and change (re-create) its path multiple times.

Luckily ShapePath wouldn't actually need to count its length there, as it doesn't support PathAttribute or PathPercent properties. So I went ahead and implemented a fast-path for creating path for ShapePath in patch linked into QTBUG-64951. This patch improves performance of all QML Shape paths, most notably for bigger animated paths. For QNanoPainter demo bezier line test, it provided up to ~20x performance boost on Windows PC. See FPS values at top-left corners of without (back windows) and with the patch (forward windows):


Hopefully that patch ends up into Qt in one form or another, but for now to get better results I used Qt 5.10 branch + patch for all the testing here.

Desktop OpenGL vs. Angle

As you probably know, Qt Quick on Windows PC can run either on desktop OpenGL or on OpenGL ES through Angle. What Angle does, is translating OpenGL ES API calls for Direct3D which is great to improve compatibility as OpenGL support varies on Windows. Direct3D drivers of different GPUs might be more optimized than OpenGL but on the other hand translation has some overhead so one might ponder which one is faster, normal OpenGL or Angle? How much does it matter?

To get some view into this, I ran the test application first on trusty old Windows PC with the following related specs:
  • Intel i5-2500K @ 3.3GHz
  • Integrated Intel HD Graphics 3000 + NVIDIA GTX 1060
  • HD monitor
  • Windows 10
  • Building with MS Visual Studio 2015 - 64bit
  • All QNanoPainter perf demo default tests enabled

Test1: Windows PC with HD Graphics 3000, fullscreen:


Test1 conclusions: No huge differences in performances. Likely old integrated Intel GPU is so non-performant that none of the rendering methods can perform well. With QQuickPaintedItem (QImage and FBO), Angle is faster than desktop OpenGL. With all the other backends, OpenGL is faster. But as said, differences aren't very big.

Full HD resolution is probably too much for that integrated GPU to handle, so let's repeat the test with default window size (375x667).

Test2: Windows PC with HD Graphics 3000, default window size:


Test2 conclusions: Interesting part here is naturally comparison with the Test1 results. Immediately we can see QNanoPainter (OpenGL) outperforming other options. Decreasing the item size improved its fps ~200%, while QNanoPainter (Angle) only gained ~40%. So decreasing the pixel amount allows OpenGL to perform better than Angle. In many tests QQuickPaintedItem (QImage) has been the slowest option, but here it's actually second fastest. Combination of fast Intel CPU + poor GPU + small item size suits it well. Not really surprising, but good to prove it.

It is also interesting that item size didn't have big affect for QML Shape nor QQuickPaintedItem(FBO), both of those are limited by something else than item pixel amount. But some results are a bit shady, leading to think that maybe this older integrated GPU is doing strange things.

Next we can enable the external NVIDIA GTX 1060 graphics card and see how huge GPU performance increase affects these different rendering methods. As QML Shape supports NVIDIA-specific GL_NV_path_rendering we need to add into our test matrix one more option, with and without vendorExtensionsEnabled property enabled. To keep things clearer, let's do Angle and OpenGL as separate graphs from now on.

Test3: Windows PC with NVIDIA GTX 1060, Angle:


Test4: Windows PC with NVIDIA GTX 1060, OpenGL:


Tests 3 and 4 conclusions:
  • No difference for QQuickPaintedItem (QImage) between Test 3 & 4. This is as expected, Qt Raster CPU backend performs equally well (or bad) with OpenGL and Angle.
  • With QML Shape (no GL_NV_path_rendering), OpenGL and Angle perform quite close to each other, with OpenGL being ~20% faster.
  • Angle doesn't have GL_NV_path_rendering extension available so on Angle results with and without vendorExtensionsEnabled are exactly same. On OpenGL, GL_NV_path_rendering gives about 20% performance improve over default GeometryRenderer.
  • QQuickPaintedItem (FBO) is clearly faster with OpenGL, about double the speed compared to Angle. QNanoPainter is also much faster with OpenGL, about 4x the speed compared to Angle.
As we are in a good benchmarking flow now let's not stop yet. Next we will switch to fresh laptop hardware, Dell XPS 15 (9560) with the following related specs:
  • Core i7-7700HQ
  • Integrated Intel HD 630 + NVIDIA GTX 1050
  • HD screen
  • Windows 10
  • Building with MS Visual Studio 2015 - 64bit
  • All QNanoPainter perf demo default tests enabled
So how does a laptop with latest Intel CPU + GPU perform? What's the difference between integrated vs. additional GPU here? Let's find out, running all test in fullscreen HD resolution.

Test 5: Dell XPS 15 with Intel HD 630, Angle:


Test 6: Dell XPS 15 with Intel HD 630, OpenGL:


Test 7: Dell XPS 15 with NVIDIA GTX 1050, Angle:


Test 8: Dell XPS 15 with NVIDIA GTX 1050, OpenGL:



Tests 5-8 conclusions:
  • With HD 630, OpenGL is also overall faster than Angle. With QML Shape and QQuickPaintedItem (FBO) OpenGL reaches ~100% higher fps, while with QNanoPainter OpenGL has ~50% higher fps than Angle.
  • Same thing with GTX 1050, OpenGL is faster than Angle. With QML Shape ~30%, QQuickPaintedItem (FBO) ~100% and QNanoPainter ~200% higher fps. It's clear that especially QNanoPainter enjoys taking all juices out of powerful GPUs.
  • Comparing integrated vs. external GPU here, enabling GTX 1050 increases QNanoPainter performance with ~200% (12fps vs. 34fps). Also interestingly QML Shape doesn't gain about anything from external GPU, so bottleneck is somewhere else. But enabling GL_NV_path_rendering for QML Shape with GTX 1050 gives ~20% higher fps which matches to results with other PC. So when running on NVIDIA GPU it's usually preferred to keep vendorExtensionsEnabled on.
  • As expected, Intel integrated GPUs have improved a lot in ~6 years. Looking at QNanoPainter numbers, this laptop with HD 630 performs ~4x faster than HD 3000 of previous setup (43fps vs. 11fps). Yes, other parts have changed too, but GPU is a big factor here.
  • Comparison between old system + NVIDIA GTX 1060 vs. new system + NVIDIA GTX 1050 is also interesting. We can see that although GPU is beefier, CPU, RAM etc. turn the table for the newer system. On new system, rendering is overall ~30% faster (QNanoPainter 34fps vs. 26fps).

Now it's probably good time to call this blog post done. As always, thoughts about these results, own testing results or any other comments are warmly welcome. And happy 2018!

Monday, December 4, 2017

Qt 5.10 Rendering Benchmarks

Qt 5.10.0 RC packages are available now and actual release is happening pretty soon. So this seems to be a good time to run some rendering benchmarks with 5.10, including new QML Shape element, QQuickPaintedItem and QNanoPainter.

After my previous blog post, some initial comments mentioned how QML Shape didn't reach their performance expectations. But I think that might be more of a "use the right tool for the job" -kind of thing. This demo application is very much designed to test the limits of how much heavily animated graphics can be drawn while keeping performance high and while having its own strengths, QML Shape likely isn't the tool for that.

To prove this point, there is a new 'flower' test case in QNanoPainter demo app which renders a nice flower path, animating gradient color & rotation (but not path). Combining it with new setting to render multiple items (not just multiple renders per item) and the outcome looks like this with 1 and 16 items:


Now when we know what the desired outcome looks like let's start testing with the first run. 

Test1: Nexus 6, 'Render flower' test:


Test1 conclusions: In this test QQuickPaintedItem (QImage backend) has clearly worst performance, CPU Raster paint engine and uploading into GPU is very non-optimal on Nexus 6. QML Shape performs the best, maintaining fluid 60fps still with 16 individual items. QNanoPainter manages quite well also and switching for QSGRenderNode backend instead of QQuickFramebufferObject to avoid rendering going through FBO gives a nice boost. When the amount of items increases this FBO overhead naturally also increases. QQuickPaintedItem with FBO backend is somewhat slower than QNanoPainter.

This test is kind of best-case-scenario for QML Shape. If path would animate that would be costly for QML Shape backend. Also for example enabling antialiasing turns tables, making QML Shape only render 2 items at 35fps while QNanoPainter manages fluid antialiased 60fps. But that's the thing, select the proper tool for your use case.

Next we can test more complex rendering where also paths animate and see how antialiasing affects the performance. In rest of the tests, instead of increasing item count we increase rendering count, meaning how many times stuff is rendered into a single QQuickItem. The default tests set contains ruler, circles, bezier lines, bars, and icons+text tests. With 1, 2 and 16 rendering counts it looks like this:



So let's continue to Test2: Nexus 6, all default tests enabled:


Test2 conclusions: Slowest performer is again QQuickPaintedItem (QImage). QML Shape becomes right after it, dropping quite a bit from lead position of Test1. Digging QML Shape performance a bit deeper and enabling different tests individually one can see that Bezier lines test makes the biggest fps hit. And disabling some code there revealed that biggest slowdown came from graph dots which were drawn with two PathArc, so improved fps by switching implementation to use QML Rectangle instead. QNanoPainter is fastest but even it only reaches 60fps with non antialiased single rendering. Note that QNanoPainter with QSGRenderNode is missing here and in all rest of the tests because when rendering only single item performance of it is almost the same as QNanoPainter with FBO.

Then we could switch to a bit more powerful hardware and repeat above test with that. 

Test3: Macbook Pro (Mid 2015, AMD R9 M370X), all default tests enabled:


Test3 conclusions: Macbook can clearly handle much more rendering than Nexus 6. As MSAA is fully supported here we are able to test both antialiased and non-antialiased for every rendering method. On macbook MSAA antialiasing is quite cheap which can be seen from QML Shape and QQuickPaintedItem reaching pretty similar frame rates with and without antialiasing. Slowest performer is antialiased QQuickPaintedItem (QImage) while QNanoPainter leading again, reaching solid 60fps with 16 render counts.

As we saw already earlier that Bezier lines test seemed particularly unsuitable for QML Shape, let's next repeat the above test except disabling that single test. After all we try to be fair here and avoid misinterpretations. 

Test4: Macbook Pro, all default tests except Bezier lines enabled:


Test4 conclusions: Most interesting data here comes from comparison to Test3 results. QQuickPaintedItem (QImage) results go up only few percentages, so bezier line test doesn't seem to influence much there. QQuickPaintedItem (FBO) results are now identical for antialiased and non antialiased so light blue line can't be seen under orange one. But not much changes in there either. QNanoPainter improves 30-50% reaching solid 60fps now with 32 render counts when antialiasing is disabled. And finally, QML Shape improves frame rates by whopping ~100% so we were right in this particular test being its Achilles' heel.

We are just scratching surface here. There would be plenty of things to test still and get deeper into individual tests. But for this blog post let's stop here.

General tips about about Qt 5.10 QML Shape usage could be:
  • Use QML Shape for simple shape items as part of QML UIs. Consider other options for more complex shapes which animate also the path. 
  • Also don't use non-trivial Shape elements in places where creation time matters e.g. ListView delegates or making multiple shapes inside Repeater, as parsing the QML into renderable nodes tree has some overhead.
  • When the need is to render rectangles, straight lines or circles, QML Rectangle element gives generally better performance than QML Shape counterpart. You can experiment with this enabling alternative code paths for RulerComponent and LinesComponent of the demo. 
  • If you target mostly hardware with NVIDIA GPU, GL_NV_path_rendering backend of QML Shape should be more performant. I didn't have suitable NVIDIA hardware available currently for testing so these results will have to wait, anyone else want to provide comparisons?

Follow up post is planned for comparing Windows side OpenGL vs. OpenGL ES + Angle rendering performances so stay tuned!

Thursday, November 9, 2017

Qt 5.10 QML Shape testing

When implementing component into QtQuick UI which needs something more than rectangles, images and texts, pure declarative QML hasn't been enough. Popular choices to use for items with some sort of vector drawing are QML Canvas, QQuickPaintedItem or QNanoPainter.

But with Qt 5.10 there will be supports for new Shape element with paths that contain lines, quads, arcs etc. so I decided to install Qt 5.10 beta3 and implement all tests of "qnanopainter_vs_qpainter_demo" with also QML + Shape elements. (This kinda makes it "qnanopainter_vs_qpainter_vs_qmlshape_demo" but not renaming now). So here is in all glory the same UI implemented with QNanoPainter (left), QQuickPaintedItem (center), and QML+Shape (right):


Hard to spot the differences right? If only there would be a way to prove this, some way to x-ray into these UIs... like QSG_VISUALIZE=overdraw to visualize what Qt Quick Scene Graph Renderer sees?


Here you can see that scene graph sees QNanoPainter and QQuickPaintedItem as just big unknown rectangles, while QML+Shape it sees into as that is composed of native scene graph nodes. But proof is in the pudding as they say, what looks the same doesn't perform the same. Here's a video showing all 3 running with two different Android devices:



As different rendering components can be enabled/disabled and settings changed, this demo is quite nice for doing performance comparisons of both exact drawing methods or combining all methods. But those will have to wait for another blog post and for non-beta Qt 5.10 to get fair results. In the mean time, feel free to pull latest sources from github, test yourself and provide patches or comments!

Wednesday, October 25, 2017

FitGraph NG UI prototype

About a month ago I started exercising more, mostly jogging, weights and soccer (with kids). Target is to be in superb shape when 2018 starts, and I'm already feeling stronger & more energetic during the day so looking good!

Anyway, this blog post is somewhat related to that. There's plenty of health-related apps and gadgets available these days and in the past I used some time pondering what would be a perfect activity tracking app for my needs. Now I decided to revive this earlier concept as 'FitGraph NG' while porting it to use QNanoPainter and polishing some parts.

As usual, let's start with a video demonstrating the actual application:


There would of course be more views available, this being just the 'activity timeline' part, but it would already cover many of my initial wishes:
  • Showing the whole day as a graph, data or textually depending on needs.
  • Automatic annotation of activities, type, duration and related activity data. And importantly, being able to select each activity to cover only data during that.
  • See how well you have reached your 'moves' goal which would come from all your activities.
  • Also collect other notes, goals, concerns etc. during the day.
I could write quite a long presentation about this, explain why things are where they are, how interactions are thought out, pinpoint small (but important!) details etc. But don't want to, you can watch the video few times and ponder about those yourself if you wish.

Some more information about the implementation side:
  • Implemented with Qt, so cross-platform on Android, iOS etc. Application logic C++ and UI naturally QML with few shaders.
  • Graphs are painted with single QNanoPainter item for efficiency. Graph animations are driven from QML side for easy tuning.
  • Data is managed with SQLite and fetched into QAbstractListModel. There's configurable QCache to reduce SQL queries. Data in this prototype is generated dummy, but basically allows "unlimited" scrolling of days.
  • Performance was important target, some tricks and optimizations were required to get application working fluidly at 60fps also on lower end Android devices. 
Thoughts welcome and thanks for reading!

Monday, October 23, 2017

Unity testing

Couple weeks ago I decided to study a bit about Unity as I haven't worked with it before. So implemented a simple "Rock Rolling in Terrain" game prototype as a case study, looking like this:


By coincidence Marko also just made a nice blog post related to Unity, and thanks to standard assets his terrain even looks quite similar to mine so not going to repeat similar notes. But all in all my initial feeling is that Unity seems quite productive environment and wouldn't mind implementing some bigger project with it to get deeper.