CUDA – MCC for Fingerprint Minutia Pairing
When I was literally thrown into biometrics, I had no idea how a fingerprint matching works. Unfortunately for me, the colleagues who were supposed to know that and even share that information with us didn’t know it either. No this is actually not a joke, its a kind of a very sad story, but I am not going to cry over my professional career here.
Before we start, lets do a very quick introduction into fingerprint biometrics: A Minutia is a point in 2D space (usually defined as a resolution of a fingerprint scanner) extracted from a fingerprint. As such, it has its XY location coordinates. That’s however not all the story. Based on the unique distortions of each fingerprint, the minutia point may be (most probably) a result of ridge bifurcation or ridge ending. One can therefore define an orientation (IE angle) of such a point or distortion. A standard fingerprint has on average between 40 to 100 of these points (Minutias).
In order to match 2 fingerprints, one has to make sure that these points with the exact (or usually just “similar”) properties can be found in both fingerprints. The algorithms, that actually does this minutia point extractions are also quite complicated and definitely not a part of this post. Right now, for us, it is important that we need to match a set of ~100 [X,Y,φ] between these 2 fingerprints.
First of all, its important to say that due to various reasons (Such as faulty detections, imperfections and partial scans, fingerprint scars and other nonlinear variations … ETC). The fingerprint matching is still not a fully solved problem and likely never will be. Various algorithms have been introduced and described and usually all of them have some pros and some cons. Most of them trade the algorithm’s speed for accuracy. If you have some mathematical background in combinatorics, then you can try to calculate the total amount of combinations of fingerprints on a 2D space of 1024×1024 samples. To save your time, its so huge, that brute-forcing the fingerprint matching is beyond our skills as of today.
Most algorithms that do match these 2 minutia-based fingerprints have generally 3 parts:
- Prepare a template and/or description of the fingerprints.
- Match minutias from the first fingerprint onto the second fingerprint.
- Calculate final score of similarity between those 2 fingerprints.
Initially, I wanted to examine one of today’s state-of-the-art algorithms for step #2 known as MCC – Minutia Cylinder Code. The original document is located HERE – Its actually well described and involves some standard university math. The most tricky part in the algorithm is actually step #1. For us here currently working on GPUs, its important just to mention how step #2 works. Just don’t worry, I usually assign myself an “easy” 😇 topic for free-time activities. So, to be honest with you, step 2 works basically such that in multiplies (point-wise) a cylinder (imagine vector of samples) with another cylinder and then sums those results. That’s really it! If you did not understand, its basically in other words a dot product – one of my favorite CUDA algorithms, that’s actually setup for the past few years as the default project template in MS Visual Studio once you install the CUDA SDK.
Now that we know what we are trying to achieve, lets get to work! Before we start, lets define some variables: The cylinder size – here we can assume for example 16x16x8 samples (2048). Even though I have mentioned ~100 points previously, lets use 128 to add some gap for fingerprints that exceed the average. Now lets take a quick look at these numbers: 1 Cylinder = 8KB (Assume floating point (4B) values – even though the original whitepaper allows for bitwise implementations also). 1 Fingerprint = 128*8KB = 1MB. The tricky part in here is still however hidden – in order to properly match those damn cylinders is to do an N-by-N comparison (IE compare every cylinder from one fingerprint with each cylinder of the other fingerprint). Wait what. 128×128 comparison for just one record? Yes, that’s it. That’s to be honest 2048 * 128 * 128 (or just 32M FP multiplications for one record – now omitting the accumulation, which is usually incorporated into FMA anyway). The worst thing is however the required memory throughput to do the comparison.
Simply said, the algorithm itself is very easy to understand, but due to its huge memory and compute requirements, it’s really a challenge to implement it properly into CUDA. In fact, I have spent quite some time investigating how one might achieve the highest performance and after implementing and optimizing several kernels and adjusting various memory patterns, I have finally estimated that I can reach up to 27.5K Records/s on my mobile Geforce RTX3060 this corresponds to about 1TFlop of computing power. By further looking into PTX, its hard to imagine that this can be further optimized (Although just slightly it can be). The performance can be however improved by a factor corresponding to the required samples per cylinder in the first place and in the second place, it may be even doubled by using __half (FP16) precision only. As such ,for FP16 and 8x8x6 cylinder,the performance can jump up to 10 times to ~250K Records/s and then scaled as well with better graphics cards. If only INT8 performance is considered, it can also jump up to additiional 2 times and also additionally ~2 times if only 96 Minutia are taken into account istead of 128.
I am glad I have had this unique opportunity to play around with common GPU features (Shared, Constant and Texture memories) as well as asynchronous copies into the shared memory (Introduced for Ampere architecture) or with overlapping data transfers with compute by either using CUDA streams or CUDA pipelines (Or even both actually). The introduction of cooperative groups to CUDA is also a very nice feature by giving us the capability to issue various instructions for different warps – but I have to say, that I would personally vote for warp-level tuning, unless it was super-beneficial as the lower we go, the tighter the bond with the hardware – and here especially for GPUs, it changes frequently and the performance gains are not that huge in comparison with the time investments and other optimizations.
Finally, I have to say that nVidia has done a lot of work on both the Hardware and also on the profiling tools and documentation. However! The documentation still needs a lot of improvements (Not mentioning the bugs and typos I have found there) as well as the NSIGHT Compute profiling tools. Especially considering the warp stall reasons and source-counter section – this needs a lot of work definitely. You can find only the very basic description of what those counters mean, but not much information of how to overcome it or what to do with it. I was also slightly confused from occupancy calculations (Among other things). It may actually happen that a kernel with lower occupancy works better than kernel with high occupancy – still the tool will include a nice warning that you can improve by increasing the occupancy.
I would also say, that in order to achieve the maximum performance, one has to interleave really a lot of work – so for example, even if you think that L2 cache (Thanks to its existence) is going to save you and even if you see a close to 100% percent hitrate, you will likely hit a bottleneck somewhere else – in this case it might be the overloaded TEXture units. It feels like no matter what instruction you use, its going to have a lot of latency and you will have a hard time covering this latency with other warps and/or blocks. Naturally, I know that CUDA works the way of “hiding latencies“, but I am getting a little bit afraid here that even to do a 1-bit shift, you will have to have tenths of other warps to properly hide latency of this single stupid instruction. At least that’s my feeling – a feeling that nVidia is making it a lot harder to cover latencies due to increasing pipeline lengths.
If you are interested in more details, feel free to contact me 👀