Push Token Profiling and Benchmarking

kkalev · March 22, 2023, 8:49am

privacyIDEA version: 3.8
Installation: Docker image
Deployment: Docker compose stack consisting of an Apache Reverse Proxy, privacyIDEA with uwsgi frontend (serving a uwsgi protocol socket), an LDAP server for user store and MariaDB as main database.
uwsgi configuration:
- master process
- 4 processes
- 2 threads per process

We are using HTTP for testing purposes in order to avoid any SSL handshake overhead.

After a user has enrolled his token, authentication is performed using Apereo CAS. When the user is sent a new challenge, a web page with a countdown clock is displayed. Before we reach that step, CAS calls two privacyIDEA endpoints:

GET /token/
POST /validate/triggerchallenge

Our aim was to profile and benchmark these endpoints in order to determine how well privacyIDEA can handle bursts of authentication requests.

Running pi-manage profile produced the following results:

GET /token/ took 0.225 secs with most of the time consumed in cryptographic functions.
POST /validate/triggerchallenge took 1.141 secs with most of the time consumed in cryptographic functions.

Details below:

PATH: '/token/'
         121017 function calls (119585 primitive calls) in 0.225 seconds

   Ordered by: internal time, call count
   List reduced from 2474 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3    0.095    0.032    0.095    0.032 {built-in method cryptography.hazmat.bindings._openssl.RSA_check_key}
       32    0.011    0.000    0.011    0.000 {method 'recv_into' of '_socket.socket' objects}
        3    0.007    0.002    0.007    0.002 {built-in method cryptography.hazmat.bindings._openssl.EVP_DigestUpdate}
        8    0.006    0.001    0.023    0.003 /usr/local/lib/python3.10/site-packages/ldap3/protocol/rfc4512.py:427(from_definition)

 PATH: '/validate/triggerchallenge'
          74940 function calls (73736 primitive calls) in 1.141 seconds
 
    Ordered by: internal time, call count
    List reduced from 2735 to 30 due to restriction <30>
 
    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         5    0.440    0.088    0.440    0.088 {method 'read' of '_ssl._SSLSocket' objects}
         5    0.350    0.070    0.350    0.070 {built-in method cryptography.hazmat.bindings._openssl.RSA_check_key}
         2    0.089    0.045    0.089    0.045 {method 'do_handshake' of '_ssl._SSLSocket' objects}
         3    0.048    0.016    0.048    0.016 {method 'connect' of '_socket.socket' objects}
         3    0.045    0.015    0.045    0.015 {built-in method _socket.getaddrinfo}
         2    0.040    0.020    0.040    0.020 {method 'load_verify_locations' of '_ssl._SSLContext' objects}

Our next step was using Apache Benchmark (ab from apache2-utils) to benchmark these endpoints. Our server is a workstation with 16 cores and our client a workstation with 12 cores. Calling the /validate/triggerchallenge endpoint resulted in ~530 msec/req with horizontal scaling as long as uwsgi pricesses/threads where >= than ab concurrent requests.
Results per concurrency:

2: 3.5 reqs/sec, 550 msec/req
4: 7.14 reqs/sec, 530 msec/req
6: 10.43 reqs/sec, 530 msec/req
8: 13.91 reqs/sec, 506 msec/req
10: 15.75 reqs/sec, 540 msec/req
12: 17.71 reqs/sec, 590 msec/req

Similar scaling results were obtained when testing the /token/ endpoint.

We are a bit troubled by the fact that it takes half a second to service a request to /validate/triggerchallenge. It seems that most of the service time is spent in cryptographic functions so it is not obvious how we might be able to improve the endpoint performance.

We will be providing the service to large academic institutions so it is not unreasonable to expect hundreds or thousands of auth requests within a small period of time (for instance, an online vote) and we would like to pinpoint the necessary hardware/server resources to be able to accommodate such traffic.

Do the above sound reasonable? Any thoughts on improving performance?

Thanks in advance.

kkalev · April 6, 2023, 7:59pm

I would like to follow up on the report with some additional information that might be helpful.

In order to pinpoint the problem, we added some execution time reporting in the privacyIDEA code, mainly in:

create_challenge() in lib/tokens/pushtoken.py
_build_smartphone_data() in lib/tokens/pushtoken.py (called by create_challenge()) and
submit_message() for the Firebase smsprovider in lib/smsprovider/FirebaseProvider.py

We are using push tokens and the Firebase service to send notification to the user’s phone.

Our results were the following:

tokens.pushtoken:208] _build_smartphone_data().load_pem_private_key() execution time (msecs): 202
tokens.pushtoken:220] _build_smartphone_data().sign execution time (msecs): 5
tokens.pushtoken:223] _build_smartphone_data() execution time (msecs): 208
smsprovider.FirebaseProvider:165] firebase.submit_message().post() execution time (msecs) 623
smsprovider.FirebaseProvider:175] firebase.submit_message() execution time (msecs) 665: 
tokens.pushtoken:932] create_challenge() execution time (msecs): 902

Loading the private key takes 200 msecs and posting to the firebase endpoint >600 msec. Probably the most serious issue is that all these calls have blocking I/O so the corresponding uwsgi threads block while waiting for Firebase.

Based on the discussion in this issue, we run similar tests on our container using the following code (cryptography v. 38.0.3 and openssl v. 1.1.1n-0+deb11u3):

from cryptography.hazmat.primitives import serialization

for _ in range(1000):
    with open("key.pem", "rb") as key_file:
        private_key = serialization.load_pem_private_key(key_file.read(), password=None)

which took 29 secs to complete.
Adding the following code

from cryptography.hazmat.backends.openssl import backend
backend._rsa_skip_check_key = True

resulted in a drop of the total execution time to 0.52 secs. The same change in _build_smartphone_data() resulted in an execution time of a few msecs.

So our questions are basically the following two:

Is it possible to enable Async I/O in privacyIDEA while using uwsgi? Uwsgi does provide documentation but it sure seems that it needs the app’s cooperation and provides multiple engines to select from. Async I/O would probably solve our problems since we would like to be able to handle new requests while waiting for Firebase to respond.
What are the implications of enabling backend._rsa_skip_check_key = True? Would it be reasonable to make it configurable through a policy setting?

Thanks for any insights

cornelinux · April 7, 2023, 5:51am

It would be interesting to add this code to your privacyIDEA profile test run environment.

cornelinux · April 7, 2023, 5:52am

It is bad practice for a plugin to do a GET /token. If this CAS plugin would work correctly, you could save 200ms.

Do not run GET /token/, rather run triggerchallenge directly!

kkalev · April 7, 2023, 6:21am

We did that and the execution time for load_pem_private_key() dropped to a few msecs (from 200 msecs).

Should we open a feature issue request for that to be configurable through a policy?

kkalev · April 7, 2023, 6:31am

Thanks for the insight on that one and for your help. The main reason for calling GET /token/ is to make sure that the user has an active token in the first place (otherwise we can direct him to getting one).

Yet I think that our main issue is concurrency (or lack thereof) and blocking I/O. As long as privacyIDEA cannot do Async I/O we will have to invest thread and CPU resources in order to combat the lack of concurrency.

For instance, if firebase.submit_message().post() was async then it would allow uwsgi to schedule the thread to handle another request while waiting for Firebase to respond. As things stand, blocking I/O blocks the whole thread and the only way to handle more requests is to increase the number of CPU cores and/or uwsgi processes/threads.

Obviously, doing async I/O in a threaded environment (especially with an outside scheduler such as uwsgi) is more than tricky to do correctly. Do you have any suggestions on that one?

kkalev · April 7, 2023, 3:00pm

Ok, another update with some interesting results.

First of all, we tried adopting the uWSGI cheaper subsystem for dynamic process scaling using the spare2 algorithm. The relevant configuration is as follows:

# Use dynamic scaling using the spare2 cheaper algorithm
plugin-dir=/usr/local/lib/uwsgi
need-plugin=cheaper_spare2
workers = 32          # maximum number of workers
cheaper-algo = spare2
cheaper = 16           # tries to keep 8 idle workers
cheaper-initial = 16   # starts with minimal workers
cheaper-step = 2      # spawn at most 4 workers at once
cheaper-idle = 60     # cheap one worker per minute while idle

Other fine tuning actions included:

Enabling connection pooling on the LDAP resolver
Increasing the innodb_buffer_pool_size for the MariaDB container to 0.5GB to be on the safe side (instead of the default Docker image configuration of 135MB)

Running ab on the /triggerchallenge endpoint with increasing concurrent requests resulted in quite low throughput of a few reqs/sec while the CPU was saturated at 16 concurrent requests (or above).

Disabling key checking changed things completely. Our results (grouped by concurrency) are as follows:

12:
Requests per second:    18.31 [#/sec] (mean)
Time per request:       655.264 [ms] (mean)

24:
Requests per second:    24.27 [#/sec] (mean)
Time per request:       989.017 [ms] (mean)

32:
Requests per second:    26.71 [#/sec] (mean)
Time per request:       1198.235 [ms] (mean)

48:
Requests per second:    33.02 [#/sec] (mean)
Time per request:       1453.620 [ms] (mean)

56:
Requests per second:    29.13 [#/sec] (mean)
Time per request:       1922.273 [ms] (mean)

64:
Requests per second:    33.28 [#/sec] (mean)
Time per request:       1923.291 [ms] (mean)

Server configuration includes 8 CPU cores and 6GB RAM (relatively under-powered). Each uwsgi process requires 90MB of memory so we are actually talking about several GBs for a few dozen processes. The CPU usually was not saturated which made a big difference on scaling the endpoint throughput.

Overall the results are quite satisfying. Even our under-powered server can service a high number of concurrent requests (20-30 reqs/sec) and thus a high enough user base (or authentication requirements). It seems reasonable that one can scale horizontally (using a HAProxy and a central SQL DB) or vertically and linearly increase the total capacity for the service.

We will be submitting a feature request to be able to disable key checking through a policy setting and will appreciate any further insights on the matter.

cornelinux · April 18, 2023, 11:41am

Yes, please do so.
Thanks!

kkalev · April 28, 2023, 9:16am

We have already done that with issue #3590

We have also opened issue #3604 suggesting a related change in the init function of the Sign class in llib/crypto.py

The Sign class is used by the SQL module. This means that in our case we make three calls per request (two for calling the sqlaudit module for read and write warped around containeraudit module and one for the actual endpoint function) with a total cost of 400 msecs as per our profiling runs.

Disabling the private key check results in a significant increase when benchmarking. For instance, benchmarking the /validate/polltransaction endpoint with ab (100 total requests, 4 concurrent connections) increases the total reqs/sec rate from 8 reqs/sec to 57 reqs/sec. At the same time, total CPU usage (derived from docker stats) decreases from 3.8 CPU cores to 2.3 CPU cores or 40%

In our view such a change makes privacyIDEA enterprise ready.

With private key checking we are only able to handle 14 reqs/sec in the /validate/polltransaction endpoint on a 8 CPU core server before saturating the CPU (a bit more than 8 concurrent requests in ab).

Without private key checking we are able to reach 120 reqs/sec with 16 concurrent requests and no CPU saturation (only 6 CPU cores are used in such a scenario leaving 2 more for MariaDB to do its work).