Single Trace Attack Against RSA Key Generation in Intel SGX SSL

Microarchitectural side-channel attacks have received significant attention recently. However, while side-channel analyses on secret key operations such as decryption and signature generation are well established, the process of key generation did not receive particular attention so far. Especially due to the fact that microarchitectural attacks usually require multiple observations (more than one measurement trace) to break an implementation, one-time operations such as key generation routines are often considered as uncritical and out of scope. However, this assumption is no longer valid for shielded execution architectures, where sensitive code is executed - in the realm of a potential attacker - inside hardware enclaves. In such a setting, an untrusted operating system can conduct noiseless controlled-channel attacks by exploiting page access patterns. In this work, we identify a critical vulnerability in the RSA key generation procedure of Intel SGX SSL (and the underlying OpenSSL library) that allows to recover secret keys from observations of a single execution. In particular, we mount a controlled-channel attack on the binary Euclidean algorithm (BEA), which is used for checking the validity of the RSA key parameters generated within an SGX enclave. Thereby, we recover all but 16 bits of one of the two prime factors of the public modulus. For an 8192-bit RSA modulus, we recover the remaining 16 bits and thus the full key in less than 12 seconds on a commodity PC. In light of these results, we urge for careful re-evaluation of cryptographic libraries with respect to single trace attacks, especially if they are intended for shielded execution environments such as Intel SGX.


INTRODUCTION
Side-channel attacks represent a serious threat to cryptographic implementations. Especially software-based side-channel attacks [23] are particularly dangerous, as they can be performed purely by executing code on a targeted machine. These attacks typically exploit various optimizations on the software level, e.g., optimized implementations where executed code paths depend on the processed data [30], and the hardware level, e.g., the cache hierarchy where memory accesses depend on the processed data [47,52]. In order to prevent such attacks, implementations should favor constant-time programming paradigms [15,31] over performance optimizations.
Although cryptographic implementations (e.g., in OpenSSL [20]) are often hardened against side-channel attacks on secret key operations such as decryption and signature generation of digital signature schemes, the process of key generation has been mostly neglected in these analyses. While power analysis attacks targeting the prime factor generation during RSA key generation have been investigated [7,19,48], software-based side-channel attacks have been considered out of scope for various side-channel attack scenarios. On the one hand, key generation is usually a one-time operation, limiting possible attack observations to a minimum. Especially in case of noisy side channels, e.g., timing attacks and cache attacks, targeting one-time operations such as the key generation procedure seems to be infeasible given only a single attack observation. On the other hand, key generation might be done in a trusted execution environment inaccessible to an attacker.
The situation, however, has changed with the introduction of shielded execution environments that aim to support secure software execution in untrusted environments and a possibly compromised operating system (OS). For example, Intel Software Guard Extensions (SGX) [17] provide hardware support that allows software to be executed isolated from the untrusted OS. While the OS cannot access memory of enclaves directly, it is still responsible for management tasks of enclaved programs such as virtual-to-physical page mapping. These management tasks enable new attack techniques such as controlled-channel attacks [12,43,51]. By monitoring page faults of enclaved programs, the OS can gather noiseless measurement traces of executed code paths and accessed data, although only at page-size granularity (4 KB). Therefore, the Intel SGX documentation demands side-channel security of code which is to be executed inside enclaves, in particular, to avoid leaking information through page access patterns [16, p. 35].
In light of this powerful attack technique, we investigated the RSA key generation routine of Intel SGX SSL and identified a critical vulnerability that allows to fully recover the generated private key by observing page accesses. Different from other microarchitectural attacks on RSA implementations that targeted the modular inversion [2] or the exponentiation operations [1,4,11,39], the attack presented in this paper targets the RSA key generation routine and can be performed with a single trace. The identified vulnerability is due to an optimized version of the Euclidean algorithm (binary Euclidean algorithm), which features input-dependent branches for checking the correctness of the generated prime factors p and q, i.e., whether p − 1 and q − 1 are coprime to the public exponent e, where e is usually fixed to 65 537.
By launching a controlled-channel attack, we recover the executed branches of the binary Euclidean algorithm running inside an enclave program and establish linear equations on the secret input, i.e., the prime factors p or q. Based on these equations, we factor the modulus N = pq with minor computational effort on a commodity PC, i.e., in less than 12 seconds for a 8 192 bit modulus, which trivially allows to recover the private key.
Contributions. The contributions of this work are as follows: (1) We consider an SGX setting and identify a critical vulnerability in the RSA key generation routine of OpenSSL, which relies on the binary Euclidean algorithm (BEA) to check the validity of generated parameters. (2) We present an attack to recover most of the bits of one of two RSA prime factors, which allows to factor N = pq and to recover the generated private key. (3) We implement a proof of concept attack that recovers generated RSA keys with a single observation only. (4) We provide a patch to mitigate the vulnerability, which is even faster than the original implementation. 1 Outline. In Section 2, we discuss background information on Intel SGX, and related work. In Section 3, we describe the RSA key generation procedure and the binary Euclidean algorithm as implemented in OpenSSL. In Section 4, we discuss the identified vulnerability and our key recovery attack on RSA. In Section 5, and Section 6, we outline our threat model and evaluate our attack in a real-world setting. In Section 7, we discuss existing countermeasures on an architectural level and we also propose a software patch to fix the identified vulnerability. Finally, we discuss further vulnerabilities in Section 8, and we conclude in Section 9.

BACKGROUND
In this section, we briefly introduce the concept of Intel SGX, and we discuss related work in terms of microarchitectural attacks against the RSA cryptosystem, both in standard settings on general-purpose computing platforms as well as in Intel SGX settings.

Intel SGX
Intel Software Guard Extensions (SGX) [17] provide hardware support for software to be executed isolated from the (untrusted) OS. Thereby, SGX reduces the trust assumption to the hardware only. Hardware-level encryption of memory ensures the confidentiality and integrity of code as well as data within an enclave. Irrespective of the privilege level, memory of enclaves cannot be accessed by software external to the enclave, not even by the OS itself. This policy is enforced by the CPU.
Although in case of Intel SGX the underlying OS need not be trusted, it still performs (security) critical tasks for enclaved programs. Among these tasks are the memory management including virtual-to-physical page mapping. To prevent misconfiguration of a running enclave by the OS, the CPU validates all management tasks that might affect enclave security [33]. Furthermore, enclaved programs share other system resources, such as the underlying hardware, with untrusted processes running on the same system. This makes them vulnerable to various kinds of side-channel attacks based on these shared resources.

Intel SGX SSL
The Intel SGX SSL library [18] is a cryptographic library for SGX enclaves. It is built on top of OpenSSL [20], a widely used toolkit for cryptographic purposes. Since Intel SGX SSL operates on OpenSSL, it inherits all of OpenSSL's side-channel properties including mitigation techniques but also potential vulnerabilities. In particular, OpenSSL employs several side-channel countermeasures to thwart traditional side-channel attacks such as cache attacks.

Microarchitectural Attacks on RSA
Aciiçmez [1] proposed the first attack exploiting the instruction cache (I-cache) to infer executed instruction paths taken by square and multiply operations in sliding window exponentiations. In a subsequent work, Aciiçmez and Schindler [4] attacked the extra reduction step of the Montgomery multiplication routine by exploiting the I-cache. Recently, Bernstein et al. [9] showed how to use knowledge of performed sliding window operations to infer private exponents.
Percival [37] proposed to monitor the square and multiply operations during the modular exponentiation of RSA by means of a technique that later became known as Prime+Probe [47]. In an effort to thwart cache-based attacks on the modular exponentiation, OpenSSL implemented a technique denoted as scatter-gather, which has been improved in [24,26]. The idea of scatter-gather is to store fragments of sensitive data in multiple cache lines, such that the same cache lines are fetched irrespective of the accessed data elements. Yarom et al. [53] attacked the scatter-gather technique by exploiting cache-bank conflicts [8,47], resulting in a sub-cacheline granularity attack. For a 4 096-bit RSA modulus they required 16 000 decryptions in order to recover the key.
Another procedure that has been attacked in the context of RSA (as well as ECDSA) is the modular inversion operation, i.e., computing the inverse x of an element a modulo n such that ax ≡ 1 mod n. Modular inversion operations are central to public key cryptography. Therefore, in the past, software implementations relied on an optimized variant of the extended Euclidean algorithm (EEA), namely the binary extended Euclidean algorithm (BEEA) [34,Algorithm 14.57]. Based on the observation that this optimized variant executes input-dependent (i.e., secret-dependent) branches, Aciiçmez et al. [2] suggested to attack the modular inversion during RSA computations by means of branch prediction analysis (cf. [3]). They speculated that all branches of an attacked application can be monitored precisely, but did not implement the attack. At the same time, Aravamuthan and Thumparthy [5] pointed out that the BEEA is vulnerable to simple power analysis (SPA) attacks. Both attacks assumed the possibility to precisely distinguish between all branches taken in order to attack the modular inversion operation.
Later on, García and Brumley [22] suggested a Flush+Reload attack on the BEEA to attack the ECDSA implementation of OpenSSL. García and Brumley implemented the proposed attack and recovered parts of the nonce values used in subsequent signature computations, which allowed them to recover the secret key. In order to mitigate these attacks, the OpenSSL procedure computing the modular inverse has been rewritten such that it prevents branches that leak sensitive information.
Side-Channel Attacks against RSA Key Generation. So far, sidechannel attacks against RSA key generation routines relied on power analysis and targeted the prime generation procedure. For example, Finke et al. [19] performed a simple power analysis attack (SPA) on the prime generation procedure, i.e., the sieving process, by assuming that the power consumption reveals the number of trial divisions before the Miller-Rabin [34,Algorithm 4.24] primality test is applied. Assuming that the prime candidates are incremented by a constant value in case of a failure, Finke et al. establish equations that allow to factor the modulus. Similarly, Vuillaume et al. [48] considered differential power analysis (DPA), template attacks, and fault attacks to attack the prime generation procedure. However, Vuillaume et al. consider the Fermat test [34,Algorithm 4.9], which is rarely used in practice due to false positives (Carmichael numbers). Bauer et al. [7] also attacked the prime sieve procedure during the prime number generation. All these side-channel attacks either target the primality test or the prime generation itself and cannot be executed by only running software on the targeted machine. They all require physical access.
Differentiation from Existing Attacks on Key Generation. The attack presented in this paper differs from previous attacks on RSA key generation as follows. First, contrary to related work which target the prime generation itself [48] or the primality tests [7,19], we target the subsequent parameter checking routine. Second, previous attacks rely on power analysis while we use a purely software-based side channel. To the best of our knowledge, software-based microarchitectural attacks on the RSA key generation procedure have not been analyzed so far.

Attacks in SGX Settings
Currently, three types of side-channel attacks have been investigated against SGX enclaves, namely controlled-channel attacks, cache attacks, and branch prediction attacks. Controlled-channel attacks only allow monitoring data accesses and execution at page granularity (4 KB), but in a noiseless manner. Contrary, cache attacks enable a more fine-grained monitoring (e.g., 64 byte), but at the cost of measurement noise. Hence, there is a trade-off between granularity and measurement noise. Branch prediction attacks can distinguish single code branches on an instruction granularity.
Controlled-Channel Attacks. Controlled-channel attacks [51] (also referred to as pigeonhole attacks [43] or page-level attacks [50]) rely on the fact that the OS manages the mapping between virtual and physical pages for all processes, including processes executed inside hardware enclaves. Hence, the OS can modify the present bit for page table entries (PTEs), which allows the OS to cause page faults and to precisely monitor these page faults for an enclaved process that accesses the unmapped pages during its execution. Thus, the OS can observe the memory accesses or executed code paths of an enclave at page granularity. Instead of using the present bit, page faults can also be triggered by making pages non-executable [50] using the non-executable (NX) bit, or by setting a reserved bit [50,51]. As before, this allows precise monitoring of page accesses.
Xu et al. [51] used controlled-channel attacks to extract sensitive data such as images and processed texts from enclaved programs. Shinde et al. [43] studied known information leaks in cryptographic primitives of OpenSSL and Libgcrypt with respect to page-level attacks. However, Shinde et al. did not identify the information leak exploited in this paper. Xiao et al. [50] used page-level attacks to mount Bleichenbacher and padding oracle attacks on various TLS implementations.
Previous page-fault based attacks could not monitor the execution of single instructions on a page. Hähnel et al. [27] and van Bulck et al. [12] relied on frequent timer interrupts of the Advanced Programmable Interrupt Controller (APIC) in order to read and clear the accessed bit of the PTE. This allows to even single-step page table accesses during enclave execution. As an example they suggested to attack a string comparison function, where the APIC interrupts the SGX enclave after every single memory access (byte granularity). Thereby, they are able to determine the length of the compared strings.
Cache Attacks. Since enclaves do not share memory with other processes or even the OS, Flush+Reload attacks [52] are not directly possible against enclaved programs. Nevertheless, other techniques such as Prime+Probe [37,47] can be applied on enclaves. For example, Götzfried [25,35] consider an all-powerful attacker who compromised the OS in order to minimize the influence of noise (e.g., scheduling the enclave on one specific core, etc.), they suffer from false positives and false negatives.
Brasser et al. [11] relied on Prime+Probe to attack the decryption process of an RSA implementation running inside an SGX enclave. Schwarz et al. [39] considered a slightly different attack scenario, where also the attack process runs inside an SGX enclave. They also relied on Prime+Probe to attack an RSA implementation running in a co-located SGX enclave. Although they extract 96% of a 4096-bit RSA key within a single trace, the number of remaining bits is still impractically high for a brute-force approach. Even worse, recovery suffers from random bit insertions and deletions at unknown positions. Hence, due to the measurement noise of Prime+Probe, several measurement traces need to be gathered in both attacks [11,39].
Although Flush+Reload cannot be applied on enclaved programs directly, van Bulck et al. [13] proposed to use Flush+Reload to attack the page table entries (managed by the OS) in order to infer what pages have been accessed by the enclave. Thereby, they defeat countermeasures that aim to detect page faults [41,43] or that mask the accessed and dirty flags of page table entries. However, their attack comes at the cost of an even coarser-grained granularity (32 KB) since one cache line holds eight PTEs.
Branch Prediction. Branch prediction represents a special type of cache attack that exploits the branch target buffer (BTB) cache in order to learn information about executed branches [2]. Lee et al. [32] observed that SGX does not clear the branch history when switching between enclave and non-enclave mode, which enables branch shadowing attacks. Branch shadowing represents an enhanced version of branch prediction analysis (cf. [3]), which relies on the last branch record (LBR) instead of RDTSC time measurements as well as APIC timer interrupts to increase the precision.

RSA KEY GENERATION IN OPENSSL
The RSA public key cryptosystem [38] provides public key encryption as well as digital signatures. The RSA key generation routine of OpenSSL-implemented in rsa_gen.c-starts by generating two primes p and q, which are then used to compute the public modulus N = pq. While p and q are chosen randomly during the key generation procedure, it is common practice that the public exponent is fixed to e = 65 537 10 = 0x010001 16 (cf. [10]). The private key is later computed as d ≡ e −1 mod ϕ (N ), with ϕ being Euler's totient function. For two prime numbers p and q, Among other checks, the key generation routine ensures that (p − 1) and (q − 1) are coprime to e, i.e., that the greatest common divisor (GCD) of the public exponent e and (p − 1) as well as (q − 1) is one. These checks are performed by relying on a variant of the Euclidean algorithm, which will be attacked in this paper.

Binary Euclidean Algorithm
A well-known algorithm to compute the GCD is the Euclidean algorithm [34, Algorithm 2.104]. For two positive integers a > b, it holds that gcd(a, b) = gcd(b, a mod b). Since this algorithm relies on costly multi-precision divisions, a more efficient variant is usually implemented for architectures with no dedicated division unit, using simple (and more efficient) shift operations and subtractions. Listing 1 depicts an excerpt of the Euclidean algorithm as implemented in OpenSSL, which is an optimized version denoted as binary GCD [34,Algorithm 14.54] that has been introduced by Stein [44]. As can be seen in Listing 1, OpenSSL uses the BIGNUM implementation for arbitrary-precision arithmetic. The functionality of each BIGNUM procedure is indicated with comments.
The binary GCD works as follows. If b is zero, a holds the GCD and the algorithm terminates. Otherwise, the algorithm distinguishes the following cases in a loop.
(1) Branch 1 (Lines 7-10): If a and b are odd, the gcd(a, b) = gcd((a − b)/2, b). The division by 2 (implemented as a right shift) accounts for the fact that the difference of two odd numbers is always even, but 2 does not divide odd numbers. During the execution, the algorithm always ensures that a > b. It swaps a and b as soon as this condition is not satisfied anymore.
A Note on the Implementation. In the source code, the function BN_gcd(...)-used to compute the GCD-calls the function euclid(...) as depicted in Listing 1, but the compiler inlines the corresponding function into BN_gcd(...). Hence, in the remainder of this paper, we will refer to BN_gcd(...) when talking about the vulnerable code.

ATTACKING RSA KEY GENERATION
During RSA key generation, the binary GCD variant described in Section 3 is used to ensure that p − 1 and e are coprime. In order to do so, the algorithm depicted in Listing 1 is executed with a = p − 1 (with p being the secret prime) and b = e (the public exponent). The crucial observation is that the binary GCD executes different branches depending on the input parameters. An attacker who is able to observe the executed branches can recover the secret input value a = p − 1 and, hence, the secret prime factor p.
Without loss of generality, we describe the attack by targeting the prime factor p, but the presented attack can also be applied to recover the prime factor q. Once we recovered either of the two prime factors, N can be factored trivially, which also allows to compute the private exponent d.

Idealized Attacker
For the sake of completeness we first consider an attacker who can precisely distinguish all executed branches of the binary GCD algorithm, including the swapping operations in lines 10, 15, and  [32] or the generalized attack described in Section 4.4. Let a be the unknown secret input to be recovered, b the known input, and a i , b i , i ≥ 0 all intermediate values calculated by the algorithm. To recover the secret input a, we build a system of linear equations, starting with a = a 0 and b = b 0 . We then iteratively add equations, depending on the executed branches, as follows.
We increment i by one before proceeding with the next iteration.
In addition, if a and b are swapped, i.e., BN_cmp(a, b) < 0 yields true, we add the following two equations and increment i again: The algorithm finishes after n steps with a n = gcd(a, b) and b n = 0. By recursively substituting all equations one can express the unknown a as a linear equation , which is trivial to solve, given that gcd(a, b) is known to be 1 in case of valid RSA parameters.

Page-level Attacker
Although considering a powerful attacker who is capable of distinguishing all branches is a realistic assumption [32], we resort to a weaker assumption in the rest of this paper. We consider a page-level attacker [43,51], who recovers the secret input a from even less observations (up to the point where the two variables are swapped) and with a coarser-grained granularity (page level). Figure 1 illustrates an excerpt of the control flow of the binary GCD for the four important branches being executed and, for illustration purposes, also the mapping of specific functions to their corresponding code pages. 3 If an attacker can distinguish executed branches based on page-access observations, the Euclidean algorithm can be reverted and the secret input a can be recovered. Indeed, the functions BN_sub(...) and BN_rshift1(...) reside on different pages within the memory, denoted as page 1 and page 4, while BN_gcd(...) is on page 2.
Observations. If this algorithm is executed with RSA parameters (a = p − 1 and b = e), we observe the following: (1) Since p is a prime number, p − 1 is even. Hence, in the first iteration, the first parameter (a = p − 1) is always even and the second parameter (b = e) is always odd, as otherwise the GCD of p − 1 and e cannot be 1 as required for valid RSA parameters. Nevertheless, recall that in our setting the algorithm is always executed with an odd b = 65 537, which is much smaller than a. Thus, in the beginning, the algorithm will 3 The mapping depicts the actual offsets of the most recent commit 899e62d of OpenSSL 1.1.0g. only execute the third (and the first) branch, reducing the value of a i , but b i remains an unchanged odd value. This is true until a i and b i are swapped for the first time, which is the case if a i < b i . Since each iteration reduces a i by one bit (in general) due to the right shift operation, the first swap will approximately occur after log 2 (p −1) −log 2 (e) iterations. Until then, every time we observe a single access to code page 4 we can be sure that branch 3 has been executed. (4) The fourth branch will only be executed if the greatest common divisor of the parameters a and b is a multiple of 2.
Since the parameter a = p − 1 is even and b = e is odd, this branch will never be executed (indicated as a red branch), as otherwise we would have invalid RSA parameters. These observations combined with the fact that the public exponent e is known allow us to "revert" the computations for all bits of a = p − 1, except about log 2 (e) bits. As mentioned before, the public exponent is fixed to e = 65 537. 5 This means that about log 2 (65 537) ≈ 16 bits of a = p − 1 cannot be recovered based on the accessed code pages. However, they can be easily determined based on the relations established from these observations. As mentioned, the functions BN_sub(...) and BN_rshift1(...) reside on different pages within the memory. In our tested implementation, they are even 20 pages apart. Thus, it is very unlikely that a different compiler setting would link them to the same page, which would make them indistinguishable to a page-level attacker monitoring these functions only. Even if this would happen, one could easily distinguish them by monitoring the sub-functions called by BN_sub(...) only, i.e., BN_wexpand(...), BN_ucmp(...), BN_usub(...), etc.

Exploiting the Information Leak
We denote the sequence of page accesses observed by an attacker as P = (p 0 , . . . , p n ). Without loss of generality, let us assume the same mapping from functions to code pages as in the previous example. For instance, the function BN_sub(...) resides on page 1 (0x00C4), BN_gcd(...) resides on page 2 (0x00CA), and the function BN_rshift1(...) resides on page 4 (0x00D8). That is, the sequence of page accesses consists of pages p i ∈ {P 1 , P 2 , P 4 } since we are only interested in these page accesses.
In order to recover the prime factor p (or p − 1 respectively), we observe a sequence of page accesses up to the point where the two variables are swapped for the first time. All later page accesses are discarded. We denote this number of iterations as m. Given the modulus N or its bit size log 2 (N ), we denote the bit size of p and q as K = log 2 (N )/2. Thus, m is upper-bounded by ⌈K − log 2 (e)⌉. Similar as before, we build a system of linear equations based on a i , starting with the unknown input a = a 0 . Since i < m, b will remain unchanged and we only need to distinguish two branches: Access to page 1, and page 4: a i+1 = a i −b 2 Access to page 4: a i+1 = a i 2 Accesses to page 2 allow to distinguish iterations. After m iterations, we express these equations by recursive substitution as a linear equation a = f (a m , b), or, more precisely a = a m · c a + b · c b with known constants c a and c b , which result from the substitution.
Both, a and a m are unknown. However, we additionally know that swapping occurred after m iterations, i.e., a m < b. Hence, we can determine the correct a by iterating over values a m ∈ [1, e) and evaluating the above equation. We use the resulting values a to check the GCD of (a + 1) and N . In case the GCD is greater than 1, we recovered a as well as the corresponding prime factor p = a + 1. We can then factor the modulus N by computing q = N /p.
As mentioned before, the iteration counter m is upper-bounded by ⌈K − log 2 (e)⌉ with K being the bit size of the prime numbers. This is because each iteration reduces a i by at least one bit due to the right shift operation. For example, a 4 096 bit RSA key will have prime numbers of length K = 2 048 bits, yielding m = 2 032 iterations to consider. However, a prime number which is closer to 2 K −1 than to 2 K combined with the subtraction in branch 1 could reduce a i by one additional bit. This would make swapping occur one iteration earlier. We would erroneously consider an incorrect equation due to swapping and determining the correct a might fail. In this case, we simply omit the last erroneous equation a m from the recursive substitution and try to determine a again by iterating over values a m−1 ∈ [1, e). As we will see in Section 6, this happens in approximately 25% of all runs, meaning that about 75% of the generated RSA keys can be recovered in the first run.
In case p − 1 is not coprime to e-which is the reason why the binary GCD algorithm is executed-the RSA key generation will discard this prime factor candidate p and re-generate another prime factor candidate p. Nevertheless, by observing the page fault pattern, an attacker is also able to detect this (extremely rare) case, and we run the same attack on the newly generated p.
Example. For an illustrative example let us assume the following hypothetical parameters. Let the public exponent be e = 17 = 0x11 16 and let the two 14-bit primes be p = 11083 = 0x2B4B 16 , and q = 9941 = 0x26D5 16 , respectively, and N = pq. In the course of validating the selected parameters, the OpenSSL implementation calls the binary GCD function with a = 11082 and b = 17. Table 1 illustrates the executed operations for the given input parameters a and b. In the first loop iteration, a is even and b is odd, which means that the function BN_rshift1(...) will be called. In the second loop iteration, a is odd and b is odd, which means that BN_sub(...) followed by BN_rshift1(...) will be executed, and so on. Finally, the algorithm returns 1 as the GCD of a = 11082 and b = 17.
Based on a controlled-channel attack, we are able to observe accesses to pages P 1 , P 2 , and P 4 , and to precisely recover the executed operations up to the point where a and b are swapped. We recursively substitute the recovered operations on a i , which leads to the equations shown in the last column of Table 1. Recall that the first swap will happen at latest after m = ⌈14 − log 2 (17)⌉ = 10 iterations. In our example, swapping is done already in iteration 9 due to a smaller p and additional subtractions. This leads to the erroneously recovered operation marked bold (and colored red) in Table 1. To recover the secret a, we start with the m-th substituted equation a 10 , not knowing that it is erroneous. If the attempt to recover a based on a 10 fails, we would need to fall back to equation a 9 . However, in this particular case the error cancels out and we already succeed with a 10 . Recall that a 10 = a 1024 − 85b 512 . With b = 17, we can rearrange it to a = 1024 · a 10 + 2890 (1) The unknown variable a 10 is bound by the parameter b. Since a and b have been swapped, a 10 must be smaller than b. We try to solve this equation by iterating over a 10 ∈ [1, b) and checking the GCD of a + 1 and N . If the GCD is greater than 1, we are able to factor N . Indeed, for a 10 = 8 the equation yields a = 11 082 and gcd(a + 1, N ) > 1. Thus, we recovered the first prime p = 11 083, which allows to factor N (q = N /p = 9941) and to recover the secret exponent d ≡ e −1 mod (p − 1)(q − 1). To see why recovery on the erroneous equation a 10 works in this case, we compare it to the valid equation a 9 = a 512 − 85b 256 , which can be rewritten as a = 512 · a 9 + 2890 Here, recovering a succeeds for a 9 = 16. Observe that in equations (1) and (2) the first constants are only off by a factor of 2 because the erroneous operation does not introduce a subtraction but only a right shift. Hence, we hit the correct guess with a 10 = a 9 /2 = 8.

Generalization
The proposed attack on RSA key generation is not limited to code pages only. One could also monitor accesses to data pages, especially   Even if a and b are located on the same heap page, attacks might still be possible by carefully crafted user input that also gets copied onto the heap and, thus, shifts the targeted buffers a and b onto different heap pages. We did not investigate such generalized attacks further, since our attack already recovers the full key by monitoring page faults up to the point where a and b are being swapped.

THREAT MODEL AND ATTACK SCENARIO
In order to exploit the identified vulnerability, we consider an enclave that dynamically generates RSA keys, which are intended to never leave the enclave. Dynamic key generation has already broad applications in other trusted execution environments, such as trusted platform modules and smart cards. In line with SGX's threat model, the operating system (OS) is considered untrusted and compromised, trying to extract secret keys from the enclave. Although, in general, attackers in SGX settings are considered to be able to trigger enclave operations arbitrarily often by repeatedly invoking the enclave with a fresh state, 6 our attacker is naturally limited to at most one observation of the enclave's key generation, as the next invocation will generate a different, independent key. 6 SGX does not protect against rolling back to a fresh state. This would require external persistent storage [45].
Using a noiseless controlled-channel attack [43,50,51], the attacker can observe page access patterns of the executing enclave.
While this is sufficient for the attack presented in this paper, we note that, without loss of generality, an attacker could also resort to different techniques. Among them are side channels using branch shadowing [32] or single-step approaches based on the APIC timer interrupts [12,27] or even attacks with fewer or no page faults [13,49], given that enough information can be extracted from a single execution.
Attack Scenarios. Dynamic key generation is a fundamental operation for most SGX applications. For example, scenarios like audio and video streaming with SGX [28] fall into our threat model. Here, a streaming enclave dynamically generates an RSA key pair and registers the public key at its streaming counterpart. Latter delivers all streaming content encrypted under this key, allowing the enclave to securely decrypt it and to display it to the user, all in the sphere of a possibly compromised OS. Another example is a document signing enclave, generating its own signature keys inside the enclave and issuing a certificate signing request to an external certification authority. Thereby, the enclave protects the signing key against malware. In any case, compromise of the private key could lead to signature forgery, espionage or video piracy with all its legal and financial consequences.

ATTACK EVALUATION
We evaluate the presented attack on an Intel Core i7-6700K 4.00 GHz platform running Ubuntu 17.10 (Linux kernel 4.13.0-37). In order to do so, we developed an SGX application that generates an RSA key based on the latest version of Intel SSL SGX. 7 We used the Linux Intel SGX software stack v1.9, consisting of the Intel SGX driver, the Intel SGX software development kit (SDK), and the Intel SGX platform software (PSW). 8 For controlling the page mapping, we used the SGX-Step kernel module as well as the corresponding SGX-Step library functions (cf. [12]). Note that we do not use the  single-stepping feature of SGX-Step but rather its page mapping capability. Since Intel SGX considers an untrusted OS, the application of SGX-Step is in line with the threat model. We describe the implementation details below.

Implementation Details
We consider a victim enclave using the Intel SGX SSL library to generate an RSA key pair. The enclave is hosted by a malicious attack application that interacts with the OS to manipulate page mappings and to record page accesses within the corresponding fault handler. Figure 2 depicts the principle of the attack. After this recording step, the collected trace of page accesses is evaluated to recover the secret key.

SGX Enclave Application (Victim Enclave).
We developed an enclave program that generates a single RSA key using the Intel SGX SSL library and outputs the public parts only, i.e., the modulus N . Therefore, we implemented an ECALL function for invoking key generation and an OCALL function which prints the modulus of the generated key to the standard output. Recall that the public exponent is fixed to e = 65 537. The project is built in pre-release hardware mode, i.e., it uses the same compiler optimizations as a production enclave in release mode and yields the same memory layout. Without loss of generality, the enclaved program does not perform any other tasks apart from generating the RSA key.
Attack Application. Based on the SGX-Step framework [12], we developed an attack application that enables and disables executable regions (pages) of the enclave program. Therefore, it toggles the NX bit of the page table entries belonging to the code pages to be traced. Without loss of generality, one could also use the present bit or a reserved bit [50,51] for the same purpose. The application registers a fault handler (via a sigaction standard library function call) which is executed whenever the enclave encounters a segmentation fault (due to a non-executable page). This fault handler conveniently serves as the basis to monitor page faults, which later on allow to recover the executed code paths.

Mounting the Attack
In order to determine the pages of interest, i.e., the ones where the BN_gcd(...), BN_sub(...), and BN_rshift(...) functions are located, we dissect the enclave binary by means of objdump. In our case, objdump reveals the following page frame numbers: 0x00CA for BN_gcd(...), 0x00C4 for BN_sub(...), and 0x00D8 for BN_rshift1(...). When starting the victim enclave, the attack application disables the execution of the BN_gcd(...) page by setting the non-executable (NX) bit in the corresponding page table entry. This causes the enclave to trap as soon as it attempts to execute this page.
When the fault handler function is executed for the first time, i.e., when a page fault (segmentation fault) occurs, we start recording subsequent page faults. On the one hand, we enable execution of the current page which caused the page fault by clearing its NX bit in order to allow the enclave to continue. On the other hand, we also disable the other pages of interest by setting their NX bits. Whenever the page fault handler is triggered, we record the accessed page and toggle the non-executable bits accordingly. Thus, we are able to precisely monitor each access to these pages.
Our practical evaluation confirmed that we observe the following page fault patterns. Executing branch 1 leads to consecutive page faults on 0x00C4 (BN_sub(...)) and 0x00D8 (BN_rshift1(...)), interleaved with page faults on 0x00CA (BN_gcd(...)), whereas executing branch 3 leads to a page fault on 0x00D8 (BN_rshift1(...)) only. When the attack application finished gathering the page faults, we process the page-fault sequence from left to right and build up an equation system according to the rules established in Section 4.3. That is, whenever we observe consecutive page accesses to page 0x00C4 and page 0x00D8, we add a i+1 = (a i − b)/2, while for a single access to page 0x00D8 we add a i+1 = a i /2. Based on these equations we run a SageMath script in order to recursively substitute the equations, recover the remaining bits by solving the equation for a m , and finally to recover the RSA private key.
The execution time of the whole attack including the gathering of the page-fault trace as well as the parsing of the gathered trace is negligible, even when attacking larger RSA keys. Causing page faults on the above mentioned pages slightly increases runtime and gathering the page-fault traces terminates immediately. Compared to normal key generation, running the attack causes moderate overall slowdowns of 65 ms (15,5%) for 4,096 bit keys and 248 ms (5,87%) for 8,192 bit keys due to the intentionally induced page faults. The biggest share of the execution time is consumed by the generation of the two random primes, i.e., the random number generation and the primality test, during RSA key generation.

Key Recovery Complexity
We developed a simple script for SageMath 9 that iterates over all possible values for 1 ≤ a m < 65 537, evaluates a = f (a m ), and checks the GCD of a + 1 and N . If it is not equal to 1, p can be recovered. Figure 3 illustrates the complexity for the task of recovering the remaining bits. The complexity has been averaged over 100 runs per modulus size and the computations are evaluated with SageMath on an Intel Xeon E5-2660 v3 (2.60GHz). The area plot (right x-axis) indicates that in about 75%-80% of all cases, the prime factors can be recovered at the first attempt, considering m = ⌈K − log 2 (e)⌉ equations. In only about 20%-25% of all cases the first attempt fails due to an early swapping in the binary GCD algorithm. In this case, we need to remove the last equation a m and restart the search in the range 1 ≤ a m−1 < 65 537. The asymptotic complexity of the key recovery is O(1). This means that the number of iterations is bound by the public exponent e, which is a constant value. In contrast, the computation time of the GCD for candidates a increases due to  the larger bit sizes of the modulus N . In 75% of all cases, a 8 192-bit modulus can be factored in less than 5 seconds on average, after gathering the measurement trace. In only 25% of all cases, we need approximately 12 seconds on average. Although 15 360-bit RSA keys (providing 256-bit security according to NIST [36]) are currently not being used in practice, we provide the results here for the sake of completeness.

COUNTERMEASURES
Architectural Countermeasures. In order to mitigate controlledchannel attacks, various architectural countermeasures have been proposed. Shinde et al. [43] introduced the notion of page-fault obliviousness, which means that the OS is still able to observe page faults, but the observable page-fault pattern is independent of the input and the executed code paths. They proposed a software-based approach incurring a significant performance overhead. This can be reduced by additional hardware support which guarantees to deliver page faults directly into the enclave [42]. Another proposal denoted as SGX-LAPD [21] considers large pages (i.e., 2 MB instead of the usual 4 KB) in order to reduce the overall number of page faults. The enclave relies on the EXINFO data structure, which tracks page fault addresses of an enclave, to verify that the OS indeed provides large pages. Their solution is based on a dedicated compiler as well as a linker in order to generate the corresponding code for large-page verification inside enclaves. Strackx et al. [46] propose hardware modifications allowing to preload all critical page mappings in the translation lookaside buffer (TLB) whenever entering the enclave. Moreover, they protect the TLB mapping from being tampered during enclave execution.
Detect Frequent Page Faults. Shih et al. [41] observed that transactional synchronization extensions (TSX) can be used to detect exceptions such as page faults and report them to enclave-internal code only, rather than to the OS. They proposed T-SGX, in which they execute blocks of enclave code inside TSX transactions. If an exception is thrown, the transaction aborts and the enclave decides whether or not to terminate its execution. Chen et al. [14] proposed an alternative approach to detect side-channel attacks within enclaves, i.e., detecting frequent page faults and aborting the execution. In order to so, they rely on the execution time within the enclave as an indicator of an ongoing side-channel attack. Since timers are also accessed through the untrusted OS, they implement a reference clock inside the enclave. The reference clock itself (a timer variable) is protected by means of TSX.
Detecting page faults does not prevent stealthier attacks that come without the need for page faults [13,49]. These attacks derive page access patterns either by monitoring the accessed and dirty bits of page table entries or by mounting cache-attacks like Flush+Reload attacks on page table entries.
Randomization. Seo et al. [40] propose SGX-Shield which randomizes the memory layout of enclaves in a multi-stage loading step. While primarily intended as a countermeasure against runtime attacks, it also raises the bar for controlled-channel attacks.
Prevent Input-Dependent Code Paths. The most straightforward approach to prevent the attack described in this work is to fix the RSA key generation procedure at the implementation level. We propose an appropriate patch in the following subsection.

Patching OpenSSL
Listing 2 shows our proposed patch for OpenSSL. Instead of relying on BN_gcd(...) to ensure that p − 1 and e are coprime, i.e., that the GCD of p − 1 and e is one, we compute the modular inverse of p − 1 modulo e using a side-channel protected modular inversion algorithm (BN_mod_inverse(...)). The inverse only exists if gcd(p − 1, e) = 1. Hence, if BN_mod_inverse(...) signals (through an error) that the inverse does not exist, we know that gcd(p −1, e) 1.  In order to ensure that the side-channel protected implementation of the inversion algorithm is called, we need to set the BN_FLG_CONSTTIME flag on the public modulus e. This ensures that BN_mod_inverse(...) internally calls the protected function BN_mod_inverse_no_branch(...), which does not contain branches that leak sensitive information.
Performance Impact. An appealing benefit of our proposed patch is that it is even faster than the vulnerable implementation. 10 We benchmarked 10 000 coprimality checks for a random number a and e = 65 537, and provide the corresponding cumulative execution times in Table 2. As can be seen in the table, our patch is by one to two orders of magnitudes faster than the original implementation on our test machine. On an Intel Core i7-5600U 2.6 GHz CPU (notebook), the speedup exceeds even a factor of 500 for 8 192 bit numbers. The reason for this massive speedup is that inversion, as implemented in OpenSSL, uses the original Euclidean algorithm with gcd(a, b) = gcd(b, a mod b). This algorithm requires far less loop iterations (e.g., between 5 and 13 iterations for 8 192-bit numbers) than the binary GCD (≈ 8192 iterations). The original Euclidean algorithm relies on a costly modular reduction in each iteration, which was the initial motivation to use the binary GCD instead, which avoids these costly modular reductions. Yet, the original Euclidean algorithm is in fact significantly faster because OpenSSL 10 Note that we do not need to compute the GCD but only check whether or not it is 1. leverages the x86 div instruction to perform the expensive modular reductions directly in hardware. Nevertheless, the performed check whether the gcd(p − 1, e) 1 handles a corner case in RSA key generation, which is highly unlikely to happen in practice. Hence, the corresponding check is in general only executed once per generated prime factor and, thus, two times during the RSA key generation.

FURTHER VULNERABILITIES
RSA X9.31. Further investigation of the OpenSSL source code revealed that the prime derivation function based on the ANSI X9.31 standard [29] (BN_X931_derive_prime_ex(...)) is also vulnerable to the presented attack. Similar as in the default RSA key generation procedure implemented in rsa_gen.c, the generated primes p and q are verified, i.e., that p−1 and q−1 are coprime to the public modulus e. Hence, the exact same attack technique also applies to the X9.31 implementation. Irrespective of whether or not this implementation is actually used (ANSI X9.31 has already been withdrawn in [6]), we suggest to patch this implementation. The patch presented in Section 7 also applies here.
Furthermore, there are two additional usages of the vulnerable BN_gcd(...) function, namely in RSA_X931_derive_ex(...) and RSA_check_key_ex(...). In these cases, the GCD is not used as mere security check but to factor out the GCD of the product (p −1)(q −1). Since the calculated GCD is never 1, our patch using the inversion algorithm cannot be applied here. Instead, we suggest to add a constant time implementation of the GCD algorithm, which is resistant against software side-channel attacks. Ideally, this implementation is even faster than the binary GCD implementation (cf. the performance analysis of our proposed patch in Section 7).
RSA Blinding. While our attack highlights a critical vulnerability in RSA key generation, other algorithms also need careful evaluation with respect to single-trace attacks. For example, we found a vulnerability in the generation of RSA blinding values used to thwart side-channel attacks on sensitive RSA exponentiation. The vulnerability causes preparation of the blinding value to fall back to an exponentiation implementation vulnerable to side-channel attacks. Similar to the attack presented in this paper, a controlledchannel attacker could attempt to recover the blinding value from a single trace and subsequently peel off the side-channel protection offered by blinding. The OpenSSL team fixed this issue in response to our findings by using the side-channel protected exponentiation algorithm appropriately.

Responsible Disclosure
We responsibly notified Intel as well as OpenSSL about our findings and provided a patch to fix the RSA key generation, as shown in Listing 2. In response, OpenSSL patched the RSA key generation vulnerability in commit 8db7946e. Also, the RSA blinding vulnerability was fixed in commit e913d11f. 11

CONCLUSION
In this paper, we investigated the RSA key generation routine executed inside SGX enclaves under the aspect of microarchitectural side-channel attacks. Our investigations revealed a critical vulnerability inside Intel SGX SSL that allows to recover the generated RSA secret key with a single observation using a controlled-channel attack. More specifically, the observable page fault patterns during the RSA key generation allow to recover the prime factor p and, thus, to factor the modulus N . To the best of our knowledge, this represents the first microarchitectural attack targeting the RSA key generation process by means of a software-based attack.
Ironically, the vulnerability is due to an optimized binary GCD algorithm that should improve the performance compared to the original Euclidean algorithm but in fact is significantly slower on Intel x86 platforms. Nevertheless, our work demonstrates that softwarebased microarchitectural attacks on shielded execution environments such as Intel SGX represent a severe threat to key generation routines and need further consideration.