Internet-Draft | ECDH-PSI | October 2024 |
Wang, et al. | Expires 24 April 2025 | [Page] |
This document describes Elliptic Curve Diffie-Hellman Private Set Intersection (ECDH-PSI). It instantiates the classical Meadows match-making protocol with standard elliptic curves and hash-to-curve methods. In ECDH-PSI, data items are encoded to points on an elliptic curve, and masked by the private keys of both parties. After collecting the mutually masked datasets from both parties, a participant computes their intersection and outputs the corresponding original data items as result.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 24 April 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Private Set Intersection (PSI) schemes enable the discovery of shared elements among different parties' datasets without revealing individual data. They are widely used when there's a need to identify overlapping data elements between two or more parties while preserving the confidentiality of each party's original data. PSI is one of the most frequently used privacy preserving techniques in business, e.g. it enables a user to detect whether his/her password is leaked without giving away the password to the server [MS21], or multiple companies to find their common customers without giving each other their raw data.¶
The classical Diffie-Hellman PSI [Meadows86] has been published for more than thirty years. The academic has also proposed numerous PSI schemes with new functionalities and secure guarantees, such as [KKRT16][CHLR18][RR22], but DH-PSI is still preferable due to its simplicity and communication efficiency. This document describes a widely deployed instance of DH-PSI denoted by Elliptic Curve Diffie-Hellman PSI (ECDH-PSI). Compared with the finite field parameters used in the original version, elliptic curves are commonly considered to be more efficient and secure [IMC][LOGJAM], and are extensively adopted by recent standards including [RFC8031][RFC8446][RFC9497].¶
In document describes 2-party ECDH-PSI as a two stage protocol, including a handshake phase and a data exchange phase. In the handshake phase, the parties agree on the cipher suite and message flow to be used in data exchange phase. Then, during the data exchange phase, original data elements are masked and transmitted in batches, allowing one or both parties to output the result.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
In this document, data structures are defined and encoded according to the conventions laid out in Section 3 of [RFC8446].¶
This section gives brief introductions for elliptic curves and hash-to-curve methods.¶
An elliptic curve E is defined by a two-variable equation over a finite field F with prime characteristic p larger than three. Except for a special point called point at infinity, a point on E is a (x,y)-coordinate pair that satisfies the curve equation, where x and y are elements of F. All distinct points of curve E consists an algebraic group of order n, where the point at infinity is the identity element. Applications and standards generally use a prime order (sub)group <G> generated by a point G on E. The order of <G> is denoted by r.¶
This document uses term "addition" to refer to the group operation of a elliptic curve group, meaning two points P and Q can be added together to get a third point R.
Scalar multiplication refers to the operation of adding a point P with itself repeatedly.
This document uses scalar_mul
to denote the scalar multiplication process.
It takes a integer x<r and a point P as inputs and outputs the multiplication result Q, which is written by Q=scalar_mul(P,x)
.¶
Many cryptographic applications require the participants to generate elliptic curve key pairs.
Usually, a key pair on refers to a (PK,sk)
tuple, where sk
is a integer generated uniformly between 1 and r-1, and PK=scalar_mul(G,sk)
.
In the context of ECDH-PSI, only sk
, namely the private key, is used.
Thus, this document uses the notation sk=keygen()
to refer to the process of generating a private key on the elliptic curve and omits PK
.¶
A hash_to_curve function encodes data of arbitrary length to a point on a elliptic curve. The encoding process first hashes the byte string to one or more elements of fixed length, and then calculates a point on the curve with the elements using so-called map_to_curve and clear_cofactor functions.¶
[RFC9380] describes a series of hash_to_curve methods (which are also called "suites") for standard curves such as NIST P-256 and curve25519.
It specifies two encoding types denoted by "encode_to_curve" and "hash_to_curve".
Both encodings can be proven indifferentiable from random oracles (ROs)[MRH04], but have different output distributions.
This document only uses "hash_to_curve" encodings which output uniformly random points.
It also uses the notation P=hash_to_curve(data)
to refer to the process of encoding data
of arbitrary length to a point P
on the curve given in the context.¶
[RFC9380] requires all uses of the hash_to_curve suites must enforce domain separation with domain separation tags (DSTs).
In particular, DSTs are distinct strings that enable different applications to simulate distinct RO instances with the same hash function.
That is, each application concatenates the DST with its message as the input to hash function.
This document allocates each suit with a unique DST.
For simplicity, DST strings are omitted when writing the process of calculating the hash value of a string.
The notation tag=hash(str)
means that the hash function actually takes "dst||str
" as its input rather than str
alone, where dst
is the DST string defined for the corresponding hash_to_curve suite.
Similarly, this document also omits DST strings when describing hash_to_curve
functions, assuming that the corresponding DST has been taken as a part of the input.¶
A suite defined by [RFC9380] includes a series of parameters such as the curve, hash function, and encoding type. In applications, these suites are very efficient to be referred to, as the included parameters can also be used beyond the scenario of mapping data to curves. For example, the participants can negotiate a hash_to_curve suite, and then use the curve deduced from the suite for other cryptographic functions such as encryption and key establishment.¶
To extend the usage of hash_to_curve suites, [RFC9380] also describes methods and naming conventions for defining new suites, which enables developers to define suites with new curves and hash functions.¶
We begin with a very brief description on how ECDH-PSI works. Firstly, the participants, A and B, agree on a EC group <G> along with the hash_to_curve suites, and generate private keys over <G>. Then, both A and B convert their data to points over <G> and multiply the points with both parties' private keys. Next, they exchange the sets of masked elements as sets of EC points, and compute the intersection over these sets. Finally, they output the original data items corresponding to the intersection of masked sets.¶
Assume that the parties have agreed on the suite, and generated private keys of sk_A
and sk_B
,
an simplified protocol flow of ECDH-PSI is shown by Figure 1, and explained step-by-step as follows:¶
hash_to_curve
,
and masks the points locally with its own private key by scalar_mul
.¶
In order to clarify the statement, this document uses "the first round" to refer to the data exchange process of Step 2, and "the second round" for the process of Step 4.¶
The classical description of ECDH-PSI enables both parties to output PSI results and thus requires two complete rounds (i.e. Step 2 and Step 4) to exchange masked data. However, many application scenarios require that should only one participant output the intersection and the other one get nothing. In such scenarios, Step 4 and Step 5 are executed unidirectionally, where only one party sends doubly-masked data in Step 4 and only the receiver of that message can output the result.¶
To output PSI result, the participants should also match the original data items with their corresponding EC points. In particular, the relationship can be established with unique indexes, where the original data item and the masked EC point(s) are linked to the same index.¶
The specification of ECDH-PSI consists of two phases as shown in Figure 2, including a handshake phase which negotiates the protocol parameters and a data exchange phase that performs the operations described by Figure 1.¶
HandshakeRequest
message to initiate the ECDH-PSI protocol flow.
This message carries lists of parameters supported by the requester and information about the requester's data.
Then, the other participant, denoted by responder, selects parameters from the requester's lists, and sends the parameters with its data information in a HandshakeResponse
message.¶
EcdhPsiBatch
messages are exchanged.
The receiver of an EcdhPsiBatch
replies with a EcdhPsiBatchResponse
message,
which tells sender the handling status of last EcdhPsiBatch
message.¶
This specification enables masked data to be transferred by many batches, as the entire data set may be extremely large and requires numerous resources to be handled with properly.
For example, the size of a data set may exceed the limit of a single package of the underlying network protocol.
The participants can divide the data set to multiple batches and use multiple EcdhPsiBatch
messages to send them so as to meet the limits of network, storage or computation resources.¶
Besides hash_to_curve suites and data formats, the handshake phase also determines the interaction pattern for data exchange phase. The pattern is decided by the following options:¶
EcdhPsiBatch
messages will be only sent from B to A.¶
EcdhPsiBatch
message.
Upon the determination of maximum batch size, the participants can deduce the number of batch messages sent or received in each round by dividing the size of entire data set with the maximum size.¶
EcdhPsiBatch
messages.
When both participants need to send batch messages in the same round, the requester will always send the first one.
Then, they can send EcdhPsiBatch
messages in turn, or let the requester send all batches in sequence before the responder sending its data.¶
The handshake phase includes a HandshakeRequest
and a HandshakeResponse
message.
The structures of these messages are documented as follows.¶
Structure of HandshakeRequest
:¶
struct { uint8 version = 1; ProtocolProposal protocol_param; DataProposal data_param; } HandshakeRequest;¶
version
: The version of ECDH-PSI protocol.¶
protocol_param
: ECDH-PSI protocol configurations supported by the requester.¶
data_param
: The requester's data information, including its data set size and proposals for data exchange pattern.¶
The ProtocolProposal
message describes ECDH-PSI protocol parameters supported by the requester.¶
Structure of this message:¶
enum { null (0), P256_XMD:SHA-256_SSWU_NU_ (1), P384_XMD:SHA-384_SSWU_NU_ (2), P521_XMD:SHA-512_SSWU_NU_ (3), curve25519_XMD:SHA-512_ELL2_NU_ (4), (255) } HashToCurveSuite enum { compressed (0), uncompressed (1), (255) } PointOctetFormat enum { no_truncation (0), 128_bit_truncation (1), (255) } TruncationOption struct { HashToCurveSuite suites<1..255>; PointOctetFormat point_octet_formats<1..255>; TruncationOption truncation_options<1..255>; } ProtocolProposal¶
suites
: The list of supported HashToCurveSuite
in order of the requester's preference.
.¶
Different suites use different DST strings as required by [RFC9380].
In particular, ECDH-PSI uses "ECDH-PSI-V<xx>-<suites>" as its DST, where <xx> is a two-digit number indicating the current protocol version as specified by the version
field of HandshakeRequest
, and <suites> is the ASCII string representation of the currently adopted suite as defined by [RFC9380].
As an example, when both participants agree to use the suite for P256 curve, the corresponding DST is "ECDH-PSI-V01-P256_XMD:SHA-256_SSWU_NU_".¶
point_octet_formats
: The list of octet string formats representing EC points, in the order of requester's preference.
This field enables the participants to weigh the tradeoffs between the costs of communication and computation as discussed by Section 4.1.¶
PointOctetFormat
values refer to the corresponding compressed/uncompressed formats documented by [ECDSA].¶
truncation_options
:
Whether the requester supports truncation for data transferred in the second round.
The options are listed in the requester's preference.
The truncation could decrease the costs of data transmission and storage, and even accelerate the computation process of intersection, at the expense of a (negligible) false-positive rate.¶
In this document, only 128-bit truncation is allowed, with a restriction that the total number of data items owned by both participants SHOULD be no more than 2^40, ensuring that the probability of collision is smaller than 2^-48. See Section 5.1 for a detailed discussion.
The requester MUST set this field by a single no_truncation
option when its data set size has already been equal or larger than 2^40.¶
The truncation process utilizes Key Derivation Functions (KDFs) as many key-exchange protocols that use shared secrets to derive shorter strings which can be seen as uniformly distributed cryptographic keys (e.g., [RFC8446][RFC9382]). In this document, the KDFs are instantiated by Hashed Message Authentication Code (HMAC) [RFC2104], along with HMAC-based KDF (HKDF)[RFC5869], and the hash function included in hash_to_curve suite.¶
Following [RFC5869], a KDF function kdf()
takes a input keying material ikm
, a salt value salt
, an application specific information info
and output length len
as its inputs.
Let mask_data
be the string representation of a doubly-masked element, a participant SHOULD truncate mask_data
with:¶
truncated_mask_data = kdf(mask_data, nil, "ECDH-PSI", 128)¶
In particular, the KDF takes mask_data
as the input keying material.
It does not take a salt value, uses ASCII string "ECDH-PSI" as the application specification information, and outputs a 128-bit string as the truncation result.¶
The data_param
field describes requester's data size and provides supported options on how to exchange EcdhPsiBatch
messages.¶
Structure of this message:¶
enum { continuous (0), interactive (1), (255) } BatchMode enum { both (0), requester (1), responder (2), (255) } OutputMode struct{ BatchMode supported_batch_modes<1..255>; OutputMode supported_output_modes<1..255>; uint64 max_batch_size; uint64 item_num; } DataProposal¶
supported_batch_modes
: The list of interaction patterns supported by the requester in the order of preference.
BatchMode
describes the message interaction pattern when both parties need to send their data sets with multiple EcdhPsiBatch
messages in the same round.¶
continuous
: A participant sends its entire data set with continuous EcdhPsiBatch
messages.¶
interactive
: The participants send EcdhPsiBatch
messages in turn.
If one party has sent all its batch messages before its partner, the partner SHOULD send its messages continuously.¶
When both parties need to send data batches in the same round, the requester SHOULD always send the first EcdhPsiBatch
message.¶
supported_output_modes
: The list of output modes supported by the requester in the order of its preference,
where OutputMode
describes which party has the privilege of obtaining the result.¶
The meaning of OutputMode
is explained as follows:¶
both
: Both parties output the intersection result, which means EcdhPsiBatch
are sent mutually in the second round.¶
requester
: Only the requester outputs PSI result, which means only the responder sends message batches in the second round.¶
responder
: Only the responder outputs PSI result, which means only the requester sends message batches in the second round.¶
max_batch_size
: The maximum byte size of a single EcdhPsiBatch
message allowed by the requester.
That is, the requester will not send a batch message beyond this limit, and it cannot receive a batch message having larger size.
The requester SHOULD decide the value of this field comprehensively according to its hardware/software configuration and environment.¶
item_num
: The total number of requester's data items.¶
The responder sends HandshakeResponse
to reply to a HandshakeRequest
message.
It includes selected parameters and the responder's data information.¶
Structure of this message:¶
enum { success(0), generic_error (0x01), unsupported_version (0x02) invalid_request (0x03), out_of_resource (0x04), unsupported_parameter (0x05), (0xFF) } HandshakeStatus struct { HandshakeStatus status; ProtocolResult protocol_param; DataResult data_param; } HandshakeResponse;¶
status
: Indicating the status of handshake. The meanings of error statuses are listed as follows:¶
generic_error
: The error cannot be categorized to the rest error statuses defined by HandshakeStatus
.¶
unsupported_version
: The responder does not support the ECDH-PSI protocol version given by version
.¶
invalid_request
: The responder cannot parse the request correctly.
The message may be in wrong format and cannot be parsed as a normal HandshakeRequest
message.¶
out_of_resource
: The responder does not have enough resource to handle the ECDH-PSI process following to the options provided by the requester.¶
unsupported_parameter
: The responder rejects all options proposed by one of the suggested option lists included in HandshakeRequest
.¶
The responder MUST ignore all options or parameters that it cannot recognize when parsing the HandshakeRequest
message.
If one of the suggested option lists is filled with unrecognized parameters, it SHOULD reply with a HandshakeResponse
carrying unsupported_parameter
.¶
Upon receiving a HandshakeResponse
message without success
status, the requester MUST ignore the other fields included in this message.¶
protocol_param
: The protocol parameter selected by the responder, including the hash_to_curve suite, EC point string format and truncation option.¶
data_param
: This message includes the data exchange pattern selected by the responder and the size of responder's dataset.¶
The structure of ProtocolResult
is defined as follows:¶
struct { HashToCurveSuite suite; PointOctetFormat point_octet_format; TruncationOption truncation_option; } ProtocolResult;¶
suite
: The hash_to_curve suite selected by the responder.¶
point_octet_format
: The format of EC point octet string chosen by the responder.¶
truncation_option
: The truncation option selected by the responder.
When deciding the field, the responder SHOULD consider the sum of both participants' dataset sizes.
If the sum is larger than 2^40, truncation MUST no be used, which means the responder MUST set this field by no_truncation
.¶
The structure of DataInfoResult
is defined as follows:¶
struct { BatchMode batch_mode; OutputMode output_mode; uint64 max_batch_size; uint64 item_num; } DataInfoResult;¶
batch_mode
: The BatchMode
selected by responder.¶
output_mode
: The OutputMode
selected by responder.¶
max_batch_size
: The maximum size of a EcdhPsiBatch
message, which SHOULD NOT be larger than the max_batch_size
given by the requester.¶
item_num
: The size of responder's dataset.¶
The data exchange phase consists of two rounds.
In each round, a participant may send or receive multiple EcdhPsiBatch
messages.
Upon receiving a EcdhPsiBatch
and handling the data properly, the participant SHOULD respond with a EcdhPsiBatchResponse
message,
indicating its partner that it is ready to receive the next EcdhPsiBatch
message.¶
The structure of EcdhPsiBatch
is defined as follows:¶
enum { transmit(0), retransmit(1), fatal_error(2), (255)} BatchStatus struct { uint64 index; opaque octet; } ECPointOctet; struct { BatchStatus status; uint32 batch_type; uint64 batch_index; uint64 batch_count; uint32 is_last_batch; ECPointOctet data<1..2^64-1>; } EcdhPsiBatch;¶
status
: Status of current batch message. Values of this field are explained as follows:¶
transmit
: It is a normal batch message which has not been transmitted before.¶
retransmit
: This message has been transmitted, but it is re-transmitted due to the receiver's request.
A batch message may be retransmitted several times before it is properly handled by the receiver.
For example, the receiver may need a relative long time to prepare the storage resource, and require the sender to re-send the batch until it has enough space.¶
fatal_error
: The sender of this batch message cannot continue the ECDH-PSI protocol procedure due to some fatal errors.
The receiver MUST ignore the other fields included in the current EcdhPsiBatch
message and terminate the PSI process.¶
Such fatal error may occur in the handshake phase or data exchange phase.
For example, if the responder sends a HandshakeRespond
message with parameters unsupported by the requester, the requester should construct and send a EcdhPsiBatch
message carrying fatal_error
status so as to terminate the process.¶
batch_type
: This field indicates the mask status of current batch.
That is, value "1" stands for that the data included in the current batch are only masked by the owner's private key, and "2" means the data has been doubly-masked with both participants' private keys.¶
batch_index
: The index of current EcdhPsiBatch
message, which is an incremental counter starting with "1".
The participant SHOULD use independent counters for different rounds and protocol runs.
For example, if participant A needs to send dataset p_A in the first round by 10 batches, and sends p_AB in the second round by 5 batches, it should use batch_index
from 1 to 10 for the first round, and 1 to 5 for the second.¶
batch_count
: Number of EC points included in the current EcdhPsiBatch
message.¶
is_last_batch
: Indicating whether this batch is the last one, where "1" means the batch is the last one from the sender in current round, and "0" for there exists subsequent batches.¶
data
:
This field carries multiple EC points with ECPointOctet
, which number is specified by the batch_count
field.
Each ECPointOctet
structure includes an unique index
allocated by the owner of the corresponding original data, and the encoded octet string.¶
The purpose of index
is to associate the original data item with the (doubly)-masked EC points, such that the participant can match its dataset with the intersection of masked datasets.
The values of index
are generated by the owner of data so as to identify each data item uniquely.
When masking the data from its partner, a participant MUST reserve the index
for every record.
Section 4.2 gives more discussions on the implementation and maintenance of such indexes.¶
The content of each octet string is determined by the selected hash_to_curve suite, octet format and truncation option. In the first round, the receiver SHOULD decode the octet strings with the configurations negotiated in the handshake phase in order to mask the data provided by the sender. However, in the second round where masking is not required, the receiver SHOULD treat the contents as octet strings rather than decode them to EC points, as such strings are enough for computing the intersection.¶
The structure of EcdhPsiBatchResponse
is defined as follows:¶
enum { success (0), retransmit (1), fatal_error (2), (255) } BatchResponseStatus; struct { BatchResponseStatus status; uint64 batch_index; } EcdhPsiBatchResponse;¶
status
: Indicating whether the EcdhPsiBatch
message identified by batch_index
has been handled properly.
Values of this field are explained as follows:¶
success
: The EcdhPsiBatch
message has been handled in a proper way.
Upon receiving a response with this status, the sender of EcdhPsiBatch
can send the subsequence batch
or prepare to receive the EcdhPsiBatch
message from its partner.¶
retransmit
: Upon receiving retransmit
, the participant MUST re-send the EcdhPsiBatch
message identified by batch_index
.¶
fatal_error
: The ECDH-PSI process meets some fatal error such that both the participants MUST terminate the current session, and discard all data received.¶
When deciding the point encoding format for the elliptic curves expect curve25519, the participants should consider the aspects of bandwidth, storage and computation comprehensively. Compressed point format could decrease the costs of transmission and storage, but it also needs more computation resources to "decompress" the point. The "decompress" process may be as costly as scalar multiplication [CZZDL24]. Uncompressed format requires more storage spaces and needs more time to transmit, but the participant can perform scalar multiplications without extra effort to recover the points.¶
For example, when both parties are deployed in the same data center and linked with a high-bandwidth local area network (LAN), they can choose to use the uncompressed format to achieve better performance.¶
As another example, the parties may be deployed in geographically separated data centers connected with low bandwidth wide area network (WAN), and equipped with high-end computation acceleration devices such as Graphics Processing Units (GPUs). In this case, the computation resource may not be the bottleneck, and the participants can choose the compress format to minimize the costs of data transmission.¶
A RECOMMENDED method of maintaining indexes is storing the records with a database table and use the row numbers or prime keys as indexes. The intersection result can be matched with original data items by a simple join operation. The participant could also design a different indexing mechanism on its own, as long as guaranteeing its uniqueness.¶
This document uses explicit indexes to identify data items rather than treats the order of records as "implicit" indexes. Compare with the implicit counterpart, explicit indexes can be efficiently created and maintained by modern databases, and do not require the participants to implement extra logic to preserve the order of records.¶
When masking the EC points from its partner, a participant MUST send the doubly-masked data with correct indexes. To achieve this, it keeps the indexes of each data items received in the first round, masks the records with its private key, and associates the masked records with the same indexes. These operations can also be implemented easily with database operations by viewing the received records and masked records as columns in the same table, which enables them to be linked naturally via the primary key.¶
Participants can negotiate hash_to_curve suites besides the ones listed by Section 3.2.1.1. They can use other suites defined by [RFC9380], or use a self-defined suite following the syntax and convention given by Section 8.9 and 8.10 of [RFC9380].¶
As an example, this document next defines a hash_to_curve suite for the ShangMi(SM) curve and cryptography algorithms. The SM algorithms are becoming mandatory in China, and have been accepted by international standards such as [ISO-SM2], [ISO-SM3] and [RFC8998].¶
The new suite, denoted by curveSM2_XMD:SM3_SSWU_RO, encodes data to points on the so-called curveSM2[GBT.32918.5-2017] with SM3 hash algorithm[GBT.32905-2016], and reuses the expand_message_xmd and Simplified Shallue-van de Woestijne-Ulas (SSWU) methods for message expansion and mapping.¶
In particular, curveSM2_XMD:SM3_SSWU_RO_ is defined as follows:¶
E: y^2 = x^3 + A * x + B, where¶
The value Z is calculated by the sage script given by Appendix H.2 of [RFC9380] with the p, A and B parameters of curveSM2.¶
The participants SHOULD agree on the meaning of HashToCurveSuite
values.
In this document, suite curveSM2_XMD:SM3_SSWU_RO_ is identified as follows:¶
enum { null (0), ... // (1) to (4) follow section 3.2.1.1 curveSM2_XMD:SM3_SSWU_RO_ (5), ... // other hash_to_curve suites (255) } HashToCurveSuite;¶
This section provides a detailed discussion on the truncation mechanism presented in this document.¶
The truncation, undoubtedly, will raise the probability of message collision. That is, two different data item may be mapped to the same string after the procedures of masking and truncation by coincidence. The collision may happen across the datasets owned by different participants, which causes a false-positive case that two different records are matched by PSI, or happen in the same dataset. The later situation may also lead to a false-positive case. To be more specific, consider a participant A has two different data items data_X and data_Y which are mapped to the same record mask_data_XY. Its partner, say B, also has record data_X which is mapped to mask_data_XY. Finally, A outputs both data_X and data_Y as the result of PSI, as it cannot distinguish which one matches the record of mask_data_XY.¶
To avoid such false-positive case, we have to consider the collision probability with respect to the sum number of data items owned by A and B.
Such probability can be computed with the well-known birthday paradox bound.
Let n
be the number of sampling and d
be the output space, the probability of collision can be approximately computed with:¶
p(n,d)=1-e^(-n(n-1)/2d)¶
This document uses a truncated length of 128 bit, which means d=2^128
.
For different n
, which is the sum size of records, the probability are calculated as follows:¶
If n=2^50
, p(2^50, 2^128) = 2^-29
¶
If n=2^45
, p(2^45, 2^128) = 2^-33
¶
If n=2^41
, p(2^41, 2^128) = 2^-47
¶
If n=2^40
, p(2^40, 2^128) = 2^-49
¶
If n=2^39
, p(2^39, 2^128) = 2^-51
¶
That is, if the number of records is less than 2^40, the probability of false-positive will be smaller than 2^-48. The participant can also decide the truncation option by calculating the collision probability, and only use truncation when they both agree that the probability is acceptable.¶
This document has no IANA actions.¶