Consumer Data Right (CDR) – User-specific Identifiers for ID Permanence

Version 1.0.0 of the Consumer Data Right standard was released in September, and it introduces a common set of Banking APIs in line with Australian government legislation. The principles behind the standards design are very solid, though the some of the specific requirements are pretty wild and they result in a bit of rethinking of some of the classical API conventions. The most prominent example of this is the approach the CDR standards take towards ‘object identifiers’, in the ID Permanence section, and I considered the requirements for this interesting enough to spend some time thinking about and documenting.

In this context, an ‘object identifier’ refers to the way in which you refer to an individual instance of an object from your API, such as the ‘accountId’ in the following URI:

GET /banking/accounts/{accountId}

In this blog post we will look at what the CDR requires for these types of identifiers, and provide some sample code which implements the obfuscation requirements specified in the standard.

Typically when you design an API, you use an Object/Collections model, and your API endpoints reflect this – i.e. if you are designing an API for cars, you have an endpoint /Cars which allows for searching for cars, then a /Cars/{carId} endpoint which allows for actions on a specific vehicle. Most of the CDR requirements for identifiers reflect this design, i.e. it MUST be supplied if required (since it is part of the URL anyway), it SHOULD be unique within a context (i.e. ‘carId’ shouldn’t be ambiguous, but any clash with ‘motorbikeId’ is acceptable), etc.

The standard also specifies that the IDs must be arbitrary and have no inherent meaning, which is also fine; it just means use an internal index number rather than the numberplate or VIN for your ‘carId’. Where it gets interesting is the following requirement:

“IDs MUST be immutable across sessions and consents but MUST NOT be transferable across data recipients. For example, data recipient A obtaining an account ID would get a different result from data recipient B obtaining the ID for the same account even if the same customer authorised the access. Under this constraint IDs cannot be usefully transferred between client organisations or data holders.”

Consumer Data Right – ID Permanence

This means that every user has different sets of identifiers! /Cars/12345 returns a different car for Alice than it does for Bob, and the car Alice receives when they invoke /Cars/12345 might be obtained by Bob by invoking /Cars/54321. The likely rationale behind this is to avoid leaking information through the URL – if Charlie can observe that Alice is viewing /Cars/12345, there is no way for them to determine precisely which car that is, Charlie can’t simply invoke /Cars/12345 to get more detail.

It does however have a downside, which is the portability of calls – Alice can’t send a link to Bob and expect Bob to see the same thing that she is seeing, which forces consumers of the API down the HATEOAS path. HATEOAS (for ‘Hypermedia as the Engine of Application State’) comes from some of the original RESTful API concepts, which assumed that consumers would essentially ‘discover’ API endpoints by making a series of calls to navigate through the API from the root URI. In reality of course, no one really did this, as network hops are not free, and it was much easier to simply invoke the precise endpoint you needed rather than make an initial call, then parse the response to determine the next call, make that call, etc. With user-specific identifiers, users are forced to make a call to get their own set of identifiers, before they are able to access objects using those identifiers, i.e. it requires a ‘pseudo-discovery’ step.

With the concept and the implications well established, we are able to consider what is required to implement user-specific identifiers sets. I am assuming that most implementations will have some sort of internal store such as a database which provides canonical identifiers, i.e. a database of cars identified by UUIDs in our on-going example. As such, at a high level, we need a way to translate keys from ‘canonical internal space’ to ‘user-specific space’, in a way that must:

  • be reversible
  • use a user identifier which is available on each call
  • present an opaque result  (i.e. someone with both the user-space identifier and the user identifier must not be able to derive the canonical identifier)
  • be fast enough to be called on 50+ objects as part of a response
  • be printable
  • not make the resulting id too large

The conclusion I have come to with regards to the best approach for this is to simply use AES with a key derived from a secret and the user identifier.  Simple sample (using the NodeJS crypto library) code for this is below and on github here:

function canonicalToUserIdentifier(id, userId){
  const mySecret = "This is a password used to derive a key."
  //Super simple user-specific key creation, there are definitely better 
  //ways to do this, but given I assume lots of validation will be done
  //around this userId (since it will be derived from a signed token),
  //concatenation is probably fine.
  var userPassword = mySecret + userId;
  //Using a 256 bit key length, since it lines up nicely with AES
  const hash = crypto.createHash('sha256');
  var key = hash.digest();
  const cipher = crypto.createCipheriv('aes-256-cbc', key, Buffer.alloc(16, 0));
  let userIdentifier = cipher.update(id, 'utf8', 'hex');
  userIdentifier +='hex');
  return userIdentifier;

function userToCanonicalIdentifier(id, userId){
  const mySecret = "This is a password used to derive a key."
  //Simple user-specific key creation, see above.
  var userPassword = mySecret + userId;
  const hash = crypto.createHash('sha256');
  var key = hash.digest();
  const decipher = crypto.createDecipheriv('aes-256-cbc', key, Buffer.alloc(16, 0));
  //Note that the id is somewhat user controlled here - I assume you have a lot
  //validation and error catching around this. This will for instance, throw a
  //TypeError if the string can't be parsed as hex.
  let canonicalIdentifier = decipher.update(id, 'hex', 'utf8');
  canonicalIdentifier +='utf8');
  return canonicalIdentifier;

As a simple test of these functions:

var testCanonicalId = "1234567890";
var testUserId = "Alice";
var userSpaceId = canonicalToUserIdentifier(testCanonicalId, testUserId)
var canonicalSpaceId = userToCanonicalIdentifier(userSpaceId, testUserId);

The use of AES allows for the transformation to be reversible, such that we are able to determine which internal resource a consumer is asking for when they use their own user-specific identifiers. The identifiers are indeed user-specific, as the user identifier is used to derive the encryption key; however it is combined with a secret which is held by the service to ensure that the resulting identifier is suitably opaque. I have chosen to present the encrypted identifier as a hex string, since it is printable, plus I like how it displays.

This approach only really violates the last point, about the resulting id is larger than initial id – due to the translation between utf8 and hex representations, and the block size of the cipher. The latter is especially apparent if the canonical id is slightly larger than the AES block, as it results in a huge chunk of padding.  You could make this a little more compact by using base64url encoding instead of hex.

An alternative approach if your identifiers are slightly larger than your block size could be to use AES in CTR mode, which makes it a streaming cipher, but this causes there to be a closer relationship between the form of the canonical identifier and the resulting user-space identifier – which should probably be avoided.

We are forced down this path because we need to be able to print the resulting id, and are not making any assumptions about our internal ids. Given the internal canonical ids are never exposed externally, it probably doesn’t matter too much that the resulting id is larger than the original as the consumer is only ever seeing the user-specific identifier forms. If it is an issue, and you are able to control your internal identifiers, then you could investigate the possibility of using some sort of format-preserving encryption.

Once you have some variant of these two functions, you can use them to ensure that each user receives separate identifiers for the same internal objects, and as they simply perform a single hash and a single round of encryption, they should be fast enough to apply to a whole page of responses before returning them to a user. Then when a user requests an object by id, you are able to transform that request into your internal identifier, use that to look up the results in the backend data store, and return it to the user.

Note that this approach is not great cryptographically – using an empty initialisation vector is pretty poor practice. Unfortunately, given the limitations of the standard, I cannot figure out a better way to handle this – since generating a random IV means you need to persist it somehow, and you can’t send it as part of the identifier, since that breaks identifier permanence across calls. Fortunately this is a tightly scoped use of cryptography just to provide obfuscation of identifiers with no, or very few, user-modifiable fields, so provided you are careful with how you perform authentication and authorisation as well as taking some care around how you present errors this approach should be relatively resistant to attacks on the cryptography.

The other concern, which isn’t explored here, is how you determine your user identifier – while you probably should simply use the ‘sub’ attribute from the access token, it is worth keeping in mind that these are ‘Pairwise Pseudonymous Identifiers’ according to the spec – which means that they are themselves further obfuscated (I suggest using the AES variant, as opposed to the SHA one, since you might actually need to retrieve the local user id from requests). Trying to perform authorisation under CDR is a bit of an adventure, given both the user and the resource they are attempting to access must be obfuscated.

Hopefully this gives you some ideas on how to approach one of the more interesting requirements of the Consumer Data Right standard. When I read through the requirement I was initially taken aback, since I am so used to simply presenting internal identifiers out to the world – which seems to make sense in the world of product catalogues and incident numbers, but I can understand some caution in the banking context. This requirement can be solved by a simple application of cryptography, though it would be nice to see some notes on this in an implementation guide.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s