IDs and Hashing in Chalk

NOTE this document is still WIP

Chalk uses cryptographic hashing (SHA-256) to help uniquely identify artifacts. Key to this is the concept of the Chalk Hash, which is a normalized hash of an artifact, without a chalk mark inserted.

The normalized hash is generally used to create the CHALK_ID field. Additionally, chalk validates the integrity of the metadata fields in a chalk mark with the METADATA_ID field.

Two artifacts with the same CHALK_ID should be the same core artifact. But, if the same set of bits were marked multiple times in multiple ways, they will produce different METADATA_IDs.

There are important things to understand about the system:

The Chalk hash may not always be the same as what you'd get from hashing the bits on disk.
Chalking of docker images and docker containers cannot currently work the same way.
Digital Signatures using the In-Toto standard are available across all artifact times to provide some additional assurance to those who want it.

In this document, we explain the trade-offs, and look at how to handle related use cases with Chalk.

Chalk-internal IDs and Hashes

It's not sufficient to compare hashes of fully chalked artifacts for identity. Even if all other bits are the same, if two artifacts aren't identically chalked, they'll end up giving different hashes.

Of course, once metadata is added, it's can be valuable to understand when an artifact's metadata seems to have been tampered with.

As a result, Chalk uses two identifiers:

The CHALK_ID field usually allows you to compare the equality of two different artifacts, in a way that is not in any way dependent on the contents of the chalk mark. This field has some subtleties that are important to understand.
The METADATA_ID field is 100% dependent on the contents of both the executable and the Chalk mark added to the executable. It is much more straightforward, and consistent across all artifact types.

The Chalk Hash

One challenge Chalk IDs have to solve, is that we might be asked to Chalk artifacts that are already chalked! Therefore, just using the hash of files on disk is not good enough.

In some cases we can solve that by removing the chalk mark before hashing (or computing as if the mark were removed, anyway). However, that overcomplicates life for complex file formats, where being able to recover the exact unchalked state can require significant additional complexity and / or storage.

For instance, consider ZIP files. The file format has some complexities, with multiple versions, and internal bits that can be different from implementation to implementation.

And, we use a third-party library to both read data out of a ZIP file, and reassemble the ZIP file.

It would be a tremendous amount of additional work to be able to guarantee that, if we're removing a mark from a Zip file, the resulting bits would be the exact same bits that were there before the current mark was inserted (well, we could do it easily by keeping a full second copy of the artifact, but doubling the size we don't find reasonable).

Similarly, while our ELF codec does not use third party libraries, the transformations we make don't necessarily have a single inverse they map to. For instance, ELF binaries can keep the "section table" basically anywhere in the binary, but much like the Unix strip command, for the sake of simplicity and correctness, we move the section table to the back of the binary. Moreover, it would take significant additional work and require some storage to make this operation invertible.

As a result, the Chalk Hash (the HASH metadata key), is not defined based on the file system hash. Instead, it is a normalized hash, meaning it should be a deterministic hash where the exact same input will always give the same output.

However, the hash function isn't required to be based on being able to reverse our transformations and recover the original file system bits. Instead, We ensure that, if you calculate the Chalk Hash without physically marking the file, it gives the same result as when you do mark the file and calculate the results. Also, we ensure that deleting a chalk mark and re-adding it always gives the same results.

To be clear, every bit of the content, after the normalization process, that is not part of the chalk mark must be authenticated. And, we require that any normalization process where two artifacts give the same normalized input stream, then the two artifacts must be semantically identical.

More on The Chalk ID

Once an artifact has been normalized, and the normalized data stream has been hashed using SHA-256, we programmatically take 100 bits of the raw hash output, base-32 encode those bits, and then add some hyphens for clarity, to get the CHALK_ID.

Being 100 bits, the ID has enough cryptographic strength to ensure integrity, and therefore, can usually be considered a more human-readable version of the Chalk hash.

As a result, CHALK_ID is a required field in Chalk marks, whereas the HASH field is not. When validating an artifact via CHALK_ID, we re-normalize, and re-compute the Chalk hash, and then compare to what's stored inside the Chalk mark. if the two IDs do not match, then we know that the bits inside the artifact have changed, and Chalk complains.

The Docker exception

Understanding the Metadata ID

Doing file-system level comparisons.

adds a data section to executables, which we want to do in a way that is as obviously correct as possible. To that end, we always end up moving the section table to the

Snapshot Network Connections on Heartbeat Release Notes