Python's builtin hash
I was reading an article today and came across Python’s builtin hash()
.
I was curious what the underlying implementation was, or at least its guarantees,
so I did a little reading. Turns out help(hash)
doesn’t actually give much
indication of the impl:
# `help(hash)`
Help on built-in function hash in module __builtin__:
hash(...)
hash(object) -> integer
Return a hash value for the object. Two objects with the same value have
the same hash value. The reverse is not necessarily true, but likely.
More googling turned up a few key things about the builtin hash function
-
It’s a consistent & well-distributed hash function.
hash('hi')
is always13312079976120147
…with some exceptions: -
Exception 1: hashing python’s core types. As it turns out, the builtin
hash()
is used for the dictionary & set implementations in Python, so to harden it against “dictionary insertion DoS attacks,”hash()
is randomized each time you start a new instance of Python 2.6.8+. -
Exception 2:
hash()
computes different values for 32-bit and 64-bit builds. -
In CPython,
hash()
on internal objects used to return theid()
of the object or some variant (/16). This may or may not be the case in the future, but at any rate you should expect hashing an object to give the same value when you hash the exact same object again later.
# In case you don't know what id() in python does... `help(id)`
Help on built-in function id in module __builtin__:
id(...)
id(object) -> integer
Return the identity of an object. This is guaranteed to be unique among
simultaneously existing objects. (Hint: it's the object's memory address.)
Since the purpose of hash()
has always been to provide something usable for
data structure implementations, it’s not collision resistant. It’s not suitable
for cryptographic applications in that regard.
In python, the hashlib
library provides various other standard cryptographic hashes
which are well-balanced (collision-prone md5, collision-possible sha-1, and currently
more robust sha224,256,384,512).