Celery: Get Your Task Together
An [outdated] comparison between different Distributed Task Queues
Read MorePublished on Apr 07, 2016 by Sep Dehpour
Here is the actual talk!
I gave a talk at Pycon 2016 about how DeepDiff does what it does.
This is my Python talk proposal for 2016. It is about how my DeepDiff does what it does. I will be presenting this talk on June 1st at Pycon. https://us.pycon.org/2016/schedule/presentation/1862/
Edit (June 2016): read more about how the presentation went at Pycon presentation after thoughts
Anybody who has used git diff will know that your life is not the same once you start diffing. When you get to the habit, there is no going back. In this talk we will see how to diff python objects, and the different scenarios that you can bump into while diffing: from diffing built-in datatypes, nested objects, slots, unhashables to objects with loops and more. Diffing python objects can give you a good insight into Python data types. Finally a package called DeepDiff is introduced which makes diffing easy.
Once upon a time in land a far away, I had to write a test that would check a system’s output to the expected one. This didn’t sound like a lot of work, except when there was a change anywhere in the dictionary, I had to dig it to find what exactly changed. (show 2 big nested dictionaries that are very different and the result of diffing.)
So the question was:
how do I feed in 2 objects (t1, t2) and get their diff in a way that it handles:
So I thought, ok cool let’s start simple and build upon that. First we categories the objects depending on both the approach we want to take to diff them and the format of the output we want to get:
Text Sequences aka strings (basestring, str, bytes)
Numerics (int, float, long, complex, datetime.datetime, Decimal)
Mappings (Dict, OrderedDict, Defaultdict)
Set, Frozen set
Other Iterables (List, Generator, Deque, Tuple, Custom Iterable)
Custom Objects
basestring, str, bytes
py3 | example | py2 | example |
---|---|---|---|
str | “hello” | unicode | u"hello" |
bytes | b"hello" | str | “hello” |
The best tool for multi-line strings diffing in Python is the difflib:
import difflib
difflib.unified_diff(t1.splitlines(), t2.splitlines(), lineterm='')
Example:
>>> import difflib
>>> t1="""
... Hello World!
... """
>>> t2="""
... Hello World!
... It is Ice-cream time.
... """
>>> mygenerator = difflib.unified_diff(t1.splitlines(), t2.splitlines(), lineterm='')
>>> print '\n'.join(list(mygenerator))
---
+++
@@ -1,2 +1,3 @@
Hello World!
+It is Ice-cream time.
Py3 | int, float, complex, Decimal |
---|---|
Py2 | int, float, long, complex, Decimal |
A set object is an unordered collection of distinct hashable objects. Frozenset is immutable set.
>>> t1 = {1,2,3}
>>> t2 = {3,4,5}
>>> items_added = t2 - t1
>>> items_removed = t1 - t2
>>> items_added
set([4, 5])
>>> items_removed
set([1, 2])
Dict, OrderedDict, Defaultdict
When dealing with dictionaries, first we need to find the mutual keys, keys added and removed:
t1_keys= set(t1.keys())
t2_keys= set(t2.keys())
same_keys = t2_keys.intersection(t1_keys)
added = t2_keys - same_keys
removed = t1_keys - same_keys
Then we want to diff the values of common keys between t1 and t2. We will do this recursively.
List, Tuple and other iterables
What is an iterable? any object that has __iter__
so you can iterate over it. For example a list, tuple or a generator.
>>> from itertools import zip_longest
>>>
>>> t1=[1,2,3]
>>> t2=[1,2,5,6]
>>>
>>> class ListItemRemovedOrAdded(object):
... pass
...
>>> items_removed = []
>>> items_added = []
>>> items_modified = []
>>>
>>> for (x, y) in zip_longest(t1, t2, fillvalue=ListItemRemovedOrAdded):
... if y is ListItemRemovedOrAdded:
... items_removed.append(x)
... elif x is ListItemRemovedOrAdded:
... items_added.append(y)
... else:
... items_modified.append("{} changed to {}".format(x, y))
...
>>>
>>> items_removed
[]
>>> items_added
[6]
>>> items_modified
['1 changed to 1', '2 changed to 2', '3 changed to 5']
Convert the sequence to set and then diff it!
>>> t1=[1,2]
>>> t2=[1,3,4]
>>> t1set=set(t1)
>>> t2set=set(t2)
>>> t1set-t2set
{2}
>>> t2set-t1set
{3, 4}
But…
>>> t1=[1, 2, {3:3}]
>>> t2=[1]
>>> t1set = set(t1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
Let’s read the definition of set again:
A set object is an unordered collection of distinct hashable objects.
If your iterable contains unhashable item, we can’t convert it to a set. What is the difference of Unhashable and Mutable?
Let’s review this again:
Mutable objects can change their value but keep their id().
An object with a fixed value. Immutable objects include numbers, strings and tuples. Such an object cannot be altered. A new object has to be created if a different value has to be stored. They play an important role in places where a constant hash value is needed, for example as a key in a dictionary.
https://docs.python.org/3/glossary.html#term-hashable
An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__()
method), and can be compared to other objects (it needs an __eq__()
method).
All of Python’s immutable built-in objects are hashable, while no mutable containers (such as lists or dictionaries) are. Objects which are instances of user-defined classes are hashable by default; they all compare unequal (except with themselves), and their hash value is derived from their id().
A lot of people confuse these 2 concepts with each other.
Example of Hashable object that is Mutable:
>>> class A:
... aa=1
...
>>> hash(A)
2857987
>>> A.aa=2
>>> hash(A)
2857987
Objects which are instances of user-defined classes are hashable by default; they all compare unequal (except with themselves), and their hash value is derived from their id().
So how do we diff iterable that contains unhashable? And we don’t care about the order or items?
The first approach I took was to convert iterable to list, then sort it and then compare objects one by one. The irony here is we don’t care about the order of items in the original iterables but in order to compare them side by side we order them:
Python 2:
>>> t1=[{1:1}, {3:3}, {4:4}]
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t1.sort()
>>> t1
[{1: 1}, {3: 3}, {4: 4}]
>>> t2.sort()
>>> t2
[{1: 1}, {3: 3}, {4: 4}]
Ok great, now we can iterate through both lists and compare each item in t1 to the corresponding one in t2:
>>> [(a, b) for a, b in zip(t1,t2) if a != b]
[]
Python 3:
>>> t1=[{1:1}, {3:3}, {4:4}]
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t1.sort()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unorderable types: dict() < dict()
Ok, so this is weird. It worked in Python 2 but NOT python 3. Obviously the sorting algorithm has changed from Py2 to 3.
What is the solution?
https://wiki.python.org/moin/HowTo/Sorting
There is a key parameter to specify a function to be called on each list element prior to making comparisons. The output of key function is used for comparison and sorting. Think about it as a function that creates hooks that represent the object for the sort algorithm. Then the sort algorithm grabs these hooks and orders the hooks and thus sorts the actual object.
>>> student_tuples = [
('john', 'A', 15),
('jane', 'B', 12),
('dave', 'B', 10),
]
>>> sorted(student_tuples, key=lambda student: student[2]) # sort by age
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
So what can we use for sorting unhashables?
The idea that came to me was to use the hash of objects as the key parameter.
But the problem was dictionaries were unhashable so how could I create a hash for them? Hmm, what about serializing the dictionary and then use the hash of the serialized object for sorting. The best thing about this idea was that if 2 dictionaries are the same, their serialized version should be the same and thus their hash should be the same:
>>> import json
>>> t1=[{1:1}, {3:3}, {4:4}]
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t1.sort(key=lambda x: hash(json.dumps(x)))
>>> t2.sort(key=lambda x: hash(json.dumps(x)))
>>> t1
[{1: 1}, {3: 3}, {4: 4}]
>>> t2
[{1: 1}, {3: 3}, {4: 4}]
>>> [(a, b) for a, b in zip(t1,t2) if a != b]
[]
yay!
>>> from itertools import zip_longest
>>> import json
>>>
>>> t1=[10, {1:1}, {3:3}, {4:4}]
>>> t1.sort(key=lambda x: hash(json.dumps(x)))
>>> t1
[{1: 1}, {3: 3}, {4: 4}, 10]
>>>
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t2.sort(key=lambda x: hash(json.dumps(x)))
>>> t2
[{1: 1}, {3: 3}, {4: 4}]
>>>
>>>
>>> class ListItemRemovedOrAdded(object):
... pass
...
>>> items_removed = []
>>> items_added = []
>>>
>>> for (x, y) in zip_longest(t1, t2, fillvalue=ListItemRemovedOrAdded):
... if y is ListItemRemovedOrAdded:
... items_removed.append(x)
... elif x is ListItemRemovedOrAdded:
... items_added.append(y)
... # else:
... # keep_diffing(x, y)
...
>>> items_removed
[10]
>>> items_added
[]
Now what about adding another item to t1:
>>> t1=[10, "a", {1:1}, {3:3}, {4:4}]
>>> t1.sort(key=lambda x: hash(json.dumps(x)))
>>> t1
['a', {1: 1}, {3: 3}, {4: 4}, 10]
>>> t2
[{1: 1}, {3: 3}, {4: 4}]
Ooops, the items with the same hash do not align with each other anymore!
If I run our diff now, we get the wrong result:
>>> items_removed = []
>>> items_added = []
>>> items_modified = []
>>>
>>> for (x, y) in zip_longest(t1, t2, fillvalue=ListItemRemovedOrAdded):
... if y is ListItemRemovedOrAdded:
... items_removed.append(x)
... elif x is ListItemRemovedOrAdded:
... items_added.append(y)
... else:
... items_modified.append("{} changed to {}".format(x, y))
...
>>> items_removed
[{4: 4}, 10]
>>> items_added
[]
>>> items_modified
['a changed to {1: 1}', '{1: 1} changed to {3: 3}', '{3: 3} changed to {4: 4}']
So ordering items based on their hashes was a wrong idea to begin with. But at least we learnt about the key parameter for sorting.
>>> import json
>>>
>>> t1 = [10, "a", {1:1}, {3:3}, {4:4}]
>>> t2 = [{3:3}, {1:1}, {4:4}, "b"]
>>>
>>>
>>> def create_hashtable(t):
... '''
... Creates hashtable of {item_hash: item}
... '''
... hashes = {}
... for item in t:
... try:
... item_hash = hash(item)
... except TypeError:
... try:
... item_hash = hash(json.dumps(item))
... except:
... print ("Warning: Can not produce a hash for %s item" % (item))
... else:
... hashes[item_hash] = item
... else:
... hashes[item_hash] = item
... return hashes
...
>>> t1_hashtable = create_hashtable(t1)
>>> t2_hashtable = create_hashtable(t2)
>>>
>>> items_added = [t2_hashtable[i] for i in t2_hashtable if i not in t1_hashtable]
>>> items_removed = [t1_hashtable[i] for i in t1_hashtable if i not in t2_hashtable]
>>>
>>> items_added
['b']
>>> items_removed
['a', 10]
This approach worked!
Now what if the object is not serializable? Ok in that case we just have to iterate over every object in t1 to see if it exists in t2. And do the same thing for t2 objects comparing to t1. Yes, you got it right as simple as that.
>>> t1 = [1, {1:1}]
>>> t2 = [2, {1:1}]
>>> id(t2)
4566093704
>>> id(t2[1])
4566093512
>>> id(t1[1])
4564810504
>>> t1[1] in t2
True
For example Decimals are not Json serializable:
>>> from decimal import Decimal
>>> num = Decimal(10.0)
>>> num
Decimal('10')
>>> from json import dumps
>>> dumps(num)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\ProgramsB\Python27\lib\json\__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "C:\ProgramsB\Python27\lib\json\encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\ProgramsB\Python27\lib\json\encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "C:\ProgramsB\Python27\lib\json\encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: Decimal('10') is not JSON serializable
>>>
And if we convert Decimal(10) to integer 10, then the serialization of the two will be the same. Thus the hash will be the same. And when we diff it will think these 2 objects are the same.
Here is how another serialization method comes in:
Pickle can serialze most Python objects. It is rare that you bump into objects that can’t be pickled. However the important thing for us to create serialization of content in a way that the serialization does not change if object’s contents are the same.
Let’s see how Pickle’s serialization result looks like for 2 objects that have the same content:
>>> from pickle import dumps
>>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
>>> dumps(t)
"((dp0\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\
nI10\nsS'Hello World'\np1\n(I1\nI2\nI3\nI4\nI5\
ntp2\n(lp3\nI1\naI2\naI3\naI4\naI5\natp4\n."
>>> dumps(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
'Hello World', (1, 2, 3, 4, 5),
[1, 2, 3, 4, 5]))
"((dp0\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\
nI10\nsS'Hello World'\np1\n(I1\nI2\nI3\nI4\nI5
\ntp2\n(lp3\nI1\naI2\naI3\naI4\naI5\natp4\n."
The serilzation of both objects look the same. Yes!
What about cPIckle? It is faster than Pickle! It is written in C. And supposed to be 1000 times faster than Pickle.
>>> from cPickle import dumps
>>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
>>> dumps(t)
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\n
I10\nsS'Hello World'\np2\n(I1\nI2\nI3\nI4\nI5\n
tp3\n(lp4\nI1\naI2\naI3\naI4\naI5\nat."
>>> dumps(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
'Hello World', (1, 2, 3, 4, 5),
[1, 2, 3, 4, 5]))
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\n
I10\nsS'Hello World'\n(I1\nI2\nI3\nI4\nI5\nt(lp2
\nI1\naI2\naI3\naI4\naI5\natp3\n."
Uh oh! The result serializations of the same objects are not the same using cPickle!.
Why? cPickle tries to optimize the serialization by including if the object is referenced once or more than once! For a discussion around this look at http://stackoverflow.com/q/7501577/1497443
I ended up using cPickle to create the serilzation to generate the {item_hash: item}
hash table in DeepDiff.
The reason it worked for DeepDiff was that we were creating references to the objects inside the libaray while diffing.
So the cPickle of objects with equal content was always the same during the program run.
Again cPickle ONLY produces different results for the same content if there is only one reference to the object vs. more than one reference.
Let’s use __dict__
attribute to get the instance’s attributes.
By default, instances of classes have a dictionary for attribute storage called __dict__
:
>>> class CL:
... attr1 = 0
... def __init__(self, thing):
... self.thing = thing
>>> obj1 = CL(1)
>>> obj2 = CL(2)
>>> obj2.attr1 = 10
>>> obj1.__dict__
{'thing': 1} # Notice that att1 is not here
>>> obj2.__dict__
{'attr1': 10, 'thing': 2}
And we deal with it as if we were diffing 2 dictionaries.
__slots__
aka When there is no __dict__
If the class is using __slots__
to have more control over the object attributes, there won’t be a __dict__
:
Example:
>>> class ClassA(object):
... __slots__ = ['x', 'y']
... def __init__(self, x, y):
... self.x = x
... self.y = y
...
>>> t1 = ClassA(1, 1)
>>> t2 = ClassA(1, 2)
>>>
>>> t1.d = 10
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'ClassA' object has no attribute 'd'
As you can see with __slots__
we can’t dynamically add attributes. You can think of __slots__
as a bridge towards static typing in python.
And unlike __dict__
, __slots__
is NOT a dictionary:
>>> t1.__slots__
['x', 'y']
So in order to create a dictionary out of the attributes:
>>> t1 = {i: getattr(t1, i) for i in t1.__slots__}
>>> t2 = {i: getattr(t2, i) for i in t2.__slots__}
>>> t1
{'x': 1, 'y': 1}
>>> t2
{'x': 1, 'y': 2}
As far as diffing goes, the rest is just like comparing 2 dictionaries.
Did you know that Python objects can have attributes pointing to themselves!
>>> class LoopTest(object):
... def __init__(self, a):
... self.loop = self
... self.a = a
...
>>> t1 = LoopTest(1)
>>> t2 = LoopTest(2)
>>> t1
<__main__.LoopTest object at 0x02B9A910>
>>> t1.__dict__
{'a': 1, 'loop': <__main__.LoopTest object at 0x02B9A910>}
Let’s step back and think about it for a sec:
For custom objects, we are creating a dictionary out of instance attributes. Then we are creating 3 lists: attributes added, attributes removed and common attributes (common keys of the dictionaries). Then we are recursively diffing the common attributes values. So once we hit the same object again, we get stuck in a never ending loop!
As you know every object in python has a unique ID. And if 2 objects have the same ID, they are the same object.
Let’s say A is parent of B and B is parent of C and C is parent of A:
A --> B --> C --> A
object | A | B | C | A |
---|---|---|---|---|
id | 1 | 2 | 3 | 1 |
If we keep the ID of every object and every parent of the object, once we start processing an object, we can check to see if its ID exists in the list of IDs of its parents. And in that case, we get out of the loop!
Example:
object | ID List |
---|---|
A | 1 |
B | 1, 2 |
C | 1, 2, 3 |
A | 1, 2, 3, 1 (Loop detected, get me out of here!) |
Since we will be creating this list of IDs for every object in the nested structure that we process recursively, we have to make sure the ID list is immutable.
def diff_common_children_of_dictionary(t1, t2, t_keys_intersect, parents_ids):
''' difference between common attributes of objects or values of common keys of dictionaries '''
for item_key in t_keys_intersect:
t1_child = t1[item_key]
t2_child = t2[item_key]
item_id = id(t1_child)
if parents_ids and item_id in parents_ids:
print ("Warning, a loop is detected.")
continue
parents_added = set(parents_ids)
parents_added.add(item_id)
parents_added = frozenset(parents_added)
diff(t1_child, t2_child, parents_ids=parents_added)
Why it has to be immutable? because otherwise the same parents_ids object is gonna be passed around the recursion (passed by reference) instead of a new parents_ids object per object.
__dict__
or __slots__
when diffing.Diffing python objects can give you a good insight into Python data types. I find diffing very useful in debugging and testing. For example I have made all my assertEqual
calls to diff the test result object and the expected result if the assertion fails. This helps a lot to understand how exactly your expected result and test result are different.
To make things easier, I have made a package called DeepDiff that you can easily install from Pypi and use:
pip install DeepDiff
https://github.com/seperman/deepdiff
DeepDiff makes it super easy to diff complex objects:
Examples:
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'iterable_item_added': ["root[4]['b']: [3]"],
'values_changed': ["root[4]['b'][1]: 2 ===> 3", "root[4]['b'][2]: 3 ===> 2"]}
>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:1, 2:2}]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:3}]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'dic_item_removed': ["root[4]['b'][2][2]"],
'values_changed': ["root[4]['b'][2][1]: 1 ===> 3"]}
>>> from collections import namedtuple
>>> Point = namedtuple('Point', ['x', 'y'])
>>> t1 = Point(x=11, y=22)
>>> t2 = Point(x=11, y=23)
>>> print (DeepDiff(t1, t2))
{'values_changed': ['root.y: 22 ===> 23']}
>>> t1 = [{"a": 2}, {"b": [3, 4, {1: 1}]}]
>>> t2 = [{"b": [3, 4, {1: 1}]}, {"a": 2}]
ddiff = DeepDiff(t1, t2, ignore_order=True)
>>>
>>> print(DeepDiff(t1, t2))
{}
And more: https://github.com/seperman/deepdiff
An [outdated] comparison between different Distributed Task Queues
Read More