Diff It To Digg It


Table of Contents:

The Talk

Here is the actual talk!

I gave a talk at Pycon 2016 about how DeepDiff does what it does.

The proposal

This is my Python talk proposal for 2016. It is about how my DeepDiff does what it does. I will be presenting this talk on June 1st at Pycon. https://us.pycon.org/2016/schedule/presentation/1862/

Edit (June 2016): read more about how the presentation went at Pycon presentation after thoughts

Summary

Anybody who has used git diff will know that your life is not the same once you start diffing. When you get to the habit, there is no going back. In this talk we will see how to diff python objects, and the different scenarios that you can bump into while diffing: from diffing built-in datatypes, nested objects, slots, unhashables to objects with loops and more. Diffing python objects can give you a good insight into Python data types. Finally a package called DeepDiff is introduced which makes diffing easy.

Description

Intro

Once upon a time in land a far away, I had to write a test that would check a system’s output to the expected one. This didn’t sound like a lot of work, except when there was a change anywhere in the dictionary, I had to dig it to find what exactly changed. (show 2 big nested dictionaries that are very different and the result of diffing.)

So the question was:

how do I feed in 2 objects (t1, t2) and get their diff in a way that it handles:

  • Diff nested dictionaries and custom objects
  • Get the exact path and value of the item that had been changed, added or removed
  • Ignore an iterable’s order when required
  • Work with both Python 2 and 3

So I thought, ok cool let’s start simple and build upon that. First we categories the objects depending on both the approach we want to take to diff them and the format of the output we want to get:

Objects categories

  1. Text Sequences aka strings (basestring, str, bytes)

  2. Numerics (int, float, long, complex, datetime.datetime, Decimal)

  3. Mappings (Dict, OrderedDict, Defaultdict)

  4. Set, Frozen set

  5. Other Iterables (List, Generator, Deque, Tuple, Custom Iterable)

  6. Custom Objects

Text Sequences aka Strings

basestring, str, bytes

py3 example py2 example
str “hello” unicode u”hello”
bytes b”hello” str “hello”

The best tool for multi-line strings diffing in Python is the difflib:

import difflib
difflib.unified_diff(t1.splitlines(), t2.splitlines(), lineterm='')

Example:

>>> import difflib
>>> t1="""
... Hello World!
... """
>>> t2="""
... Hello World!
... It is Ice-cream time.
... """
>>> mygenerator = difflib.unified_diff(t1.splitlines(), t2.splitlines(), lineterm='')
>>> print '\n'.join(list(mygenerator))
---
+++
@@ -1,2 +1,3 @@

 Hello World!
+It is Ice-cream time.

Numerics

Py3 int, float, complex, Decimal
Py2 int, float, long, complex, Decimal

Set, Frozenset

A set object is an unordered collection of distinct hashable objects. Frozenset is immutable set.

>>> t1 = {1,2,3}
>>> t2 = {3,4,5}
>>> items_added = t2 - t1
>>> items_removed = t1 - t2
>>> items_added
set([4, 5])
>>> items_removed
set([1, 2])

Mappings

Dict, OrderedDict, Defaultdict

When dealing with dictionaries, first we need to find the mutual keys, keys added and removed:

t1_keys= set(t1.keys())
t2_keys= set(t2.keys())

same_keys = t2_keys.intersection(t1_keys)

added = t2_keys - same_keys
removed = t1_keys - same_keys

Then we want to diff the values of common keys between t1 and t2. We will do this recursively.

Diff other Sequences

List, Tuple and other iterables

What is an iterable? any object that has __iter__ so you can iterate over it. For example a list, tuple or a generator.

Diff sequence when order is important

>>> from itertools import zip_longest
>>> 
>>> t1=[1,2,3]
>>> t2=[1,2,5,6]
>>> 
>>> class ListItemRemovedOrAdded(object):
...     pass
... 
>>> items_removed = []
>>> items_added = []
>>> items_modified = []
>>> 
>>> for (x, y) in zip_longest(t1, t2, fillvalue=ListItemRemovedOrAdded):
...     if y is ListItemRemovedOrAdded:
...         items_removed.append(x)
...     elif x is ListItemRemovedOrAdded:
...         items_added.append(y)
...     else:
...         items_modified.append("{} changed to {}".format(x, y))
... 
>>> 
>>> items_removed
[]
>>> items_added
[6]
>>> items_modified
['1 changed to 1', '2 changed to 2', '3 changed to 5']

Diff sequence when we don not care about the order

Convert the sequence to set and then diff it!

>>> t1=[1,2]
>>> t2=[1,3,4]
>>> t1set=set(t1)
>>> t2set=set(t2)
>>> t1set-t2set
{2}
>>> t2set-t1set
{3, 4}

But…

>>> t1=[1, 2, {3:3}]
>>> t2=[1]
>>> t1set = set(t1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'

Let’s read the definition of set again:

A set object is an unordered collection of distinct hashable objects.

If your iterable contains unhashable item, we can’t convert it to a set. What is the difference of Unhashable and Mutable?

Let’s review this again:

Mutable vs. Immutable

Mutable objects can change their value but keep their id().

An object with a fixed value. Immutable objects include numbers, strings and tuples. Such an object cannot be altered. A new object has to be created if a different value has to be stored. They play an important role in places where a constant hash value is needed, for example as a key in a dictionary.

Hashable

https://docs.python.org/3/glossary.html#term-hashable

An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() method).

All of Python’s immutable built-in objects are hashable, while no mutable containers (such as lists or dictionaries) are. Objects which are instances of user-defined classes are hashable by default; they all compare unequal (except with themselves), and their hash value is derived from their id().

Unhashable vs. Mutable

A lot of people confuse these 2 concepts with each other.

Example of Hashable object that is Mutable:

>>> class A:
...     aa=1
...
>>> hash(A)
2857987
>>> A.aa=2
>>> hash(A)
2857987

Objects which are instances of user-defined classes are hashable by default; they all compare unequal (except with themselves), and their hash value is derived from their id().

Diff sequence containing unhashable ignoring the order

So how do we diff iterable that contains unhashable? And we don’t care about the order or items?

Approach 1: Sort Iterable and then compare objects one by one

The first approach I took was to convert iterable to list, then sort it and then compare objects one by one. The irony here is we don’t care about the order of items in the original iterables but in order to compare them side by side we order them:

Python 2:

>>> t1=[{1:1}, {3:3}, {4:4}]
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t1.sort()
>>> t1
[{1: 1}, {3: 3}, {4: 4}]
>>> t2.sort()
>>> t2
[{1: 1}, {3: 3}, {4: 4}]

Ok great, now we can iterate through both lists and compare each item in t1 to the corresponding one in t2:

>>> [(a, b) for a, b in zip(t1,t2) if a != b]
[]

Python 3:

>>> t1=[{1:1}, {3:3}, {4:4}]
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t1.sort()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unorderable types: dict() < dict()

Ok, so this is weird. It worked in Python 2 but NOT python 3. Obviously the sorting algorithm has changed from Py2 to 3.

What is the solution?

Sort key

https://wiki.python.org/moin/HowTo/Sorting

There is a key parameter to specify a function to be called on each list element prior to making comparisons. The output of key function is used for comparison and sorting. Think about it as a function that creates hooks that represent the object for the sort algorithm. Then the sort algorithm grabs these hooks and orders the hooks and thus sorts the actual object.

>>> student_tuples = [
        ('john', 'A', 15),
        ('jane', 'B', 12),
        ('dave', 'B', 10),
]
>>> sorted(student_tuples, key=lambda student: student[2])   # sort by age
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

So what can we use for sorting unhashables?

The idea that came to me was to use the hash of objects as the key parameter.

But the problem was dictionaries were unhashable so how could I create a hash for them? Hmm, what about serializing the dictionary and then use the hash of the serialized object for sorting. The best thing about this idea was that if 2 dictionaries are the same, their serialized version should be the same and thus their hash should be the same:

>>> import json
>>> t1=[{1:1}, {3:3}, {4:4}]
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t1.sort(key=lambda x: hash(json.dumps(x)))
>>> t2.sort(key=lambda x: hash(json.dumps(x)))
>>> t1
[{1: 1}, {3: 3}, {4: 4}]
>>> t2
[{1: 1}, {3: 3}, {4: 4}]
>>> [(a, b) for a, b in zip(t1,t2) if a != b]
[]

yay!

Diffing iterables with different lengths

>>> from itertools import zip_longest
>>> import json
>>>
>>> t1=[10, {1:1}, {3:3}, {4:4}]
>>> t1.sort(key=lambda x: hash(json.dumps(x)))
>>> t1
[{1: 1}, {3: 3}, {4: 4}, 10]
>>>
>>> t2=[{3:3}, {1:1}, {4:4}]
>>> t2.sort(key=lambda x: hash(json.dumps(x)))
>>> t2
[{1: 1}, {3: 3}, {4: 4}]
>>>
>>>
>>> class ListItemRemovedOrAdded(object):
...     pass
...
>>> items_removed = []
>>> items_added = []
>>>
>>> for (x, y) in zip_longest(t1, t2, fillvalue=ListItemRemovedOrAdded):
...     if y is ListItemRemovedOrAdded:
...         items_removed.append(x)
...     elif x is ListItemRemovedOrAdded:
...         items_added.append(y)
...     # else:
...     #     keep_diffing(x, y)
...
>>> items_removed
[10]
>>> items_added
[]

Now what about adding another item to t1:

>>> t1=[10, "a", {1:1}, {3:3}, {4:4}]
>>> t1.sort(key=lambda x: hash(json.dumps(x)))
>>> t1
['a', {1: 1}, {3: 3}, {4: 4}, 10]
>>> t2
[{1: 1}, {3: 3}, {4: 4}]

Ooops, the items with the same hash do not align with each other anymore!

If I run our diff now, we get the wrong result:

>>> items_removed = []
>>> items_added = []
>>> items_modified = []
>>>
>>> for (x, y) in zip_longest(t1, t2, fillvalue=ListItemRemovedOrAdded):
...     if y is ListItemRemovedOrAdded:
...         items_removed.append(x)
...     elif x is ListItemRemovedOrAdded:
...         items_added.append(y)
...     else:
...         items_modified.append("{} changed to {}".format(x, y))
...
>>> items_removed
[{4: 4}, 10]
>>> items_added
[]
>>> items_modified
['a changed to {1: 1}', '{1: 1} changed to {3: 3}', '{3: 3} changed to {4: 4}']

So ordering items based on their hashes was a wrong idea to begin with. But at least we learnt about the key parameter for sorting.

Approach 2: Put items in a dictionary of {item_hash: item}

>>> import json
>>>
>>> t1 = [10, "a", {1:1}, {3:3}, {4:4}]
>>> t2 = [{3:3}, {1:1}, {4:4}, "b"]
>>>
>>>
>>> def create_hashtable(t):
...     '''
...     Creates hashtable of {item_hash: item}
...     '''
...     hashes = {}
...     for item in t:
...         try:
...             item_hash = hash(item)
...         except TypeError:
...             try:
...                 item_hash = hash(json.dumps(item))
...             except:
...                 print ("Warning: Can not produce a hash for %s item" % (item))
...             else:
...                 hashes[item_hash] = item
...         else:
...             hashes[item_hash] = item
...     return hashes
...
>>> t1_hashtable = create_hashtable(t1)
>>> t2_hashtable = create_hashtable(t2)
>>>
>>> items_added = [t2_hashtable[i] for i in t2_hashtable if i not in t1_hashtable]
>>> items_removed = [t1_hashtable[i] for i in t1_hashtable if i not in t2_hashtable]
>>>
>>> items_added
['b']
>>> items_removed
['a', 10]

This approach worked!

Now what if the object is not serializable? Ok in that case we just have to iterate over every object in t1 to see if it exists in t2. And do the same thing for t2 objects comparing to t1. Yes, you got it right as simple as that.

>>> t1 = [1, {1:1}]
>>> t2 = [2, {1:1}]
>>> id(t2)
4566093704
>>> id(t2[1])
4566093512
>>> id(t1[1])
4564810504
>>> t1[1] in t2
True

When it is not Json serializable

  • What if the object is not json serializable?
  • What if json serializable version of 2 different objects are the same?

For example Decimals are not Json serializable:

>>> from decimal import Decimal
>>> num = Decimal(10.0)
>>> num
Decimal('10')
>>> from json import dumps
>>> dumps(num)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ProgramsB\Python27\lib\json\__init__.py", line 243, in dumps
    return _default_encoder.encode(obj)
  File "C:\ProgramsB\Python27\lib\json\encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "C:\ProgramsB\Python27\lib\json\encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
  File "C:\ProgramsB\Python27\lib\json\encoder.py", line 184, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: Decimal('10') is not JSON serializable
>>>

And if we convert Decimal(10) to integer 10, then the serialization of the two will be the same. Thus the hash will be the same. And when we diff it will think these 2 objects are the same.

Here is how another serialization method comes in:

Pickle

Pickle can serialze most Python objects. It is rare that you bump into objects that can’t be pickled. However the important thing for us to create serialization of content in a way that the serialization does not change if object’s contents are the same.

Let’s see how Pickle’s serialization result looks like for 2 objects that have the same content:

>>> from pickle import dumps
>>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
>>> dumps(t)
"((dp0\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\
nI10\nsS'Hello World'\np1\n(I1\nI2\nI3\nI4\nI5\
ntp2\n(lp3\nI1\naI2\naI3\naI4\naI5\natp4\n."
>>> dumps(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
           'Hello World', (1, 2, 3, 4, 5),
            [1, 2, 3, 4, 5]))
"((dp0\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\
nI10\nsS'Hello World'\np1\n(I1\nI2\nI3\nI4\nI5
\ntp2\n(lp3\nI1\naI2\naI3\naI4\naI5\natp4\n."

The serilzation of both objects look the same. Yes!

Cpickle

What about cPIckle? It is faster than Pickle! It is written in C. And supposed to be 1000 times faster than Pickle.

>>> from cPickle import dumps
>>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
>>> dumps(t)
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\n
I10\nsS'Hello World'\np2\n(I1\nI2\nI3\nI4\nI5\n
tp3\n(lp4\nI1\naI2\naI3\naI4\naI5\nat."
>>> dumps(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10},
           'Hello World', (1, 2, 3, 4, 5),
            [1, 2, 3, 4, 5]))
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\n
I10\nsS'Hello World'\n(I1\nI2\nI3\nI4\nI5\nt(lp2
\nI1\naI2\naI3\naI4\naI5\natp3\n."

Uh oh! The result serializations of the same objects are not the same using cPickle!.

Why? cPickle tries to optimize the serialization by including if the object is referenced once or more than once! For a discussion around this look at http://stackoverflow.com/q/7501577/1497443

Why cPickle works for our case

I ended up using cPickle to create the serilzation to generate the {item_hash: item} hash table in DeepDiff. The reason it worked for DeepDiff was that we were creating references to the objects inside the libaray while diffing. So the cPickle of objects with equal content was always the same during the program run. Again cPickle ONLY produces different results for the same content if there is only one reference to the object vs. more than one reference.

What did we learn from diffing iterables

  • Difference of unhashable and mutable
  • Sets can only contain hashable
  • Create hash for dictionary
  • Custom sorting with a key function
  • Converting a squence into hashtable

Diff Custom objects

Let’s use __dict__ attribute to get the instance’s attributes. By default, instances of classes have a dictionary for attribute storage called __dict__:

>>> class CL:
...     attr1 = 0
...     def __init__(self, thing):
...         self.thing = thing

>>> obj1 = CL(1)
>>> obj2 = CL(2)
>>> obj2.attr1 = 10
>>> obj1.__dict__
{'thing': 1}  # Notice that att1 is not here
>>> obj2.__dict__
{'attr1': 10, 'thing': 2}

And we deal with it as if we were diffing 2 dictionaries.

__slots__ aka When there is no __dict__

If the class is using __slots__ to have more control over the object attributes, there won’t be a __dict__:

Example:

>>> class ClassA(object):
...     __slots__ = ['x', 'y']
...     def __init__(self, x, y):
...         self.x = x
...         self.y = y
...
>>> t1 = ClassA(1, 1)
>>> t2 = ClassA(1, 2)
>>>
>>> t1.d = 10
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'ClassA' object has no attribute 'd'

As you can see with __slots__ we can’t dynamically add attributes. You can think of __slots__ as a bridge towards static typing in python.

And unlike __dict__, __slots__ is NOT a dictionary:

>>> t1.__slots__
['x', 'y']

So in order to create a dictionary out of the attributes:

>>> t1 = {i: getattr(t1, i) for i in t1.__slots__}
>>> t2 = {i: getattr(t2, i) for i in t2.__slots__}
>>> t1
{'x': 1, 'y': 1}
>>> t2
{'x': 1, 'y': 2}

As far as diffing goes, the rest is just like comparing 2 dictionaries.

loops

Did you know that Python objects can have attributes pointing to themselves!

>>> class LoopTest(object):
...     def __init__(self, a):
...         self.loop = self
...         self.a = a
...
>>> t1 = LoopTest(1)
>>> t2 = LoopTest(2)
>>> t1
<__main__.LoopTest object at 0x02B9A910>
>>> t1.__dict__
{'a': 1, 'loop': <__main__.LoopTest object at 0x02B9A910>}

Let’s step back and think about it for a sec:

For custom objects, we are creating a dictionary out of instance attributes. Then we are creating 3 lists: attributes added, attributes removed and common attributes (common keys of the dictionaries). Then we are recursively diffing the common attributes values. So once we hit the same object again, we get stuck in a never ending loop!

Detect loops with ID

As you know every object in python has a unique ID. And if 2 objects have the same ID, they are the same object.

Let’s say A is parent of B and B is parent of C and C is parent of A:

A --> B --> C --> A

| object | A | B | C | A | |–|–|–|–|–| | id | 1 | 2 | 3 | 1 |

If we keep the ID of every object and every parent of the object, once we start processing an object, we can check to see if its ID exists in the list of IDs of its parents. And in that case, we get out of the loop!

Example:

| object | ID List | |–|–| | A | 1 | | B | 1, 2 | | C | 1, 2, 3 | | A | 1, 2, 3, 1 (Loop detected, get me out of here!) |

Since we will be creating this list of IDs for every object in the nested structure that we process recursively, we have to make sure the ID list is immutable.

def diff_common_children_of_dictionary(t1, t2, t_keys_intersect, parents_ids):
    ''' difference between common attributes of objects or values of common keys of dictionaries '''
    for item_key in t_keys_intersect:

        t1_child = t1[item_key]
        t2_child = t2[item_key]

        item_id = id(t1_child)

        if parents_ids and item_id in parents_ids:
            print ("Warning, a loop is detected.")
            continue

        parents_added = set(parents_ids)
        parents_added.add(item_id)
        parents_added = frozenset(parents_added)

        diff(t1_child, t2_child, parents_ids=parents_added)

Why it has to be immutable? because otherwise the same parents_ids object is gonna be passed around the recursion (passed by reference) instead of a new parents_ids object per object.

What did we learn about diffing custom objects

  • We try to convert every object into a dictionary. Either using its __dict__ or __slots__ when diffing.
  • Once objects are dictionaries, we will process it recursively to diff all their common attributes.
  • Python objects can have attributes pointing to themselves or any of their parents!
  • In order to detect loops in objects, we can use an immutable container of object IDs to know all the ancestors of an object.

Conclusion

Diffing python objects can give you a good insight into Python data types. I find diffing very useful in debugging and testing. For example I have made all my assertEqual calls to diff the test result object and the expected result if the assertion fails. This helps a lot to understand how exactly your expected result and test result are different.

To make things easier, I have made a package called DeepDiff that you can easily install from Pypi and use:

pip install DeepDiff

https://github.com/seperman/deepdiff

DeepDiff makes it super easy to diff complex objects:

Examples:

List difference

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'iterable_item_added': ["root[4]['b']: [3]"],
  'values_changed': ["root[4]['b'][1]: 2 ===> 3", "root[4]['b'][2]: 3 ===> 2"]}

List that contains dictionary:

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:1, 2:2}]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:3}]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'dic_item_removed': ["root[4]['b'][2][2]"],
  'values_changed': ["root[4]['b'][2][1]: 1 ===> 3"]}

Named Tuples:

>>> from collections import namedtuple
>>> Point = namedtuple('Point', ['x', 'y'])
>>> t1 = Point(x=11, y=22)
>>> t2 = Point(x=11, y=23)
>>> print (DeepDiff(t1, t2))
{'values_changed': ['root.y: 22 ===> 23']}

Ignoring order:

>>> t1 = [{"a": 2}, {"b": [3, 4, {1: 1}]}]
>>> t2 = [{"b": [3, 4, {1: 1}]}, {"a": 2}]
ddiff = DeepDiff(t1, t2, ignore_order=True)
>>>
>>> print(DeepDiff(t1, t2))
{}

And more: https://github.com/seperman/deepdiff

Seperman

Sep Dehpour is a principal engineer at ChowNow.

Los Angeles, California, USA http://zepworks.com