Serializing empty collections can be surprisingly wasteful! This post describes a technique which speeds up things and reduces the data size! And the principle behind this technique is simple: avoid serializing empty collections. If your objects contain many empty collections this can make a big difference as you’ll see. We’ll reduce the data size by more than 80% and requires 60% less time to serialized and deserialize for our simple example class.

Serializing empty collections takes space too

The serialization algorithm used by the JDK is very sophisticated. During serialization ensures that each object will be encoded in the output only once. This minimizes the size of in the output. But what about different objects with the same value? Or to be more clear what happens when our objects contain a lot of empty List, Set, Queue and Map instances? Although these are all different objects, as in o1 != o2, they are all equal, as in o1.equals(o2).

In this post we use the term Collection instances to mean List, Qeueu, Set as well as Map implementations. We are well aware of the fact that the Map interface does not extends the Collection interface. Collection here should be read as the “Collection API”

What happens when we serialize all those empty Collection instances? Well each instance will be written to the output of course. The serialization of an empty collection object still produces some data in the output. Each of the Collection classes writes at least its size to the output during serialization. Add to that the header that ObjectOutputStream uses and you can see how you end up with quite some bytes to encode an empty data structure in the output. And if you have plenty of empty collections this may amount to a lot of data to encode well… nothing. But it turns out that with a bit of effort we can squeeze more out of the serialization process!

The technique

The serialization algorithm of the JDK is a very generic implementation. It does treats Collection instances as it would any other object. This is were this technique can help. Instead of repeatedly encoding empty collection objects in the output we’ll replace them with something smaller during serialization. Of course we need to handle this alternative encoding during deserialization.

The EmptyData class below shows how this can be achieved for a simple Serializable class. We implement the writeObject() and readObject() methods but with a twist. For this technique to work we need to do two things:

  1. replace empty instances in the serialized output; this will be done in writeObject()
  2. create empty instances when deserializing the input; this will happen in readObject()

writeObject(empty) vs writeObject(null)

Look at the writeObject() method of EmptyData. It is pretty straightforward. But instead of using writeObject() we added logic to write null to the output whenever an empty instance is encountered. This is where we achieve both a performance and a size gain. There are two reason for that:

  1. writeObject(null) is an edge case which is handled very early on during serialization making it faster
  2. writeObject(null) is handled by calling the special writeNull() method which write just a single byte to the output taking very little space

 

private void writeObject(ObjectOutputStream out) throws IOException {
 out.writeObject(m.isEmpty() ? null : m);
 out.writeObject(l.isEmpty() ? null : l);
}

Using writeObject(null) also improves performance because it reduces the number of objects tracked but the ObjectOutputStream. This tracking is done using a HashMap-like data structure. writeNull() does not add entries this data structure making it a fast operation.

readObject(): null means creating a new instance

For the technique to work we need to perform the inverse step during deserialization. We need to replace null values with new Collection instances. The is done within the readObject() method.

The readObject() method of the EmptyData class first reads an Object from the ObjectInputStream. If the reference is null we create a new instance. If not we cast the reference to the correct type before assigning to the the appropriate field.

private void readObject(java.io.ObjectInputStream in) throws IOException, ClassNotFoundException {
 Object o = in.readObject();
 m = (o != null) ? ((Map) o) : new HashMap();

 o = in.readObject();
 l = (o != null) ? ((List) o) : new ArrayList();
}

 

The full EmptyData class

The entire EmptyData class can be found below. The implementation of the writeObject() and readObject() methods incorporate the technique described above. In order to demonstrate the benefit with use a public static final boolean field which allows us to enable or disable the optimization with little effort.

import java.io.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class EmptyData implements Serializable {
 public static final boolean SMALL_AND_FAST = false;
 private static final int SZ = 1000 * 1000;

 Map m = new HashMap();
 List l = new ArrayList();

 public EmptyData() {
 }

 private void readObject(java.io.ObjectInputStream in) throws IOException, ClassNotFoundException {
 Object o = in.readObject();
 m = (o != null) ? ((Map) o) : new HashMap();

 o = in.readObject();
 l = (o != null) ? ((List) o) : new ArrayList();
 }

 private void writeObject(ObjectOutputStream out) throws IOException {
 if (SMALL_AND_FAST) {
 out.writeObject(m.isEmpty() ? null : m);
 out.writeObject(l.isEmpty() ? null : l);
 } else {
 out.writeObject(m);
 out.writeObject(l);
 }
 }
}

Benchmark

Time for some benchmarking. We use a simple benchmark to compare both the size and speed benefits of this tecnhique. No JMH benchmark this time. Instead we serialize 1 million EmptyData instances into a ByteArrayInputStream and measure its size. Then we deserialize the data and measure how long it took from start to finish. We repeat this 20 times.

Size: 81% smaller

When serializing 1 million EmptyData instance containing each an empty ArrayList and HashMap we end up with nearly 50 megabytes (49000170 bytes) of data in the output. Using the simple replacement technique we can represent the same state with only 9 megabytes !Significantly reduced the serialized size by not serializing empty collections

This difference in serialized data size is due to the special writeNull() used by writeObject() internally. Instead of writing the collection size to the serialized output it only writes a single byte. The result 40 megabytes less data are written to the output.

writeNull-handles-the-special-case

Speed:64% faster

How about the serialization speed then? We compare time needed tothe serialize and deserialize 1 million EmptyData instances. Using the standard scheme this takes 49.1 seconds. But when we replace empty collections with the more compact null value and back the time drops to 17.7 seconds!

The speed up it due to:

  • the fact that we write much less data to the output stream (see above)
  • writeObject() involves inspecting the fields of the object being serialized while writeNull() ends the object graph traversal

improved serialization speed by not serializing empty collections

 

Use with care

The technique shown here can not always be used! Before you rush off and start using this all over the place there is an important condition you should be aware of. In our EmptyData class we made sure that the List and Map references never get shared with other objects. We need to be sure that there is only a single reference pointing to the Collection instances which we’ll be replacing.

In the case of the EmptyData class we initialize the l and m fields during construction. The references should never be shared with other objects. As soon as there are cycles in the object graph you should not used the technique shown here! Why? Because the replacement scheme will result in different object graph after serialization. This a very dangerous and a very hard to find source of data corruption.

Conclusion

Whenever your application is serializes objects which contain many empty collection instances consider using this technique. With a little effort you optimize serialization both terms of size and speed. By serializing null instead of emtpy instances we use this special value to our advantage. It requires only 1 byte in the output and thus minimizes the size of the output. Since null is treated early on during serialization it is also faster.

No time to optimize by hand?

Then TRY the automated optimization performed by Externalizer4J for FREE. It is easy to setup and use click below and start now!Start optimizing your classes

 

Resources