HAppS applications, like any application with persistent data storage, are faced with the issue of migrating existing data when the format of the persistent data is changed. This tutorial will explore the binary serialization and migration facilities provided by HAppS-Data. If you think I am doing it all wrong, please let me know. Writing this tutorial is the extent of my experience using the HApps-Data migration facilities.
Requirements
This tutorial only uses the HAppS-Data (and dependencies) portion of HAppS. It has been tested with HAppS-Data 0.9.3. The first three lines of the module look like this:
> {-# LANGUAGE TemplateHaskell, UndecidableInstances, FlexibleInstances, GeneralizedNewtypeDeriving, MultiParamTypeClasses, DeriveDataTypeable, TypeFamilies #-}
> module Main where
> import HAppS.Data
Serialization
The most obvious way to serialize data in Haskell is to use the familiar Read
and Show
classes. Simply use show
to turn a value into a String
, and read
to turn a String
back into a value. This method has three serious flaws however:
- The law
read . show == id
does not hold for all Show/Read instances.
- The serialized representation is rather verbose
- No migration path when types change, leaving your old data inaccessible
HAppS-Data provides a Serialize
class which addresses these three issues. From an end user point of view the Serialize
functionality provides three items of interest:
- The
Serialize
class
- the
serialize
anddeserialize
functions
- The
deriveSerialize
function
> class (Typeable a, Version a) => Serialize a where
> ...
>
> serialize :: Serialize a => a -> Lazy.ByteString
> deserialize :: Serialize a => Lazy.ByteString -> (a, Lazy.ByteString)
>
> deriveSerialize :: Language.Haskell.TH.Syntax.Name
> -> Language.Haskell.TH.Syntax.Q [Language.Haskell.TH.Syntax.Dec]
The Version
superclass is used during data migration. The serialize
and deserialize
functions are the counterparts to show
and read
. deriveSerialize
is a Template Haskell function which provides functionality similar to deriving (Read, Show)
.
The Version
class
The Version
class is very straight-forward. It consists of a single function which returns the Mode
(aka, the version) of a datatype.
>
> class Version a where
> mode :: Mode a
> mode = Versioned 0 Nothing
>
> data Mode a = Primitive -- ^ Data layout won't change. Used for types like Int and Char.
> | Versioned (VersionId a) (Maybe (Previous a))
>
> newtype VersionId a = VersionId {unVersion :: Int} deriving (Num,Read,Show,Eq)
There are two categories of datatypes:
- primitives whose layout will never change, and, hence, will never need to be migrated
- everything else
The Versioned
constructor takes two arguments. The first argument is a version number which you increment when you make an change to the data-type. The second argument is an indicator of the previous version of the data-type. The exact details are covered in the next section.
Putting it all together
Let's say we have the following types:
>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data Foo
> = Bar String
> | Baz Beep
>
> data Beep
> = Beep
> |])
The deriveAll
template haskell function is similar to the normal haskell deriving clause, except it also has the ability to derive Default
instances. Additionally, it always derives Typeable
and Data
instances even though they are not explicitly listed.
To make the types serializeable we first need to create Version
instances.
> instance Version Beep where
> mode = Versioned 0 Nothing
>
> instance Version Foo where
> mode = Versioned 0 Nothing
We want to indicate that Beep
and Foo
are non-primative types, so we use the Versioned
constructor. Next we specify a version number for the type. It could be anything, but 0
is the most sensible choice. Since there are now previous versions of these types we mark the previous type as Nothing
.
For all non-primitive types the initial version of Versioned 0 Nothing
is sensible. So the Version
class provides it as a default value for mode
:
> class Version a where
> mode :: Mode a
> mode = Versioned 0 Nothing
Hence, we could shorten our Version
instances from above to:
> instance Version Beep
> instance Version Foo
Next we derive Serialize
instances for our types:
> $(deriveSerialize ''Beep)
> $(deriveSerialize ''Foo)
Now we can use serialize
to serialize values. Let's look at the output of serialize Beep
*Main> Data.ByteString.Lazy.unpack $ serialize Beep
[0,0,0,0,0,0,0,0,0]
*Main>
We see that Beep
serializes to 9 bytes. The first 8 bytes represent the VersionId
. VersionId
is basically an Int
, and the serialization code always treats Int
s as a 64-bit values to avoid cross-platform issues. The final byte indicates which constructor of Beep
was used. In this case the zeroth constructor was used.
At first it may seem like we don't have enough information here to deserialize the data, after all there are no type names, constructors, etc. But deserializing these bytes is no different than doing read "1" :: Int
. Because we know the type of the value we want to be reading at compile time, we do not need to record that information in the stored data. We just do:
*Main> deserialize (serialize Beep) :: (Beep,ByteString)
(Beep,Empty)
*Main>
As a side note, String
s are serialized to a very compact representation. In fact, they are stored as compactly as ByteString
s because they are first converted to a ByteString
.
*Main> Data.ByteString.Lazy.unpack $ serialize "hello"
[0,0,0,0,0,0,0,5,104,101,108,108,111]
*Main> Data.ByteString.Lazy.unpack $ serialize (Data.ByteString.Lazy.Char8.pack "hello")
[0,0,0,0,0,0,0,5,104,101,108,108,111]
*Main>
The first 8 bytes are the length of the String
, and the remaining bytes are the utf-8 encoded characters of the String
.
So, if you application is best served by using String
s instead of ByteString
s, you do not have to take an extra steps to ensure that the serialized data is compactly represented.
Simple Migration
Let's say we want to add another constructor to the Beep
type. As a first pass, we will actually create a whole new type named Beep'
, which is similar to the old type, but has an additional constructor BeepBeep
.
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data Beep' = BeepBeep' | Beep'
> |])
>
> $(deriveSerialize ''Beep')
Because we are extending a previous type, our Version
instance will look a bit different:
> instance Version Beep' where
> mode = extension 1 (Proxy :: Proxy Beep)
This indicates that we are extending the old type Beep
. The new version number must be higher than the old version, but does not have to be strictly sequential.
Because we specified that this type is a newer version of an older type, we also need to tell HAppS how to migrate the old data to the new type. To do this, we simply create an instance of the Migrate
class.
> class Migrate a b where
> migrate :: a -> b
The Migrate
class is quite simple, it contains a single function, migrate
which migrates something of type a
to type b
. In our current example, all we need is:
> instance Migrate Beep Beep' where
> migrate Beep = Beep'
We can demonstrate migration by serializing a value of type Beep
and deserializing it as type Beep'
. The migration happens automatically in the deserialize
function.
*Main> fst $ deserialize (serialize Beep) :: Beep'
Beep'
*Main>
When deserialize
tries to deserialize the data produced by serialize Beep
, it will first check the version number. When it sees that the version number in the stored data is lower than the version number of the current type it will instead try to decode it as the type you specified as the previous version. If the version associated with the previous type is still higher than the value in the serialized data, the migration code will recurse until it finds a matching version number. Once it finds a matching version number, it will call the corresponding deserialization "instance" to decode the old data. Then as the recursion unwinds, it will apply the migrate
function to migrate the data to newer and newer formats until it reaches the newest format.
Managing History
A big issue in the above example is that when we added the new constructor we also changed the name of the type and its existing constructors. That is not very convenient in a real application where you have a multitude of references to the old names.
Fortunately, we do not have to change the name of the type to add a new constructor. As we saw in the beginning, the name of the type and the names of the constructors are not actually stored in the serialized data. So, instead we can change the name of the old type from Beep
to OldBeep
and update its constructor as well.
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data OldBeep = OldBeep
> |])
>
> $(deriveSerialize ''OldBeep)
> instance Version OldBeep
Because OldBeep
and Beep
have the same shape, they will serialize to the same byte sequence:
*Main> Data.ByteString.Lazy.unpack $ serialize OldBeep
[0,0,0,0,0,0,0,0,0]
*Main> Data.ByteString.Lazy.unpack $ serialize Beep
[0,0,0,0,0,0,0,0,0]
*Main>
that means we can serialize an OldBeep
value and then deserialize it as a Beep
value, like this:
*Main> fst $ deserialize (serialize OldBeep) :: Beep
Beep
*Main>
Note that this is not the same as migration. Here we are just exploiting the fact that because the type name and constructor names are not encoded in the serialized data we can change those names and still be able to deserialize the data.
Full Migration Example #1
Here is the full example which shows:
Beep
renamed toOldBeep
- the new
Beep
with the extra constructor
- the migration code from
OldBeep
toBeep
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data OldBeep
> = OldBeep
> |])
>
> instance Version OldBeep
> $(deriveSerialize ''OldBeep)
>
>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data OldBeep
> = OldBeep
> data Beep = BeepBeep | Beep
> |])
>
> instance Version Beep where
> mode = extension 1 (Proxy :: Proxy OldBeep)
>
> $(deriveSerialize ''Beep)
Using separate files to manage type history
Keeping all the revisions of your type in one file, and changing the name of the type and its constructors every revision is tedious and hard to manage. Instead, we can use a system where we rename the files that contain our types. To start, we will put the types we want to serialize in a separate file (or files), such as Types.lhs.
> {-# LANGUAGE TemplateHaskell, UndecidableInstances, FlexibleInstances, GeneralizedNewtypeDeriving, MultiParamTypeClasses, DeriveDataTypeable, TypeFamilies #-}
> module Types
> ( Bar(..)
> , Foo(..)
> ) where
> import HAppS.Data
>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data Foo
> = Bar String
> | Baz Beep
>
> data Beep
> = Beep
> |])
>
> instance Version Beep
> $(deriveSerialize ''Beep)
> instance Version Foo
> $(deriveSerialize ''Foo)
Now let's say we want to add a constructor Bop
to the type Foo
. First we rename Types.lhs to Types_000.lhs and change the module name to reflect the changed file name:
> {-# LANGUAGE TemplateHaskell, UndecidableInstances, FlexibleInstances, GeneralizedNewtypeDeriving, MultiParamTypeClasses, DeriveDataTypeable, TypeFamilies #-}
> module Types_000
> ( Beep(..)
> , Foo(..)
> ) where
> import HAppS.Data
>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data Foo
> = Bar String
> | Baz Beep
>
> data Beep
> = Beep
> |])
>
> instance Version Beep
> $(deriveSerialize ''Beep)
> instance Version Foo
> $(deriveSerialize ''Foo)
Next we create a new Types.lhs:
> {-# LANGUAGE TemplateHaskell, UndecidableInstances, FlexibleInstances, GeneralizedNewtypeDeriving, MultiParamTypeClasses, DeriveDataTypeable, TypeFamilies #-}
> module Types
> ( Beep(..) -- ^ re-exported from Types_000
> , Foo(..) -- ^ extended here
> ) where
> import HAppS.Data
We import the old Types_000 qualified as T0
to avoid name clashes.
> import qualified Types_000 as T0
Since we are only modifying Foo, we can import and re-export the old version of Beep unmodified (also see the module export list above):
> import Types_000 (Beep)
Then we extend Foo
with the new constructor Bop Int
:
>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data Foo
> = Bar String
> | Baz Beep
> | Bop Int
> |])
Next we create a Version
instance which indicates that the previous version of Foo
is T0.Foo
.
>
> instance Version Foo where
> mode = extension 1 (Proxy :: Proxy T0.Foo)
>
> $(deriveSerialize ''Foo)
>
And, finally, we specify how to migrate the old data:
> instance Migrate T0.Foo Foo where
> migrate (T0.Bar str) = Bar str
> migrate (T0.Baz beep) = Baz beep
Note that Foo
in Types.lhs and Foo
in Types_000.lhs are different types, namely Types.Foo
and Types_000.Foo
. So this works for the same reason that renaming Beep
to OldBeep
works.
Serializing Datatypes from 3rd Party Libraries
It is also possible to serialize datatypes from 3rd party libraries, provided those types have Data
and Typeable
instances. There is a caveat with this however. If the third party library changes the type, then you will not be able to read your data. This is not a fatal flaw however. You can simply copy the old type definition into a local module, and then migrate the old data to the new format.
Suggested Policy
- Put the types you will serialize in one or more files which only contain types
- Deploy your web 2.718 killer app
- Before you do any more development, copy the current type files to sequential versions and create new current type files which re-export all the types. You can skip this step if the current type file only contains re-exports. i.e., if no type changes were made to that type file during the previous iteration.
- Make changes for current development cycle, and then go to step 2.