Sunday, December 14, 2008

Data Migration with HApps-Data

HAppS applications, like any application with persistent data storage, are faced with the issue of migrating existing data when the format of the persistent data is changed. This tutorial will explore the binary serialization and migration facilities provided by HAppS-Data. If you think I am doing it all wrong, please let me know. Writing this tutorial is the extent of my experience using the HApps-Data migration facilities.


Requirements


This tutorial only uses the HAppS-Data (and dependencies) portion of HAppS. It has been tested with HAppS-Data 0.9.3. The first three lines of the module look like this:


> {-# LANGUAGE TemplateHaskell, UndecidableInstances, FlexibleInstances, GeneralizedNewtypeDeriving, MultiParamTypeClasses, DeriveDataTypeable, TypeFamilies #-}
> module Main where
> import HAppS.Data

Serialization


The most obvious way to serialize data in Haskell is to use the familiar Read and Show classes. Simply use show to turn a value into a String, and read to turn a String back into a value. This method has three serious flaws however:



  1. The law read . show == id does not hold for all Show/Read instances.
  2. The serialized representation is rather verbose
  3. No migration path when types change, leaving your old data inaccessible

HAppS-Data provides a Serialize class which addresses these three issues. From an end user point of view the Serialize functionality provides three items of interest:

  1. The Serialize class
  2. the serialize and deserialize functions
  3. The deriveSerialize function

> class (Typeable a, Version a) => Serialize a where 
> ...
>
> serialize :: Serialize a => a -> Lazy.ByteString
> deserialize :: Serialize a => Lazy.ByteString -> (a, Lazy.ByteString)
>
> deriveSerialize :: Language.Haskell.TH.Syntax.Name
> -> Language.Haskell.TH.Syntax.Q [Language.Haskell.TH.Syntax.Dec]

The Version superclass is used during data migration. The serialize and deserialize functions are the counterparts to show and read. deriveSerialize is a Template Haskell function which provides functionality similar to deriving (Read, Show).


The Version class


The Version class is very straight-forward. It consists of a single function which returns the Mode (aka, the version) of a datatype.


>
> class Version a where
> mode :: Mode a
> mode = Versioned 0 Nothing
>
> data Mode a = Primitive -- ^ Data layout won't change. Used for types like Int and Char.
> | Versioned (VersionId a) (Maybe (Previous a))
>
> newtype VersionId a = VersionId {unVersion :: Int} deriving (Num,Read,Show,Eq)

There are two categories of datatypes:



  • primitives whose layout will never change, and, hence, will never need to be migrated
  • everything else

The Versioned constructor takes two arguments. The first argument is a version number which you increment when you make an change to the data-type. The second argument is an indicator of the previous version of the data-type. The exact details are covered in the next section.

Putting it all together


Let's say we have the following types:


>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data Foo
> = Bar String
> | Baz Beep
>
> data Beep
> = Beep
> |])

The deriveAll template haskell function is similar to the normal haskell deriving clause, except it also has the ability to derive Default instances. Additionally, it always derives Typeable and Data instances even though they are not explicitly listed.


To make the types serializeable we first need to create Version instances.


> instance Version Beep where
> mode = Versioned 0 Nothing
>
> instance Version Foo where
> mode = Versioned 0 Nothing

We want to indicate that Beep and Foo are non-primative types, so we use the Versioned constructor. Next we specify a version number for the type. It could be anything, but 0 is the most sensible choice. Since there are now previous versions of these types we mark the previous type as Nothing.

For all non-primitive types the initial version of Versioned 0 Nothing is sensible. So the Version class provides it as a default value for mode:


> class Version a where
> mode :: Mode a
> mode = Versioned 0 Nothing

Hence, we could shorten our Version instances from above to:


> instance Version Beep
> instance Version Foo

Next we derive Serialize instances for our types:


> $(deriveSerialize ''Beep)
> $(deriveSerialize ''Foo)

Now we can use serialize to serialize values. Let's look at the output of serialize Beep



*Main> Data.ByteString.Lazy.unpack $ serialize Beep
[0,0,0,0,0,0,0,0,0]
*Main>

We see that Beep serializes to 9 bytes. The first 8 bytes represent the VersionId. VersionId is basically an Int, and the serialization code always treats Ints as a 64-bit values to avoid cross-platform issues. The final byte indicates which constructor of Beep was used. In this case the zeroth constructor was used.


At first it may seem like we don't have enough information here to deserialize the data, after all there are no type names, constructors, etc. But deserializing these bytes is no different than doing read "1" :: Int. Because we know the type of the value we want to be reading at compile time, we do not need to record that information in the stored data. We just do:



*Main> deserialize (serialize Beep) :: (Beep,ByteString)
(Beep,Empty)
*Main>

As a side note, Strings are serialized to a very compact representation. In fact, they are stored as compactly as ByteStrings because they are first converted to a ByteString.



*Main> Data.ByteString.Lazy.unpack $ serialize "hello"
[0,0,0,0,0,0,0,5,104,101,108,108,111]
*Main> Data.ByteString.Lazy.unpack $ serialize (Data.ByteString.Lazy.Char8.pack "hello")
[0,0,0,0,0,0,0,5,104,101,108,108,111]
*Main>

The first 8 bytes are the length of the String, and the remaining bytes are the utf-8 encoded characters of the String.


So, if you application is best served by using Strings instead of ByteStrings, you do not have to take an extra steps to ensure that the serialized data is compactly represented.


Simple Migration


Let's say we want to add another constructor to the Beep type. As a first pass, we will actually create a whole new type named Beep', which is similar to the old type, but has an additional constructor BeepBeep.


> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data Beep' = BeepBeep' | Beep'
> |])
>
> $(deriveSerialize ''Beep')

Because we are extending a previous type, our Version instance will look a bit different:


> instance Version Beep' where
> mode = extension 1 (Proxy :: Proxy Beep)

This indicates that we are extending the old type Beep. The new version number must be higher than the old version, but does not have to be strictly sequential.


Because we specified that this type is a newer version of an older type, we also need to tell HAppS how to migrate the old data to the new type. To do this, we simply create an instance of the Migrate class.


> class Migrate a b where
> migrate :: a -> b

The Migrate class is quite simple, it contains a single function, migrate which migrates something of type a to type b. In our current example, all we need is:


> instance Migrate Beep Beep' where
> migrate Beep = Beep'

We can demonstrate migration by serializing a value of type Beep and deserializing it as type Beep'. The migration happens automatically in the deserialize function.



*Main> fst $ deserialize (serialize Beep) :: Beep'
Beep'
*Main>

When deserialize tries to deserialize the data produced by serialize Beep, it will first check the version number. When it sees that the version number in the stored data is lower than the version number of the current type it will instead try to decode it as the type you specified as the previous version. If the version associated with the previous type is still higher than the value in the serialized data, the migration code will recurse until it finds a matching version number. Once it finds a matching version number, it will call the corresponding deserialization "instance" to decode the old data. Then as the recursion unwinds, it will apply the migrate function to migrate the data to newer and newer formats until it reaches the newest format.


Managing History


A big issue in the above example is that when we added the new constructor we also changed the name of the type and its existing constructors. That is not very convenient in a real application where you have a multitude of references to the old names.


Fortunately, we do not have to change the name of the type to add a new constructor. As we saw in the beginning, the name of the type and the names of the constructors are not actually stored in the serialized data. So, instead we can change the name of the old type from Beep to OldBeep and update its constructor as well.


> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data OldBeep = OldBeep
> |])
>
> $(deriveSerialize ''OldBeep)
> instance Version OldBeep

Because OldBeep and Beep have the same shape, they will serialize to the same byte sequence:


*Main> Data.ByteString.Lazy.unpack $ serialize OldBeep
[0,0,0,0,0,0,0,0,0]
*Main> Data.ByteString.Lazy.unpack $ serialize Beep
[0,0,0,0,0,0,0,0,0]
*Main>

that means we can serialize an OldBeep value and then deserialize it as a Beep value, like this:



*Main> fst $ deserialize (serialize OldBeep) :: Beep
Beep
*Main>

Note that this is not the same as migration. Here we are just exploiting the fact that because the type name and constructor names are not encoded in the serialized data we can change those names and still be able to deserialize the data.


Full Migration Example #1


Here is the full example which shows:



  1. Beep renamed to OldBeep
  2. the new Beep with the extra constructor
  3. the migration code from OldBeep to Beep

> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data OldBeep
> = OldBeep
> |])
>
> instance Version OldBeep
> $(deriveSerialize ''OldBeep)
>
>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data OldBeep
> = OldBeep
> data Beep = BeepBeep | Beep
> |])
>
> instance Version Beep where
> mode = extension 1 (Proxy :: Proxy OldBeep)
>
> $(deriveSerialize ''Beep)

Using separate files to manage type history


Keeping all the revisions of your type in one file, and changing the name of the type and its constructors every revision is tedious and hard to manage. Instead, we can use a system where we rename the files that contain our types. To start, we will put the types we want to serialize in a separate file (or files), such as Types.lhs.


> {-# LANGUAGE TemplateHaskell, UndecidableInstances, FlexibleInstances, GeneralizedNewtypeDeriving, MultiParamTypeClasses, DeriveDataTypeable, TypeFamilies #-}
> module Types
> ( Bar(..)
> , Foo(..)
> ) where

> import HAppS.Data

>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data Foo
> = Bar String
> | Baz Beep
>
> data Beep
> = Beep
> |])
>
> instance Version Beep
> $(deriveSerialize ''Beep)
> instance Version Foo
> $(deriveSerialize ''Foo)

Now let's say we want to add a constructor Bop to the type Foo. First we rename Types.lhs to Types_000.lhs and change the module name to reflect the changed file name:


> {-# LANGUAGE TemplateHaskell, UndecidableInstances, FlexibleInstances, GeneralizedNewtypeDeriving, MultiParamTypeClasses, DeriveDataTypeable, TypeFamilies #-}
> module Types_000
> ( Beep(..)
> , Foo(..)
> ) where

> import HAppS.Data

>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data Foo
> = Bar String
> | Baz Beep
>
> data Beep
> = Beep
> |])
>
> instance Version Beep
> $(deriveSerialize ''Beep)
> instance Version Foo
> $(deriveSerialize ''Foo)

Next we create a new Types.lhs:


> {-# LANGUAGE TemplateHaskell, UndecidableInstances, FlexibleInstances, GeneralizedNewtypeDeriving, MultiParamTypeClasses, DeriveDataTypeable, TypeFamilies #-}
> module Types
> ( Beep(..) -- ^ re-exported from Types_000
> , Foo(..) -- ^ extended here
> ) where

> import HAppS.Data

We import the old Types_000 qualified as T0 to avoid name clashes.


> import qualified Types_000 as T0

Since we are only modifying Foo, we can import and re-export the old version of Beep unmodified (also see the module export list above):


> import           Types_000 (Beep)

Then we extend Foo with the new constructor Bop Int:


>
> $(deriveAll [''Eq,''Ord,''Read,''Show, ''Default]
> [d|
> data Foo
> = Bar String
> | Baz Beep
> | Bop Int
> |])

Next we create a Version instance which indicates that the previous version of Foo is T0.Foo.


>
> instance Version Foo where
> mode = extension 1 (Proxy :: Proxy T0.Foo)
>
> $(deriveSerialize ''Foo)
>

And, finally, we specify how to migrate the old data:


> instance Migrate T0.Foo Foo where
> migrate (T0.Bar str) = Bar str
> migrate (T0.Baz beep) = Baz beep

Note that Foo in Types.lhs and Foo in Types_000.lhs are different types, namely Types.Foo and Types_000.Foo. So this works for the same reason that renaming Beep to OldBeep works.


Serializing Datatypes from 3rd Party Libraries


It is also possible to serialize datatypes from 3rd party libraries, provided those types have Data and Typeable instances. There is a caveat with this however. If the third party library changes the type, then you will not be able to read your data. This is not a fatal flaw however. You can simply copy the old type definition into a local module, and then migrate the old data to the new format.


Suggested Policy



  1. Put the types you will serialize in one or more files which only contain types
  2. Deploy your web 2.718 killer app
  3. Before you do any more development, copy the current type files to sequential versions and create new current type files which re-export all the types. You can skip this step if the current type file only contains re-exports. i.e., if no type changes were made to that type file during the previous iteration.
  4. Make changes for current development cycle, and then go to step 2.

4 comments:

Anonymous said...

Excellent write-up. Maybe an addition would be that you can also use HAppS-State to automatically do the serializing for you, but that's also maybe something for another post =).

Jeremy Shaw said...

Thanks!

My next post will likely be about using migration in the context of a full HAppS application. I plan to explain how HAppS-State actually works, and how to deal with the issue that HAppS-State can cause when you rename modules.

But first, I have to figure those things out ;)

Anonymous said...

Great article - I'm trying to figure out HAppS at the moment so reading this post was really useful.

Joe said...

Thanks for this excellent post. Love to see more Happstack-related posts written in this clear style.