Information Representation for Storage and Communication

There are many different data representations of a single piece of information – and many ways of interpreting data as information. In this blog entry, we show that, when designing systems, we have to distinguish information from its data representation in order to ensure the system matches its requirements of being authoritative or redundant, and to communicate this information using a fault tolerant representation.

At the Net-A-Porter group, Fulcrum is our proprietary back-end application for managing products displayed on the public websites. Fulcrum sends this content as messages to various ‘web services’ (in the most general sense – including Solr and SQL databases) which are then used to populate the public websites.  Fulcrum is authoritative, while the web services, which host content for public consumption, are redundant.

Until recently, our authoritative database and redundant services have had very similar data representations. We’ll look at a problem resulting from this, and the solution we came up with. The challenges we’re addressing in this blog are:

  • to identify the principles defining good data representations depending on whether it’s for authoritative or redundant services;
  • a way of transforming the data between the representations; and
  • ensuring that systems communicate the information atomically to ensure communication is fault tolerant – that failures do not result in data which is an invalid representation of information.

Problem – Swatches

swatch  (swɒtʃ)

  1. a sample of cloth
  2. a number of such samples, usually fastened together in book form
  3. (at Net-A-Porter) a set of products which differ only by colour
When managing a product in Fulcrum, we can edit its swatch – adding and removing other products. On the public websites, when we see one of these products we also see a row of colours linking to the other products in the swatch.  
When translating this relationship between products into SQL, an intuitive first guess might be a table with fields (product_id, swatch_product_id), where getting all the swatches in this set would be 
SELECT swatch_product_id FROM product_swatches
WHERE product_id = 166321;

We call this a graph representation.

A 16 Product Swatch

To the left is a picture of the Graph representation of a swatch of 16 products, with each product as a vertex and each row as an edge between the product and one of its swatches.

Things to note:

  • A swatch of n products has n²-n rows
  • Adding/removing a product affects 2n-2 rows
  • Can you see the missing edge?

 

This is a good place to clarify our definition of data redundancy as storing more data than necessary in order to deduce the information being stored. In traditional database design, this redundancy is regarded as A Bad Thing since omissions and errors can cause multiple mappings of the data into contradictory information.

Back to the diagram above. Apart from the large amount of data required for representing this swatch, populating and maintaining it requires nested for-loops where out-by-one errors (think a missing edge) are hard to discern. Once this data discrepancy has been populated across the redundant web services, it’s also difficult to repair. In this example, we had products whose page was missing links to some of the other products in its swatch. On the other hand, we need redundancy in our public-facing data for speed of populating the web pages, and to diminish the impact of failure or corruption of the data services.

Solutions

Solution 1: Simple authoritative data representation

For what we refer to as the container representation, we have a new field in the product table called swatch_id referencing a table called swatch with a single field – id. Now, to find all the products in a swatch, we can do:

SELECT id FROM product WHERE swatch_id = 
      (SELECT swatch_id FROM product WHERE id = 166321);

To the left is a picture of the container representation of two swatches – one with 16 products and another with 3, with each product as a circle and each swatch as a container. Things to note:
  • A swatch of n products requires only storing the swatch_id along with each product – n updates
  • Adding/removing a product from a swatch affects 1 row (product)

 

Solution 2: Use the algorithm for updating redundant services

Adding or removing a swatch’s product in the authoritative service is now a single add or delete. The for-loop generating all  (product_id, swatch_product_id) pairs is still required when publishing this information – assuming the redundant service still has the previous data representation.

Solution 3: Populate the redundant data service with a single snapshot

Rather than storing swatches and products in separate tables, our Indexed Product Service stores the entire swatch list of a product within the Product’s representation. Updating a swatch is a matter of sending the redundant service a snapshot of each of the products. While it involves the same number of selects, it is a single call to a well-defined atomic ‘snapshot’ operation. If the message doesn’t go through, the product retains the previous swatch set rather than being left with a partial change.

Solution 4: HOWTO restructure the authoritative data

As you can see from the notes above, the new authoritative structure is much easier to use, but there’s still the question of moving the data from the product_swatches graph representation to the container representation. Happily, I’m not the first Perl developer to run into this problem!

use Graph::Undirected;
my $graph = Graph::Undirected->new;

Create a graph where each vertex is a product, and each edge is a swatch relationship:

while (my $sch_pair = $rs_pairs->next) {
    my $pid = $sch_pair->product_id;
    my $s_pid = $sch_pair->swatch_product_id;

    $graph->add_vertex($pid);
    $graph->add_vertex($s_pid);
    $graph->add_edge($pid,$s_pid);
}

With a single call, we can now get an array, each element of which is a swatch’s list of products:

my @g_components = $graph->connected_components;

Conclusion

What we learnt from this redesign can be applied to many other situations when systems are being restructured. The principles are

  • Ensure the data representation on the authority is minimal. If you can remove a row and deduce a relationship (think the graph structure above with the missing edge) then the data representation is not minimal. It’s easier to maintain (fewer lines of code) and easier to understand what the data represents.
  • The redundant data representation should be quick and easy to access – why put it in two tables when you can have it in a single document?
  • Publishing from the authority is an atomic snapshot. Ensure that if a message is lost, it doesn’t leave the service in an invalid state, such as a product being connected to some, but not all, of the products in the swatch.
  • Authoritative data over-rides the redundant data when there are discrepancies.
Print Friendly

Leave a Reply