Josef is a chemical engineer who began his career in computer controlled analyzers. With more than 15 years experience in primary and secondary storage, high availability, disaster recovery and virtualization, Josef is a well-recognized expert in his field.
Josef joined Bull in 2007. Today he leads the Austrian Consulting Team and acts as Advisory Services Champion for Bull Austria. Holding more than 30 IT certifications – including vendor-independent SNIA Certifications – Josef has recently been recognized as a SCSN-E worldwide High Achiever by the Storage Networking Industry Association.
Duplication, redundancy, expansion… the challenge of managing data
Organizations in every industry are weighed down by the relentless growth in the data they produce and which they need to manage. What if there was a way to get rid of redundant data that has been stored for ages and gone through various iterations? In fact, there is not just one way, but multiple technologies available to help customers cope with their data growth.
Has anyone not heard of data deduplication yet? But does deduplication solve all problems? Obviously not, but again several other options are available to customers to slow down their data growth and increase efficiency. While all technologies have their advantages, they may not fit all customer requirements, which means it is important to look at the available options and match them with customers’ business requirements.
The following article looks at technologies available and where each can be used most effectively, and closes with a simple data reduction matrix to help evaluate the needs of individual organizations.
Data management technologies: what are they, and when can you use them?
Every day, IT departments have to manage more and more data. Unfortunately for those who need to manage and cope with this growth, it is the unstructured data that grows; in other words, data that sits outside databases or ERP systems. Often it is stored in e-mail systems or on file servers, in the form of word-processing documents, presentations and even voice or video data.
These days, a number of new technologies, as well as established solutions, are available to help deal with this growth, but it is not always easy to find the right technology to suit your needs, as you have to understand the benefits, downsides and application areas of each of the technologies. If you consider technologies such as deduplication, compression or thin provisioning, they are already available in Bull’s StoreWay™ offering, either as primary and/or secondary storage[1].
In any case, technology alone may not solve all the customer’s needs; as Gartner clearly points out in its 2009 research paper: “Invest in Storage Professional Services, Not More Hardware”[2]. A simple advisory package delivered by Bull’s StoreWay experts can help identifying the right technologies and ensure that they are used in the right way.
Compression
Even though compression has been used for many years to reduce data, it should not be seen as ‘old technology’ at all. Organizations have successfully used compression for years, implemented in secondary storage technology with unstructured data; but we are also seeing more and more storage vendors implementing loss-less compression technologies in primary storage. It is easy to implement and can be supported by dedicated hardware. When applying compression to unstructured data, space savings are relatively similar to other technologies, such as deduplication.
The space-saving results that can be achieved depend entirely on the source data. The more ‘raw’ the source data (eg. word-processing documents, spreadsheets…), the better the results; but the more compressed the source already is (eg. jpeg, movies, mp3, pdf…), the less compression you can expect.
Two different compression algorithms of used: ‘lossy’ and ‘loss-less’. Lossy compression is acceptable where not all the information needs to be kept. Very often it is used with multimedia formats such as jpeg, mp3 or mp4. A lossy compression algorithm removes most of the redundant data through reduction. It is important to recognize that the compressed data cannot be reduced much further using other technologies.
A typical scenario for using lossy compresssion would be to reduce multimedia information that is no longer needed in a high-definition format: for example, to reduce pictures from 10Mbit to 2MBit resolution.
Loss-less compression has also been used for a number of years. With loss-less compression the original data is kept, but stored in an optimized format. It is typically used for software-based file compression, such as zip or hardware-based implementation in secondary storage, like tape drives. Loss-less compression is a widely used technology which often gets implemented in optimized hardware. Bull expects this technology to appear more and more in primary storage arrays as well.
Other technologies, such as deduplication and thin provisioning, can be implemented together with loss-less compression to achieve further data reduction.
Let’s look at an example: how to store a number like 7.88888888 using compression algorithms. Using loss-less compression, it could be stored as 7.[8]8 (seven dot 8 times 8), which still keeps all the information but takes up about 40% less space. With lossy compression, it is rounded to 7.9 or even 8, depending on the granularity of information we want to keep, and so saves even more space with the trade-off that not all the original information is kept.
Thin provisioning
Thin provisioning is a space-saving technology that leverages the fact that today most applications are assigned more space than they currently need, which leads to bad storage utilization rates. Thin provisioning can help to increase storage efficiency while making management easier.

Today’s IT systems are often over provisioned in terms of storage, and there are several good reasons for doing this, such as file system optimization (file systems like to have some free space to optimize themselves), avoiding maintenance tasks such as growing file systems or database files, or standardization for the purposes of virtualization (where all machines are provisioned from the same master).
Using thin provisioning, only the space that is really needed is physically allocated, which in theory saves a great deal. In reality you have to take a close look at what space saving results are really achievable, as applications can either be thin friendly or thin unfriendly, depending on how they allocate space.
For example, if you take traditional file systems – which are usually ‘thin unfriendly’ – they always try to allocate new space before overwriting old blocks, to avoid fragmentation. For thin provisioning, that means that all blocks are used at least once and storage arrays cannot see which space is really being used and which space is free.
Despite this, we are seeing more and more ‘thin friendly’ applications, such as databases – as long as they do not pre-allocate storage – and virtualization solutions. Due to the maturity of this technology, applications usually use built-in mechanisms to tell a thin provisioning storage array which blocks are used and which are free. This mechanism is called ‘space reclamation’ and is required to not only start thin but to stay thin provisioned over time.
The low-hanging fruit for thin provisioning is Network Attached Storage (NAS), where the file system is running inside the storage array itself, which makes it relatively easy to use only the space which is allocated.
In any case, organizations should check for thin storage and thin reclamation capabilities, to achieve the best results and get thin at the same time!
Be aware, though, that with thin provisioning you may assign more storage than you have actually got. So monitoring is key in thin provisioned environments, to avoid running out of space.
Deduplication
In general, deduplication is a process which ensures that duplicate or redundant information is only stored once. In the market two main methodologies of deduplication have evolved: ‘record linking’, which has been around for decades and used to link redundant records in databases. The other, which is relatively new, is data deduplication as a storage technology, either in hardware or software.
Let’s look at storage related deduplication technologies, especially how, where and when to deduplicate.
Data deduplication is works using the hash[3] and checksum[4] of objects and comparing them with information in a database to identify duplicate data. The checksum is used to ensure that data is the same as before the deduplication.
The most widely used terminologies in data deduplication are: object-based deduplication, called ‘single instance storage’ by Microsoft[5], versus block-based (fixed block length or variable block length, sometimes called segment size deduplication) deduplication: source versus target; and in-line versus post-process.
Let’s take a closer look at each of these technologies, to better understand their usage.
How to deduplicate
Object-based deduplication has been around for a while, and is implemented in a number of products. The hash of the object (typically a file) is compared with a database table, and if it already exists, it is only linked and not saved again. In object-based deduplication, a small change in an element like a file already changes the hash of the element and therefore the space saving works only if the content of the object is the same. If you look at file servers, the same files are often saved under different names without any change. So object-based deduplication is easy to manage and can save a lot of space.
To achieve greater granularity, objects can be divided into smaller pieces. As this is often implemented in the block storage directly, we then speak of ‘block-based deduplication’. When using a fixed block size, deduplication depends on where the data is saved in a file. If the change is written to the end of an object, it is more likely to find duplicates than if the change gets written at the beginning of an object.
The most advanced technologies available work with so-called variable block size. The idea behind this is that data streams are analyzed for redundant data in different sizes of blocks. So even if the change is at the beginning or in the middle of an object it is more likely to identify the redundant parts.
As with all deduplication technologies you can only estimate space-saving rates based on average results with different data types provided by vendors. To find out how much space saving you can achieve, you’d need to evaluate deduplication with your own data, for example through a deduplication assessment.
Where to deduplicate
The availability of reliable deduplication in primary storage is somewhat limited today. But we expect to see more products coming onto the market over the next few years, especially in combination with compression. So for now, we will focus on its implementation in secondary storage or backup products.
In the backup process, we find two areas where data can be deduplicated: first, deduplication by the back-up client at the data source. This needs to be supported by the back-up application. Traditionally this has its origin in the need to carry out tape-less backups over slower wide-area networks. As an example, you could back up an organization’s remote locations to a central Data Center.
Nowadays, with the convergence of standard back-up clients and deduplication functionality within the back-up clients, this feature has been introduced into Data Centers as well. It is often used to minimize resource utilization in virtualized environments.
The second point of deduplication is at the target device of the back-up process, which is carried out both in software and dedicated hardware. As deduplication requires a great deal of computing power and memory, the achievable performance of target-based deduplication is growing with every generation of CPU and memory. Today we have reached a point where these technologies are moving out of their niche and entering the mainstream market. Gartner expects that over the next few years these technologies will change the back-up market and secondary storage dramatically.
When to deduplicate
Given that deduplication requires a lot of computing power it is often implemented as a two-stage process where, with many hardware-based solutions, the deduplication process runs after the back-up of data is complete. As a result, data needs to be stored without any reduction on disk first and then deduplicated at a later point in time. This process is called post-process deduplication, and requires more disk performance and space.
Some software technologies, as well as a few hardware devices, can handle deduplication directly while data is being backed up. This process is called in-line deduplication and requires more computing power but much less disk space and performance to handle the throughput.
As CPU and memory performance grow faster than comparable disk performance, we would expect that in-line technologies will be the prevailing technologies in a few years. Only a few vendors already implemented true in-line technology, with some using in-line up to a certain throughput, but many leveraging post-process technology.

A new way of thinking
As data deduplication technologies have reached a level of maturity where they have proved their success in various customer implementations, it is now time for customers to rethink the back-up and storage management process.
After reading that article, you might notice that investing in assessment and integration services is key for a successful implementation. Bull offers a wide range of services in this area, including consulting services such as GPS, COMPASS and deduplication assessments, as well as integration services delivered by Bull Infrastructure Services and Support.
For more information about our Europeanwide roadshow “Data Center next generation” run together with EMC, Intel and VMware >>> http://www.bull.com/bulldirect/N47/agenda.html#roadshow
[1] Primary storage is in this context used for data storage that is in active use by users or application, whereas secondary storage is used for backup and restore purposes.
[2] Gartner Research ID Number: G00165133, Publication Date: 27 February 2009 – “Invest in Storage Professional Services, Not More Hardware”
[3] Wikipedia – Hash function – http://en.wikipedia.org/wiki/Hash_function
[4] Wikipedia – Checksum – http://en.wikipedia.org/wiki/Checksum
[5] United States Patent 5,813,008; Benson, et al., September 22, 1998









