The new world of massive scale data
By Robin LesterÂ
Data is more relevant today than it has ever been. Data flows though industry, it is the life blood that connects businesses, people and machines to humans. It powers the gears of industry, destroying geographic boundaries with communication systems and, with the internet, creates a sea of knowledge that means it is no longer necessary for a person to remember details, just where details may be found.
However, it did not always used to be this way. Although the amount of data has been growing since writing was first invented, the process of data generation, storage and retrieval has been slow. The Great Library of Alexandrea held, at its peak, up to 400,000 scrolls – a lot to read but only estimated at around 20GB, small enough to fit onto a memory stick or a modern phone. There have, of course, been a few blips in collection and storage of data in the form of political change, wars and natural disasters, but in spite of this data has slowly been multiplying. Stored by those who understand how important it is to keep data – either for sentimental reasons or a need to understand the world, the opposition or, more sinisterly, the people who are producing it.
The need to understand the world has always been there. From the first expedition, map or telescope, people have wanted to know about the world so they can gain an edge and survive. However, the ability to collect and consume data has always been somewhat lacking, until around 40 years ago. With the birth of the internet and consumer grade electronics the amount of data stored and produced is accelerating rapidly. Gartner has estimated that there are 7.3 billion PCs, smartphones and tablets worldwide, along with 28 billion other devices that can produce electronic data. Compare that with the number of people who could consume this data. In 2020 it is estimated that there will be less than 8 billion people on the planet. (An equality between the number of gadgets to people was, in fact, achieved in 2014)
With all these devices producing data, it is estimated that by 2020 there will be 1.7 megabytes of data produced for every person, every second. This equates to more than seven new Libraries of Alexandria every day for every person on the planet. In total, that is more than 58 billion new Libraries of Alexandria every 24 hours.
But what use is all this data, how could it be understood? It is unfathomable for a human brain to comprehend that much data. With the explosion in data, new concepts needed to be developed to understand, shape and mould it. Phrases like Big Data have been coined to show the difference between understandable amounts of data and data that is too vast to be comprehensible. Phrases like the Internet of Things have come into being to help conceptualize the devices that are pumping out oceans of information. How do we understand it all? We understand it as a need arises. The more data you have, the more colour can be added to understanding a problem. For example, understanding why a machine fails means understanding the coloration of events that could have worked together to cause a component failure – humidity, temperature, ambient temperature, external movements, fuel mix, lubrication levels etc.
We understand data as grains of information
When we understand data, we see it as a collection of single elements of information, such as a tweet, a text, a temperature reading or a frame in a video. We understand data as atomic grains of information, or particles of knowledge in a greater collection. In physics, when more and more particles exist in a frame of reference, the particles start to take on new meaning. They start to behave less like individual units and more like a liquid. Partials start to flow and separate like a wave moving between rocks, ripples can occur and, at scale, the particles start to lose their own shape and take on the shape of their host container. This is also seen in Big Data, where partials of information become less about what they contain and more about the data they hold in relation to all the other particles of data. This analogy is used in data science and Big Data to understand the immensity of all the particles of data being looked at.
A sensor device, like a stalactite in a dark cave, will create a tiny drop of information. This droplet will combine with other droplets and form a small stream. The stream (or streaming data) begins to combine with other streams. As more drops of data flow, a lake is formed (a Data Lake). More data is added to the lake, increasing the volume and knowledge held within.
As with water, this is just the start of the journey. A lake is not the right kind of water for most needs and a Data Lake is not data that can give meaning on its own. A local water company could not give out vats of lake water for drinking and washing and the same is true of a Data Lake. The data in a Data Lake is dirty, much as the water in a lake is not clean. As with drinking water, data must be processed and cleaned before it can be consumed.
A Data Lake is not data that can give meaning on its own
A local water authority will draw some water out of their reservoirs and push it through a cleaning process. The same has to happen to Big Data held in a Data Lake. Data has to be sucked out. In much the same way as stones and grit are removed, elements that do not fit into the data need to be cleaned out. After data is pre-processed, it has to be given some kind of meaning. Who is going to use it? What does it say? Is there enough data, too much or not enough for the recipient person or system? As in a water processing plant, the data is worked on to change it into something that is understandable and relevant.
The concept of a Data Lake does, in some respects, lead us to think of one large container full of random data. The reality is somewhat more ordered. While data may be stored in a huge vault, types of data are generally categorized together. For example, tweets would not be stored with movies, tweets from last year may not even be stored with tweets from this year. In reality, data is not stored in a lake but in a series of ponds, all with their own input and outlet streams.
While a water company would let water flow from its lake in a random order, within a Data Lake it is necessary to identify and pull the correct data for the task. Within Big Data languages exist that embrace the terminology of the lake, such as U-SQL, named to represent a U-Boat diving deep into the lake to find the data that is required.
With so much data, new ways of understanding it come into play. There is too much data for a human to see patterns and there is simply no point in collecting data if it cannot be used.
Understanding and using Big Data calls for a shift in companies. No longer can the IT department get the right data to the right people, with so much data it is not possible to understand who might need what. Forward-thinking companies have responded to this problem by moving to a more self-service model of reporting, scaling out the job of finding meaning from the data from IT departments to the data consumers. Statistical languages are used to help identify patterns, as are machine learning algorithms.
It also stops making sense for companies to own the hardware their data is sitting on, and it makes more sense to rent storage from data centres where scale enables better economic models.
It makes more sense to rent storage from data centres where scale enables better economic models
What this all boils down to is that as data grows and flows the reality is that we stop being able to analyse it in a traditional manner, but must rely on machines to identify the currents and eddies.
With the new order of massive scale data it will not be important who has the best data, but rather who can pump the most relevant data around their organization’s circulatory system to supply the right people with the right level of information and patterns so they can be better informed to make the best business decisions. The company that can do that will stay on top.
Robin Lester is a Premier Field Engineer at Microsoft