Big Data for Dummies - What Data is BIG DATA?
By Saheli Roy Chowdhuri
Big data is a term used by a lot of analytics and ecommerce companies, and usually refers to vast amounts of data that can be used to provide intelligence. Although many of us have heard the term, not all of us have stopped to ask ourselves, “What is big data?”.
90% of the information in the world today has been created over the last 2 years , and by 2020 we will be generating 50 times more information than we are today! Today, organizations are not measuring data capacity in Gigabytes or Terabytes anymore, but in Exabytes and soon in Zettabytes. To break down data types, its storage and the analytics of big data it’s important to understand what big data really is.
So, what is Big Data?
Stored data is broadly classified into two types of structures, Structured Data and Unstructured Data.
Structured Data, is data that is stored in a highly organized format such as information stored in a relational database that is easily queried using search terms and algorithms that are available. For example a chain of hospitals and all its patient records, or data libraries that can be queried by a search engine. This can be thought of traditional data in a sense, as humans are used to using and analyzing data in this format.
Unstructured Data, is data that lacks organization. This data is hard to manage and make sense of, and machines are still understanding and improving data management and retrieval methods. Images and video uploaded to the internet, your email, text and social media conversations are examples of unstructured data.
Wikipedia defines big data as "data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time". Simply put, big data refers to the exabytes of information that is a challenge to manage and organize using current day technology. Big data is data from everywhere - GPS location and mobile devices, sensors on aeronautical or or space equipment, satellites, cameras and all other imaginable sources. According to IBM, 2.5 exabytes or 2.5 billion gigabytes of data was generated every day in 2012. This is the volume of data we’re managing and with storage becoming more manageable and cheaper, these volumes will only increase rapidly.
Characteristics of Big data
Big data is characterized by the following three V’s.
1) Volume - The volume refers to the extensive data that is being generated on a daily basis and then used as historical data to derive insights. According to a statistic mentioned earlier in this blog, 90% of the world's information was generated over the last 2 years, and this number will only multiply. This is an important characteristic, because just the sheer volume of data is a consideration for whether the data is categorized and hence managed as big data. Aeroplanes generate approximately 2.5 billion Terabyte of data each year from the sensors installed in the engines. Self-driving cars will generate 2 Petabyte of data every year. That again is absolutely nothing if we compare it to the Square Kilometer Array Telescope that will generate 1 Exabyte of data per day.
2) Variety - Variety is simply the different types of data formats we currently use. This could be email, photos, videos, weather and sensor data, social media chatter and many more. Variety is an important characteristic that a data analyst must consider to effectively use big data.
3) Velocity - Velocity refers to the rate at which new information is generated and the rate at which existant information is moved around. This is an important characteristic as computation time then becomes a large consideration while analyzing and visualizing large amounts of real-time data. For instance, technology now lets us analyze data while it’s being created without even storing it into a database! Due to a large number of people being connected to the internet, there is always new data being generated. Every minute we upload 100 hours of video on YouTube. In addition, every minute over 200 million emails and 300,000 tweets are sent. This high rate of content generation is a challenge companies must now tackle.
Where’s the world's data being stored?
For consumers and enterprises alike, the cost of data storage per GB/TB has drastically reduced and will not only become more cost effective, but also faster. This storage could be on local owned storage media or on a cloud service providers network.
90% of the data an organization generates is unstructured driving the need to be able to record and manage all of this data. Other than being structured, data can also exist as unstructured, semi-structured or complex structures. Cloud service providers will have set up large data farms on which data can be used and modified on-demand. To support these large volumes of unstructured and complex data, open-sourced tools like Hadoop and NoSQL are being developed and being used widely.
To learn more about big data analytics using Hadoop concepts and introduction to the Hadoop architecture, take the online course on Big Data Concepts & Hadoop Architecture.