What are Big Data problems? What are the issues faced by big companies in maintaining the data and how they manage it?

Deepak Patra
11 min readSep 17, 2020

Big Data has been described by some Data Management pundits (with a bit of a snicker) as “huge, overwhelming, and uncontrollable amounts of information.”

The evolution of Big Data includes a number of preliminary steps for its foundation, and while looking back to 1663 isn’t necessary for the growth of data volumes today, the point remains that “Big Data” is a relative term depending on who is discussing it. Big Data to Amazon or Google is very different than Big Data to a medium-sized insurance organization, but no less “Big” in the minds of those contending with it.

In the last two years, over 90% of the world’s data was created , and with 2.5 quintillion bytes of data generated daily, it is clear that the future is filled with more data, which can also mean more data problems.

Whilst it is clear that companies can benefit from this growth in data, executives must be cautious and aware of the challenges they will need to overcome, particularly around:

  • Collecting, storing, sharing and securing data.
  • Creating and utilising meaningful insights from their data.

90% of the data on the internet has been created since 2016, according to an IBM Marketing Cloud study . People, businesses, and devices have all become data factories that are pumping out incredible amounts of information to the web each day.

The Amount of Data Created Each Day on the Internet in 2019

In 2014, there were 2.4 billion internet users. That number grew to 3.4 billion by 2016, and in 2017 300 million internet users were added. As of June 2019 there are now over 4.4 billion internet users. This is an 83% increase in the number of people using the internet in just five years!

Not only are there more people using the internet, but they are using it in many different ways.

Each minute of every day the following happens on the internet:

If we do some quick calculations, we can see the amount of data created on the internet each day. There are 1440 minutes per day…so that means there are approximately:

Facebook revealed some big, big stats on big data to a few reporters at its HQ today, including that its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour.

The 3 Vs:

Though there are numerous problems in the Big Data World, the 3 V’s volume, velocity and variety are the 3 most important to be marked.

VOLUME

Volume is the V most associated with big data because, well, volume can be big. What we’re talking about here is quantities of data that reach almost incomprehensible proportions.

Facebook, for example, stores photographs. That statement doesn’t begin to boggle the mind until you start to realize that Facebook has more users than China has people. Each of those users has stored a whole lot of photographs. Facebook is storing roughly 250 billion images.

Can you imagine? Seriously. Go ahead. Try to wrap your head around 250 billion images. Try this one. As far back as 2016, Facebook had 2.5 trillion posts. Seriously, that’s a number so big it’s pretty much impossible to picture.

So, in the world of big data, when we start talking about volume, we’re talking about insanely large amounts of data. As we move forward, we’re going to have more and more huge collections. For example, as we add connected sensors to pretty much everything, all that telemetry data will add up.

How much will it add up? Consider this. Gartner, Cisco, and Intel estimate there will be between 20 and 200 (no, they don’t agree, surprise!) connected IoT devices, the number is huge no matter what. But it’s not just the quantity of devices.

Consider how much data is coming off of each one. I have a temperature sensor in my garage. Even with a one-minute level of granularity (one measurement a minute), that’s still 525,950 data points in a year, and that’s just one sensor. Let’s say you have a factory with a thousand sensors, you’re looking at half a billion data points, just for the temperature alone.

Then, of course, there are all the internal enterprise collections of data, ranging from energy industry to healthcare to national security. All of these industries are generating and capturing vast amounts of data.

That’s the volume vector.

VELOCITY

Remember our Facebook example? 250 billion images may seem like a lot. But if you want your mind blown, consider this: Facebook users upload more than 900 million photos a day. A day. So that 250 billion number from last year will seem like a drop in the bucket in a few months.

Velocity is the measure of how fast the data is coming in. Facebook has to handle a tsunami of photographs every day. It has to ingest it all, process it, file it, and somehow, later, be able to retrieve it.

That feed of Twitter data is often called “the firehose” because so much data (in the form of tweets) is being produced, it feels like being at the business end of a firehose.

Here’s another velocity example: packet analysis for cybersecurity. The Internet sends a vast amount of information across the world every second. For an enterprise IT team, a portion of that flood has to travel through firewalls into a corporate network.

Unfortunately, due to the rise in cyberattacks, cybercrime, and cyberespionage, sinister payloads can be hidden in that flow of data passing through the firewall. To prevent compromise, that flow of data has to be investigated and analyzed for anomalies, patterns of behavior that are red flags. This is getting harder as more and more data is protected using encryption. At the very same time, bad guys are hiding their malware payloads inside encrypted packets.

That flow of data is the velocity vector.

VARIETY

You may have noticed that I’ve talked about photographs, sensor data, tweets, encrypted packets, and so on. Each of these are very different from each other. This data isn’t the old rows and columns and database joins of our forefathers. It’s very different from application to application, and much of it is unstructured. That means it doesn’t easily fit into fields on a spreadsheet or a database application.

Take, for example, email messages. A legal discovery process might require sifting through thousands to millions of email messages in a collection. Not one of those messages is going to be exactly like another. Each one will consist of a sender’s email address, a destination, plus a time stamp. Each message will have human-written text and possibly attachments.

Photos and videos and audio recordings and email messages and documents and books and presentations and tweets and ECG strips are all data, but they’re generally unstructured, and incredibly varied.

All that data diversity makes up the variety vector of big data.

Hadoop Distributed File System — a Solution to Big Data?

The Hadoop Distributed File System (HDFS) is part of a comprehensive business solution to storing big data for analysis. It is a function of Hadoop and it stores the data for companies wishing to maintain large amounts of data. If you’re looking for an answer to the big data storage dilemma, HDFS is right for you. Here’s a guide to help you understand it better.

What Is Hadoop Distributed File System?

The Hadoop Distributed File System (HDFS) is the way Hadoop programs store data. It is instrumental in solving big data issues for companies. HDFS transforms singular data into large quantities of connecting pathways. It essentially enables companies to expand storage drastically.

HDFS is a core part of Hadoop — which contains different main modules. One of them converts the data format. The second action (HDFS) stores the data and is the most important function within Hadoop. These steps are crucial to solving big data problems for enterprises. Before we explain HDFS further, it’s important to understand what Hadoop is and why HDFS became part of its process.

What Is Hadoop?

Hadoop is a collection of open source software that companies use to handle their big data processes. These programs are available to essentially everyone. Furthermore, all who use them are capable of adjusting their function. Ultimately, Hadoop maintains data storage.

Hadoop is a way for companies to analyze big data. It is dynamic and modifiable when necessary. Data systems change, making it valuable in that it can be altered to fit the new changes. It allows companies to use large networks of systems to manage large quantities of data.

Hadoop as a Solution?

Hadoop is designed to handle the three V’s of Big Data: volume, variety, velocity. First lets look at volume, Hadoop is a distributed architecture that scales cost effectively. In other words, Hadoop was designed to scale out, and it is much more cost effective to grow the system. As you need more storage or computing capacity, all you need to do is add more nodes to the cluster. Second is variety, Hadoop allows you to store data in any format, be that structured or unstructured data.

This means that you will not need to alter your data to fit any single schema before putting it into Hadoop. Next is velocity, with Hadoop you can load raw data into the system and then later define how you want to view it. Because of the flexibility of the system, you are able to avoid many network and processing bottlenecks associated with loading raw data. Since data is always changing, the flexibility of the system makes it much easier to integrate any changes.

Hadoop will allow you to process massive amounts of data very quickly. Hadoop is known as a distributing processing engine which leverages data locality. That means it was designed to execute transformations and processes where the data actually exists.

Another benefit of value is from an analytics perspective, Hadoop allows you load raw data and then define the structure of the data at the time of query. This means that Hadoop is quick, flexible, and able to handle any type of analysis you want to conduct.

Organizations begin to utilize Hadoop when they need faster processing on large data sets, and often find they save the organization some money too. Large users of Hadoop include: Facebook, Amazon, Adobe, EBay, and LinkedIn. It is also in use throughout the financial sector and the US government.

Hadoop first and foremost is a way for large data sets to be stored for analysis. It is intended to be the next viable option instead of a single-storage solution such as a hard drive. This helped it transform the way data was stored, because it’s more efficient to distribute the data onto numerous physical storage locations rather than just one giant receptacle. The reason for this is because the more places the data is stored, the quicker it can be retrieved.

Think of it like driving down a one-lane freeway. Eventually cars pile up. The Hadoop creates multiple lane freeways that let data cruise the freeway quickly. With Hadoop, companies are saving money by essentially pooling their storage resources where the Hadoop can function. This is where HDFS comes in.

Where Does HDFS Fit Into This Equation?

The Hadoop stores data using the distributed file system. The company computers and other hardware are connected using HDFS, allowing the data files to be stored across an array of systems rather than in a single location.

There are many benefits to this, one such benefit is it acts as a backup of sorts. Imagine if you had twenty one dollar bills in your pocket versus one twenty dollar bill in your pocket. If you dropped the twenty, you would be out twenty dollars. If you dropped a one, you would still have nineteen dollars. The system works similarly in that technical and hardware failures aren’t a data death sentence.

The HDFS is a large part of the solution for companies attempting to manage big data efficiently. As the nature of Hadoop evolves, keep in mind its core functions to understand the changes.

--

--