26 July 2021

Yotta Bytes Big data DataScience Storage


Yotta bytes  Big Data Storage

1. Zettabyte Era

Data is measured in bits and bytes. One bit contains a value of 0 or 1. Eight bits make a byte. Then we have kilobytes (1,000 bytes), megabytes (1000² bytes), gigabytes (1000³ bytes), terabytes (1000⁴ bytes), petabytes (1000⁵ bytes), exabytes (1000⁶ bytes) and zettabytes (1000⁷ bytes).

Cisco estimated that in 2016 we have passed one zettabyte in total annual Internet traffic, that is all data that we have uploaded and shared on the world wide web, most of it being file sharing. A zettabyte is a measure of storage capacity, which equals 1000⁷ (1,000,000,000,000,000,000,000 bytes). One zettabyte is equal to a thousand exabytes, a billion terabytes, or a trillion gigabytes. In other words — that’s a lot! Especially if we take into account that Internet is not even 40 years old. Cisco estimated also that by 2020 the annual traffic will grow to over 2 zettabytes.

From Block Storage to GPUDirect | Past & Future of Data Platform

Internet traffic is only one part of the total data storage, which includes also all personal and business devices. Estimates for the total data storage capacity which we have right now, in 2019, vary, but are already in 10–50 zettabyte range. By 2025 this is estimated to grow to the range of 150–200 zettabytes.

Definitely data creation will only fasten in upcoming years, so you might wonder: is there any limit to data storage? Not really, or rather, there are limits, but are so far away that we won’t get anywhere near them anytime soon. For example, just a gram of DNA can store 700 terabytes of data, which means that we could store all our data we have right now on 1500kg of DNA — packed densely, it would fit into an ordinary room. That however is very far from what we are able to manufacture currently. The largest hard drive being manufactured has 15 terabytes, and the largest SSD reaches 100 terabytes.

The term Big Data refers to a dataset which is too large or too complex for ordinary computing devices to process. As such, it is relative to the available computing power on the market. If you look at recent history of data, then in 1999 we had a total of 1.5 exabytes of data and 1 gigabyte was considered big data. Already in 2006, total data was estimated at 160 exabytes level — 1000% more in 7 years. In our Zettabyte Era 1 gigabyte is no longer big data really, and it makes sense to talk about big data starting with at least 1 terabyte. If we were to put that in more mathematical terms, then it seems natural to talk about Big Data with regard to datasets which exceed total data created in the world divided by 1000³.

2. Petaflops

For data to be useful, it’s not enough to store it, you also have to access it and process it. One can measure computer’s processing power by either number of instruction per second (IPS) or floating-point operations per second (FLOPS). While IPS is broader than FLOP, it is also less precise and depends on a programming language used. On the other hand FLOPS are pretty easy to imagine as they are directly related to number of multiplications/divisions we can do per second. For example a simple handheld calculator needs several FLOPS to be functional, while most modern CPUs are in the range of 20–60 GFLOPS (gigaFLOPS = 1000³ FLOPS). The record breaking computer built in 2018 by IBM reached 122.3 petaFLOPS (1000⁵ FLOPS), which is several millions faster than an ordinary PC (200 petaflops in a peak performance).

DataOps –The Foundation for Your Agile Data Architecture

GPUs perform better with floating-point computations reaching several hundred GFLOPS (mass market devices). Things are getting interesting when you look into specialized architecture. Latest trend is building hardware to boost machine learning and the most well known example is TPU by Google which reaches 45 teraFLOPS (1000⁴ FLOPS) and can be accessed through the cloud.

If you need to perform large computations and you don’t have yourself a supercomputer, the next best thing is to rent it, or compute on the cloud. Amazon gives you up to 1 petaFLOPS with P3 while Google offers a pod of TPUs with speed up to 11.5 petaFLOPS.

3. Artificial Intelligence and Big Data

Let’s put it all together: you have the data, you have computing power to match it, so it’s time to use them in order to gain new insights. To really benefit from both you have to turn to machine learning. Artificial Intelligence is at the forefront of data usage, helping in making predictions about weather, traffic or health (from discovering new drugs to early detection of cancer).

AI needs training to perform specialized tasks, and looking at how much training is needed in order to achieve peak performance is a great indicator of computing power vs data. There’s a great report by OpenAI from 2018 evaluating those metrics and concluding that since 2012 the AI training measured in petaflops/day (petaFD) was doubling every 3.5 month. One petaFD consists of performing 1000⁵ neural net operations per second for one day, or a total of about 10²⁰ operations. The great thing about this metric is that it not only takes architecture of a network (in a form of number of operations needed), but connects it with implementation on current devices (compute time).

You can compare how much petaFD was used in recent advances in AI, by looking at the following chart:

The leader being unsurprisingly AlphaGo Zero by DeepMind with over 1,000 petaFD used or 1 exaFD. How much is that really in terms of resources? If you were to replicate the training yourself with the same hardware, you could easily end up spending near $3m as estimated here in detail. To put the lower estimate on it, based on above chart 1,000 petaFD is at least like using the best available Amazon P3 for 1,000 days. With the current price at $31.218 per hour, this would give $31.218 x 24 (hours) x 1,000 (days) = $749,232. This is the lowest bound as it assumes that one neural net operation is one floating-point operation and that you get the same performance on P3 as on different GPUs/TPUs used by DeepMind.

Computational Storage: Edge Compute Deployment

This shows that AI needs a lot of power and resources to be trained. There are examples of recent advances in machine learning when not much was needed in terms of computing power or data, but most often than not, additional computing power is quite helpful. That is why building better supercomputers and larger data centers makes sense, if we want to develop artificial intelligence and thus our civilization as a whole. You can think about supercomputers similarly to Large Hadron Colliders — you build larger and larger colliders so that you can access deeper truths about our universe. The same is true about computing power and artificial intelligence. We don’t understand our own intelligence or how we perform creative tasks, but increasing the scale of FLOPS can help unravel the mystery.

Embrace the Zettabyte Era! And better profit from it quickly, as Yottabyte Era is not far away.

As data gets bigger, what comes after a yottabyte?

An exabyte of data is created on the Internet each day, which equates to 250 million DVDs worth of information. And the idea of even larger amounts of data — a zettabyte — isn’t too far off when it comes to the amount of info traversing the web in any one year. Cisco (s csco) estimates we’ll see a 1.3 zettabytes of traffic annually over the internet in 2016 — and soon enough, we might to start talking about even bigger volumes.

After a zettabyte comes yottabytes, which big data scientists use to talk about how much government data the NSA or FBI have on people altogther. Put it in terms of DVDs, a yottabyte would require 250 trillion of them. But we’ll eventually have to think bigger, and thanks to a presentation from Shantanu Gupta, director of Connected Intelligent Solutions at Intel (s intc), we now know the next-generation prefixes for going beyond the yottabyte: a brontobyte and a gegobyte.

A brontobyte, which isn’t an official SI prefix but is apparently recognized by some people in the measurement community, is a 1 followed by 27 zeros. Gupta uses it to describe the type of sensor data we’ll get from the internet of things. A gegobyte is 10 to the power of 30. It’s meaningless to think about how many DVDs that would be, but suffice it to say it’s more than I could watch in a lifetime.

Big Data Storage Options & Recommendations

And to drive home the influx of data, Gupta offered the following stats (although in the case of CERN, the SKA telescope and maybe the jet engine sensors, not all of that data needs to be stored):

  • On YouTube, 72 hours of video are uploaded per minute, translating to a terabyte every four minutes.
  • 500 terabytes of new data per day are ingested in Facebook databases.
  • The CERN Large Hadron Collider generates 1 petabyte per second.
  • The proposed Square Kilometer Array telescope will generate an exabyte of data per day.
  • Sensors from a Boeing jet engine create 20 terabytes of data every hour.

This begs the question which tools should we use?

8 Best Big Data Hadoop Analytics Tools in 2021

Most companies have big data but are unaware of how to use it. Firms have started realizing how important it is for them to start analyzing data to make better business decisions.

With the help of big data analytics tools, organizations can now use the data to harness new business opportunities. This is return will lead to smarter business leads, happy customers, and higher profits. Big data tools are crucial and can help an organization in multiple ways – better decision making, offer customers new products and services, and it is cost-efficient.

Let us further explore the top data analytics tools which are useful in big data:

1. Apache Hive

A java-based cross-platform, Apache Hive is used as a data warehouse that is built on top of Hadoop. a data warehouse is nothing but a place where data generated from multiple sources gets stored in a single platform. Apache Hive is considered as one of the best tools used for data analysis. A big data professional who is well acquainted with SQL can easily use Hive. The Query language used here is HIVEQL or HQL.


Hive uses a different type of storage called ORC, HBase, and Plain text.

The HQL queries resemble SQL queries.

Hive is operational on compressed data which is intact inside the Hadoop ecosystem

It is in-built and used for data-mining.

2. Apache Mahout

The term Mahout is derived from Mahavatar, a Hindu word describing the person who rides the elephant. Algorithms run by Apache Mahout take place on top of Hadoop thus termed as Mahout. Apache Mahout is ideal when implementing machine learning algorithms on the Hadoop ecosystem. An important feature worth mentioning is that Mahout can easily implement machine learning algorithms without the need for any integration on Hadoop.


Composed of matrix and vector libraries.

Used for analyzing large datasets.

Ideal for machine learning algorithms.

Big data processing with apache spark

3. Apache Impala

Ideally designed for Hadoop, the Apache Impala is an open-source SQL engine. It offers faster processing speed and overcomes the speed-related issue taking place in Apache Hive. The syntax used by Impala is similar to SQL, the user interface, and ODBC driver like the Apache Hive. This gets easily integrated with the Hadoop ecosystem for big data analytics purposes.


Offers easy-integration.

It is scalable.

Provides security.

Offers in-memory data processing.

4. Apache Spark

It is an open-source framework used in data analytics, fast cluster computing, and even machine learning. Apache Spark is ideally designed for batch applications, interactive queries, streaming data processing, and machine learning.


Easy and cost-efficient.

Spark offers a high-level library that is used for streaming.

Due to the powerful processing engine, it runs at a faster pace.

It has in-memory processing.

5. Apache Pig

Apache Pig was first developed by Yahoo to make programming easier for developers. Ever since it offers the advantage of processing an extensive dataset. Pig is also used to analyze large datasets and can be presented in the form of dataflow. Now, most of these tools can be learned through professional certifications from some of the top big data certification platforms available online. As big data keep evolving, big data tools will be of the utmost significance to most industries.


Known to handle multiple types of data.

Easily extensible.

Easy to program.

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

6. Apache Storm

Apache Storm is an open-source distributed real-time computation system and is free. And this is built with the help of programming languages like Java, Clojure, and many other languages. Apache Storm is used for streaming due to its speed. It can also be used for real-time processing and machine learning processing. Apache Storm is used by top companies such as Twitter, Spotify, and Yahoo, etc.


The operational level is easy.

Fault tolerance.


7. Apache Sqoop

If there is a command-line developed by Apache, that would be Sqoop. Apache Sqoop’s major purpose is to import structured data such as Relational Database Management System (RDBMS) like Oracle, SQL, MySQL to the Hadoop Distributed File System (HDFS). Apache Sqoop can otherwise transfer data from HDFS to RDBMS too.


Sqoop controls parallelism.

Helps connect to the database server.

Offers feature to import data to HBase or Hive.

Floating on a RAFT: HBase Durability with Apache Ratis

8. HBase

HBase is a non-distributed, column-based oriented, and non-relational database. It composes of multiple tables and these tables consist of many data rows. These data rows further have multiple column families and the column’s family each consists of a key-value pair. HBase is ideal to use when looking for small size data from large datasets.


The Java API is used for client access.

It blocks the cache for real-time data queries.

Offers modularity and linear scalability.

Besides the above-mentioned tools, you can also use Tableau to provide interactive visualization to demonstrate the insights drawn from the data and MapReduce, which helps Hadoop function faster.

However, you need to take the right pick while choosing any tool for your project.

Data driven sustainability along the supply chain | Big-Data.AI Summit 2021

As Big Data Explodes, Are You Ready For Yottabytes?

The inescapable truth about big data, the thing you must plan for, is that it just keeps getting bigger. As transactions, electronic records, and images flow in by the millions, terabytes grow into petabytes, which swell into exabytes. Next come zettabytes and, beyond those, yottabytes.

A yottabyte is a billion petabytes. Most calculators can’t even display a number of that size, yet the federal government’s most ambitious research efforts are already moving in that direction. In April, the White House announced a new scientific program, called the Brain Research through Advancing Innovative Neurotechnologies (BRAIN) Initiative, to “map” the human brain. Francis Collins, the director of the National Institutes of Health, said the project, which was launched with $100 million in initial funding, could eventually entail yottabytes of data.

Overcoming Kubernetes Storage Challenges with Composable Infrastructure

And earlier this year, the US Department of Defense solicited bids for up to 4 exabytes of storage, to be used for image files generated by satellites and drones. That’s right—4 exabytes! The contract award has been put on hold temporarily as the Pentagon weighs its options, but the request for proposals is a sign of where things are heading.

Businesses also are racing to capitalize on the vast amounts of data they’re generating from internal operations, customer interactions, and many other sources that, when analyzed, provide actionable insights. An important first step in scoping out these big data projects is to calculate how much data you’ve got—then multiply by a thousand.

If you think I’m exaggerating, I’m not. It’s easy to underestimate just how much data is really pouring into your company. Businesses are collecting more data, new types of data, and bulkier data, and it’s coming from new and unforeseen sources. Before you know it, your company’s all-encompassing data store isn’t just two or three times what it had been; it’s a hundred times more, then a thousand.

Not that long ago, the benchmark for databases was a terabyte, or a trillion bytes. Say you had a 1 terabyte database and it doubled in size every year—a robust growth rate, but not unheard of these days. That system would exceed a petabyte (a thousand terabytes) in 10 years.

Challenges of the Software-Defined Data Center of the Future: Datera

And many businesses are accumulating data even faster. For example, data is doubling every six months at Novation, a healthcare supply contracting company, according to Alex Latham, the company’s vice president of e-business and systems development. Novation has deployed Oracle Exadata Database Machine and Oracle’s Sun ZFS Storage appliance products to scale linearly—in other words, without any slowdown in performance—as data volumes keep growing. (In this short video interview, Latham explains the business strategy behind Novation’s tech investment.)

Terabytes are still the norm in most places, but a growing number of data-intensive businesses and government agencies are pushing into the petabyte realm. In the latest survey of the Independent Oracle Users Group, 5 percent of respondents said their organizations were managing 1 to 10 petabytes of data, and 6 percent had more than 10 petabytes. You can find the full results of the survey, titled “Big Data, Big Challenges, Big Opportunities,” here.

These burgeoning databases are forcing CIOs to rethink their IT infrastructures. Turkcell , the leading mobile communications and technology company in Turkey, has also turned to Oracle Exadata Database Machine, which combines advanced compression, flash memory, and other performance-boosting features, to condense 1.2 petabytes of data into 100 terabytes for speedier analysis and reporting.

Envisioning a Yottabyte

Some of these big data projects involve public-private partnerships, making best practices of utmost importance as petabytes of information are stored and shared. On the new federal brain-mapping initiative, the National Institutes of Health is collaborating with other government agencies, businesses, foundations, and neuroscience researchers, including the Allen Institute, the Howard Hughes Medical Institute, the Kavli Foundation, and the Salk Institute for Biological Studies

Lifting the Clouds: Storage Challenges in a Containerized Environment

Space exploration and national intelligence are other government missions soon to generate yottabytes of data. The National Security Agency’s new 1-million-square-foot data center in Utah will reportedly be capable of storing a yottabyte.

That brings up a fascinating question: Just how much storage media and real-world physical space are necessary to house so much data that a trillion bytes are considered teensy-weensy? By one estimate, a zettabyte (that’s 10 to the twenty-first power) of data is the equivalent of all of the grains of sand on all of Earth’s beaches.

Of course, IT pros in business and government manage data centers, not beachfront, so the real question is how can they possibly cram so much raw information into their data centers, and do so when budget pressures are forcing them to find ways to consolidate, not expand, those facilities?

The answer is to optimize big data systems to do more with less—actually much, much more with far less. I mentioned earlier that mobile communications company Turkcell is churning out analysis and reports nearly 10 times faster than before. What I didn’t say was that, in the process, the company also shrank its floor space requirements by 90 percent and energy consumption by 80 percent through its investment in Oracle Exadata Database Machine, which is tuned for these workloads.

Businesses will find that there are a growing number of IT platforms designed for petabyte and even exabyte workloads. A case in point is Oracle’s StorageTek SL8500 modular library system, the world’s first exabyte storage system. And if one isn’t enough, 32 of those systems can be connected to create 33.8 exabytes of storage managed through a single interface.

So, as your organization generates, collects, and manages terabytes upon terabytes of data, and pursues an analytics strategy to take advantage of all of that pent-up business value, don’t underestimate how quickly it adds up. Think about all of the grains of sand on all of Earth’s beaches, and remember: The goal is to build sand castles, not get buried by the sand.

The BRAIN Initiative

The BRAIN Initiative — short for Brain Research through Advancing Innovative Neurotechnologies — builds on the President’s State of the Union call for historic investments in research and development to fuel the innovation, job creation, and economic growth that together create a thriving middle class.

The Initiative promises to accelerate the invention of new technologies that will help researchers produce real-time pictures of complex neural circuits and visualize the rapid-fire interactions of cells that occur at the speed of thought. Such cutting-edge capabilities, applied to both simple and complex systems, will open new doors to understanding how brain function is linked to human behavior and learning, and the mechanisms of brain disease.

In his remarks this morning, the President highlighted the BRAIN Initiative as one of the Administration’s “Grand Challenges” – ambitious but achievable goals that require advances in science and technology to accomplish. The President called on companies, research universities, foundations, and philanthropies to join with him in identifying and pursuing additional Grand Challenges of the 21st century—challenges that can create the jobs and industries of the future while improving lives.

In addition to fueling invaluable advances that improve lives, the pursuit of Grand Challenges can create the jobs and industries of the future.

Data Science Crash Course

That’s what happened when the Nation took on the Grand Challenge of the Human Genome Project. As a result of that daunting but focused endeavor, the cost of sequencing a single human genome has declined from $100 million to $7,000, opening the door to personalized medicine.

Like sequencing the human genome, President Obama’s BRAIN Initiative provides an opportunity to rally innovative capacities in every corner of the Nation and leverage the diverse skills, tools, and resources from a variety of sectors to have a lasting positive impact on lives, the economy, and our national security.

That’s why we’re so excited that critical partners from within and outside government are already stepping up to the President’s BRAIN Initiative Grand Challenge.

The BRAIN Initiative is launching with approximately $100 million in funding for research supported by the National Institutes of Health (NIH), the Defense Advanced Research Projects Agency (DARPA), and the National Science Foundation (NSF) in the President’s Fiscal Year 2014 budget. 

Foundations and private research institutions are also investing in the neuroscience that will advance the BRAIN Initiative.  The Allen Institute for Brain Science, for example, will spend at least $60 million annually to support projects related to this initiative.  The Kavli Foundation plans to support BRAIN Initiative-related activities with approximately $4 million dollars per year over the next ten years.  The Howard Hughes Medical Institute and the Salk Institute for Biological Studies will also dedicate research funding for projects that support the BRAIN Initiative.

This is just the beginning. We hope many more foundations, Federal agencies, philanthropists, non-profits, companies, and others will step up to the President’s call to action. 


Microsoft Linux is Here!

CBL-Mariner is an internal Linux distribution for Microsoft’s cloud infrastructure and edge products and services. CBL-Mariner is designed to provide a consistent platform for these devices and services and will enhance Microsoft’s ability to stay current on Linux updates. This initiative is part of Microsoft’s increasing investment in a wide range of Linux technologies, such as SONiC, Azure Sphere OS and Windows Subsystem for Linux (WSL). CBL-Mariner is being shared publicly as part of Microsoft’s commitment to Open Source and to contribute back to the Linux community. CBL-Mariner does not change our approach or commitment to any existing third-party Linux distribution offerings.

CBL-Mariner has been engineered with the notion that a small common core set of packages can address the universal needs of first party cloud and edge services while allowing individual teams to layer additional packages on top of the common core to produce images for their workloads. This is made possible by a simple build system that enables:

Package Generation: This produces the desired set of RPM packages from SPEC files and source files.

Image Generation: This produces the desired image artifacts like ISOs or VHDs from a given set of packages.

Whether deployed as a container or a container host, CBL-Mariner consumes limited disk and memory resources. The lightweight characteristics of CBL-Mariner also provides faster boot times and a minimal attack surface. By focusing the features in the core image to just what is needed for our internal cloud customers there are fewer services to load, and fewer attack vectors.

When security vulnerabilities arise, CBL-Mariner supports both a package-based update model and an image based update model. Leveraging the common RPM Package Manager system, CBL-Mariner makes the latest security patches and fixes available for download with the goal of fast turn-around times.

More Information:

















0 reacties:

Post a Comment