Thursday, 24 April 2014

Understanding of Big Data

    Understanding of Big Data
  • What is Big Data?
  1. The term Big Data applies to information that can’t be processed or analyzed using traditional processes or tools. Increasingly, organizations today are facing more and more Big Data challenges. They have access to a wealth of information, but they don’t know how to get value out of it because it is sitting in its most raw form or in a semi structured or unstructured format; and as a result, they don’t even know whether it’s worth keeping (or even able to keep it for that matter).
  2. Big data is a collection of digital information whose size is beyond
the ability of most software tools and people to capture, manage, and
process the data.

  1. Big Data solutions are ideal for analyzing not only raw structured data but
    semi-structured data and also unstructured data from a wide variety of source.

  2. Big Data solutions are ideal when all or most of the data needs to be analyzed versus a sample of the data or a sampling of data is not nearly as effective as a larger set of data from which to derive analysis.
  3. Big data solutions are ideal for iterative and exploratory analysis when business measures on data are not predetermined.
  4. Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery, and/or analysis.
  5. Big data” is a big buzz phrase in the IT and business world right now – and there are a dizzying array of opinions on just what these two simple words really mean. Technology vendors in the legacy database or data warehouse spaces say “big data” simply refers to a traditional data warehousing scenario involving data volumes in either the single or multi-terabyte range. Others disagree: They say “big data” isn’t limited to traditional data warehouse situations, but includes real-time or operational data stores used as the primary data foundation for online applications that power key external or internal business systems. It used to be that these transactional/real-time databases were typically “pruned” so they could be manageable from a data volume standpoint. Their most recent or “hot” data stayed in the database, and older information was archived to a data warehouse via extract-transform-load (ETL) routines.
  6. But big data has changed dramatically. The evolution of the web has redefined:
The speed at which information flows into these primary online systems.
The number of customers a company must deal with.
The acceptable interval between the time that data first enters a system, and its transformation into information that can be analyzed to make key business decisions.
    1. Big Data is a term used to describe large collections of data (also known as data sets) that may be unstructured, and grow so large and quickly that it is difficult to manage with regular database or statistics tools.
  • Characteristics of Big Data
Four characteristics define Big Data: Volume, Variety, Value and Velocity

1. Volume – TB’s to PB’s of data
2. Velocity – how fast the data is coming in
3. Variety – all types are now being captured. (structured, semi-structured, unstructured)
4. Value – mining the valuable pieces of data from among data that does not matter.


  • The Volume of Data
    The sheer volume of data being stored today is exploding. In the year 2000, 800,000 petabytes (PB) of data were stored in the world. Of course, a lot of the
data that’s being created today isn’t analyzed at all and that’s another problem
we’re trying to address with BigInsights. We expect this number to reach 35 zettabytes (ZB) by 2020. Twitter alone generates more than 7 terabytes (TB) of data every day, Facebook 10 TB, and some enterprises generate terabytes of data every hour of every day of the year.

The volume of data available to organizations today is on the rise, while the percent of data they can analyze is on the decline.

  • The Variety of Data
The volume associated with the Big Data phenomena brings along new challenges
for data centers trying to deal with it: its variety. With the explosion of
sensors, and smart devices, as well as social collaboration technologies, data in
an enterprise has become complex, because it includes not only traditional relational data, but also raw, semi structured, and unstructured data from web
pages, web log files (including click-stream data), search indexes, social media
forums, e-mail, documents, sensor data from active and passive systems, and
so on. What’s more, traditional systems can struggle to store and perform the
required analytics to gain understanding from the contents of these logs because
much of the information being generated doesn’t lend itself to traditional
database technologies. In our experience, although some companies are
moving down the path, by and large, most are just beginning to understand
the opportunities of Big Data (and what’s at stake if it’s not considered).

  • The Velocity of Data
Just as the sheer volume and variety of data we collect and store has changed,
so, too, has the velocity at which it is generated and needs to be handled. A conventional understanding of velocity typically considers how quickly the data is
arriving and stored, and its associated rates of retrieval. While managing all of
that quickly is good—and the volumes of data that we are looking at are a consequence of how quick the data arrives—we believe the idea of velocity is actually something far more compelling than these conventional definitions.
To accommodate velocity, a new way of thinking about a problem must
start at the inception point of the data. Rather than confining the idea of velocity
to the growth rates associated with your data repositories, we suggest
you apply this definition to data in motion: The speed at which the data is
flowing. After all, we’re in agreement that today’s enterprises are dealing
with petabytes of data instead of terabytes, and the increase in RFID sensors
and other information streams has led to a constant flow of data at a pace
that has made it impossible for traditional systems to handle.

  • The Value of Data
The economic value of different data varies significantly. Typically there is good information hidden amongst a larger body of non-traditional data; the challenge is identifying what is valuable and then transforming and extracting that data for analysis.
  • Big Data Use Cases

  • Sentiment Analysis
Let's start with the most widely discussed use case, sentiment analysis. Whether looking for broad economic indicators, specific market indicators, or sentiments concerning a specific company or its stocks, there is obviously a trove of data to be harvested here, available from traditional as well as new media (including social media) sources. While news keyword analysis and entity extraction have been in play for a while, and are readily offered by many vendors, the availability of social media intelligence is relatively new and has certainly captured the attention of those looking to gauge public opinion. (In a previous post, I discussed the applicability of Semantic technology and Entity Extraction for this purpose, but as promised, I'm sticking to the usage topic this time).
Sentiment analysis is considered straightforward, as the data resides outside the institution and is therefore not confined by organizational boundaries. In fact, sentiment analysis is becoming so popular that some hedge funds are basing their entire strategies on trading signals generated by Twitter analytics. While this is an extreme example, most financial institutions at this point are using some sort of sentiment analysis to gauge public opinion about their company, market, or the economy as a whole.
  • Predictive Analytics
Another fairly common use case is predictive analytics. Including correlations, back-testing strategies, and probability calculations using Monte Carlo simulations, these analytics are the bread and butter of all capital market firms, and are relevant both for strategy development and risk management. The large amounts of historical market data, and the speed at which new data sometimes needs to be evaluated (e.g. complex derivatives valuations) certainly make this a big data problem. And while traditionally these types of analytics have been processed by large compute grids, today, more and more institutions are looking at technologies that would bring compute workloads closer to the data, in order to speed things up. In the past, these types of analytics have been primarily executed using proprietary tools, while today they are starting to move towards open source frameworks such as R and Hadoop (detailed in previous posts).
  • Risk Management
As we move closer to continuous risk management, broader calculations such as the aggregation of counter party exposure or VAR also fall within the realm of Big Data, if only due to the mounting pressure to rapidly analyze risk scenarios well beyond the capacity of current systems, while dealing with ever-growing volumes of data. New computing paradigms that parallelize data access as well as computation are gaining a lot of traction in this space. A somewhat related topic is the integration of risk and finance, as risk-adjusted returns and P&L require that growing amounts of data be integrated from multiple, standalone departments across the firm, and accessed and analyzed on the fly.
  • Rogue Trading
Speaking of finance and accounting, a less common use case - but one that is frequently discussed as we're faced with increasing implications - is rogue trading. Deep analytics that correlate accounting data with position tracking and order management systems can provide valuable insights that are not available using traditional data management tools. Here too, a lot of data needs to be crunched from multiple, inconsistent sources in a very dynamic way, requiring some of the technologies and patterns discussed in earlier posts.
  • Fraud
Turning our attention to the detection of more sinister fraud, a similar point can be made. Correlating data from multiple, unrelated sources has the potential to catch fraudulent activities earlier than current methods. Consider for instance the potential of correlating Point of Sale data (available to a credit card issuer) with web behavior analysis (either on the bank's site or externally), and cross-examining it with other financial institutions or service providers such as First Data or SWIFT. This would not only improve fraud detection but could also decrease the number of false positives (which are part and parcel of many travelers' experience today).
  • Retail Banking
Most banks are paying much closer attention to their customers these days than in the past, as many look at ways to offer new, targeted services in order to reduce customer turnover and increase customer loyalty, (and, in turn, the banks' revenue). In some ways this is no different than retailers’ targeted offering and discounting strategies. The attention that mobile wallets have been getting recently alone, attests to the importance that all parties involved – from retailers to telcos to financial institutions – are putting on these types of analytics, rendered even more powerful when geo-location information is added to the mix.
Banks, however, have additional concerns, as their products all revolve around risk, and the ability to accurately assess the risk profile of an individual or a loan is paramount to offering (or denying) services to a customer. Though the need to protect consumer privacy will always prevail, banks now have more access to web data about their customers – undoubtedly putting more informational options at their fingertips – to provide them with the valuable information needed to target service offerings with a greater level of sophistication and certainty. Additionally, web data can help to signal customer life events such as a marriage, childbirth, or a home purchase, which can help banks introduce opportunities for more targeted services. And again, with location information (available from almost every cell phone) banks can achieve extremely granular customer targeting.

No comments:

Post a Comment