Engineering Math - Quick Reference                                 Home : www.sharetechnote.com

Big Data

What is the big data ?

Even though a lot of people are talking about Big Data, but it is not easy to clearly define what it is. Big Data is not a technical term, it is more like a marketing word. Like most of the marketing work, the term Big Data also sound like easy, but very difficult to clearly define it.

It may sound like 'Huge amount of data' and seems to mean 'Only size matters'. Size is an important criteria for Big Data, but it is not the only criteria for the big data. In addition to size of the data, the format of data, relationships among data, structure in the data set can also be important criteria to differentiate the conventional data and big data.

One of the definition that I have heard and I like is "Big data is a set of data that cannot be processed / analyzed by the conventional relational data base technology'. This sounds much technical and clearer to me.

In conventional data base, the first thing we do is to define a clear data structure (table structure) and all the data are placed according to the predefined format. So each and every data are placed under predefined label (e.g, column name) and most of analysis and process is done based on those predefine label. In other words, we already knows about the overall data structure and relationship between each of items in the database. But in most case of big data, huge amount of data are just collected and accumulated. They don't have any clear label, there are no predefined structure categorizing those data. (Of course, there may be very primitive/basic categories like 'text', 'image', 'video' etc, but these kind of labeling/category would not give you any specific information). So, in Big data splitting the accumulated data into a set of meaningful group becomes a very important steps for big data analysis.

Units of Data Volumn

Even thought size of the data is not the only criteria for Big data, usually very very huge units are used to describe the big data. If you read or watch things about Big data, you would hear many of data units that we don't normally use. For your convinience, I summarized some of the data units you would hear about in various big data discussion.

 Unit Amount (Decimal) Amount (Binary) Yotta Byte 10^24 2^80 Zetta Byte 10^21 2^70 Exa Byte 10^18 2^60 Peta Byte 10^15 2^50 Tera Byte 10^12 2^40 Giga Byte 10^9 2^30 Mega Byte 10^6 2^20

Who is generating big data ?

In most of conventional data base system, usually we have a certain group of people who is generating the data and most of those people knows that they are now generating the data. For example, a lot of sales people is generating data by inputting sales records by keyboard, scan, bar code reading etc.

But in case of big data, everybody (every one of us) and even machine (like sensors and camera) are genering the data and in most case we don't even realize that we are now generating those data. Followings are only a few examples of big data generation

• All the text and images in Google
• All the messages you creates in Twitter
• All the text and images you post in facebook
• All the search words you type in search engines (e.g, google, bing, yahoo etc)
• Location information when you are getting access to internet, mobile communication system etc

How to Analyze the Big Data ?

How do you analyze the conventional data ? Probablity the most common things you would do would be as follows :

• Plot the data along time, region, age etc
• Taking average, moving everage etc
• Doing regression based on various functions

These techniques would be used as tools in Big Data analysis as well, but there are many cases where it is not enough to apply only these techniques in Big Data. For example, let's assume that you have all the messages posted on Twitter, what would you do ? Of course, you can apply the conventional methods to this data as well. For example, you can plot the number of twitters for hourly and daily basis. But what if you ask 'what is the most popular topics in twitter ?' 'How a specific topic spread across people ?' To give this kind of answer from twitters would not be easy with conventional data base technology (mostly based on SQL). You would need special tools to enable you with super high processing power, flexible search algorithm and relation finding algorithm etc.

The most important technique for Big Data analysis is "Pattern Finding", "Pattern Recognition"

How do you do those pattern finding ? Is there any clearly defined algorithm applicable for all the big data ?

The most important technique for Pattern finding/recognition is 'your brain'. It is better if your brain has a lot of knowledges in specific field. If it is the case, your brain would be the best tool for the big data in that specific field.

If you are an expert and had a deep experience in a certain area and just pouring a lot of information into your brain for a long time, your brain would create its own intuition and may spit out some meaningful conclusion/information on its own. In many cases, you don't know exactly how it came out but in many case it will be very useful anyway. Just trust your brain and make it sure that you put big enough amount of data into your brain for enough long period of time.

I know.. this does not sound technical. If you are more interested in technical issues about this, search topics about machine learning, graph theory being discussed in connection with Big data analysis. Take some time to watch the videos linked below and it can be a good starting point.

Actually the huge amount of Videos posted on YouTube is one of the typical example of big data, currently YouTube will categorize and recommend you those videos based on the text title labeled by the person who posted. As far as I know as of now, it is only human brain to understand the contents of these video data and extract the information as a whole. It would be excellent if a software (or any kind of machine) can extract the information from this kind of video/audio/image data as our brain does and this is the biggest topics being done in big data research (especially machine learning or Artificial Intelligence). I think there always has been great deal of improvement in extracting information from image or short audio sentence.

Video Digest

 Basic Introduction/Short Clips : In these video, you would have the introductions to Big Data in many different ways from many different persons. Talks/Discussion on Big Data Forum/Presentation/Lecture on Big Data Application Examples of Big Data : You will see various application of Big data in real life and business model. Hadoop / other tools: Hadoop is the most widely tool for big data collection/management and analysis. In almost any of the big data related video, you would hear some comments on this tool. Microsoft Excel for Big Data Algorithm/Machine Learning for Big Data Analysis