Big Data Tutorial Summarization Big DataWhat is big data is a question often asked? Well it is not regular data, it is data that doesn’t fit well in a familiar analytic paradigm, it wont fit in the rows and columns of an excel sheet, hard to analyze, wont fit on a normal computer’s hard drive. You can basically describe big data by looking at the three V’s, Volume, velocity and variety.
Volumewhat we call big data in the 70’s and 80’s now can fit on a 3 $ flash drive, the maximum number of Excel rows that you can have in a single spread sheet has changed over time, it was in the thousands, now it reached over a million, this may seem a lot but if you are logging internet activity this number is actually small. There are cameras today that capture videos at 18GBs per minute. It is basically data that we are used to, but just a lot more of it.Velocity Computer systems are creating data in a very increasing speeds, the number of consumers of data are growing more and more, the demand access of it is increasing, data is coming in very fast, constant streaming of data nowadays is a challenge because some data is not static, it has become a moving target. This is driving the trend toward high-velocity data and real-time analytics of it.Variety Variety is an important aspect of Big Data, What mean in variety is that you can have many types of data, data that is hard for us to fit well in the rows and columns of an excel sheet or in a relational data base. Variety may be the main reason for us to use Big Data solutions.
. A recent study shows that variety is the main and the biggest factor in leading companies specializing in big data solutions.ConsumersFor consumers big data does a big job in providing valuable services for people, while it operates invisibly by taking huge amount of information and putting into two or three things that the consumer needs, for example, Google Now, Net-flix and Spotify, they all use Big Data in order to gather information to know what you might or will like.ResearchIn research, big data has contributed much in science, Google Flu Trends, is able to identify out breaks of the flu in for example the United States much faster from the specialists in the government, not only Google trends, but Wikipedia does it with even more accuracy.
A quick look on how big data is different from small dataSmall data is usually gathered for a specific goal, big data on the other hand may have a goal in mind when it first started but things can evolve and take other directions. Small data is usually in one place and often on a single computer.Big data on the other hand can be in multiple files, multiple servers and even on computers on different geographical places. When it comes to data preparation, small data usually prepared by the end user for their own purposes, Big Data on the other hand is prepared first by one group of people then analyzed by another group then it is used by a third group of people.What does it take to be a data scientist The Data Science Venn DiagramHacking SkillsData is transferred and traded electronically, therefore, in order to be in this market you need to speak hacker. This, however, does not require a background in computer science most hackers these days never took a single CS course.Being able to manipulate text files at the command-line, understanding operations, thinking arithmetically these are the hacking skills that make for a successful data hacker.
StatisticsOnce you have acquired and cleaned the data, the next step is to actually extract insight from it. In order to do this, you need to apply appropriate math and statistics methods, which requires at least a baseline familiarity with these tools. This is not to say that a PhD in statistics in required to be a competent data scientist, but it does require knowing and having a good background in statistics.Substantive expertise Data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Science is about discovery and building knowledge, which requires some motivating questions about the world that can be brought to data and tested with statistical methods. The question is now, do we have to have all three V’s in order to have Big Data or just one?Well, unless you have two of them, you cannot work with big data, if you only have one, you need to work with other people who actually have the other skills so you can complete each other.
Big Data and PrivacyWith so much data being generated, collected, and processed, data privacy and its protection will only grow and hackers’ interest will grow with it, and that will cause potential accessing for big amounts of data that could be private.When it comes to businesses and consumers there are some tricky big data situations that needs to be handled in order not to invade anyone’s privacy, for example a firm used data analytics on its customer database to work out when female customers appeared to be pregnant based on browsing and buying habits. Then the company sent coupons to their costumers for certain items, including those expected to be mothers. This all was done to improve customer loyalty for the company.This also means if one data set was combined with another completely separate data base, without first determining if any other data items should be removed prior to combining to protect privacy and hidden information, it is possible that some people could be reidentified. The important thing is that two data files which are anonymous could be combined together and breach people’s privacy if the data is not masked well.
On the other hand, information may be stolen from the company, several companies have had their information and data stolen from them, including credit card numbers, addresses and much more private information, companies may loose a lot of money, maybe hundreds of millions of dollars due to consumers no longer sure that their information is safe there.Human and Machine Generated DataHuman Generated Data can be emails, texts, documents, photos, cell phone calls, online purchases and tweets. It is generated and growing faster and faster each moment.
Just imagine the number of videos uploaded to YouTube and Tweets going around. This data can be Big Data, it is intentional data too.Machine Generated Data is a new kind of data. Latest information show that 95% of Data nowadays can not be seen by human eyes, This category consists of sensor data, and logs generated by machines such as email logs, click stream logs and much more.
Machine generated data volume is way larger than Human Generated Data that is why before Hadoop was in the scene, the machine generated data was mostly ignored and not captured. It is because dealing with the volume was not possible and not cost effective.Unstructured, structured and semi-structured data Unstructured data mainly refers to information and data that isnt in a traditional row-column database.
For instance Unstructured data files often include text and multimedia content. Studies say that almost 80% of the data in any organization is unstructured, not to mention that the amount of unstructured data is growing very very fast these days.Unstructured data is the opposite of structured data. Structured data generally resides in a relational database, it is sometimes called relational data. Unstructured data can be easily put into fields. For example we can set up fields for phone numbers and credit card numbers that accept a certain number of digits these can be placed into fields.
On the other hand unstructured data is not a relational type of data and doesn’t fit into pre-defined fields.In addition to structured and unstructured data, there’s also a another type of data which is called semi-structured data. It is data or information that cant be placed in a relational database but on the other hand its somehow organized, that makes it easier to work on and analyze. For example XML documents and NoSQL databases can be called Unstructured Data.Big data is mostly associated with unstructured data. Big data refers to extremely large datasets that are difficult to analyze with traditional tools.
Big data can include both structured and unstructured data.HadoopWhat is it? It is not a single thing, it is a collection of software applications that works with big data, its a framework that consists of different modules. The most important part of Hadoop is the HDFS, which refers to the Hadoop Distributed File System that is used to store files across many computers.It takes a collection or a piece of information and spreads it across a bunch of computers, sometimes thousands of them, it is not a data base, its not a single file with rows and columns, it can have hundreds or millions of files that are spread across the computers.MapReduceIt is another critical part of Hadoop, its a process that consists of mapping and reducing,map is to take a task and data and split it into peaces, because you want to send it to different computers and each of the computers can handle a certain number of data, then they can work in parallel.
Reduce process takes the results of the analysis that we did on the other computers and combines the output so we can have a single result. The original map reduce program has been replaced by YARN, which refers to Yet Another Resource Negotiator, sometimes called MapReduce two, yarn allow more things to be done than the original MapReduce which did batch processing, you had to get everything together at once and split it out and wait til it was done and then get your results. YARN can do batch processing and also can do stream processing which means things are coming in as fast as possible and coming out simultaneously instead of waiting for other processes to be compete. HiveHive might be the most frequently used and most major component of Hadoop, what it does is that it summarizes queries and analyzes data that is in hadoop, It uses HiveQL language.Who uses Hadoop? Yahoo, Facebook, LinkedIn and so much many others.Data miningData mining is an analytic process that is designed to look into large amounts of data, also known as Big Data, it searches for consistent patters and systematic relationships between variables.
The goal of data mining is prediction and predictive data mining could be the most common type of it.Stages of data miningExploration: It basically starts with data preparation that involves cleaning the data and data transformations, this means selcting subsets of records and performing operations on them to bring the number of variables to a manageable range. Then a simple choice of straightforward predictors are put into work using wide variety of graphical and statistical methods.
Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance. It is not a simple operation, it sometimes involves a very hard processes.
There are a plenty of techniques developed in order for it to work – many of which are based on so-called competitive evaluation of models, and that is, applying different models to the same data set and then comparing their performance to choose the best. Deployment: The final stage that involves using the model selected as best in the model building and validation stage and applying it to new data in order to generate predictions or estimates of the expected output.The concept of Data Mining is becoming increasingly popular. The business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty.