Big Data: A Very Short Introduction |
Page 1 of 2 Author: Dawn E. Holmes This book aims to introduce you to Big Data, how it’s stored and analysed, together with its social impact, how does it fare? This small book aims to introduce you to Big Data, explaining how it is stored, analysed, used by industry, together with the social concerns of privacy and security. The audience is the curious general reader. That said, I could imagine an IT manager, IT architect, or even a developer new to Big Data finding this book useful. This relatively cheap book has only 112 main pages, covering eight chapters, but is packed with detail. Below is a chapter-by-chapter exploration of the topics covered. Chapter 1 The Data Explosion The book opens with a look at what is data, providing some history on how data has been used from Palaeolithic tally sticks through to its use for taxes, and census data. The creation of the World Wide Web and the subsequent rise of social media has led to increasing amounts of data being generated. The various types of data are briefly described (i.e. structured, semi-structured, unstructured), and metrics are provided on the recent and expected future data volumes. Big Data refers to this very large amount of data. Next, various sources of Big Data are examined, these include search engine data, logs, healthcare data, and sensors. Real-time analysis of Big Data allows near-immediate decisions to be made, and this is discussed in relation to autonomous cars. The chapter ends by asking the question “What use is all this data?” and proceeds to provide answers relating to new techniques that aid decision making. This chapter provides a useful context on how data has been used in the past, and how Big Data is being used to create new ways of analysing data to produce innovative means of decision making. The chapter is easy to read, interesting, detailed, wide-ranging, and not a single word is wasted. These traits apply to the whole of the book. Chapter 2 Why is big data special? This chapter opens with a comparison of Big Data with traditional small data processing. The characteristics of Big Data are defined in terms of the usual 3 Vs (volume, variety, and velocity), together with some additional more recent ‘v’s (veracity, value, and visualization). Next, the process of mining Big Data to discover patterns that can be acted upon is described. A brief introduction to some Machine Learning algorithms and techniques (unsupervised clustering and supervised classification) provides the background to their use in detecting credit card fraud. This chapter provides an interesting overview of what constitutes Big Data, and some techniques that assist its usefulness. Chapter 3 Storing big data Massive amounts of data have the obvious knock-on effect on data storage. Some history of storage is provided (e.g. in the 1980s the PCs average hard drive was 5Mb). Next, storing of structured data is examined, this is typically stored in Relational Database Management Systems (RDBMSs) which have techniques to reduce the amount of redundant data (i.e. normalisation). While RDBMSs provide suitable vertical scalability for most applications, there comes a time when a threshold is reached, and performance becomes impractical (there are some intermediate workarounds, like sharding). This limit is removed when using distributed computed, and the most pertinent, the Hadoop Distributed File System (HDFS) is discussed as a solution. The development of the many varied types of NoSQL (i.e. Not Only SQL) databases are discussed in terms of providing different solutions to typically unstructured data. Additionally, the CAP theorem is discussed - were on a distributed system a decision needs to be made to promote data availability over data consistency. The 4 main types of NoSQL databases are briefly described (i.e. key-value, column-based, document, and graph). The chapter next looks at the growing importance of cloud storage, with reference to Amazon Web Service’s S3 storage. The chapter ends with a look at both lossless and lossy data compression as a means of reducing storage. This chapter provides a useful look at the limits of relational databases and the many alternative NoSQL and distributed processing solutions for storing Big Data. Chapter 4 Big data analytics With so much data being generated, new techniques have been developed to ensure it is processed in a timely manner. The chapter looks at MapReduce, a well-established distributed processing algorithm. Since the data volumes are huge, is it more efficient to pass the algorithm (e.g. calculation instructions) to data held on distributed storage (e.g. HDFS), rather than the more traditional method of passing data to the algorithm. The workings of MapReduce are described with some simple data relating to diseases. Similarly, other analytic techniques are examined with practical examples, namely:
Next, some commonly available big datasets are briefly described. The chapter ends with a short discussion on Big Data as an example of what Thomas Kuhn described as a scientific revolution (i.e. there is a paradigm shift, to move away from the use of statistics with their sampled datasets towards big datasets that often contain all the data and thus no need for sampling techniques). There’s a very useful point about correlation is not the same as causation, and with such large datasets some implausible correlations may arise. This is another interesting and diverse chapter, which discusses some common distributed processing techniques.
|
|||
Last Updated ( Tuesday, 16 October 2018 ) |