Big Data

Table of Contents

1 Big Data - The basics

Learn It - What is Big Data and where they are from

  • Big Data is a catch-all term for data that cannot be stored and processed by the usual way, for example using a database with fields and records.
  • Big Data can be described in terms of the three “V”s:
    1. Volume - too big to fit into a single server
    2. Velocity/speed - at which the data is generated and received
    3. Variety - data in range of data types such as structured, unstructured, text, multimedia.
  • The example sources of Big Data:
    • social media - updates, streaming
    • IoT - from smart home devices to start cars and wearables
    • Government agencies for demographic data, sensus data.
    • Scientific research related data gathering from various sensors and observations
    • Data from networked sensors, smartphones, video surveillance, mouse clicks etc are continuously streamed.

Learn It - Challenges of Big Data

  • The challenges come from the three "V"s of Big Data - the constantly incoming and changing data set, the unstructured nature of those data.
  • The Volumn of Big Data is too big to fit onto a single server
  • The Volumn of Big Data is too large to be analysed easily
  • The Volumn and the the Velocity of the data generation demands more processing power from multiple servers
  • Unstructured data make analysis of data difficult
  • Unstructured data means
    • the data cannot be defined in fields and records like the traditional database.
    • the data do not have a pre-defined data model or are not organised in a pre-defined manner.
    • Unstructured data are typically text-heavy, but may contain data such as dates, numbers, and facts. Thinking customers' written comments vs multiple choice survey.
    • Examples of Unstructured data include: webpages, emails, multimedia data etc. Those data can not be easily stored or processed using standard database.
  • Structured data means:
    • data can be defined using traditional fields and records with each field having a name and data type.

2 Big Data - Big Data and Functional Programming

Learn It - Solving the Big Data challenges

  • When data sizes are too big to fit on a single server:
  • The processing must be distributed across more than one machine
    • Functional programming is a solution, because it makes it easier to write correct and efficient distributed code.
  • Functional programming languages support:
    • immutable data structures
    • statelessness
    • higher-order functions
    • Easier to write:
      • correct code
      • code that can be distributed to run across more than one server.
      • functional programming operations are often collection oriented thus can be parrallelised easily
      • one part of a functional program cannot change data and thus affect another part
      • Order of execution less rigidly defined in a functional language than for procedural, object-oriented or other paradigms; This makes distribution of process possible.

3 Big Data - Modelling Big Data

Learn It - Fact based model

  • In the fact-based model, you deconstruct the data into fundamental units called facts.
  • Each fact within a fact-based model captures a single piece of information.
  • Concept of atomicity:
    • Facts are atomic and cannot be subdivided into smaller meaningful components

bigdata-factmodel.png accredit: https://notes.shichao.io/bd/ch2/

  • In relational database, update is one of the fundamental operations. However, for immutability in Big Data, you don't update or delete data, you only add more
  • Facts are timestamped to make them immutable and eternally true.
  • With a fact-based model, the master dataset will be an ever-growing list of immutable, atomic facts, which has the following advantages:
    • Is queryable at any time in its history
    • Tolerates human errors
    • Handles partial information
    • Has the advantages of both normalised and denormalised forms

Learn It - Graph schema

  • The facts do not convey the structure behind data: there is no types of facts and no relationships between the facts.
  • Graph schemas are graphs that capture the structure of a dataset stored using the fact-based model.
  • The three core components of a graph schema:
    • Nodes are the entities in the system. In this example, the nodes are the FaceSpace users, represented by a Person ID
    • Edges are relationships between nodes.
    • Properties are information about entities

bigdata-graphmodel.png accredit: https://notes.shichao.io/bd/ch2/

  • Try to work out the following exam question:

bigdata-exercise1.png

  • answer to the above:

bigdata-exercise1-answer.png