Big Data

1 Big Data - The basics

Big Data is a catch-all term for data that cannot be stored and processed by the usual way, for example using a database with fields and records.
Big Data can be described in terms of the three “V”s:
1. Volume - too big to fit into a single server
2. Velocity/speed - at which the data is generated and received
3. Variety - data in range of data types such as structured, unstructured, text, multimedia.
The example sources of Big Data:
- social media - updates, streaming
- IoT - from smart home devices to start cars and wearables
- Government agencies for demographic data, sensus data.
- Scientific research related data gathering from various sensors and observations
- Data from networked sensors, smartphones, video surveillance, mouse clicks etc are continuously streamed.

The challenges come from the three "V"s of Big Data - the constantly incoming and changing data set, the unstructured nature of those data.
The Volumn of Big Data is too big to fit onto a single server
The Volumn of Big Data is too large to be analysed easily
The Volumn and the the Velocity of the data generation demands more processing power from multiple servers
Unstructured data make analysis of data difficult
Unstructured data means
- the data cannot be defined in fields and records like the traditional database.
- the data do not have a pre-defined data model or are not organised in a pre-defined manner.
- Unstructured data are typically text-heavy, but may contain data such as dates, numbers, and facts. Thinking customers' written comments vs multiple choice survey.
- Examples of Unstructured data include: webpages, emails, multimedia data etc. Those data can not be easily stored or processed using standard database.
Structured data means:
- data can be defined using traditional fields and records with each field having a name and data type.

When data sizes are too big to fit on a single server:
The processing must be distributed across more than one machine
- Functional programming is a solution, because it makes it easier to write correct and efficient distributed code.
Functional programming languages support:
- immutable data structures
- statelessness
- higher-order functions
- Easier to write:
  - correct code
  - code that can be distributed to run across more than one server.
  - functional programming operations are often collection oriented thus can be parrallelised easily
  - one part of a functional program cannot change data and thus affect another part
  - Order of execution less rigidly defined in a functional language than for procedural, object-oriented or other paradigms; This makes distribution of process possible.

In the fact-based model, you deconstruct the data into fundamental units called facts.
Each fact within a fact-based model captures a single piece of information.
Concept of atomicity:
- Facts are atomic and cannot be subdivided into smaller meaningful components

In relational database, update is one of the fundamental operations. However, for immutability in Big Data, you don't update or delete data, you only add more
Facts are timestamped to make them immutable and eternally true.
With a fact-based model, the master dataset will be an ever-growing list of immutable, atomic facts, which has the following advantages:
- Is queryable at any time in its history
- Tolerates human errors
- Handles partial information
- Has the advantages of both normalised and denormalised forms

The facts do not convey the structure behind data: there is no types of facts and no relationships between the facts.
Graph schemas are graphs that capture the structure of a dataset stored using the fact-based model.
The three core components of a graph schema:
- Nodes are the entities in the system. In this example, the nodes are the FaceSpace users, represented by a Person ID
- Edges are relationships between nodes.
- Properties are information about entities