Big Data
Table of Contents
1 Big Data - The basics
Learn It - What is Big Data and where they are from
- Big Data is a catch-all term for data that cannot be stored and processed by the usual way, for example using a database with fields and records.
- Big Data can be described in terms of the three “V”s:
- Volume - too big to fit into a single server
- Velocity/speed - at which the data is generated and received
- Variety - data in range of data types such as structured, unstructured, text, multimedia.
- The example sources of Big Data:
- social media - updates, streaming
- IoT - from smart home devices to start cars and wearables
- Government agencies for demographic data, sensus data.
- Scientific research related data gathering from various sensors and observations
- Data from networked sensors, smartphones, video surveillance, mouse clicks etc are continuously streamed.
Learn It - Challenges of Big Data
- The challenges come from the three "V"s of Big Data - the constantly incoming and changing data set, the unstructured nature of those data.
- The Volumn of Big Data is too big to fit onto a single server
- The Volumn of Big Data is too large to be analysed easily
- The Volumn and the the Velocity of the data generation demands more processing power from multiple servers
- Unstructured data make analysis of data difficult
- Unstructured data means
- the data cannot be defined in fields and records like the traditional database.
- the data do not have a pre-defined data model or are not organised in a pre-defined manner.
- Unstructured data are typically text-heavy, but may contain data such as dates, numbers, and facts. Thinking customers' written comments vs multiple choice survey.
- Examples of Unstructured data include: webpages, emails, multimedia data etc. Those data can not be easily stored or processed using standard database.
- Structured data means:
- data can be defined using traditional fields and records with each field having a name and data type.
2 Big Data - Big Data and Functional Programming
Learn It - Solving the Big Data challenges
- When data sizes are too big to fit on a single server:
- The processing must be distributed across more than one machine
- Functional programming is a solution, because it makes it easier to write correct and efficient distributed code.
- Functional programming languages support:
- immutable data structures
- statelessness
- higher-order functions
- Easier to write:
- correct code
- code that can be distributed to run across more than one server.
- functional programming operations are often collection oriented thus can be parrallelised easily
- one part of a functional program cannot change data and thus affect another part
- Order of execution less rigidly defined in a functional language than for procedural, object-oriented or other paradigms; This makes distribution of process possible.
3 Big Data - Modelling Big Data
Learn It - Fact based model
- In the fact-based model, you deconstruct the data into fundamental units called facts.
- Each fact within a fact-based model captures a single piece of information.
- Concept of atomicity:
- Facts are atomic and cannot be subdivided into smaller meaningful components
accredit: https://notes.shichao.io/bd/ch2/
- In relational database, update is one of the fundamental operations. However, for immutability in Big Data, you don't update or delete data, you only add more
- Facts are timestamped to make them immutable and eternally true.
- With a fact-based model, the master dataset will be an ever-growing list of immutable, atomic facts, which has the following advantages:
- Is queryable at any time in its history
- Tolerates human errors
- Handles partial information
- Has the advantages of both normalised and denormalised forms
Learn It - Graph schema
- The facts do not convey the structure behind data: there is no types of facts and no relationships between the facts.
- Graph schemas are graphs that capture the structure of a dataset stored using the fact-based model.
- The three core components of a graph schema:
- Nodes are the entities in the system. In this example, the nodes are the FaceSpace users, represented by a Person ID
- Edges are relationships between nodes.
- Properties are information about entities
accredit: https://notes.shichao.io/bd/ch2/
- Try to work out the following exam question:
- answer to the above: