# Little Big Data

## 1 Handling data

### Learn It

• Last lesson, we looked at the scale of data used globally, and thought about how collecting and analysing lots of records can help organisations and governments to improve.
• In this lesson, we'll try handling a large data set for ourselves.
• Black-hat hackers often wish to log in to others' IT systems by trying to guess their password.
• One approach that can be taken is to 'brute-force' attack the system, by having a computer program attempt every possible password combination until it guesses correctly. It might try these first…
• a
• b
• c
• …z
• aa
• ab
• ac
• …az
• ba
• bb
• etc…
• By doing this, eventually any password can be eventually 'cracked'.
• As you can imagine, if the password contains capital letters, then it'll take twice as long to run through all single-character passwords.
• If you include numbers, that's another 10 combinations on each sweep through.
• If you include puncutation marks (e.g. %, ^, &) then this will take even longer.
• This can be rather slow.
• Let's imagine that its rather important you access the file, and so have written a password cracking program to repeatedly try to open the document with a brute-force attack.
• Set the password length to 6 characters (e.g. 'lockit' or 'system')
• Set the keys per second to 'PDF - 22014K/s', for a PDF file.
• Set the characterset to 'mixalpha'
• Click 'Get time'
• On average, how long will it take to find the password if we used a 7 character password?
• How about 8 characters? Does the extra character make much difference?
• What if we used mixed alphabet and numeric characters for an 8 character password?

### Try It

• Most people tend to use words for passwords, rather than random characters.
• A second form of password attack is using a dictionary file, which contains common words.
• Getting a human to work through the list, typing each password until they get the correct one could take some time.
• Getting a computer to run through the list won't take quite so long.
• We'll write a short (white-hat) program to allow users to test their passwords to see if they're secure or not.
• We'll start simple, and add complexity slowly.

### Try It

• Often, developers want to write an outline of a program before writing it. By writing the code in human-readable English, it becomes easier to articulate the flow of our program, even if we don't know all the command words to actually write it yet.
• We call this pseudocode. A simple dictionary search algorithm could be expressed like this:
```1. Make a variable called 'foundPass' and set it to False. We'll change this if your password is found.
4a. ii. Set foundPass to True
5. If foundPass is False, print a message to say the password wasn't found.
```
• Load up IDLE, start a new program (File -> New) and copy the code below into the window:
```passwordList=['password','123456','batman']

foundPass=False

foundPass=True

if foundPass==False:
print("That password isn't in my list.")
```
• Save the file as 'simplePassword.py' and run the program a few times to see how it works.
• This is a start, but three passwords simply isn't enough for a dictionary attack.
• Time for an upgrade. Save this file to your home drive, ensuring its in the same directory that you save your Python work into.
• Make another new file, and add this code to it:
```print("Loading passwords... Please wait...")

for eachLine in theFile:
for eachWord in eachLine.split():

theFile.close()

foundPass=False

print("Found it in record " + str(passwordPosition))
foundPass=True

if foundPass==False:
print("That password is not in the dictionary.")
```
• Test the program, using some passwords. Try predictable ones (e.g. 123456) and any others you can think of.
• Tip: Avoid typing your own password; another person sitting nearby could see it.

• This program uses a lot of the ideas from the previous version of the program, but has a few new lines. If you look at the code and think about the variable names, you should be able to come up with an explanation of how it works.
• Write out pseudocode to explain how the program works.