Little Big Data

1 Handling data

Learn It

  • Last lesson, we looked at the scale of data used globally, and thought about how collecting and analysing lots of records can help organisations and governments to improve.
  • In this lesson, we'll try handling a large data set for ourselves.
  • Black-hat hackers often wish to log in to others' IT systems by trying to guess their password.
  • One approach that can be taken is to 'brute-force' attack the system, by having a computer program attempt every possible password combination until it guesses correctly. It might try these first…
    • a
    • b
    • c
    • …z
    • aa
    • ab
    • ac
    • …az
    • ba
    • bb
    • etc…
  • By doing this, eventually any password can be eventually 'cracked'.
  • As you can imagine, if the password contains capital letters, then it'll take twice as long to run through all single-character passwords.
  • If you include numbers, that's another 10 combinations on each sweep through.
  • If you include puncutation marks (e.g. %, ^, &) then this will take even longer.
  • This can be rather slow.
  • Imagine you had a password-protected PDF file with your homework in it, and had forgotten the password to open it.
  • Let's imagine that its rather important you access the file, and so have written a password cracking program to repeatedly try to open the document with a brute-force attack.
  • Click here to see a password cracking calculator.
    • Set the password length to 6 characters (e.g. 'lockit' or 'system')
    • Set the keys per second to 'PDF - 22014K/s', for a PDF file.
    • Set the characterset to 'mixalpha'
    • Click 'Get time'
  • On average, how long will it take to find the password if we used a 7 character password?
  • How about 8 characters? Does the extra character make much difference?
  • What if we used mixed alphabet and numeric characters for an 8 character password?

Try It

  • Most people tend to use words for passwords, rather than random characters.
  • A second form of password attack is using a dictionary file, which contains common words.
  • Dictionary files are readily (and freely) available on the web. Click here to download a relatively small password file, containing 1.1 million passwords.
  • Getting a human to work through the list, typing each password until they get the correct one could take some time.
  • Getting a computer to run through the list won't take quite so long.
  • We'll write a short (white-hat) program to allow users to test their passwords to see if they're secure or not.
  • We'll start simple, and add complexity slowly.

Try It

  • Often, developers want to write an outline of a program before writing it. By writing the code in human-readable English, it becomes easier to articulate the flow of our program, even if we don't know all the command words to actually write it yet.
  • We call this pseudocode. A simple dictionary search algorithm could be expressed like this:
1. Make a variable called 'foundPass' and set it to False. We'll change this if your password is found.
2. Ask user for the password to test. Store as 'yourPassword'
3. Store some passwords in a list. Let's call it 'passwordList'
4. For each password in passwordList...
    4a. If yourPassword is the same as the current passwordList item...
        4a. i. Print a message, saying we've found your password
        4a. ii. Set foundPass to True
5. If foundPass is False, print a message to say the password wasn't found.
  • Load up IDLE, start a new program (File -> New) and copy the code below into the window:
passwordList=['password','123456','batman']

yourPassword=input("Please enter a password: ")
foundPass=False

for tryPass in passwordList:
    if yourPassword==tryPass:
	print("Your password could be hacked.")
	foundPass=True

if foundPass==False:
    print("That password isn't in my list.")
  • Save the file as 'simplePassword.py' and run the program a few times to see how it works.
  • This is a start, but three passwords simply isn't enough for a dictionary attack.
  • Time for an upgrade. Save this file to your home drive, ensuring its in the same directory that you save your Python work into.
  • Make another new file, and add this code to it:
print("Loading passwords... Please wait...")
theFile = open('passwords.txt', 'r')
passwordList=[]

for eachLine in theFile:
    for eachWord in eachLine.split():
	passwordList.append(eachWord)

theFile.close()

yourPassword=input("Please enter a password: ")
passwordPosition=0
foundPass=False

for tryPass in passwordList:
    passwordPosition+=1
    if yourPassword==tryPass:
	print("Found it in record " + str(passwordPosition))
	foundPass=True

if foundPass==False:
    print("That password is not in the dictionary.")
  • Test the program, using some passwords. Try predictable ones (e.g. 123456) and any others you can think of.
  • Tip: Avoid typing your own password; another person sitting nearby could see it.

Badge It - Silver

  • This program uses a lot of the ideas from the previous version of the program, but has a few new lines. If you look at the code and think about the variable names, you should be able to come up with an explanation of how it works.
  • Write out pseudocode to explain how the program works.

Badge It - Gold

  • Make a text file with this Monday's lessons in it; one on each line.
  • Write another with Tuesday's lessons in it. Again, put one on each line.
  • Write a program which allows you to enter a lesson.
  • The program should tell you if the lesson is on Monday, Tuesday or neither.

Badge It - Platinum

  • Improve your program by having it tell you which lesson period AND which day the lesson is on.