Learning Diary: Statistics With Python Part One: The Basics

MikeNM –

Rants and Reviews. Mostly just BS and Affiliate Links.
Follow on Mastodon

May 24, 2019

I am rounding out the Python stuff at Code Academy with their Learning Statistics with Python course. Along with a couple of books I bought, this should be a fun way to find direct applications with Python at my current skill level.

I finished the first section, which covers the basics of statistics that you probably remember from Middle School: Mean, Median, and Mode. These are relatively straightforward to explain:

* Mean is the mathematical average found by adding all of the values in a set and dividing by the total number of entries. For example, your four frames of bowling are 12, 7, 9, 8. Added together these are 36. You then divide by 4 which gives us the Mean of 9.
* Median is the middle of a number set. In sets with an odd number of sets, you take the middle value in an ordered set. Let's add another frame to our game and order them from least to highest: 7,8,9,9,12. Because the third value has two other values on each side, this is the middle--making the Median 9. In even sets, like our original four scores, you still order them by value: 7,8,9,12. Because you don't have an exact middle here, you take the two middle values: 8 and 9 and average them giving us: 4.45.
* Mode is a bit simpler. It is merely the value that appears most often in a set. Going back to our five frames: 7,8,9,9,12. In this case, 9 appears most often.

I haven't gotten much farther than these basics. I wanted to make a program that looks at each of these values and uses the Median to figure out if the Mode or average is a better representative of the samples. I am working from a layman's perspective here, and am using whether the Mean or Median is closer to the Mode to decide which is a better indicator of a set.

All of these articles are going to use the same set of data from the city of Vancouver. A 2018 report on the expenses of city employees who make over 75,000 a year along with their salary. This was an approachable CSV dataset I could import and work with as I went through the course ideas.

Below is the simple program I made. It imports the CSV. (I will complain again about how annoying Python's CSV reader is compared to Powershell, it requires a lot more work to make a dataset into an array.) It loops through that file and creates an array of the salaries. Using the statistics library, I then found the Mean, Median, and Mode. It prints those out to the console.

Then I find the distance between the Median and the Mean from the Mode. I compare those values to find out which is closer to the most frequent salary and return that as the better sample. I wrote this just trying to play around, and some quick reading showed that figuring out the proper figure is a bit more complicated than my swift calculations. As I learn more, I'll return to this sheet and find out how close this initial impression was. I'll paste the output into a second code block so you can see how it turned out.

Code

    import csv
    import statistics
    statfile = open("2018StaffRemunerationOver75KWithExpenses.csv",'r')
    statset = csv.reader(statfile)
    salaryList = []
    for row in statset:
        if row == ['', '', '', '', '']:
            #print (row)
            continue
        elif row == ['Name','Department','Title','Remuneration','Expenses']:
            #print (row)
            continue
        else:
            #print(row)
            salaryList.append(int(row[3]))
    salaryMean = int(statistics.mean(salaryList))
    print ("The mean (average) salary is $: " + str(salaryMean))
    salaryMode = int(statistics.mode(salaryList))
    print ("The mode (most frequent) salary is: $" + str(salaryMode))
    salaryMedian = int(statistics.median(salaryList))
    print ("The median (midpoint of all salaries) is: $" + str(salaryMedian))
    mediantoMode = abs(salaryMedian - salaryMode)
    meantoMode = abs(salaryMean - salaryMode)
    print(mediantoMode)
    print(meantoMode)
    print ("The mean is "+ str(meantoMode) + " from the mode and the median is " + str(mediantoMode) + " from the mode salary of $" + str(salaryMode))
    if(meantoMode < mediantoMode):
        print("The mean is a better representative sample of the salaries at: $" + str(salaryMean))
    elif(meantoMode == mediantoMode):
        print("The mean and mode are equally distant from the median. The mean is $" + str(salaryMean) + " and the mode is $" + str(salaryMode))
    else:
        print("The median is a better representative sample of the salaries at: $" + str(salaryMedian))

Output

The Mean (average) salary is: $103848
The Mode (most frequent) salary is: $95664
The Median (midpoint of all salaries) is: $98716

The Mean is 8184 from the Mode and the Median is 3052 from the Mode salary of $95664
The Median is a better representative sample of the salaries at: $98716

MikeNM

Rants and Reviews. Mostly just BS and Affiliate Links.
Follow on Mastodon

Subscribe to this blog

Learning Diary: Statistics With Python Part One: The Basics

Code

Output

Share this article with friends