Homework 5

Dear all,

For Homework 5 you will need:

  1. Create an array of values and calculate standard deviation of thi array in R (using sd() function) and Python (using std() function from numpy package.
  2. Think and explain why the results are different!
  3. Implement your own version of std function in Python that returns the same result that sd() function in R.
  4. Using your own function calculate standard deviation for the column AGE in the fhs.sas7bdat dataset
  5. Submit your code as a job on the SCC
    Extra Credit: Run your code from within a Jupyter notebook and using Markdown directives include formulas and explanation of 2 different ways of computing standard deviation.

Below you will find detailed instructions, links, and examples that should help you to complete this homework. Please read instructions for each step very carefully. You should develop your code using SCC cluster. Once your code works interactively you can run it as a job.

You will need to submit your work to the Blackboard. You can either submit your Python script along with the output from the job you ran on the SCC or (if you decide to get an extra credit) a single pdf or html document that will be created as an output if you submit a job using your jupyter notebook.

There is a very detailed documentation for the Jupyter notebook .

You can find Markdown basics as well as many examples how to use LaTeX commands to include formulas if you would like to complete extra cretdit.

Detailed Instructions for the homework

Open Rstudio and create a vector x:
x <- c( 21,7,5,3,8)
Calculate standard deviation:
sd(x)
[1] 7.085196


In Python environment (either in spyder or Jupyter notebook), load 3 packages that you will need for this homework:

In [1]:
import math
import numpy as np
import pandas as pd

Create a vector (called array in Python) using the same values:

In [2]:
x = [21,7,5,3,8]

Use std() function from package numpy to calculate standard deviation:

In [3]:
np.std(x)
Out[3]:
6.3371918071019442

Why the values are so different?

Your task is to write your own function that calculates a standard deviation value that is equal to the one you get in R.

You cannot use any Python functions other then len(), which returns the length of the vector.
To complete this homework you might need to use loops, conditional if statements and function defitions. Below I provide some simple examples for all three. Please remember that the body of if statement, loop and function must have indentation and add an empty line at the end to separate your function, loop or if from the following code.

If statement:

In [4]:
# Example of if:
if len(x) > 0:
    print("Length of the array is", len(x))
else:
    print("Array has no elements")
    
Length of the array is 5
In [5]:
y = []  # empty array
if len(y) > 0:
    print("Length of the array is", len(y))
else:
    print("Array has no elements")
    
Array has no elements

for loop

In [6]:
# calculate sum of values in array:
sum = 0
for value in x:
    sum = sum + value
    
print("Sum of elements in x is", sum)
Sum of elements in x is 44

Function

In [7]:
#function definition
def my_mean(array):
    sum = 0
    for value in array:
        sum = sum + value
    
    return sum/len(array)
In [8]:
# Call function (check to make sure mean() function in R returns the same value:
my_mean(x)
Out[8]:
8.8
In [9]:
# Let's check with Python function mean:
np.mean(x)
Out[9]:
8.8000000000000007

But my function does not work if I call it with an empty vector

y=[]
my_mean(y)

In [10]:
# Improved version of function mean:
#function definition
def my_mean(array):
    
    if len(array) <1:
        return np.nan
    
    sum = 0
    for value in array:
        sum = sum + value
    
    return sum/len(array)
In [11]:
# Apply function to an empty vectory
y=[] 
my_mean(y)
Out[11]:
nan
In [12]:
# Check if it still works on a regular vector
my_mean(x)
Out[12]:
8.8

Reading FHS dataset

In [13]:

# To read this file on the SCC, you should use /project/bs803 path:
# fhs = pd.read_sas("/project/bs803/fhs.sas7bdat")
In [14]:
# You can explore this dataset using head command:
fhs.head()
Out[14]:
SEX RANDID TOTCHOL AGE SYSBP DIABP DIABETES BPMEDS PERIOD CIGPDAY HEARTRTE HDLC LDLC MAP
0 1.0 2448.0 195.0 39.0 106.0 70.0 0.0 0.0 1.0 0.0 80.0 NaN NaN 82.000000
1 1.0 2448.0 209.0 52.0 121.0 66.0 0.0 0.0 3.0 0.0 69.0 31.0 178.0 84.333333
2 2.0 6238.0 250.0 46.0 121.0 81.0 0.0 0.0 1.0 0.0 95.0 NaN NaN 94.333333
3 2.0 6238.0 260.0 52.0 105.0 69.5 0.0 0.0 2.0 0.0 80.0 NaN NaN 81.333333
4 2.0 6238.0 237.0 58.0 108.0 66.0 0.0 0.0 3.0 0.0 80.0 54.0 141.0 80.000000
In [15]:
# Now let's calculate mean of column AGE using Pandas mean() method
fhs['AGE'].mean()
Out[15]:
54.792809839167454
In [16]:
# We can also run mean() function from numpy package:
np.mean(fhs['AGE'])
Out[16]:
54.792809839167454
In [17]:
# What about my own mean function:
my_mean(fhs['AGE'])
Out[17]:
54.792809839167454

Looks like all 3 agree about mean value of column AGE. What about standard deviation?

In [18]:
# standard deviation method from pandas package
fhs['AGE'].std()
Out[18]:
9.564299222631858
In [19]:
# standard deviation function from numpy package
np.std(fhs['AGE'])
Out[19]:
9.56388791684017

As you can see the above 2 values differ and in R we get slightly different value then in the std() function from numpy... Why? Your task is to implement your own function my_std() that will compute standard deviation the way R and pandas package do, so you get the same value as sd() function in R.

First check it on a simple vector x that we defined earlier. Make sure your function works if you pass an empty vector too! When everything works fine, try to apply it to the column AGE in FHS dataset above.

Please make sure you add to your python script a detailed explanation of what you are doing. If you are creating a regular python file in spyder or other Python environment use comments. Single line comments in Python are similar to R, but if you want to have multi-line comments in Python, you can use the following syntax:

'''
This function calculates mean value of an array.
I calculate this value as a sum of all values in array divided by the length of the vector
'''

If however you are using a Jupyter notebook, you can place your text into a markdown cell.

Check your code running it interactively to make sure it runs without errors. Once it runs correctly, you can run it as a batch job on the SCC

Submitting Python script on the SCC

If you want to submit a Python script (hw5.py) on the SCC, you need to create a batch script (job.qsub or similar) just like we did for SAS and R script, but at the bottom of your script you should put the following lines:

module load python3/3.7.7
python hw5.py

Submitting Python Jupyter Notebook as a job on the SCC (optional)

For the extra credit you can execute a Jupyter Notebook on the SCC instead of Python script. In this case you need to include the following lines into your job.qsub file:

module load python3/3.7.7
module load pandoc/2.5
module load texlive/2018
jupyter nbconvert --to notebook --execute hw5.ipynb
jupyter nbconvert hw5.nbconvert.ipynb --to pdf

Or if you want to save it into HTML format then:
module load python3/3.7.7
jupyter nbconvert --execute hw5.ipynb

Useful links

Below are some helpful links if you want to learn more about Python:
Learn Python interactively
Link to the html page with many useful commands from pandas and seaborn packages that we went through during the class.

In [ ]: