# This is Python Code
print("Hello World!")
Hello World!
At the end of this week, you will be able to:
Python has emerged over the last recent years as one of the most used tools for data science projects. It is known for code readability and interactive features. Similar to R, Python is supported by a large number of packages that extend its features and functions. Common packages are, to name few:
We will use RStudio
IDE to run Python but, there are other IDEs that you may want to check for your information such as Pycharm, Jupyter, and others. We will be using Python 3
. We will see that there are multiple similarities between R and Python.
Please be advised that if you experience any problems with Python while using the RStudio Server UWF
, we suggest installing R/RStudio on your device and then installing the R package reticulate.
As we have seen with R, Python also is organized with libraries/packages/modules. These libraries need to *be installed as needed and loaded. To install a library in Python on RStudio Server. Go to Terminal (not Console) and run the following:
python3 -m pip install –user “package name”
example to install numpy, run:
python3 -m pip install –user numpy
You need to do this for each library you would need to use such numpy
, pandas
, matplotlib
, statsmodels
, seaborn
, and sklearn
.
Indentation refers to the spaces at the beginning of a code line. The indentation in Python is very important.
Recordings of this week provide lessons about the following concepts:
# This is Python Code
print("Hello World!")
Hello World!
You can name a variable following these rules:
= "HeyHey"
x = 40
y
x
y= "Hey", 45 # Assign values to multiple variables
x, y print(x)
print(y)
= ["first","second","third"] # list
ranks = ranks
x, y, z print(ranks)
x
y
z# define a function
def myf():
="Hello"
xprint(x)
# use the function
myf()
# define another function
def myf():
global x # x to be global - outside the function
="Hello"
xprint(x)
myf()
Hey
45
['first', 'second', 'third']
Hello
Hello
Data Types:
= str(3) # x will be '3'
x = int(3) # x will be 3
x = float(3) # x is a float - 3.0
x = 1j # x is complex
x = range(5,45) # x is a range type
x = [1,2,1,24,54,45,2,1] # x is a list
x = (1,2,1,24,54,45,2,1) # x is a tuple
x = {"name" : "Ach", "age" : 85} # x is a dict (mapping) x
Math operations:
5+4 # Addition
5*4 # Multiplication
5**4 # power / exponent
print("Hey"*3) # String operations
import math as mt # More more math functions using package *math*
556) # cosine function
mt.cos(import random # generate random numbers
print(random.randrange(1, 10))
import numpy as np # generate random numbers
print(np.random.normal(loc=0,scale=1,size=2))
HeyHeyHey
7
[-0.23600152 0.31460777]
Strings operations:
= "Hello There!"
word 1] # accessing characters in a String
word[for z in word:
print(z)
len(word) # strings length
"or" in word # check if "or" is in word
= "Do you use Python or R or both!"
word1 "or" in word1 # check if "or" is in word1
H
e
l
l
o
T
h
e
r
e
!
True
Python assignment operators:
Operator | Example | Results |
---|---|---|
= | x = 10 | x = 10 |
+= | x += 10 | x = x+10 |
-= | x -= 10 | x = x-10 |
*= | x *= 10 | x = x*10 |
/= | x /= 10 | x = x/10 |
%= | x %= 10 | x = x%10 |
**= | x **= 10 | x = x**10 |
If-Else Statements:
= 2
h if h > 2:
print("Yes!") # indentation very important other ERROR
elif h > 50:
print("Yes Yes!")
else:
print("No")
No
For Loop Statements:
for k in range(1,10):
print(str(k)) # does not show up 10; goes up to 9
1
2
3
4
5
6
7
8
9
NumPy
is a Python library. It stands for Numerical Python and very useful for manipulating arrays. It is faster than using Lists and quite useful for machine learning applications.
import numpy # this code import NumPy library
= numpy.array([1,2,45,564,98]) # create array using NumPy
arr1 print(arr1)
[ 1 2 45 564 98]
Usually, we give a Library an alias such as np
for the NumPy library. Array objects in NumPy are called ndarray
. We can pass any array (list, tuple, etc.) to the function array()
:
import numpy as np
= np.array([1,2,45,564,98])
arr1 print(arr1)
# Multidimensional arrays!
= np.array(56)
d0 = np.array([15, 52, 83, 84, 55])
d1 = np.array([[1, 2, 3], [4, 5, 6]])
d2 = np.array([[[1, 2, 3], [4, 5, 6]], [[11, 21, 31], [41, 51, 61]]])
d3
print(d0.ndim) # print dimension
print(d1.ndim)
print(d2.ndim)
print(d3.ndim)
[ 1 2 45 564 98]
0
1
2
3
Array Indexing:
import numpy as np
= np.array([[1,2,3,4,5], [6,7,8,9,10]], dtype=float)
D2
print('4th element on 1st dim: ', D2[0, 3])
print('4th element on 2nd dim: ', D2[1, 3])
print('1st dim: ', D2[0, :])
= np.array([1, 2, 3, 4, 5, 6, 7])
arr
print("From the start to index 2 (not included): ", arr[:2])
print("From the index 2 (included) to the end: ", arr[2:])
4th element on 1st dim: 4.0
4th element on 2nd dim: 9.0
1st dim: [1. 2. 3. 4. 5.]
From the start to index 2 (not included): [1 2]
From the index 2 (included) to the end: [3 4 5 6 7]
Arithmetic operations and Math/Stat functions:
import numpy as np
= np.array([[1,2,3,4,5], [6,7,8,9,10]], dtype="f")
a = np.array([[10,20,30,40,50], [60,70,80,90,100]], dtype="i")
b
# b-a
np.subtract(b,a) # b+a
np.add(b,a) # b/a
np.divide(b,a) # b*a
np.multiply(b,a) # exponential function
np.exp(a) # natural logarithm function
np.log(a) # square root function
np.sqrt(a) 3,3),5) # 3x3 constant array
np.full((# mean
a.mean() # standard deviation
a.std() # variance
a.var() =0) # mean across axis 0 (rows)
a.mean(axis# median
np.median(a) =0) # median np.median(a,axis
array([3.5, 4.5, 5.5, 6.5, 7.5], dtype=float32)
Random numbers generation:
Random is a module in NumPy
to offer functions to work with random numbers.
from numpy import random
= random.randint(100) # a random integer from 0 to 100
x print(x)
= random.rand(10) # 10 random numbers float from 0 to 1
x print(x)
= random.randint(100,size=(10)) # 10 random integers from 0 to 100
x print(x)
= random.randint(100,size=(10,10)) # 10x10 random integers from 0 to 100
x print(x)
= random.choice([100,12,0,45]) # sample one value from an array
x print(x)
= random.choice([100,12,0,45],size=(10)) # sample one value from an array
x print(x)
= random.choice([100, 12, 0, 45], p=[0.1, 0.3, 0.6, 0.0], size=(10)) # Probability sampling
x print(x)
= random.normal(loc=1, scale=0.5, size=(10)) # Normal distribution
x print(x)
= random.normal(loc=1, scale=0.5, size=(10)) # Normal distribution
x print(x)
13
[0.81040855 0.97471457 0.95913855 0.61207417 0.87371005 0.13088357
0.78756168 0.52396309 0.48394988 0.2610489 ]
[43 10 86 55 16 23 89 40 33 31]
[[84 20 5 97 50 34 12 46 40 89]
[21 26 67 85 86 90 0 0 39 87]
[40 89 71 28 5 97 94 98 97 2]
[92 38 71 53 67 91 6 5 4 35]
[43 42 87 45 11 58 34 1 51 30]
[31 74 83 63 38 90 23 77 29 90]
[48 91 31 9 99 98 51 92 80 21]
[ 4 35 76 65 28 88 94 10 22 75]
[13 43 32 25 9 47 81 76 98 10]
[11 51 42 36 55 25 84 30 84 12]]
45
[ 0 0 12 45 0 100 100 45 12 100]
[12 0 0 0 0 0 0 0 0 0]
[1.04037504 0.72046807 1.38105608 0.42761585 0.69574904 1.28736621
1.9797221 1.11598722 1.8180968 1.4188014 ]
[1.00355013 0.02429212 0.95880871 0.05053701 1.5125858 0.92567197
0.82939892 0.45977718 1.42182413 0.8339264 ]
📚 For more reading visit Introduction to NumPy.
Pandas
is a Python library. It is useful for data wrangling and working with data sets. Pandas
refers to both Panel Data and Python Data Analysis. This is a handy Cheat Sheet for Pandas for data wrangling.
import pandas as pd
= [1,6,8]
a = pd.Series(a) # this is a panda series
series print(series)
= {
mydata "calories": [1000, 690, 190],
"duration": [50, 40, 20]
}= pd.DataFrame(mydata) # data frame
mydataframe mydataframe
0 1
1 6
2 8
dtype: int64
calories | duration | |
---|---|---|
0 | 1000 | 50 |
1 | 690 | 40 |
2 | 190 | 20 |
Read CSV Files
CSV files are a simple way to store large data sets. Data Frame Pandas can read CSV files easily:
import pandas as pd
import numpy as np
= pd.read_csv("../datasets/mycars.csv")
df print(df.info()) # Info about Data
df.head()
3,"speed"] = np.NaN # insert NaN in the row 10 in speed column
df.loc[
df.head()
= df.dropna() # drop NA cells
newdf
newdf.head()
= True) # drop NA cells and replace "df" with the new data
df.dropna(inplace
df.head()
= pd.read_csv("../datasets/mycars.csv")
df 100, inplace = True) # replace NA values with 100.
df.fillna(
"speed"].fillna(10, inplace = True) # replace NA values with 10 only in column "speed"
df[
= df["speed"].mean() # find mean of speed
x "speed"].fillna(x, inplace = True) # replace NA values with mean only in column "speed"
df[
print(df.duplicated().head()) # show duplicates
# drop duplicates df.drop_duplicates().head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 50 non-null int64
1 speed 50 non-null int64
2 dist 50 non-null int64
dtypes: int64(3)
memory usage: 1.3 KB
None
0 False
1 False
2 False
3 False
4 False
dtype: bool
Unnamed: 0 | speed | dist | |
---|---|---|---|
0 | 1 | 4 | 2 |
1 | 2 | 4 | 10 |
2 | 3 | 7 | 4 |
3 | 4 | 7 | 22 |
4 | 5 | 8 | 16 |
🛎 🎙️ Recordings on Canvas will cover more details and examples! Have fun learning and coding 😃! Let me know how I can help!
Instructions are posted on Canvas.