R Basics

At the end of this week, you will be able to:

Data Science?

The use of data as evidence is crucial but, it is not something novel. If we examine a definition of the field of statistics, we can observe that is given as four subtopics:

  • Data Collection
  • Data Analysis
  • Results Interpretation
  • Data Visualization

Originally, statistics was viewed as the analysis and interpretation of information about states. And science is understood as organized knowledge in the form of testable explanations and predictions about the universe.

So, what is data science? Data science is more than just using statistics and data to answer scientific questions.

Nowadays, data science is viewed as the use of various sources of data to extract knowledge and provide insights using multiple skills including programming, math and statistics, and communication.

Venn diagram by Drew Conway provides a visualization on data science.

Data Science Venn diagram

Data Science Venn diagram by Drew Conway

Typical examples of data science projects:

  • Market analysis What product will sell better in conjunction with another popular product
  • Market segmentation Are there distinguishable features that characterize different groups of sales agents, customers or businesses?
  • Advertising and marketing What advertisement should be placed on what site?
  • Fraud How to detect if a retail/finance transaction is valid or not?
  • Demand forecasting What is the demand for a particle service at a specific time/place?
  • Classification Emails classification (spam vs. valid email)

Tools for Data Science

Data science helps managers, engineers, policymakers, and researchers - almost everybody - to make informed decisions based on evidence from data. Computers and technologies have empowered how much data we can store, manipulate, and analyze. To enable these functions, technologies and tools are developed to help us to be more productive and efficient when conducting data science projects.

Data Science Workflow

The technologies deployed in the analytics and data science have advanced very fast and multiple open source projects exist, for example:

  • Data framework: Hadoop, Spark,…
  • Query Languages: SQL, SQL-like,…
  • Data manipulation, modeling, and graphing: R, Python,…
  • Software management: Git, GitHub,…

Data Science Workflow

Often, the data science process is iterative. Some steps in the data science workflow include:

  1. Specify the question of interest (business understanding, scientific goal, predict or estimate,…)

  2. Collect data (internal, external, sampled, relevant, ethics,…)

  3. Manipulate data (explore, transform, merge, filter,…)

  4. Model data (machine learning, statistics, probability, fit, validate,…)

  5. Communicate and interpret the results (storytelling, visualization, dashboard, reports,…)

  6. Deploy and monitor models

Introduction to R / RStudio /Quarto

The two programming languages we cover in this course are R and Python. These are both open source programming languages. Let’s start off with R.

A few features of R are:

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. There is link to download R, documentation and manuals, The R journal, books related to R, and R packages by topics

RStudio is an integrated development environment (IDE) for R and Python, with a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging and workspace management. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. There is an open source license that you can install for free from here: download RStudio

Quarto provides an authoring framework for data science reporting. It creates dynamic content with Python, R, Julia, and Observable and high quality reports that can be shared with an audience. Quarto (.qmd) documents are fully reproducible and support dozens of static and dynamic output formats.

Install your R/RStudio

For TFDS, we will be using RStudio Server hosted at UWF. This is the link https://rstudio.hmcse.uwf.edu/. Login using your UWF account.

You don’t need to install R and RStudio on your computer. But, you are welcome to do so if you wish so.

Getting started with R

πŸ›ŽοΈ Recordings of this week provide lessons about R, RStudio, and GitHub. The following will be covered:

  • RStudio (editor, console, global Env., and etc.)
  • R (scripts, packages, help)
  • GitHub and connection to RStudio
  • R Markdown - Cheet Sheet
  • My first R script - the basics
    • Values, vectors, matrices, factors, data.frames, lists. Here is an example of code:
# assign a value to object named "x"
x = 1
# or
x <- 1
1 -> x  
# Calculator 
x=10^2
y=2*x
# vectors / arrays
c(1,21,50,80,45,0)
[1]  1 21 50 80 45  0
# characters array
c("d","4","r")
[1] "d" "4" "r"
# characters
"R is useful and cool"
[1] "R is useful and cool"
# boolean - TRUE or FALSE
45>96
[1] FALSE
# built-in functions
sum(1,3,5)
[1] 9
  • Statistical and mathematical functions: An example of code:
# a vector / array
vec1= c(1,21,50,80,45,0)
# minimun
min(vec1)
[1] 0
# maximum
max(vec1)
[1] 80
# exponential function
exp(vec1)
[1] 2.718282e+00 1.318816e+09 5.184706e+21 5.540622e+34 3.493427e+19
[6] 1.000000e+00
# cosine function
cos(vec1)
[1]  0.5403023 -0.5477293  0.9649660 -0.1103872  0.5253220  1.0000000
# sine function
sin(vec1)
[1]  0.8414710  0.8366556 -0.2623749 -0.9938887  0.8509035  0.0000000
# logarithm function of base e
log(vec1,0.5)
[1]  0.000000 -4.392317 -5.643856 -6.321928 -5.491853       Inf
# square root
sqrt(vec1)
[1] 1.000000 4.582576 7.071068 8.944272 6.708204 0.000000
# logarithm function of base 10
log10(10)
[1] 1
# logarithm function of base 2
log2(2)
[1] 1
# logarithm function of base 45
log(45,base = 45)
[1] 1
# factorial
factorial(3)
[1] 6
# binomial coefficient / combination
choose(10,5)
[1] 252
  • Summary statistics, random number generation. An example:
# a set of values
vec1= c(1,21,50,80,45,0)
# summation
sum(vec1)
[1] 197
# arithmetic mean
mean(vec1)
[1] 32.83333
# standard deviation
sd(vec1)
[1] 31.30122
# summary statistics
summary(vec1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    6.00   33.00   32.83   48.75   80.00 
# variance
var(vec1)
[1] 979.7667
# quantile
quantile(vec1,0.5)
50% 
 33 
# 100 Standard normal random numbers
x=rnorm(100,mean=0,sd=1)
# histogram
hist(x)

Histogram of 100 normal random numbers

  • Functions, conditional statements: if, for and while. A code example:
# create your own function
  myfunction=function(){
    return(print("Hello there!"))
  }
# if statement
lucky.number=100
if(lucky.number<=54){
print("You win!")
  }else{
  print("You lost!")
}
[1] "You lost!"

πŸ›Ž πŸŽ™οΈ Recordings on Canvas will cover more details and examples! Have fun learning and coding πŸ˜ƒ! Let me know how I can help!

πŸ“š πŸ‘ˆ Assignment - R basics

Instructions are posted on Canvas.