- Illumined Insights
- Posts
- Write R Code with ChatGPT
Write R Code with ChatGPT
Do you even need to learn to program at this point?
Welcome to the “Illumined Insights” newsletter! Thank you so much for subscribing. This weekly newsletter touches on all things analytics and data science with a focus on areas such as data visualization and sports analytics.
This week we add our “two cents” to the ongoing discussion about Artificial Intelligence (AI). I’ve been using ChatGPT and several of its competitors extensively over the last few months. I’m deeply impressed by the potential of these tools and ultimately believe that they will be revolutionary and change how we work, learn, and teach. For full disclosure, I am a paying subscriber to ChatGPT which grants access to ChatGPT 4. Let’s get started.
Stephen Hill, Ph.D.
I started learning the R programming language around 2013. I had been asked to teach an “advanced” analytics course and using Excel only wasn’t going to cut it. So I undertook a self-taught crash course to learn enough R before mid-2014 so that I wouldn’t embarrass myself teaching it. Back then (as now) there were quite a few online tutorials and short courses that I could use to, over time, cobble together enough R knowledge. There wasn’t, however, a publicly accessible tool such as ChatGPT.
I’ve been using ChatGPT on a nearly daily basis to help me write and improve my R code. Let’s take a look at several coding examples in R using ChatGPT to generate our code. Let’s start out with asking ChatGPT to use the “ggplot2” package to create a simple scatterplot from the “diamonds” dataset. Here’s the prompt: “Provide R code to generate a scatterplot of carat versus price using the ggplot2 package and the diamonds dataset.”
Here’s ChatGPT’s response (I’m using horizontal dividers to show where the ChatGPT response begins and ends):
“The `diamonds` dataset is a part of the `ggplot2` package in R, which includes information about almost 54,000 diamonds, such as their carat, cut, color, clarity, depth, table, price, and dimensional measurements.
Here is a simple R script that uses `ggplot2` to create a scatterplot of `carat` versus `price`.
# Load the necessary library
library(ggplot2)
# Use the diamonds dataset
data(diamonds)
# Generate the scatterplot
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point() +
labs(title = "Scatterplot of Carat vs Price", x = "Carat", y = "Price")
In this script:
- The `library(ggplot2)` command is used to load the ggplot2 package.’
- The `data(diamonds)` command is used to load the diamonds dataset.
- `ggplot(data = diamonds, aes(x = carat, y = price))` initializes a ggplot object and sets up the aesthetics for the plot.
- `geom_point()` adds a layer to the plot that will draw the points for each diamond, with the position of each point determined by the `carat` and `price`.
- The `labs()` function is used to set the title of the plot and the labels for the x and y axes.
Remember to have `ggplot2` package installed in your R environment. You can install it using `install.packages("ggplot2")` if it's not installed.
Not only did ChatGPT provide the code that we requested, it went above and beyond our request. The output begins with a quick overview of the “diamonds” dataset before providing the code to create the requested plot. The code itself is commented and goes so far as to provide appropriate labels for the plot. After the code is a detailed description of each line of the code with a helpful reminder to make sure that the “ggplot2” package is installed. The code runs fine in R and generates the plot shown below:
R ggplot2 scatterplot from ChatGPT code
Let’s increase the level of difficulty a bit. Can ChatGPT write a functioning Shiny app and what happens when we ask ChatGPT to modify the code to enhance the app?
Here’s the prompt: “Write a Shiny app to display a scatterplot of carat versus price from the diamonds dataset using the ggplot2 package. Add a user input to select the size of the points in the scatterplot.”
And here’s the ChatGPT output:
Sure, here is a simple Shiny application with the specifications you mentioned:
# Ensure necessary packages are installed
if(!require(shiny)) install.packages("shiny")
if(!require(ggplot2)) install.packages("ggplot2")
library(shiny)
library(ggplot2)
ui <- fluidPage(
titlePanel("Diamonds Scatterplot: Carat vs Price"),
sidebarLayout(
sidebarPanel(
sliderInput("pointSize",
"Point Size:",
min = 1,
max = 5,
value = 2)
),
mainPanel(
plotOutput("scatterPlot")
)
)
)
server <- function(input, output) {
output$scatterPlot <- renderPlot({
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(size = input$pointSize) +
labs(title = "Scatterplot of Carat vs Price",
x = "Carat",
y = "Price") +
theme_minimal()
})
}
shinyApp(ui = ui, server = server)
This Shiny app provides a scatterplot of carat versus price from the diamonds dataset. The size of the points in the scatterplot can be adjusted using a slider on the sidebar panel. You can run this code in your local R environment. Please make sure that you have the "shiny" and "ggplot2" packages installed and loaded.
Note: the "diamonds" dataset is a part of the "ggplot2" package. It contains information about 53,940 diamonds, including the price, carat, and other attributes.
If you read the our previous, three-part series where we developed a Shiny app and then added various enhancements you’d certainly be familiar with the basic Server and UI structure of the Shiny app. Running this app works perfectly with no corrections to the code needed. A screenshot of the app is shown below.
Shiny app developed with ChatGPT code
So far so good from ChatGPT. What if we give ask it create a coding challenge and then provide a solution? Here’s the first prompt: “Please provide me an R coding challenge to give to ChatGPT.” The ChatGPT output was:
Sure, here is a challenge for you:
R Coding Challenge:
In the world of data science, data preprocessing is a key aspect of data analysis. Often, data comes with missing or NA values that need to be treated before proceeding to analysis. In this challenge, write an R function that takes in a dataframe and does the following:
The function should be named
treat_missing_data
.It should take in two parameters: a dataframe
df
and a methodmethod
.If
method
is "mean", it should replace all NA values in numeric columns with their respective column mean. Ifmethod
is "median", it should replace all NA values in numeric columns with their respective column median. Ifmethod
is "mode", it should replace all NA values in numeric columns with their respective column mode. Ifmethod
is "remove", it should remove all rows with at least one NA value.If
method
is not provided or any other value is given, the function should return the original dataframe.The function should return the treated dataframe.
Ensure your function is robust and can handle dataframes of different sizes and with different types of columns (numeric, character, etc.).
Test your function with an appropriate dataframe and different method inputs.
You can use the mtcars
dataset in R for testing your function. Add some NA values to this dataset before testing.
Remember to document your function appropriately so it is clear what it does and how to use it.
OK. This is pretty good. My main complaint would be that ChatGPT seemed to misinterpret my prompt. I was thinking that it would develop an R coding challenge for itself and, perhaps, even present a solution without being explicitly told to do so. I followed up this output with another prompt: “Can you provide a solution to this coding challenge? Use the mtcars dataset with randomly added NA values.” ChatGPT then generated this output:
Sure, here is a solution to the problem. Please note that R does not have a built-in function to calculate mode. Hence, we first need to create a function calculate_mode()
to calculate mode of a numeric vector.
# Calculate mode
calculate_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# Main function
treat_missing_data <- function(df, method = NULL){
if(is.null(method)){
return(df)
} else if (method %in% c('mean', 'median', 'mode', 'remove')){
for(col in names(df)){
if(is.numeric(df[[col]])){
if(method == 'mean'){
df[[col]][is.na(df[[col]])] <- mean(df[[col]], na.rm = TRUE)
} else if(method == 'median'){
df[[col]][is.na(df[[col]])] <- median(df[[col]], na.rm = TRUE)
} else if(method == 'mode'){
df[[col]][is.na(df[[col]])] <- calculate_mode(df[[col]][!is.na(df[[col]])])
}
}
}
if(method == 'remove'){
df <- df[complete.cases(df),]
}
} else {
return(df)
}
return(df)
}
# Testing
set.seed(123)
df <- mtcars
na_rows <- sample(1:nrow(df), 5)
na_cols <- sample(1:ncol(df), 5)
for(i in 1:5){
df[na_rows[i], na_cols[i]] <- NA
}
df_treated <- treat_missing_data(df, "mean")
print(df_treated)
In this code, we first create a helper function calculate_mode()
to calculate mode of a numeric vector. Then, we define the main function treat_missing_data()
which takes in a dataframe and an optional method parameter. Depending on the value of method
, it replaces NA values with column mean, median or mode, or removes rows with NA values.
We then test our function by creating a dataframe from the mtcars
dataset and randomly inserting NA values. Finally, we treat the dataframe using the treat_missing_data()
function with "mean" method and print the treated dataframe.
The code that ChatGPT provides works perfectly. The commentary provided after the code is a nice extra touch. Could I have written the code that R provided? Sure. It would have taken me awhile. ChatGPT generated the code in less than ten seconds.
So, is ChatGPT a perfect companion for R coding? Yes and no. If you frequently need to write R code, ChatGPT will help you write code faster and more efficiently. It does make mistakes, but the mistakes that it makes are often a product of poor prompting rather than bad coding. If you are learning R code, ChatGPT can be a great companion as it does an admirable job of explaining the code that it writes and can also provide insights about code that you provide to it.
Are you interested in learning more about data visualization using R? Click below to get notified about my upcoming book “Data Visualization in R”.
Each week we’ll feature a dataset that we find interesting, useful, etc. This week we’ll deviate a bit in this feature and instead highlight a free, online course in ChatGPT prompt engineering. I’ve found this course, from DeepLearning.ai to be an invaluable resource. Check it out and let me know what you think.
Feedback?
Did you enjoy this week’s newsletter? Do you have a topic, tool, or technique that you would like to see featured in a future edition? I’d love to hear from you!
Support the Newsletter?
Support this newsletter with a “coffee” (optional, but appreciated).
Start Your Own Newsletter?
This newsletter is created on and distributed via Beehiiv, the world’s best newsletter platform. Want to start your own newsletter? Click below to get started. Please note that this is an affiliate link. I may receive a small commission if you sign up for Beehiiv via this link.
t?