- Illumined Insights
- Posts
- Exploring ChatGPT's Code Interpreter
Exploring ChatGPT's Code Interpreter
An Analytics/Data Science Revolution?
Welcome to the “Illumined Insights” newsletter! Thank you so much for subscribing. This weekly newsletter touches on all things analytics and data science with a focus on areas such as data visualization and sports analytics.
This week we look at ChatGPT’s Code Interpreter and discuss the implications of this tool for analytics and data science professionals and students.
Stephen Hill, Ph.D.
Before we get started exploring ChatGPT’s Code Interpreter tool, let’s get a few caveats out of the way. Code Interpreter is currently (July 2023) only available to ChatGPT Plus subscribers ($20/month) and is very much a “beta” tool. If you are a ChatGPT Plus subcriber, the Code Interpreter tool must be enabled before it is used. To enable the Code Interpreter, login to ChatGPT, go to Settings, then Beta Features, and then turn on Code Interpreter.
Enabling Code Interpreter
With Code Interpreter successfully enabled, let’s get started. I used the prompt: “Let's start by generating a series of 10 random numbers. Calculate the mean and standard deviation of these numbers. Show the numbers, mean, and standard deviation as output.” The screenshot below shows ChatGPT’s output.
ChatGPT Code Interpreter Output (Random Numbers, Mean, and Standard Deviation)
We can see that Code Interpreter used the Python function “np.random.rand” from NumPy to generate ten random numbers. This function, by default, generates random numbers on the range of zero to one (in a similar manner to Excel’s RAND function). The “np.mean” and “np.std” functions were then used to calculate the mean and standard deviation, respectively. The results of the random number generation and the mean and standard deviation calculations were stored in objects with reasonable names and then output to the screen.
That worked very well and took only seconds to execute. Let’s give the Code Interpreter a slightly more challenging task. Here’s the next prompt that I provided: “Create an appropriate visualization of the population of the top ten most populous countries in the world. The resulting visualization should be in a downloadable format.” The prompt is intentionally a bit vague and I did not provide any data to ChatGPT. It will have to obtain the data itself. I have also discovered, through a bit of trial-and-error, that the Code Interpreter does a better job of rendering visualizations when asked to provide the visualizations in “a downloadable format”. Let’s see how it does. The output was a bit long, so I’ll split it across two screenshots.
ChatGPT Population Data Visualization (Part 1)
ChatGPT Population Data Visualization (Part 2)
Here we see a limitation of ChatGPT. The model is trained on data up to September 2021. Because of this, the model provides population data from 2021 and then politely asks if I would like to use this data or provide my own. I chose to proceed with the 2021 data. The result is below:
ChatGPT Population Data Visualization (Part 3)
We see a bar chart of the population of the ten most populous countries with an option to download the chart. I decided to then add another layer of difficulty with the prompt: “Color the bars in the bar chart with the primary color of the country's flag.” ChatGPT’s response was:
ChatGPT Population Data Visualization (Part 4)
The resulting chart is then (with the Python code below in Carbon screenshot):
ChatGPT Population Data Visualization (Part 5)
ChatGPT Population Data Visualization (Part 6)
I have to admit to being pretty impressed. This is a pretty good result for about two minutes of work. There’s quite a bit of green in the country bar colors, so some adjustments might be useful, but beyond that the chart is decent. Could I make this chart myself? Sure, but it would take a few minutes to go and find the data, prepare it, and then generate the chart.
Let’s up the level of difficulty a bit. One of the key features of the Code Interpreter is the ability to upload a data file and then perform analysis on this data. I uploaded a dataset that I often use in class. This dataset is downloadable here: Link. The dataset include health insurance charges and a few pieces of personal information: age, Body Mass Index, number of children, sex, smoking status, and region of the country lived in.
I started by uploading the dataset into the Code Interpreter and asked ChatGPT to “Provide a summary of this dataset”. Here’s what it produced:
ChatGPT Insurance Data Analysis (Part 1)
ChatGPT Insurance Data Analysis (Part 2)
ChatGPT Insurance Data Analysis (Part 3)
Uploading the data and producing this initial, descriptive analysis of the dataset took about 20 seconds. Code Interpreter used the Python Pandas “pd.read_csv” function to read-in the dataset, the “df.describe” function to summarize the numerical variables, and the “nunique” function to count the number of each unique categories in the categorical variables.
Next, I asked: “Can you create some interesting visualizations from this data?” I was intentionally vague in my question to give Code Interpreter plenty of latitude to produce visualizations. Here are the results:
ChatGPT Insurance Data Analysis (Part 4)
ChatGPT Insurance Data Analysis (Part 5)
Not bad. In addition to provide four visualizations, the ChatGPT also provided a bit of commentary on each chart.
Let’s go for a more directed approach by providing a more complex prompt: “Let's assume that we will want to eventually build a model to predict the "charges" variable. Can you create some visualizations that would be appropriate to help us determine which variables might be strong predictors of "charges"? Here are the results:
ChatGPT Insurance Data Analysis (Part 6)
ChatGPT Insurance Data Analysis (Part 7)
This is pretty impressive stuff. ChatGPT recognizes that the “smoker” variable is an important one (i.e., smokers tend to have higher charges) and creates appropriate visualizations to see that effect and the effect of “age”, “bmi”, and the other variables. Thoughtful commentary on these relationships is also provided. All of this analysis has been completed in less than five minutes. The time-saving aspects of this are becoming more and more apparent to me as I explore this tool.
Let’s see if the Code Interpreter tool can build a predictive model for us. Here’s the prompt: “Create the best predictive model that you can in order to predict the "charges" variable.” Here are the results (spoiler alert: I’m impressed):
ChatGPT Insurance Data Analysis (Part 8)
ChatGPT Insurance Data Analysis (Part 9)
ChatGPT Insurance Data Analysis (Part 10)
Here’s the code (hidden in the screenshots above):
ChatGPT Insurance Data Analysis (Part 11)
ChatGPT Insurance Data Analysis (Part 12)
There’s a lot to digest here. ChatGPT used an appropriate modeling approach by pre-processing the data (i.e., making sure that variable types are correct and indicator variables are created), splitting the data into training and testing sets, and evaluating the model performance on the training and testing data. The biggest oversight is the lack of presentation of the model itself (e.g., the model coefficients, p-values, etc.). Let’s see if we can extract that with the prompt: “Can I see the model itself? It would be nice to see the model coefficients and p-values.” Here are the results:
ChatGPT Insurance Data Analysis (Part 13)
ChatGPT Insurance Data Analysis (Part 14)
ChatGPT Insurance Data Analysis (Part 15)
The Python code was:
ChatGPT Insurance Data Analysis (Part 16)
Wow. OK. This is really good. ChatGPT provided plenty of explanation and context as it did its work. It would have been nice to see the tool address what to do with the variables in the model that were found to not be statistically significant predictors. It would be a logical next step to probe a bit in this direction. However, let’s take a different course of action. Let’s see if the Code Interpreter can “put it all together”. Here’s the prompt: “Can you create a report in a downloadable PDF format that provides a descriptive analysis of the data, appropriate data visualizations, and a predictive model? The practical implications of this analysis should be discussed.” This prompt is very similar to what I would ask of my students when they analyze a dataset and then create models. Let’s see what we get from ChatGPT.
ChatGPT Insurance Data Analysis (Part 17)
ChatGPT Insurance Data Analysis (Part 18)
There’s more coming, but this is stunnning. ChatGPT recognizes that it “forgot” or “overlooked” parts of this process. It corrects itself. I’ve seen this behavior when working with the standard ChatGPT interface, but not like this. This is impressive.
ChatGPT Insurance Data Analysis (Part 19)
The PDF report was then provided in a downloadable format via link. If you’ve made it this far, I assume you would like to see the report? Well here it is: Link. I’ll be honest here, the work is impressive, but the report itself is underwhelming. I would be pretty disappointed if a student of mine submitted this.
I created a short video with an entirely new chat to see if we can coax ChatGPT to do a better job. Check it out below. The result may be surprising!
What are your impressions of ChatGPT and the Code Interpreter tool? “Gamechanger” or “Meh”?
Are you interested in learning more about data visualization using R? Click below to get notified about my upcoming book “Data Visualization in R”.
Feedback?
Did you enjoy this week’s newsletter? Do you have a topic, tool, or technique that you would like to see featured in a future edition? I’d love to hear from you!
Support the Newsletter?
Support this newsletter with a “coffee” (optional, but appreciated).
Start Your Own Newsletter?
This newsletter is created on and distributed via Beehiiv, the world’s best newsletter platform. Want to start your own newsletter? Click below to get started. Please note that this is an affiliate link. I may receive a small commission if you sign up for Beehiiv via this link.