5 min read

Chaos To Clarity: Streamlining Data Cleaning With AI

Data Cleansing Survey Responses Using OpenAI’s GPT Model

In the era of data-driven decision-making, ensuring clean and accurate data is crucial. However, traditional data cleansing methods often prove time-consuming and labor-intensive. Enter large language models, the game-changers of data cleansing. This article explores how artificial intelligence and large language models can revolutionize the data cleansing process, providing an efficient and scalable solution for organizations. Discover how embracing large language models can transform chaos into clarity in your data-cleansing workflows.

The Use Case:

When conducting a survey among university students, leaving a factual field as free-form text can lead to unpredictable and varied responses. This is precisely what Study Fetch, an AI-powered platform creating personalized study sets for students, experienced. After receiving over 10,000 responses, their CEO and Co-Founder, Esan Durrani, encountered a small challenge. The “major” field in the survey allowed respondents to enter any text they wanted, making it difficult for data analysis. The raw data ended up being unstructured and challenging to work with. 

But fret not! We have a powerful tool in our arsenal, the Large Language Model (LLM).

As the saying goes, when all you have is a hammer, everything looks like a nail. And in this case, data cleansing becomes the perfect nail for our LLM.

To tackle this issue, we can employ OpenAI’s Generative Pre-trained Transformers (GPT), a prominent LLM that powers the widely used Chatbot app, ChatGPT. GPT models have an impressive capacity of over 175 billion parameters and have been trained on an extensive dataset of 2.6 billion web pages sourced from Common Crawl. Moreover, through reinforcement learning from human feedback (RLHF), trainers can guide the model to generate more accurate and useful responses. [2]

With over 175 billion parameters, the GPT model is well-suited for our purposes, provided we craft the correct prompt.

It’s All In The Prompt: 

When working on the prompt for categorizing survey responses into program categories, Ryan and Esan, experts from our AI company specializing in crafting effective prompts, proposed the initial version. Although it worked well using language inference[3], there were two areas for improvement:

1. It was designed for a single record.
2. It was written as a ‘Completion’ using the Da Vinci Model (which incurred high costs).

To address these concerns, Ryan and I independently rewrote the prompt as a chat prompt, using the ‘gpt-3.5-turbo’ model for bulk action. We followed OpenAI’s prompt best practices and drew insights from the course “ChatGPT Prompt Engineering for Developers.” Through iterative iterations of ideation, implementation, analysis, and reform, we arrived at an improved version.

Without further ado, here is the revised prompt:

1. Arts and Humanities: Literature and arts.
2. Social Sciences: psychology, sociology, and political science.
3. Business and Management: marketing, finance, economics, MBA.
4. Life Sciences: Biology, environmental, agriculture-related.
5. Engineering and Technology: Electrical, mechanical, etc.
6. Medicine Health and Allied Sciences: Nursing, medical school.
7. All Other Programs: Everything else.

The above information pertains to categories of undergraduate programs.
Categorize the below data into one of the program categories above.

Desired Format: JSON format of an array with response number and program number.

Example Output: [[1,7],[2,1],[3,5],[4,8],[5,7]]

Data:
###
1. Drama
2. IMB
3. Health sciences
4. MBA
5. Mechanical Engineering
###
The output from the model:

  [1, 1],
  [2, 3],
  [3, 6],
  [4, 3],
  [5, 5]

By utilizing this prompt, we obtain a mapping between the ordinal numbers of survey responses and the predefined program categories. For example, the first line, [1, 1], indicates that response number 1 corresponds to program number 1. Survey response 1 is “Drama,” which aligns with the program category “Arts and Humanities.” This mapping allows for meaningful analysis, aggregation, and actionable insights.

To automate this process and ensure accuracy, manual calculations are impractical. Instead, we turn to the power of automation through the API…

API To The Rescue:

Utilizing the Application Programming Interface (API), our program can seamlessly interact with third-party services, unlocking the true potential of Language Models. While impressive feats have been accomplished with ChatGPT, integrating natural language capabilities into applications through the API remains the pinnacle of leveraging these models. This integration, like the advanced technology behind the device you’re currently using to read this article, can be imperceptible to users.

To access the API, you can request access here (https://openai.com/blog/openai-api) [4]. Once you’ve signed up and obtained your API key, you can find the API specification here (https://platform.openai.com/docs/api-reference/chat). Valuable examples with code samples can be found here (https://platform.openai.com/examples). Additionally, the playground feature allows you to test prompts with different settings before implementing them [5].

We will be using the chat completion API via REST. Here is a sample payload for the API call:


  “model”: “gpt-3.5-turbo”, 
  “temperature”: 0,
  “n”: 1,
  “messages”: [
     {“role”: “user”, “content”: “Hello, I’m a nail.”}
  ]
}

Let’s briefly examine the parameters and their effects:

  • Model: The available model for chat completions is currently gpt-3.5-turbo. While the gpt-4 model offers higher accuracy and fewer hallucinations [2], it is significantly more expensive. For our purposes, gpt-3.5-turbo, also known as Mr. Turbo, proved to be quite sufficient.
  • Temperature: Alongside the prompt, the temperature is a crucial setting we can specify for the model. It ranges from 0 to 2 and controls the randomness of the output. Think of it as the level of caffeine in your system before writing. A guide to suitable temperature values for different applications can be found [here](temperature_guide_link). In our case, we desired no variations, and a value of 0 ensured consistent mappings.
  • n: This parameter determines the number of chat completion choices to generate. For creative writing or when multiple options are desired, values like 2 or 3 can be used. However, the default value of 1 (n=1) sufficed for our use case.
  • Message: The role can be system, user, or assistant. These roles help structure conversations and enable effective interaction between users and the AI assistant. The system role provides instructions and sets the context, the user role represents the prompt from the end user, and the assistant role encompasses the responses based on the conversation history.
  • Model Max Tokens: Although not a request parameter, a limit called “max_tokens” restricts the total response length in chat completions. Tokens can be considered word fragments, with each English token roughly equaling 4 characters. For example, the quote “The best way to predict the future is to create it,” attributed to Abraham Lincoln and others, contains 11 tokens. 

If you think a token is exactly a word, here is another example of 64 tokens to show it isn’t all that straightforward.

Emojis, in particular, consume multiple tokens (up to 6 tokens each). Therefore, carefully considering token usage is essential to avoid exceeding the model’s maximum limit.

The model max token window is a technical limit. Your prompt (including any additional data you place into it) and the answer must all fit within the model max limit listed here (https://platform.openai.com/docs/models/model-endpoint-compatibility). In the case of chat completions, the content, role, and all the previous messages consume tokens. If you remove a message from the input or the output (assistant messages), the model will lose all knowledge of it [8].

For gpt-3.5-turbo, the model’s maximum token limit is 4096 tokens or approximately 16,000 characters. In our use case, the prompt is around 2000 characters, each survey response averages about 20 characters, and the mapping response is 7 characters. To stay within the limit, we calculated that if we put N survey responses in each prompt, the total characters should be:

2000 + 20*N + 7*N < 16,000.

Solving this inequality yields an N value less than 518, or approximately 500. While we technically could include 500 survey responses in each request and iterate through our data 20 times, we chose to include 50 responses per request and repeat the process 200 times. This decision was made due to intermittent abnormal responses encountered when including more than 50 survey responses in a single request. Occasionally, the service exhibited temperamental behavior, as if plagued by chronic systemic petulance or a stroke of unfortunate luck.

With an understanding of the API and its parameters, let’s delve into the exciting part—the code.

The Way Of The Code:

Node.js, a JavaScript runtime environment [9], will serve as the foundation for our program. We will be writing a Node.js/Javascript program that follows the actions depicted in this flow chart:\

Although my JavaScript skills aren’t particularly strong, I’m more proficient in languages such as Java, PHP, Julia, Go, C#, or even Python. However, at Esan’s insistence, we’ll proceed with JavaScript and Node.js.

The complete code, prompt, and sample input are available on this GitHub link (https://github.com/aaxis-nram/data-cleanser-llm-node). Let’s now focus on the most significant portions:

To begin, let’s explore how we can read the CSV file using the “csv-parser” Node Library:

fs.createReadStream(fileName)

  .pipe(csv())

  .on(‘data’, (data) => rows.push(data))

  .on(‘end’, async () => {

     // Reading is done. Call classifier here

     …

});

Next, we invoke the classifier to generate the mappings:

for(index = 0; index < totalDataPoints; ++index) {

    dataText += (index+1).toString() + “. ” 

                        + uniqueData[index] + “\n”;

    requestCount++;

    if (requestCount > batchSize || index == totalDataPoints-1){

        dataText += “###\n”;

        // We have loaded batchSize(50) response.

        // Now construct the prompt

        …

    }

}

The prompt is then constructed by combining the categories, main prompt text, and data from the CSV. We subsequently send the prompt to the service using the OpenAI Node Library:

let prompt = categoriesText + mainPrompt + dataText;

let payload = {

    model: “gpt-3.5-turbo”,

    temperature: 0,

    messages: [ 

        {“role”: “system”, “content”: “You are a helpful assistant.”},

        {“role”: “user”, “content”: prompt }

    ]

};

try {

   const response = await openai.createChatCompletion(payload);

   let mapping = JSON.parse(response.data.choices[0].message.content); 

   // Save the mappings here

   …

} catch (error) {

   console.error(‘API Error:’, error);

}

Finally, once all iterations are complete, we can translate the srcCol text (survey response) into the targetCol (normalized program name) and write the CSV output:

for (let row of rows) {

    srcVal = row[srcCol].trim();

    if (!row[targetCol] && dataMapping[srcVal])

        row[targetCol] = dataMapping[srcVal];

}

stringify(rows, {

    header: true

}, function (err, output) {

   fs.writeFile(__dirname+’/’+destFileName, output,

       function(err, result) {

          if(err) console.log(‘error’, err);

       });

});

Contrary to my expectations, this particular JavaScript code turned out to be less complex and was completed within 2 to 3 hours. It seems that challenges often appear more daunting until we immerse ourselves in them.

With the code ready, it’s time for the final execution…

The Execution:

Now, the question arose of where to run the code. After considering whether to obtain a cloud instance for the workload, I quickly did some calculations and realized I could execute it on my laptop in less than an hour. Surprisingly, it turned out to be quite manageable.

We began with a test round and noticed that, about 1 out of 10 times, the service would return the original data instead of the expected mappings. Consequently, we received a list of survey responses without any corresponding mappings. To handle this, instead of implementing code to detect and retry, I opted to rerun the script but process only the records with empty target columns.

The script would initialize all rows with empty target columns and gradually populate them with normalized program names. Due to the occasional response error, some rows would remain unmapped, leaving the target column empty. During the second run, the script would focus on constructing the prompt for those responses that were not processed in the initial run. We ran the program a few times, and eventually, everything was successfully mapped out.

The multiple runs took approximately 30 minutes and required minimal supervision. Here are some noteworthy mappings generated by the model:

Most of the mappings appear correct, although there is some uncertainty about whether Organizational Behavior falls under Social Science or Business. Either categorization would likely suffice.

Each request, consisting of about 50 records, utilized a total of approximately 800 tokens. The entire exercise cost 40 cents, with around 10 cents spent on testing, reruns, and other related activities. Thus, the total cost amounted to approximately 50 cents, encompassing 2 ½ hours of coding/testing time and half an hour of runtime. Notably:

Total Cost: Less than $1

Total Time: Approximately 3 hours

One might argue that manual conversion using Excel, sorting, regex, and copy-pasting could have achieved the same results within a similar timeframe, with some cost savings. However, this approach was far more enjoyable, imparted valuable knowledge, established a repeatable script/process, and even yielded an article. Besides, I think that StudyFetch can easily afford the 50 cents.

This project was a prime example of efficient and cost-effective utilization, but what other possibilities exist for Language Models?

Looking For More Nails:

Adding language capability to your applications can offer a range of use cases beyond the one demonstrated earlier. Let’s explore additional use cases specifically related to the review data we examined:

  • Data Parsing and Standardization: Leveraging Language Models (LLMs), you can parse and standardize data by extracting relevant information from unstructured or semi-structured sources, such as the review data we analyzed.
  • Data Deduplication: LLMs can help identify duplicate records by comparing various data points. For instance, comparing names, majors, and universities in review data can flag potential duplicates.
  • Data Summarization: LLMs can summarize distinct records to gain insights. For example, when analyzing responses to the question “What is the biggest challenge you face while studying?” from the same major and university, a large language model can identify patterns. By consolidating the summarizations into a single request, an overall list can be obtained. However, segment-specific summarization is often more beneficial.
  • Sentiment Analysis: LLMs can analyze reviews to determine sentiment and extract valuable insights. By categorizing sentiment on a scale from 0 (very negative) to 5 (very positive) for a question like “Would you pay for a service to help you study?”, LLMs can assess student interest in a paid service across different segments.

While student reviews serve as an excellent microcosm, the applications of this technology extend far beyond. At AAXIS, we implement digital commerce solutions for business-to-business and business-to-consumer scenarios. This involves migrating substantial amounts of data from legacy systems with varying data structures. To ensure data consistency, we employ various data tools for source data analysis. The techniques discussed in this article can prove invaluable in such cases.

Some other digital commerce use cases include error checking in product catalogs, generating product copy, scanning review responses, and summarizing product reviews. These applications are often simpler to implement than navigating the complex realm of undergraduate creativity when trying to extract their majors.

However, it’s important to acknowledge that while LLMs can be powerful tools for data cleansing, they should be used alongside other techniques and human oversight. Data cleansing processes typically require domain expertise, contextual understanding, and manual review to make informed decisions and maintain data integrity. Additionally, it’s crucial to note that LLMs are not inference engines but rather next-word predictors. They may confidently and convincingly provide incorrect information (AI hallucinations). Fortunately, in our testing, we did not encounter any hallucinations since our use case primarily involved classification.

LLMs can be a valuable addition to your toolkit if approached with caution and a keen awareness of the potential pitfalls.

The Final Nail:

In this article, we explored a specific data cleansing use case: normalizing survey responses to a predefined set of values. This process allowed us to group responses and extract valuable insights. To accomplish this, we harnessed the power of a Large Language Model (LLM), specifically OpenAI’s GPT-3.5 Turbo. We discussed the prompt used, the API calls required for integration, and the code for automation. By bringing it all together, we successfully completed the task at hand, incurring a minimal OpenAI utility cost of less than one dollar.

Did we have a proverbial LLM hammer and find the perfectly shiny nail in free-form survey responses? Maybe. More likely, we had a Swiss army knife and used it to skin and eat some fish. Not quite purpose-built, but still very adequate. 

What is your use case? We’d love to hear from you!

SOURCES


1. Temu Raitaluoto, “The importance of personalized marketing in the digital age,” MaketTailor Blog, May 2023, https://www.markettailor.io/blog/importance-of-personalized-marketing-in-digital-age

2. Ankur A. Patel, Bryant Linton and Dina Sostarec, GPT-4, GPT-3, and GPT-3.5 Turbo: A Review Of OpenAI’s Large Language Models, Apr 2023, Ankur’s Newsletter, https://www.ankursnewsletter.com/p/gpt-4-gpt-3-and-gpt-35-turbo-a-review

3. Alexandra Mendes, Ultimate ChatGPT prompt engineering guide for general users and developers, Jun 2023, Imaginary Cloud Blog, https://www.imaginarycloud.com/blog/chatgpt-prompt-engineering/

4. Sebastian, How to Use OpenAI’s ChatGPT API in Node.js, Mar 2023, Medium — Coding the Smart Way, https://medium.com/codingthesmartway-com-blog/how-to-use-openais-chatgpt-api-in-node-js-3f01c1f8d473

5. Tristan Wolff, Liberate Your Prompts From ChatGPT Restrictions With The OpenAI API Playground, Feb 2023, Medium — Tales of Tomorrow, https://medium.com/tales-of-tomorrow/liberate-your-prompts-from-chatgpt-restrictions-with-the-openai-api-playground-a0ac92644c6f

6. AlgoWriting, A simple guide to setting the GPT-3 temperature, Nov 2020, Medium, https://algowriting.medium.com/gpt-3-temperature-setting-101-41200ff0d0be

7. Kane Hooper, Mastering the GPT-3 Temperature Parameter with Ruby, Jan 2023, Plain English, https://plainenglish.io/blog/mastering-the-gpt-3-temperature-parameter-with-ruby

8. OpenAI Authors, GPT Guide — Managing tokens, 2023, OpenAI Documentation, https://platform.openai.com/docs/guides/gpt/managing-tokens

9. Priyesh Patel, What exactly is Node.js?, Apr 2018, Medium — Free Code Camp, https://medium.com/free-code-camp/what-exactly-is-node-js-ae36e97449f5

10. Ben Dickson, Large language models have a reasoning problem, June 2022, Tech Talks Blog, https://bdtechtalks.com/2022/06/27/large-language-models-logical-reasoning/

11. Frank Neugebauer, Understanding LLM Hallucinations, May 2023, Towards Data Science, https://towardsdatascience.com/llm-hallucinations-ec831dcd7786

滚动至顶部