Quantcast
Channel: Machine Learning | Towards AI
Viewing all articles
Browse latest Browse all 792

Topic Modeling on Customer Reviews using BERTopic and Llama2

$
0
0
Author(s): Boris Dorian Da Silva Originally published on Towards AI. Topic Modeling on Customer Reviews using BERTopic and Llama2 A Quick Guide to Creating Interpretable Topics from Customer Reviews with BERTopic and Llama2 using Ollama. Image by playground.com Introduction No matter the industry, most companies utilize customer reviews to gather crucial insights about their products/services. Topic modeling is a technique that facilitates the discovery of main themes and topics within a vast collection of text documents. This method aids in comprehending customer sentiments, preferences, and challenges. Customer reviews typically consist of two main components: an overall score for the product/service and a descriptive comment. These scores may come in various formats, such as “1 to 10,” “1 to 5,” “Negative/Neutral/Positive,” etc. However, for ease of analysis, it is advisable to standardize them into three categories: Negative, Positive, and Neutral. For instance, in a “1 to 5” score format, the standardization could be defined as follows: Negative: 1 or 2 Neutral: 3 Positive: 4 or 5 Once the reviews scores are standardized, the goal is to address two main questions: What are people talking about in the negative reviews? What are people talking about in the positive reviews? In this article, I present a guide for constructing a quick and straightforward topic model using the powerful BERTopic library to extract topics from documents and leveraging the Large Language Model “Llama2” to improve the topic representations. This guide serves as an effective strategy for gaining initial insights from the reviews. BERTopic According to the official documentation, BERTopic “is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions”. BERTopic employs a sequence of five steps to generate topic representations: Image from BERTopics documentation (source) For further details, I recommend referring to the official website and the original paper by Maarten Grootendorst. Dataset In this article, I use a publicly available dataset from Kaggle comprising over 33,000 anonymized reviews of McDonald’s stores in the United States, sourced from Google reviews. You can access the dataset at: https://www.kaggle.com/datasets/nelgiriyewithana/mcdonalds-store-reviews/data Below, you can see the contents of the dataset: import pandas as pddf = pd.read_csv('McDonald_s_Reviews.csv', encoding='latin1')df.head() We will divide our dataset into 2 dataframes for the negative reviews and the positive ones, based on the standardization described above. df_neg = df[df['rating'].isin(['1 star', '2 stars'])]df_pos = df[df['rating'].isin(['4 stars', '5 stars'])]print('Quantity of negative reviews: ', len(df_neg))print('Quantity of positive reviews: ', len(df_pos)) Topic Model Training Once we have the reviews from the dataset, we will implement a simple configuration for BERTopic to train two topic models: one for negative reviews and another for positive reviews. The particularity of this configuration lies in our utilization of KeyBERTInspired to perform an initial stage of representation fine-tuning. The outcome of this process is a list of keywords for each topic that represent the topic. We will apply the same configuration to both the positive and negative reviews. from bertopic import BERTopicfrom bertopic.representation import KeyBERTInspiredfrom sklearn.feature_extraction.text import CountVectorizerrepresentation_model = KeyBERTInspired()#vectorizer_model = CountVectorizer(min_df=5, stop_words = 'english')topic_model_neg = BERTopic(#nr_topics = 'auto', #vectorizer_model = vectorizer_model, representation_model = representation_model)topic_model_pos = BERTopic(#nr_topics = 'auto', #vectorizer_model = vectorizer_model, representation_model = representation_model)print('Training topic model for negative reviews...')topics_neg, ini_probs_neg = topic_model_neg.fit_transform(list(df_neg.review.values))print('Training topic model for positive reviews...')topics_pos, ini_probs_pos = topic_model_pos.fit_transform(list(df_pos.review.values))df_neg['topic'] = topics_negdf_neg['topic_prob'] = ini_probs_negdf_pos['topic'] = topics_posdf_pos['topic_prob'] = ini_probs_postopics_info_neg = topic_model_neg.get_topic_info()topics_info_pos = topic_model_pos.get_topic_info() After training the topic model, we obtained 195 topics for negative reviews and 317 topics for positive reviews. To view the topics, you can execute the following lines of code: # Topics information from Negative Reviewstopic_model_neg.get_topic_info() # Topics information from Positive Reviewstopic_model_pos.get_topic_info() As you can observe, the largest group is denoted by the topic “-1,” which corresponds to outlier reviews. Essentially, the resulting model couldn’t allocate a topic to these reviews. If you wish to mitigate the number of outliers, I suggest referring to the official documentation to explore new configurations for the model. In the following section, you can visualize a graph displaying the scores of the words for the first 8 topics. # Topic Word Scores from Negative Reviewstopic_model_neg.visualize_barchart() Graph by author # Topic Word Scores from Positive Reviewstopic_model_pos.visualize_barchart() Graph by author 2° Stage of Fine-Tuning Representation using LLM The “Representation” column from the dataframe with the topics information (obtained in the previous section) includes a list of keywords that represent each topic, obtained through the KeyBERTInspired algorithm. While this representation can provide insights into the meaning of each topic, fully understanding and interpreting these keywords in the context of the topic may require considerable analysis. To expedite this process, we propose conducting a second stage of representation fine-tuning using a Large Language Model (LLM), specifically Llama2 in this case. The concept behind this approach is for the LLM model to generate concise labels representing each topic based on samples of reviews associated with the topic and their corresponding keywords obtained from KeyBERTInspired. Ollama and Llama2 To run the Llama2 model, we will use Ollama, that is a streamlined tool that allows users to easily set up and run large language models locally. You can download the installer from the following link: https://ollama.com/download Once installed, you will need to download the model from the terminal using the following command line: ollama pull llama2 Image by author Llama2 is released by Meta Platforms, Inc. This model is trained on 2 trillion tokens, and by default supports a context length of 4096. Llama 2 Chat models are fine-tuned on over 1 million human annotations, and are made for chat. By default, Ollama uses 4-bit quantization for this model. Prompt Engineering The next step involves building a custom prompt to send to the LLM model. The Llama2-chat model uses the following template to define system and instruction prompts: <s>[INST] <<SYS>>{{ system_prompt }}<</SYS>>{{ user_message }} [/INST] Below, you can find the breakdown of each component within the template: <s>: the beginning of the entire sequence.[INST]: the beginning of some instructions.<<SYS>>: the beginning of the system message.{{ system_prompt }}: Where the user should edit the […]

Viewing all articles
Browse latest Browse all 792

Trending Articles