Leveraging SuperDuperDB to Create a Simple De-duplication System Easily

Author(s): Okoh Anita Originally published on Towards AI. Image by Author Introduction I have spent a considerable amount of years in the identity resolution field, trying to identify duplicate customer accounts and associating them into groups. In my experience, there are two types of B2C new customers: A True new customer: A customer with no existing account in the company database. 2. A False new customer: A customer that has one or more existing accounts in the company’s database One common issue most large B2C companies face is customer account duplication, i.e., customers opening multiple accounts in a company. Depending on the company’s new customer incentives, customers with multiple accounts could potentially utilize the incentive more than once. If not tracked, it could lead to significant monetary losses over time as more false new customers increase. Lately, I have been thinking about how LLMs could help identify and associate customers in real-time, i.e., deciding if a customer is eligible for new customer incentives as soon as they register. This led to crafting a solution that can be summarized in two steps: Using LLM to find semantic similarity between customer details in the database as soon as a customer registers. However, only a semantic search would not suffice as it can output false positive similarities, and these similarities may hurt the business's reputation, especially when an actual new customer is penalized unfairly. Creating a simple re-ranking logic layer as a postprocessing task or a second layer of validation will help to narrow down true positive results. Like always, my thought process typically ends up with me finding tools to help build simple MVP demos fast. And this time was not any different. A new framework I have recently been playing with is SuperDuperDB SuperDuperDB is an open-source framework that attempts to eliminates complex MLOps pipelines, specialized vector databases — and the need to migrate and duplicate data by integrating AI at the data’s source, directly on top of your existing data infrastructure. More info about the framework documentation here I used SuperDuperDB to extend the vector search capability of MongoDB as my backend database. And then use the RecordLinkage library to re-rank the results outputted from the vector search as a postprocessing step The Python Record Linkage Toolkit is a library to link records in or between data sources. The toolkit provides most of the tools needed for record linkage and deduplication. The package contains indexing methods, functions to compare records and classifiers. More info about the toolkit documentation here Code Demo Creating scenarios Let's say we have a course website called “UpSite” that gives a 10-day free trial to new accounts. The goal is to decide if the customer is eligible for new customer incentives. First, let's understand the architecture. Image by Author The above gives a high-level view of the simplistic customer de-duplication architecture. Here is the flow A new customer inputs the name, phone, address and email to register. These details are, first, combined into a string to serve as a search string. It is then converted into vectors using an embedding model to search for the five nearest similar customer details in the database. Since no cut-off score is added to the similarity search result, we always expect to get the five closest results, regardless of whether they are actually similar records or not. To reduce these potential false positive results, a second validation is done to compare each field and score the similarity between the new customer details and the similar customer details returned by the vector search. This involves using basic string similarity algorithms like the Jarowickler method with a threshold similarity score of 0.85 for the names, emails and addresses. The exact match score for the phone number is checked instead. If a particular returned customer detail has a similarity score sum of greater than 0, then the returned customer detail is added to a response Dataframe. Image by Author If, at the end of the validation, the length of the Response DataFrame is greater than zero, then three things are returned as responses. The customer details Similar customers' details An eligibility rejection message: “ Sorry, you are not eligible for the new customer 10-day trial.” Image by Author However, if the DataFrame length is zero, only the eligibility acceptance message is returned: Thank you for registering. Verify your email in your inbox and start enjoying your new customer 10-day trial. Image by Author To make all these logics come to life, I wrapped it up in a streamlit app. GIF by Author Code Now you understand the flow, let's translate it to code in 5 steps First, let's convert our MongoDB into a SuperDuperDB Object and input our data import jsonfrom superduperdb import superduperfrom superduperdb import Documentfrom superduperdb.backends.mongodb import Collectionwith open('data.json') as f: data = json.load(f)mongodb_uri="mongomock://test"db = superduper(mongodb_uri, artifact_store="filesystem://./data/")collection = Collection('customer_details')db.execute(collection.insert_many([Document(r) for r in data])) All that is done above is to: Convert the MongoDB instance into a SuperDuperDB object and define where the artifacts are to be stored. A Collection object was also instantiated, and then the JSON file containing the customer details was input into the Collection instantiated. Note: Adding the artifacts path is optional. It can be a local file path or an external storage path. If you already have a MongoDB database with datasets, all you need to do is convert it into a SuperDuperDB object using your Mongo URI and instantiate the Collection. Feel free to check out your dataset afterward by running the code below result = db.execute(Collection('customer_details').find_one())print(result) This is what the first row of the data looks like Document({'Full Name': 'Denny Täsche', 'Email': 'denny.täsche@gmail.com', 'Address': 'Corina-Stumpf-Ring 36 02586 Gransee', 'Phone Number': '03772092016', 'details': 'denny täsche denny.täsche@gmail.com corina-stumpf-ring 36 02586 gransee 03772092016', '_fold': 'train', '_id': ObjectId('6565c8c620d98740773c2874')}) SuperDuperDB supports other popular databases like Postgres, DuckDB, etc. and can be converted to a Superduperdb object the same way More info about how I generated the data can be found below Creating Dataset Sythnetically: Thought-Process Using domain knowledge to […]

Leveraging SuperDuperDB to Create a Simple De-duplication System Easily

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List