We all have lists...and they can be annoying to de-duplicate.
* User feedback * Groceries * Employee Surveys * Bug reports * You name it
Most ways to consolidate like-items work off of keywords or worse, exact phrases (Sheets/Excel).
But LLMs are much better at understanding an items semantic meaning and determining if two items should be combined or not.
I decided to build my first python package, The Semantic Deduplicator, to help me consolidate items based on their meaning, not keywords.
For Example On Groceries: ['We need more berries', 'I want more more milk', 'Can we get more carbonated water please?', 'We need more sparkling water'] ...deduplicated... ['Berries', 'Milk', 'Sparkling Water']
How it works:
1. Start with an empty list ready to populate
2. The first item you add will get 1) transformed into a clean name (user feedback > product request) and 2) added to the list
3. While you're adding more items
* Check to see if your new item's embedding is close to any existing item
* If so, ask the LLM to compare your two items to see if they should be combined
* If so, combine them
This package is more of an exploration and POC so be careful with it. I'd love to hear any feedback.
All the links:
* YT Explainer Video: https://www.youtube.com/watch?v=etLsNgkGbeM
* Twitter Thread: https://twitter.com/GregKamradt/status/1719760658936545336
* Pypi: https://pypi.org/project/semantic-deduplicator/
* Github: https://github.com/gkamradt/SemanticDeduplicator
As a capstone project for Galvanize's data science immersive I took another look at the NYC Taxi data set. A ton of analysis has been done on individual rides/cars and I was curious about what story would be told by looking at this data through the aggregate.
Through the clustered map you can identify different 'personalities' of the city with a birds eye view. Check it out here http://ryd.io/cluster_map
I've just spent past couple weeks working hard on this project and would love to talk to anyone about it if they are interested.
After the conclusion of the program I'm excited to join a new data team and work on awesome problems.
Feel free to contact me with any questions
Tech: Backend - Python, Flask, Jinja Front - Bootstrap, leaflet, ajax Graphic - Originally in matplotlib/Cartodb and styled in photoshop Data Analysis - Python + stats packages
gkamradt {at} gmail
As a capstone project for Galvanize's data science immersive I took another look at the NYC Taxi data set. A ton of analysis has been done on individual rides/cars and I was curious about what story would be told by looking at this data through the aggregate.
Through the clustered map you can identify different 'personalities' of the city with a birds eye view. Check it out here http://ryd.io/cluster_map
I've just spent past couple weeks working hard on this project and would love to talk to anyone about it if they are interested.
After the conclusion of the program I'm excited to join a new data team and work on awesome problems.
Feel free to contact me with any questions
gkamradt {at} gmail