Show HN: Yobulk – Open-source CSV importer powered by GPT3 (opens in new tab)

(github.com)

233 pointsyosai3y ago109 comments

109 comments

65 comments · 15 top-level

data_ders3y ago· 14 in thread

wait… if importing malformed csvs gets automated that’s like half of a data professional’s job gone in a poof of smoke /s. jk -- great use case

so often w/ pandas I’d: 1. “yeet” the csv into a dataframe 2. use dataframe methods to massage the data to a “clean” state 3. push as much of the df methods into pd.read_csv() parameter options

it’s be great to iterate more quickly on the above loop. better yet — what if it would could auto-generate a letter to send to the folks from whom you got this data on how they could better output to csv to make ingestion simpler and easier for downstream users…. but maybe that letter would just be "don't use CSV!"

related to flat data formats, it obviously makes sense to start with CSV, but what about the future? If this tool became ubiquitous, how might a SWE or data professional's job change? What opportunities be created? As in:

1. CSV is ubiquitous but has no singularly well-adopted standard. 2. software and data engineers struggle with CSVs as a result of #1. 3. tool is created to reduce pain and friction. 4. profit? a new market? a new standard?

Last, but most personally interestingly, how much do you know about the Apache Arrow ecosystem and how it's mission might overlap with YoBulk's

anothernewdude3y ago

You'd have to be insane to trust GPT that much. I wouldn't want anything hallucinated in my data.

yosaiOP3y ago

@data_ders We realized that more than 70% of business data is shared in CSV and Excel formats, and only a small percentage use API integrations for data exchange..So CSV is here to stay for sure..On the other side,data engine is a sub module inside YoBulk.We are trying to automate complete CSV importing workflow..mostly solving the CSV errors in a collaborative way with the data donor..Yobulk's USP is how we show the errors in human readable way.We have written wrapper on top of some open source data validation engines.Yes..I have used apache Arrow..we not competing with Apache Arrow..We are creating an alternative to flatfile.com

chaps3y ago

Man, been down this path for a long while. It gets tough! Flattening csvs with hierarchical headers (as in, headers that that apply a category to a second row of headers) are tough.

The ways csv can fail is just fucking nuts. Especially when they're half hand written, half automated, or where a failure is 20m rows in. Hard to have speed and strong checks simultaneously.

1 more reply

faebi3y ago

I still like the jsonl standard quite a lot. JSON is pretty much universal, yet it‘s better structured than csv. The ‚l‘ine in jsonl makes it easier parsable and independent from the remaining structure. Keys are duplicated hard but that‘s where gzip comes in.

boringg3y ago

"half of a data professional's job" ... you mean like 90%

textninja3y ago

It’s useful to make a distinction between what is one’s job and where one spends his time - that is, the theory and practice of job titles. Data cleaning is 0% of a Data Scientist’s job despite occupying 90% of their time. My hope is that AI can help bridge that gap.

1 more reply

recursive3y ago

> CSV is ubiquitous but has no singularly well-adopted standard

RFC4180 exists regardless of adoption level. In a way, the simplicity of the spec causes the proliferation of grammars. No one thinks they can just yolo a PDF by hand in a text editor. Ok, maybe PDF is a bad example. But CSV (as specified in RFC4180) is so dead simple that people take shortcuts.

yosaiOP3y ago

Yes you are absolutely right..We need a solution beyond standard as 80% businesses run on CSV..

1 more reply

silent_cal3y ago

pd.yeet("/data/data_1.csv")

pd.yeet(lambda x: yeet(x))

pd.yeet_to_csv("/clean_data/cleaned_data_1.csv")

refulgentis3y ago

yeet is dispose of with haste, not “move, but zoomer” or “sloppily with haste”

chaps3y ago

Wot, no. It's throw hastily we with no care of what it splats into. Pretty sure it came from a video of someone throwing something (food or drink?) in a crowded highschool hallway, from within the hallway. It's chaotic, reckless energy in.. mostly.. harmless form.

1 more reply

sgerenser3y ago

The "revolutionary new tool" to replace CSVs was XML in the late 1990s.

nerdponx3y ago

Except not at all. XML is harder to enter by hand than CSV and not mess it up. CSV optimizes for the easiest cases and performs well on them. XML optimizes for the most complicated cases and therefore performs poorly on the easiest cases. JSON is somewhere in the middle. The main problems with CSV have to do with 1) MS Excel and 2) some kind of delusion among programmers that formatting or parsing arbitrary data is easy and you don't need a library for it, so you get hand-rolled generators and parsers that emit broken files.

Otherwise, the problem with CSV has little to do with the CSV format as such and more to do with the fact that the data is stringly-typed. XML has the same problem. JSON interestingly does not. Everything has tradeoffs.

4 more replies

yosaiOP3y ago

@sgerenser Yes CSV is everywhere..YoBulk is smartly positioning itself for the data donor/provider or customer..It is the end customer or data provider who bite the bullet and do the time consuming data cleaning.The customer should know about the errors, duplicates,PII data,inconsitency in data.Customer has to be properly guided to clean the data in best possible manner.

anoonmoose3y ago· 10 in thread

I've been looking for an AI/GPT/deep learning tool that would help me perform some sanitation and normalization of a large data set that's quite personal to me- my last.fm data, time-stamped logs of (nearly) every song I've listened to for almost twenty years now. The data has all kinds of issues- for example, yesterday I realized that I had two sets of logs for one album. One version of the album used U+2026 (…) and one used three periods (...). There are problems like that, stuff more akin to typos, styling stuff (& vs and), or even garbage-in garbage-out stuff (YouTube Music changing the tags on the same album over time making it look like I actually listened to different albums, or not actually having all of the tags they're supposed to have).

I've got .NET code that hits the last.fm api and dumps the info to a LiteDB database, so I can export to CSV pretty easily if this tool would be useful to me, unless anyone has any better directions to point me in. Appreciate any thoughts you folks have.

nerdponx3y ago

In the case of Unicode at least, the Unicode consortium maintains a database of "confusable" characters and a tool to detect them: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=%E2%8...

You can download the database for use in your own programs, and there is at least one Python package built around it: https://pypi.org/project/confusables/

carterschonwald3y ago

Is there a schema document explaining the format of the dataset?

Edit Found it in the associated doc It’s a cute approach !

Data File Format

Each line in the data file has the following format: Field 1 is the source, Field 2 is the target, and Field 3 is obsolete, always containing the letters “MA” for backwards compatibility. For example:

0441 ; 0063 ; MA # ( с → c ) CYRILLIC SMALL LETTER ES → LATIN SMALL LETTER C #

2CA5 ; 0063 ; MA # ( ⲥ → c ) COPTIC SMALL LETTER SIMA → LATIN SMALL LETTER C # →ϲ→

Everything after the # is a comment and is purely informative. A asterisk after the comment indicates that the character is not an XID character [UAX31]. The comments provide the character names.

Implementations that use the confusable data do not have to recursively apply the mappings, because the transforms are idempotent. That is,

skeleton(skeleton(X)) = skeleton(X)

IanCal3y ago

If you've got borked encodings around as well, the python package ftfy is wonderful: https://pypi.org/project/ftfy/

Undoes whatever on earth it is excel does, helps clean up bits of html/etc.

breck3y ago

It's still early, but TreeBase might be worth a look. (https://jtree.treenotation.org/treeBase/index.html)

It's the public domain software that powers PLDB.com and CancerDB.com.

You store your data in Tree Notation in plain text files, and use the Grammar Language (a Tree Language) for schemas, which also enforces correctness. You use Git for version control. You then can query the data using TQL (also a Tree Language). You can display your data using Scroll (also a Tree Language)

So your data, your query language, your schemas, your display language are all in the same simple plain-text notation: Tree Notation. Of course, there's also a lot of Javascript glue.

Very little documentation at the moment, and it's brand new, but it simple was not possible before the M1s, which came out in December 2020, and the growth rate is very good.

It's all signal, no noise, so it's a timeless solution, and you won't regret putting your data in there.

0live3y ago

I would suggest https://github.com/OpenRefine/OpenRefine to clean your data.

yosaiOP3y ago

@0live Cleaning data is only a module.YoBulk helps you to automate complete CSV import workflow.please read our blog https://www.yobulk.dev/blog/Building%20an%20In-house%20CSV%2... to understand the CSV workflow problem.Happy to answer you queries.

1 more reply

anoonmoose3y ago

Love this suggestion, excited to check it out!

aarondia3y ago

Its not an AI based approach, but it is a step up from writing code by hand -- you could try using open source Mito -> https://www.trymito.io -> full disclosure I built it -> to do some of this messy data wrangling. Mito lets you view and manipulate your data in a spreadsheet in Jupyter and it generates the equivalent Python code for each edit. For things like identifying that the data uses '&' and 'and', viewing your data in a spreadsheet is >> just writing code.

Once you generate the code, you could copy it into your pipeline so that you pull the code from the last.fim api, preprocess it with the Python code that Mito generated, and then dump it into the LiteDB.

yosaiOP3y ago

@anoonmoose we have an internal pipeline which streams the monogoDB data to any CSV or any webhook url path.it's a export pipeline which streams the processed data to CSVs.We will expose an API in coming days which will fit your usecase.

a_subsystem3y ago

We're using PowerBI for this kind of thing.

It's certainly not open source, but you put in wonky tables, give it a couple/few examples of how you would like it to be, and it uses AI to spit out clean tables for export.

I'm not a fan of proprietary working files, but if that ever becomes a problem, at least we've still got the data.

yosaiOP3y ago· 8 in thread

Hey Everybody,

We are really excited to open source YoBulk today.

YoBulk is an open source CSV importer for any SaaS application - It's a free alternative to https://flatfile.com/

Why are we building YoBulk:

In our previous startup, we were receiving CSV files from various billboard screen owners every day, following a specific template that we defined. Despite the well-defined template, the CSV files we received often contained manual errors, which was a challenge to fix with the data provider.

We were receiving around 500,000 billboard data updates each day, including price changes and creative info data. It was a difficult and time-consuming job to clean and format the data to fit our database schema and upload it into the database. As a result, we wanted to automate the entire CSV importing process. In our second startup, we encountered similar challenges when cleaning large CSV files with location and timestamp data.

We realised that more than 70% of business data is shared in CSV and Excel formats, and only a small percentage use API integrations for data exchange. As developers and product managers, we have experienced the difficulties of building a scalable CSV importer, and we know that many others face the same challenges. Our goal is to solve this problem by taking an open source AI and developer-centric approach

Who can use YoBulk:

YoBulk is a highly beneficial tool for a variety of professionals, such as Developers, Product Managers, Customer Success teams, and Marketers. It simplifies the process of onboarding and verifying customer data, making it an indispensable asset for those who deal with frequent CSV data uploads to a system with a predetermined schema or template.

This tool is particularly valuable for updating sales CRM or product catalog data, and it effectively solves the initial challenge of customer data ingestion.

The Problem:

Importing a CSV is a really hard problem to solve. Some of the key problems are:

1.Missing of Collaboration and Automation in CSV importing workflow:

In a usual situation, the customer success team responsible for receiving CSV data has to engage in extensive back-and-forth communication with the customer to address unintentional manual errors present in a CSV. This process requires a high level of collaboration and may even necessitate assistance from the customer's internal teams to correct the data. The entire workflow is currently manual and therefore needs to be automated.Being able to quickly see data errors and fix them on the spot in a collaborative way with the customer is the way forward.

2.Scale:CRM CSV files can sometimes reach sizes as large as 4 GB, making it nearly impossible to open them on a standalone machine for data correction. This presents a significant challenge for small businesses who cannot afford to invest in big data technologies such as EMR, Databrick, and ETL tools to address CSV import scaling problems.

3.Countless complex validation Types ::A single date format can have as many as 100 different variations, such as dd-mm-yyyy, mm-dd-yyyy, and dd.mm.yyyy. Manually setting validation rules for each of these formats is almost impossible, and correcting errors manually can also be difficult. Additionally, it can be challenging to comprehend errors without a human touch. Cross-validation between fields/columns is always a challenge in a specific CSV. For example, if a CSV contains two fields such as first name and age, creating custom validation to flag an error if the first name is missing and the age is greater than 50 can be really difficult.

4.Data mapping issues:In a typical scenario, the recipient of CSV data provides a template to the data donor and creates a CSV column to template mapping before importing. However, in many cases, the CSV column names do not match the corresponding template column names. For instance, the data receiver may provide a field labeled "EMP date of Joining," but the uploaded CSV may contain a field labeled "EMP DOJ." These mapping issues can significantly slow down the CSV importing process.

5.Data Security and Privacy:It is always risky to share your customer data with third-party companies for data cleaning purposes.

6.Non-availability of low code/No code tool: Product managers and customer success teams, who are typically no-code users, often rely on data analysts to create a programmed CSV template with validation rules, which must be shared with customers to receive CSV data in a specific format. However, in an ideal scenario, no-code users should be able to create a template independently, without depending on developers.

7.Vague error messages:Unclear error messages do not provide users with enough context to confidently resolve their issues before uploading their data. Without a specific explanation of the problem, users may have to try various fixes until they find one that works.Example:while uploading a CSV file to a portal, I had received an error like “baseID is null”.i was clueless:)

The Solution:

1. Smart Spreadsheet View: Designed to be a data exchange hub for any business that utilizes CSV files, YoBulk makes it easy to import and transform any CSV into a smart spreadsheet interface. This user-friendly interface highlights errors in a clear, concise manner, simplifying the task of cleaning data.

2. Bring your validation function: YoBulk offers a platform for Developers to create a custom CSV importer that includes personalized validation rules based on JSON schema. With this functionality, developers can design an importer that meets their specific needs and preferences.

3. AI first : YoBulk harnesses the power of OpenAI to provide advanced column matching, data cleaning and JSON schema generation features.

4. Build for Scale: YoBulk is designed for large-scale CSV validation, with the ability to process files in the gigabyte range without any glitches or errors.

5. Embeddable: Take advantage of YoBulk's customizable import button feature, which can be embedded on any SaaS or App. This allows you to receive CSV data in the exact format you require, streamlining your workflows.

Hosting and Deployment:

YoBulk can be self hosted and currently running on Mongo.

Github : git clone git@github.com:yobulkdev/yobulkdev.git

Getting started is really simple :

Please refer https://doc.yobulk.dev/GetStarted/Installation

Docker command: git clone https://github.com/yobulkdev/yobulkdev.git cd yobulkdev docker-compose up -d Or docker run --rm -it -p 5050:5050/tcp yobulk/yobulk Or git clone https://github.com/yobulkdev/yobulkdev cd yobulkdev yarn install yarn run dev

Also please join our community at :

- Github : https://github.com/yobulkdev/yobulkdev - Slack : https://join.slack.com/t/yobulkdev/signup. - Twitter : https://twitter.com/YoBulkDev - Reditt : https://reddit.com/r/YoBulk

Would love to hear your feedback & how we can make this better.

Thank you,

Team YoBulk

hermitcrab3y ago

I take issue with "Non-availability of low code/No code tool". There are plenty of no-code and low-code ETL tools that are heavily used for reading, re-formatting and restructuring CSV files. For exanple out own Easy Data Transform, which is a drag and drop data transformation tool aimed very much at business users, rather than professional data scientists.

yosaiOP3y ago

@hermitcrab here we mean no code /low code for validation template creation.We are not a ETL tool.Yes we do ETL operation internally.YoBulk is a flatfile.com alternative and primarily meant for data donor..we provide a spreadsheet view for the data donor who is mostly non tech guy to intuitively solve data errors..It's not meant for data scientists..

nerdponx3y ago

This is a really interesting use of AI, and I think this has been a sought-after use case for a while. I recall the wave of "ML APIs" and auto-ML frameworks a few years ago that promised to use an ML model to automatically perform feature engineering, hyperparameter optimization, data cleaning, etc., but never caught on as tools the hands of non-experts.

However I'm surprised that this works in completely automated fashion. Given the fundamentally nondeterministic nature of language models, how do you ensure that the output is correct? Do you have a set of assertions that must become true about the data before the result is returned? How do you prevent the model from being too clever with your assertions, and replacing the data with all 0s or something similarly à la Asimov's Three Laws of Robotics (see eg https://en.wikipedia.org/wiki/Runaround_(story))?

yosaiOP3y ago

@nerdponx This is really a great question. We are currently using AI for schema generation as well as column matching. The column matching is done with the Dice's Coefficient in yobulk system. But with chatgpt's column matcher, we are leveraging the model of chatgpt to match the columns. Further, there is a roadmap for auto cleaning by keeping the historical records and bulding a model to sense the data type entered into the csv for the specific organization.We give the user final power to decide which if the GPT output is correct or not.Happy to engage with you on this topic.

Mystery-Machine3y ago

Please let someone proof read the Readme, it's embarrassing

yosaiOP3y ago

@Mystery-Machine happy to get your detail feedback on the Readme.We will correct it.

hattermat3y ago

wow - this is huge, wonder how a lot of the companies in this space will respond

yosaiOP3y ago

some of the companies in this space>> https://flatfile.com/,https://www.oneschema.co/,https://www....

WhiteNoiz33y ago· 4 in thread

When I read the headline I thought this would take a few rows of your CSV file and generate the schema from that using AI. Seems like you still need to manually describe the columns.

yosaiOP3y ago

Yes, we have a workflow for your usecase.YoBulk can create a template or schema by uploading a CSV.We read some lines and create the schema.Right now we have not added AI for that.This flow is very handy for the usecase like when you want to upload a CSV file to hubspot or linkedin and want to do the data cleaning according to hubspot and linkedin defined template.People can use upload linkedin or hubspot template CSV to YoBulk and create a template.Now they can validate their CSV data against the YoBulk template before uploading to hubspot and linkedin portal.

iLoveOncall3y ago

This is a trivial problem you can solve in a hundred lines in any programming language, you don't need AI.

WhiteNoiz33y ago

This is probably true (and I agree in this case it would be better for a tool like this). What I like about the idea of LLMs is that you won't need to write a couple hundred line program to parse something, you can just describe what you want. And I don't need to wait for someone to write a tool exactly suited for my usecase.

blowski3y ago

It's also something you can do manually with a few admin people, or perhaps using a COBOL script. Innovation means we'll have different ways of doing the same thing.

1 more reply

PaulHoule3y ago· 3 in thread

Around the time that BERT and fasttext were just coming out, I worked at a startup that had built a system that used text CNNs to interpret CSV files, particularly we had models that profiled at the level of individual cells by classifying either the content of the cell plus the content of the cell plus the label of the column.

I was thinking we were going to get bought by one of our big clients like Airbus or a major accounting firm but actually the firm got bought by a major shoe and clothing brand. I still wear swag from that employer to the gym sometimes so I like to think it got transmigrated when the acquisition happened.

yosaiOP3y ago

Thanks for your comment PaulHoule. Text CNN is the way forward for YoBulk also..We will be building a model for self correction of CSV error.It has to understand the context at each cell level and train the model accordingly.

Der_Einzige3y ago

My guess is Zelando right? I was always intrigued at the quality of NLP research coming out of them!

PaulHoule3y ago

No, it was this

https://fdra.org/latest-news/nike-acquires-data-integration-...

jimlongton3y ago· 2 in thread

Does this send all the data to a third party? What if it contains personal information?

yosaiOP3y ago

No datas are sent to any 3rd party.YoBulk is self hosted.Your personal information is stored on your database.Feel free to ask any data security related question.

counttheforks3y ago

How are you running GPT locally? OpenAI is a third party.

3 more replies

cjtechie3y ago· 2 in thread

This is what exactly I was looking for. Will it help me to run big files and cleanse?

yosaiOP3y ago

YoBulk uses buffer streaming internally.So you can upload a CSV of GBs size.You can try once at your end and let me know your feedback.

cjtechie3y ago

Thank you. Let me try this out

bcrl3y ago· 1 in thread

This strikes me as an idea as bad as the Xerox document scanner that implemented a compression algorithm that changed digits. It'll be really fun debugging when something completely unexpected gets spit out of the neural network.

yosaiOP3y ago

@bcrl we provide our own schema which is understood by our validation engine.It's an user option to use GPT or not..GPT output always comes with an disclaimer which might not be correct..We will be solving that slowly..

RileyJames3y ago· 1 in thread

I’m looking for a solution to this problem at the moment. Of course I could DIY, but I’m more attracted to a solution like flat file or YoBulk, rather than reinvent the wheel.

But from the Flatfile website it’s quite clear how I’d integrate it into our product. From your GitHub repo, it’s clear that’s it’s opensource, and thus I know I could integrate it, but it’d be great to see some specific examples of how, and the benefits.

If your going to compare yourself as a flatfile alternative, I feel that’s a key point to make.

Will check it out more deeply tho, looks quite interesting. I’m sold on the concept, but the GitHub repo didn’t push me over the line to “this is the solution”.

yosaiOP3y ago

Yes.. we are the making the point very clear here that we are the open source alternative to Flatfile and it resonates with your thoughts also.Curious to know why the GitHub repo has put you in ambiguity.Happy to receive a candid feedback from you.

icelancer3y ago· 1 in thread

This is a neat tool. I am looking for ways to use GPT-3 to read CSVs / DBs with a bunch of time series numerical data (and summaries of said numerical data) and ask it questions, but I'm having trouble doing so. Does anyone have suggestions for this related project?

yosaiOP3y ago

icelancer,Right now YoBulk is flattening the CSV to JSON with key-value document DB format and storing in MongoDB.You can use GPT3 to create mongo queries to fetch the data.We will be adding APIs soon where you can make a query through GPT3 and fetch the data from Mongo DB.Keep a watch on YoBulk Git repo.

dstala3y ago· 1 in thread

> YoBulk harnesses the power of OpenAI to provide advanced column matching

@yosai, can you give an example? just curious

yosaiOP3y ago

@dstala thanks for exploring YoBulk.Under the hood,YoBulk uses Open AI apis which takes the uploaded CSV column name and template column as an input and gives accurate matching.You can try the product and let me know if you have any comment.

dontcontactme3y ago· 1 in thread

"Open source tool powered by closed source API" Is it really open source then?

yosaiOP3y ago

Yes it's definitely open source.On lighter note, until we build our own text CNN, we will use open AI to show the power and usecases of AI to our users.

ddgflorida3y ago· 1 in thread

Besides the commercial CSV importer mentioned, there are several more. I won't name them but if you google it - you'll see 4 or 5 right away.

yosaiOP3y ago

YoBulk's scope is much much bigger than a standalone CSV importer.We want to complete automate the CSV import workflow with an AI firstapproach.https://www.yobulk.dev/blog/Building%20an%20In-house%20CSV%2... gives an idea about the collaboration and workflow scope of YoBulk.

yarapavan3y ago· 1 in thread

This looks good and promising! Congrats and best wishes, yosai!!

yosaiOP3y ago

@yarapavan Thanks.Please explore the product!!

nektro3y ago

what a joke

j / k navigate · click thread line to collapse

109 comments

65 comments · 15 top-level

data_ders3y ago· 14 in thread

wait… if importing malformed csvs gets automated that’s like half of a data professional’s job gone in a poof of smoke /s. jk -- great use case

Last, but most personally interestingly, how much do you know about the Apache Arrow ecosystem and how it's mission might overlap with YoBulk's

anothernewdude3y ago

You'd have to be insane to trust GPT that much. I wouldn't want anything hallucinated in my data.

yosaiOP3y ago

chaps3y ago

Man, been down this path for a long while. It gets tough! Flattening csvs with hierarchical headers (as in, headers that that apply a category to a second row of headers) are tough.

The ways csv can fail is just fucking nuts. Especially when they're half hand written, half automated, or where a failure is 20m rows in. Hard to have speed and strong checks simultaneously.

1 more reply

faebi3y ago

boringg3y ago

"half of a data professional's job" ... you mean like 90%

textninja3y ago

1 more reply

recursive3y ago

> CSV is ubiquitous but has no singularly well-adopted standard

yosaiOP3y ago

Yes you are absolutely right..We need a solution beyond standard as 80% businesses run on CSV..

1 more reply

silent_cal3y ago

pd.yeet("/data/data_1.csv")

pd.yeet(lambda x: yeet(x))

pd.yeet_to_csv("/clean_data/cleaned_data_1.csv")

refulgentis3y ago

yeet is dispose of with haste, not “move, but zoomer” or “sloppily with haste”

chaps3y ago

1 more reply

sgerenser3y ago

The "revolutionary new tool" to replace CSVs was XML in the late 1990s.

nerdponx3y ago

4 more replies

yosaiOP3y ago

anoonmoose3y ago· 10 in thread

nerdponx3y ago

In the case of Unicode at least, the Unicode consortium maintains a database of "confusable" characters and a tool to detect them: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=%E2%8...

You can download the database for use in your own programs, and there is at least one Python package built around it: https://pypi.org/project/confusables/

carterschonwald3y ago

Is there a schema document explaining the format of the dataset?

Edit Found it in the associated doc It’s a cute approach !

Data File Format

0441 ; 0063 ; MA # ( с → c ) CYRILLIC SMALL LETTER ES → LATIN SMALL LETTER C #

2CA5 ; 0063 ; MA # ( ⲥ → c ) COPTIC SMALL LETTER SIMA → LATIN SMALL LETTER C # →ϲ→

Everything after the # is a comment and is purely informative. A asterisk after the comment indicates that the character is not an XID character [UAX31]. The comments provide the character names.

Implementations that use the confusable data do not have to recursively apply the mappings, because the transforms are idempotent. That is,

skeleton(skeleton(X)) = skeleton(X)

IanCal3y ago

If you've got borked encodings around as well, the python package ftfy is wonderful: https://pypi.org/project/ftfy/

Undoes whatever on earth it is excel does, helps clean up bits of html/etc.

breck3y ago

It's still early, but TreeBase might be worth a look. (https://jtree.treenotation.org/treeBase/index.html)

It's the public domain software that powers PLDB.com and CancerDB.com.

So your data, your query language, your schemas, your display language are all in the same simple plain-text notation: Tree Notation. Of course, there's also a lot of Javascript glue.

Very little documentation at the moment, and it's brand new, but it simple was not possible before the M1s, which came out in December 2020, and the growth rate is very good.

It's all signal, no noise, so it's a timeless solution, and you won't regret putting your data in there.

0live3y ago

I would suggest https://github.com/OpenRefine/OpenRefine to clean your data.

yosaiOP3y ago

1 more reply

anoonmoose3y ago

Love this suggestion, excited to check it out!

aarondia3y ago

yosaiOP3y ago

a_subsystem3y ago

We're using PowerBI for this kind of thing.

It's certainly not open source, but you put in wonky tables, give it a couple/few examples of how you would like it to be, and it uses AI to spit out clean tables for export.

I'm not a fan of proprietary working files, but if that ever becomes a problem, at least we've still got the data.

yosaiOP3y ago· 8 in thread

Hey Everybody,

We are really excited to open source YoBulk today.

YoBulk is an open source CSV importer for any SaaS application - It's a free alternative to https://flatfile.com/

Why are we building YoBulk:

Who can use YoBulk:

This tool is particularly valuable for updating sales CRM or product catalog data, and it effectively solves the initial challenge of customer data ingestion.

The Problem:

Importing a CSV is a really hard problem to solve. Some of the key problems are:

1.Missing of Collaboration and Automation in CSV importing workflow:

5.Data Security and Privacy:It is always risky to share your customer data with third-party companies for data cleaning purposes.

The Solution:

3. AI first : YoBulk harnesses the power of OpenAI to provide advanced column matching, data cleaning and JSON schema generation features.

4. Build for Scale: YoBulk is designed for large-scale CSV validation, with the ability to process files in the gigabyte range without any glitches or errors.

Hosting and Deployment:

YoBulk can be self hosted and currently running on Mongo.

Github : git clone git@github.com:yobulkdev/yobulkdev.git

Getting started is really simple :

Please refer https://doc.yobulk.dev/GetStarted/Installation

Also please join our community at :

- Github : https://github.com/yobulkdev/yobulkdev - Slack : https://join.slack.com/t/yobulkdev/signup. - Twitter : https://twitter.com/YoBulkDev - Reditt : https://reddit.com/r/YoBulk

Would love to hear your feedback & how we can make this better.

Thank you,

Team YoBulk

hermitcrab3y ago

yosaiOP3y ago

nerdponx3y ago

yosaiOP3y ago

Mystery-Machine3y ago

Please let someone proof read the Readme, it's embarrassing

yosaiOP3y ago

@Mystery-Machine happy to get your detail feedback on the Readme.We will correct it.

hattermat3y ago

wow - this is huge, wonder how a lot of the companies in this space will respond

yosaiOP3y ago

some of the companies in this space>> https://flatfile.com/,https://www.oneschema.co/,https://www....

WhiteNoiz33y ago· 4 in thread

When I read the headline I thought this would take a few rows of your CSV file and generate the schema from that using AI. Seems like you still need to manually describe the columns.

yosaiOP3y ago

iLoveOncall3y ago

This is a trivial problem you can solve in a hundred lines in any programming language, you don't need AI.

WhiteNoiz33y ago

blowski3y ago

It's also something you can do manually with a few admin people, or perhaps using a COBOL script. Innovation means we'll have different ways of doing the same thing.

1 more reply

PaulHoule3y ago· 3 in thread

yosaiOP3y ago

Der_Einzige3y ago

My guess is Zelando right? I was always intrigued at the quality of NLP research coming out of them!

PaulHoule3y ago

No, it was this

https://fdra.org/latest-news/nike-acquires-data-integration-...

jimlongton3y ago· 2 in thread

Does this send all the data to a third party? What if it contains personal information?

yosaiOP3y ago

No datas are sent to any 3rd party.YoBulk is self hosted.Your personal information is stored on your database.Feel free to ask any data security related question.

counttheforks3y ago

How are you running GPT locally? OpenAI is a third party.

3 more replies

cjtechie3y ago· 2 in thread

This is what exactly I was looking for. Will it help me to run big files and cleanse?

yosaiOP3y ago

YoBulk uses buffer streaming internally.So you can upload a CSV of GBs size.You can try once at your end and let me know your feedback.

cjtechie3y ago

Thank you. Let me try this out

bcrl3y ago· 1 in thread

yosaiOP3y ago

RileyJames3y ago· 1 in thread

I’m looking for a solution to this problem at the moment. Of course I could DIY, but I’m more attracted to a solution like flat file or YoBulk, rather than reinvent the wheel.

If your going to compare yourself as a flatfile alternative, I feel that’s a key point to make.

Will check it out more deeply tho, looks quite interesting. I’m sold on the concept, but the GitHub repo didn’t push me over the line to “this is the solution”.

yosaiOP3y ago

icelancer3y ago· 1 in thread

yosaiOP3y ago

dstala3y ago· 1 in thread

> YoBulk harnesses the power of OpenAI to provide advanced column matching

@yosai, can you give an example? just curious

yosaiOP3y ago

dontcontactme3y ago· 1 in thread

"Open source tool powered by closed source API" Is it really open source then?

yosaiOP3y ago

Yes it's definitely open source.On lighter note, until we build our own text CNN, we will use open AI to show the power and usecases of AI to our users.

ddgflorida3y ago· 1 in thread

Besides the commercial CSV importer mentioned, there are several more. I won't name them but if you google it - you'll see 4 or 5 right away.

yosaiOP3y ago

yarapavan3y ago· 1 in thread

This looks good and promising! Congrats and best wishes, yosai!!

yosaiOP3y ago

@yarapavan Thanks.Please explore the product!!

nektro3y ago

what a joke

j / k navigate · click thread line to collapse