so often w/ pandas I’d: 1. “yeet” the csv into a dataframe 2. use dataframe methods to massage the data to a “clean” state 3. push as much of the df methods into pd.read_csv() parameter options
it’s be great to iterate more quickly on the above loop. better yet — what if it would could auto-generate a letter to send to the folks from whom you got this data on how they could better output to csv to make ingestion simpler and easier for downstream users…. but maybe that letter would just be "don't use CSV!"
related to flat data formats, it obviously makes sense to start with CSV, but what about the future? If this tool became ubiquitous, how might a SWE or data professional's job change? What opportunities be created? As in:
1. CSV is ubiquitous but has no singularly well-adopted standard. 2. software and data engineers struggle with CSVs as a result of #1. 3. tool is created to reduce pain and friction. 4. profit? a new market? a new standard?
Last, but most personally interestingly, how much do you know about the Apache Arrow ecosystem and how it's mission might overlap with YoBulk's
The ways csv can fail is just fucking nuts. Especially when they're half hand written, half automated, or where a failure is 20m rows in. Hard to have speed and strong checks simultaneously.
RFC4180 exists regardless of adoption level. In a way, the simplicity of the spec causes the proliferation of grammars. No one thinks they can just yolo a PDF by hand in a text editor. Ok, maybe PDF is a bad example. But CSV (as specified in RFC4180) is so dead simple that people take shortcuts.
pd.yeet(lambda x: yeet(x))
pd.yeet_to_csv("/clean_data/cleaned_data_1.csv")
The "revolutionary new tool" to replace CSVs was XML in the late 1990s.
Otherwise, the problem with CSV has little to do with the CSV format as such and more to do with the fact that the data is stringly-typed. XML has the same problem. JSON interestingly does not. Everything has tradeoffs.
I've got .NET code that hits the last.fm api and dumps the info to a LiteDB database, so I can export to CSV pretty easily if this tool would be useful to me, unless anyone has any better directions to point me in. Appreciate any thoughts you folks have.
You can download the database for use in your own programs, and there is at least one Python package built around it: https://pypi.org/project/confusables/
Edit Found it in the associated doc It’s a cute approach !
Data File Format
Each line in the data file has the following format: Field 1 is the source, Field 2 is the target, and Field 3 is obsolete, always containing the letters “MA” for backwards compatibility. For example:
0441 ; 0063 ; MA # ( с → c ) CYRILLIC SMALL LETTER ES → LATIN SMALL LETTER C #
2CA5 ; 0063 ; MA # ( ⲥ → c ) COPTIC SMALL LETTER SIMA → LATIN SMALL LETTER C # →ϲ→
Everything after the # is a comment and is purely informative. A asterisk after the comment indicates that the character is not an XID character [UAX31]. The comments provide the character names.
Implementations that use the confusable data do not have to recursively apply the mappings, because the transforms are idempotent. That is,
skeleton(skeleton(X)) = skeleton(X)
Undoes whatever on earth it is excel does, helps clean up bits of html/etc.
It's the public domain software that powers PLDB.com and CancerDB.com.
You store your data in Tree Notation in plain text files, and use the Grammar Language (a Tree Language) for schemas, which also enforces correctness. You use Git for version control. You then can query the data using TQL (also a Tree Language). You can display your data using Scroll (also a Tree Language)
So your data, your query language, your schemas, your display language are all in the same simple plain-text notation: Tree Notation. Of course, there's also a lot of Javascript glue.
Very little documentation at the moment, and it's brand new, but it simple was not possible before the M1s, which came out in December 2020, and the growth rate is very good.
It's all signal, no noise, so it's a timeless solution, and you won't regret putting your data in there.
Once you generate the code, you could copy it into your pipeline so that you pull the code from the last.fim api, preprocess it with the Python code that Mito generated, and then dump it into the LiteDB.
It's certainly not open source, but you put in wonky tables, give it a couple/few examples of how you would like it to be, and it uses AI to spit out clean tables for export.
I'm not a fan of proprietary working files, but if that ever becomes a problem, at least we've still got the data.
We are really excited to open source YoBulk today.
YoBulk is an open source CSV importer for any SaaS application - It's a free alternative to https://flatfile.com/
Why are we building YoBulk:
In our previous startup, we were receiving CSV files from various billboard screen owners every day, following a specific template that we defined. Despite the well-defined template, the CSV files we received often contained manual errors, which was a challenge to fix with the data provider.
We were receiving around 500,000 billboard data updates each day, including price changes and creative info data. It was a difficult and time-consuming job to clean and format the data to fit our database schema and upload it into the database. As a result, we wanted to automate the entire CSV importing process. In our second startup, we encountered similar challenges when cleaning large CSV files with location and timestamp data.
We realised that more than 70% of business data is shared in CSV and Excel formats, and only a small percentage use API integrations for data exchange. As developers and product managers, we have experienced the difficulties of building a scalable CSV importer, and we know that many others face the same challenges. Our goal is to solve this problem by taking an open source AI and developer-centric approach
Who can use YoBulk:
YoBulk is a highly beneficial tool for a variety of professionals, such as Developers, Product Managers, Customer Success teams, and Marketers. It simplifies the process of onboarding and verifying customer data, making it an indispensable asset for those who deal with frequent CSV data uploads to a system with a predetermined schema or template.
This tool is particularly valuable for updating sales CRM or product catalog data, and it effectively solves the initial challenge of customer data ingestion.
The Problem:
Importing a CSV is a really hard problem to solve. Some of the key problems are:
1.Missing of Collaboration and Automation in CSV importing workflow:
In a usual situation, the customer success team responsible for receiving CSV data has to engage in extensive back-and-forth communication with the customer to address unintentional manual errors present in a CSV. This process requires a high level of collaboration and may even necessitate assistance from the customer's internal teams to correct the data. The entire workflow is currently manual and therefore needs to be automated.Being able to quickly see data errors and fix them on the spot in a collaborative way with the customer is the way forward.
2.Scale:CRM CSV files can sometimes reach sizes as large as 4 GB, making it nearly impossible to open them on a standalone machine for data correction. This presents a significant challenge for small businesses who cannot afford to invest in big data technologies such as EMR, Databrick, and ETL tools to address CSV import scaling problems.
3.Countless complex validation Types ::A single date format can have as many as 100 different variations, such as dd-mm-yyyy, mm-dd-yyyy, and dd.mm.yyyy. Manually setting validation rules for each of these formats is almost impossible, and correcting errors manually can also be difficult. Additionally, it can be challenging to comprehend errors without a human touch. Cross-validation between fields/columns is always a challenge in a specific CSV. For example, if a CSV contains two fields such as first name and age, creating custom validation to flag an error if the first name is missing and the age is greater than 50 can be really difficult.
4.Data mapping issues:In a typical scenario, the recipient of CSV data provides a template to the data donor and creates a CSV column to template mapping before importing. However, in many cases, the CSV column names do not match the corresponding template column names. For instance, the data receiver may provide a field labeled "EMP date of Joining," but the uploaded CSV may contain a field labeled "EMP DOJ." These mapping issues can significantly slow down the CSV importing process.
5.Data Security and Privacy:It is always risky to share your customer data with third-party companies for data cleaning purposes.
6.Non-availability of low code/No code tool: Product managers and customer success teams, who are typically no-code users, often rely on data analysts to create a programmed CSV template with validation rules, which must be shared with customers to receive CSV data in a specific format. However, in an ideal scenario, no-code users should be able to create a template independently, without depending on developers.
7.Vague error messages:Unclear error messages do not provide users with enough context to confidently resolve their issues before uploading their data. Without a specific explanation of the problem, users may have to try various fixes until they find one that works.Example:while uploading a CSV file to a portal, I had received an error like “baseID is null”.i was clueless:)
The Solution:
1. Smart Spreadsheet View: Designed to be a data exchange hub for any business that utilizes CSV files, YoBulk makes it easy to import and transform any CSV into a smart spreadsheet interface. This user-friendly interface highlights errors in a clear, concise manner, simplifying the task of cleaning data.
2. Bring your validation function: YoBulk offers a platform for Developers to create a custom CSV importer that includes personalized validation rules based on JSON schema. With this functionality, developers can design an importer that meets their specific needs and preferences.
3. AI first : YoBulk harnesses the power of OpenAI to provide advanced column matching, data cleaning and JSON schema generation features.
4. Build for Scale: YoBulk is designed for large-scale CSV validation, with the ability to process files in the gigabyte range without any glitches or errors.
5. Embeddable: Take advantage of YoBulk's customizable import button feature, which can be embedded on any SaaS or App. This allows you to receive CSV data in the exact format you require, streamlining your workflows.
Hosting and Deployment:
YoBulk can be self hosted and currently running on Mongo.
Github : git clone git@github.com:yobulkdev/yobulkdev.git
Getting started is really simple :
Please refer https://doc.yobulk.dev/GetStarted/Installation
Docker command: git clone https://github.com/yobulkdev/yobulkdev.git cd yobulkdev docker-compose up -d Or docker run --rm -it -p 5050:5050/tcp yobulk/yobulk Or git clone https://github.com/yobulkdev/yobulkdev cd yobulkdev yarn install yarn run dev
Also please join our community at :
- Github : https://github.com/yobulkdev/yobulkdev - Slack : https://join.slack.com/t/yobulkdev/signup. - Twitter : https://twitter.com/YoBulkDev - Reditt : https://reddit.com/r/YoBulk
Would love to hear your feedback & how we can make this better.
Thank you,
Team YoBulk
However I'm surprised that this works in completely automated fashion. Given the fundamentally nondeterministic nature of language models, how do you ensure that the output is correct? Do you have a set of assertions that must become true about the data before the result is returned? How do you prevent the model from being too clever with your assertions, and replacing the data with all 0s or something similarly à la Asimov's Three Laws of Robotics (see eg https://en.wikipedia.org/wiki/Runaround_(story))?
I was thinking we were going to get bought by one of our big clients like Airbus or a major accounting firm but actually the firm got bought by a major shoe and clothing brand. I still wear swag from that employer to the gym sometimes so I like to think it got transmigrated when the acquisition happened.
But from the Flatfile website it’s quite clear how I’d integrate it into our product. From your GitHub repo, it’s clear that’s it’s opensource, and thus I know I could integrate it, but it’d be great to see some specific examples of how, and the benefits.
If your going to compare yourself as a flatfile alternative, I feel that’s a key point to make.
Will check it out more deeply tho, looks quite interesting. I’m sold on the concept, but the GitHub repo didn’t push me over the line to “this is the solution”.
@yosai, can you give an example? just curious