DeepFabric – Generate high-quality synthetic datasets at scale (opens in new tab)

(lukehinds.github.io)

106 pointsdecodebytes8mo ago19 comments

19 comments

13 comments · 5 top-level

jossclimb8mo ago· 3 in thread

How is the diversity, duplication?

Very good, and even better with the new DAG approach - we have been using great-expectations to bench and seeing very good diversity and low amounts of duplication - you check out one of the recent CoT examples here: https://huggingface.co/datasets/lukehinds/deepfabric-devops-...

evgen8mo ago

This dataset disappeared. Did it move or get pulled for some reason? (glanced at it when you noted this and went back today to check it out and found a 404...)

tehryanx8mo ago

based on the description, I think it's using something similar to GLAN https://arxiv.org/abs/2402.13064

scosman8mo ago· 2 in thread

If anyone's interested in synthetic data generation, we've built a fully interactive visual tool for SDG. It supports generating hierarchical topic trees like other tools, but we do two things others don't:

First: fully interactive UI. This might sound unnecessary, but synthetic data is a creative and iterative process. It helps to review each step as you go, tweaking prompts. Are the topics right? Are the inputs realistic? Are the outputs reasonable? Once your prompts are dialed in, you can scale up the volume, but there's a creative iterative process to get there.

Second: we have many templates for common synthetic data gen use cases. For fine-tuning you want to focus on the breadth of realistic inputs. For "bug" evals you want to trigger specific error cases based on a description of the issue. For measuring evaluators/LLM judges you need a topic tree mixing passing and failing data. We also provide templates for common use cases: bias, maliciousness, toxicity, jailbreaking, etc. These are good to bootstrap the creative process above, but you can edit each to meet your needs.

It's a free app on GitHub. Docs and videos: https://docs.kiln.tech/docs/synthetic-data-generation

decodebytesOP8mo ago

Ah right, kiln - Deepfabric was originally named promptwright , and I can see kiln has copied over some of our code and used it for its synth-gen (which is a nice compliment!)

We are actually planning on moving to graphs now, which we are seeing better results with over trees, check it out if you also want to use them in kiln - but you might want to wait until we validate a little more and lift it out of experimental.

I think the key difference between the two since kiln adopted the same approach is the ability to generate reasoning / chain of thought and export to alpaca, chatml, etc - along with direct to unsloth.ai's formatting. I doubt we will have UI as its for running on backend systems and part of an ML pipeline along with being a library / SDK.

scosman8mo ago

I personally wrote Kiln's SDG code myself -- no code was copied from here or anywhere else. Not sure where that claim is coming from, but it's not accurate.

I might have taken some of the prompts and modified them. I didn't recognize the new name, do recognize the old one.

Edit:

- just confirmed. No code copied. Prompts were originally from the Pluto library, then modified by the library above, then modified again by me for Kiln.

- And just to clarify, Kiln has had supported for chain of thought, reasoning, and all major export formats (ChatML/Unsloth/OpenAI/Hugging Face). Plus API integrations with Together, Fireworks, OpenAI, Google Vertex.

People should try both. I just want to clear on the origins of the code/prompts, and the feature set.

1 more reply

dcreater8mo ago· 2 in thread

are their good synthetic data sets generated from DeepFabric publicly available?

decodebytesOP8mo ago

sure, just starting to get some up on HF. A good example might be GSM8K as this shows the structured output where every result is strictly formatted - I am using this right now to train models and managaing to get a small qwen model up in the 60% range, which wildly is higher then llama2 and xAI Grok 1

GSM8K: https://huggingface.co/datasets/lukehinds/deepfabric-GSM8K-c...

also some others

infra failures reasoning / CoT: https://huggingface.co/datasets/lukehinds/deepfabric-devops-...

Medical (multi-turn): https://huggingface.co/datasets/lukehinds/deepfabric-7k-medi...

Programming challenges: https://huggingface.co/datasets/lukehinds/programming-challe...

If there is anything in particular you need, drop me a message or feel free to open an issue and I can create something for you.

dcreater8mo ago

Thanks, what LLMs were used to create these?

1 more reply

crashabr8mo ago· 1 in thread

How easy it is to pass an existing db schema to this library in order to generate a testable synthetic dataset?

decodebytesOP8mo ago

I would love to learn more and have a try, I figure you can dump out to txt or csv -

you can raise and issue and I will certainly give it a go - or also reach me via the discord link on the main repo. Let's see what we can do.

bumseltagbaerbi8mo ago

"Synthetic CDOs"

j / k navigate · click thread line to collapse

19 comments

13 comments · 5 top-level

jossclimb8mo ago· 3 in thread

How is the diversity, duplication?

decodebytesOP8mo ago

evgen8mo ago

This dataset disappeared. Did it move or get pulled for some reason? (glanced at it when you noted this and went back today to check it out and found a 404...)

tehryanx8mo ago

based on the description, I think it's using something similar to GLAN https://arxiv.org/abs/2402.13064

scosman8mo ago· 2 in thread

It's a free app on GitHub. Docs and videos: https://docs.kiln.tech/docs/synthetic-data-generation

decodebytesOP8mo ago

Ah right, kiln - Deepfabric was originally named promptwright , and I can see kiln has copied over some of our code and used it for its synth-gen (which is a nice compliment!)

scosman8mo ago

I personally wrote Kiln's SDG code myself -- no code was copied from here or anywhere else. Not sure where that claim is coming from, but it's not accurate.

I might have taken some of the prompts and modified them. I didn't recognize the new name, do recognize the old one.

Edit:

- just confirmed. No code copied. Prompts were originally from the Pluto library, then modified by the library above, then modified again by me for Kiln.

People should try both. I just want to clear on the origins of the code/prompts, and the feature set.

1 more reply

dcreater8mo ago· 2 in thread

are their good synthetic data sets generated from DeepFabric publicly available?

decodebytesOP8mo ago

GSM8K: https://huggingface.co/datasets/lukehinds/deepfabric-GSM8K-c...

also some others

infra failures reasoning / CoT: https://huggingface.co/datasets/lukehinds/deepfabric-devops-...

Medical (multi-turn): https://huggingface.co/datasets/lukehinds/deepfabric-7k-medi...

Programming challenges: https://huggingface.co/datasets/lukehinds/programming-challe...

If there is anything in particular you need, drop me a message or feel free to open an issue and I can create something for you.

dcreater8mo ago

Thanks, what LLMs were used to create these?

1 more reply

crashabr8mo ago· 1 in thread

How easy it is to pass an existing db schema to this library in order to generate a testable synthetic dataset?

decodebytesOP8mo ago

I would love to learn more and have a try, I figure you can dump out to txt or csv -

you can raise and issue and I will certainly give it a go - or also reach me via the discord link on the main repo. Let's see what we can do.

bumseltagbaerbi8mo ago

"Synthetic CDOs"

j / k navigate · click thread line to collapse