dpleban | Better HN

dpleban

18 karmaJoined March 7, 201952 submissions

CEO and Co-Founder of DagsHub. Talk to me about physics, ML, or design.

Recent submissions

Show HN: An ML-Oriented Alternative for Google Drive in Colab (opens in new tab)

(colab.research.google.com)

1dpleban2y ago0

Active Learning with Domain Experts

We were working on a project with a domain expert recently and felt there could be learnings that are relevant to the community we could share.

In our case, the domain expert was a dentist who reached out to us to help him create a machine learning model that would segment teeth in panoramic X-rays. He had some data pre-labeled, but the vast majority of his dataset was unlabeled.

Since labeling these X-rays is a time consuming process and requires domain knowledge, we decided to use Active Learning.

Following our success in creating an Active Learning pipeline in a Jupyter Notebook using Data Engine, we created a new Tooth Fairy project, which expands on that and brings even more capabilities into the notebook.

https://dagshub.com/blog/active-learning-with-domain-experts-a-case-study/

Check out our post and learn: * Why and when you should use Active Learning * How to efficiently work with domain experts (and mistakes to avoid!) * What a real use-case Active Learning pipeline looks like, by checking out the accompanying repo

Curious to get your input on this

1dpleban2y ago0

GPTs and Large Language Models in Production with Hamel Husain (opens in new tab)

(podcasters.spotify.com)

1dpleban3y ago0

Show HN: I built a choose your adventure (dataset and model) for Hugging Face (opens in new tab)

(colab.research.google.com)

2dpleban3y ago0

Operationalizing ML: An Interview Study (opens in new tab)

(arxiv.org)arXiv

3dpleban3y ago1

Ask HN: What does Machine Learning production look like in your case?

Hey HN! I'm working on a talk/blog on a related topic and was curious to see what the case is for people here. I think each case is a bit different so I don't want to constrain people too much.

If you deploy ML to production in your group/team/company – what does production mean for you?

Examples: - "We run a model once a week that predicts some stuff and stores it in a table, then the customer queries it" - "We create an inference endpoint on some cloud resource, which our product/users use to predict poses in videos" - "I wish I knew, we're still figuring it out" - "We deploy a model as part of a larger pipeline in a system of microservices (and other buzzwords)"

Also, if you are in an extra-sharing mood – in your version of production, were there any counter-intuitive things you learned when you first set up the pipeline?

Cheers! Enjoy the picture Dall-E2 made for you of a cat asking for upvotes in return. https://labs.openai.com/s/2enTplV9c9OxU7lyqhyIjXlN

2dpleban3y ago0

Part 2: Don't Buy Cord; Build It Yourself (opens in new tab)

(cord.com)

5dpleban4y ago1

Gods Go Data Structures (opens in new tab)

(github.com)GitHub

1dpleban4y ago0

Large-Scale WebGL-Powered Geospatial Data Visualization Tool (opens in new tab)

(kepler.gl)

2dpleban4y ago0

Show HN: Play Wordle in Google Colab OR Build an AI agent to do it (opens in new tab)

(colab.research.google.com)

1dpleban4y ago0

Unpopular Opinion: Agile is the only way to run Data Science projects (opens in new tab)

(laszlo.substack.com)

2dpleban4y ago0

Most important metrics for labeling data

There are a lot of great articles about measuring the performance of data annotator’s agreement on labels, like this one https://towardsdatascience.com/the-definite-guide-for-creating-an-academic-level-dataset-with-industry-requirements-and-6db446a26cb2.

I see mentions in a lot of places of Cohen’s Kappa/Krippendorf’s alpha, Fleischer’s Kappa, Comparing to predefined ground truth, etc.

If you’re managing an annotation process in your organization, how do you evaluate your annotators, and what challenges have you faced in the process?

As a side note, is anyone using programmatic labeling in a real dataset? Thoughts?

2dpleban4y ago0

Supporting ML Reproducibility

TL;DR: We announced that we'd support the Reproducibility Challenge $500 per paper reproduced, and we're announcing the award winners for the Spring '21 edition, as well as our support for the Fall '21 edition of the challenge. Check out the awesome papers below

---

Hey HN! Creator of DagsHub here. We really care about reproducibility. That's why, a while back, we announced our support for the Papers with Code ML Reproducibility Challenge, and that we'd award participants $500 per paper reproduced (according to the guidelines), to align incentives and put our money where our mouth is!

Today, I'm really happy to share the teams that were given the award, and the projects they worked on – read the full blog here: https://dagshub.com/blog/ml-reproducibility-challenge-spring-2021/

I honestly think the full read is interesting and worth your time, but here are the highlights from the papers:

1. Contextual Decomposition Explanation Penalization (CDEP) – The original paper proposes a method to reduce the chance of models learning spurious correlations instead of the actually important features. The team that reproduced it re-implemented the original project in Tensorflow, rewriting some functions completely from scratch! Along the way, they made a contribution to the Tensorflow addons repo

2. Self-supervision for Few-shot Learning – As its name suggests, this paper tests the importance of self-supervised learning in few-shot learning contexts. The team that reproduced it explored different input configurations than the one proposed in the article, and found out that it significantly affects the performance.

3. GANSpace: Discovering Interpretable GAN Controls – A proposed method to use "simple" PCA to create controls for GANs that are more humanly interpretable while being more computationally efficient. The team re-implemented the original implementation in Tensorflow and trained the model with a few benchmark datasets, they have a lot of very cool examples of the method in their report.

Thank you to everyone who took part in this challenge! None of this could be possible without you and we learned a lot in this process!

So what's next – well we've decided to continue the support the Fall 2021 edition of the Reproducibility Challenge! We want to host more reproduced papers since this makes the ML field better for everyone.

If you want to take part and move the field forward on the reproducibility front, check out the guidelines for more information on how to take part: https://dagshub.com/DAGsHub-Official/reproducibility-challenge/wiki/ML+Reproducibility+Challenge+Fall+2021

2dpleban4y ago0

Community Sourced Open Audio Datasets – Hacktoberfest 2021

Hey HN! For Hacktoberfest this year, we wanted to do something that was geared towards the ML community, and we decided to create an open-source catalog of audio datasets.

The response has been truly amazing! We received 40 dataset contributions, which are now publicly available, and viewable on DagsHub. They cover various tasks, languages, and sizes, and you can use them all for your projects.

If you want to check out the list of datasets: https://dagshub.com/blog/hacktoberfest-2021-open-source-audio-datasets/. I can't wait to see what everyone builds with these.

A huge THANK YOU to everyone who participated! You are what made this possible! The fact that Hacktoberfest is over doesn't mean you can't continue contributing. We'd love to see more datasets, both in the audio domain and others.

11dpleban4y ago0

Supporting Hacktoberfest for ML Datasets

Hey HN community! Creator of DagsHub here. Hacktoberfest 2021 is well underway, but there's still a lot of time left, and I was missing some opportunities to contribute to the community on the ML/DS fronts.

We've decided to support Hacktoberfest by creating an open-source catalog of datasets in the audio domain. The idea is to have a bunch of audio datasets, which will be completely open-source, with the ability to view, visualize (waveform, spectrograms, etc), and download to use in your projects. Check out this dataset that I created as an example: https://dagshub.com/DagsHub/Librispeech-ASR-corpus/src/master/dev-clean/84/121123/84-121123-0000.flac.

You can read the full guidelines here: https://dagshub.com/blog/hacktoberfest-x-dagshub-2/ Would be happy to answer questions, but I think if you're passionate about open-source ML, this is a great opportunity to contribute.

1dpleban4y ago0

Recent submissions

Show HN: An ML-Oriented Alternative for Google Drive in Colab (opens in new tab)

(colab.research.google.com)

1dpleban2y ago0

Active Learning with Domain Experts

We were working on a project with a domain expert recently and felt there could be learnings that are relevant to the community we could share.

Since labeling these X-rays is a time consuming process and requires domain knowledge, we decided to use Active Learning.

https://dagshub.com/blog/active-learning-with-domain-experts-a-case-study/

Curious to get your input on this

1dpleban2y ago0

GPTs and Large Language Models in Production with Hamel Husain (opens in new tab)

(podcasters.spotify.com)

1dpleban3y ago0

Show HN: I built a choose your adventure (dataset and model) for Hugging Face (opens in new tab)

(colab.research.google.com)

2dpleban3y ago0

Operationalizing ML: An Interview Study (opens in new tab)

(arxiv.org)arXiv

3dpleban3y ago1

Ask HN: What does Machine Learning production look like in your case?

Hey HN! I'm working on a talk/blog on a related topic and was curious to see what the case is for people here. I think each case is a bit different so I don't want to constrain people too much.

If you deploy ML to production in your group/team/company – what does production mean for you?

Also, if you are in an extra-sharing mood – in your version of production, were there any counter-intuitive things you learned when you first set up the pipeline?

Cheers! Enjoy the picture Dall-E2 made for you of a cat asking for upvotes in return. https://labs.openai.com/s/2enTplV9c9OxU7lyqhyIjXlN

2dpleban3y ago0

Part 2: Don't Buy Cord; Build It Yourself (opens in new tab)

(cord.com)

5dpleban4y ago1

Gods Go Data Structures (opens in new tab)

(github.com)GitHub

1dpleban4y ago0

Large-Scale WebGL-Powered Geospatial Data Visualization Tool (opens in new tab)

(kepler.gl)

2dpleban4y ago0

Show HN: Play Wordle in Google Colab OR Build an AI agent to do it (opens in new tab)

(colab.research.google.com)

1dpleban4y ago0

Unpopular Opinion: Agile is the only way to run Data Science projects (opens in new tab)

(laszlo.substack.com)

2dpleban4y ago0

Most important metrics for labeling data

I see mentions in a lot of places of Cohen’s Kappa/Krippendorf’s alpha, Fleischer’s Kappa, Comparing to predefined ground truth, etc.

If you’re managing an annotation process in your organization, how do you evaluate your annotators, and what challenges have you faced in the process?

As a side note, is anyone using programmatic labeling in a real dataset? Thoughts?

2dpleban4y ago0

Supporting ML Reproducibility

---

Today, I'm really happy to share the teams that were given the award, and the projects they worked on – read the full blog here: https://dagshub.com/blog/ml-reproducibility-challenge-spring-2021/

I honestly think the full read is interesting and worth your time, but here are the highlights from the papers:

Thank you to everyone who took part in this challenge! None of this could be possible without you and we learned a lot in this process!

2dpleban4y ago0

Community Sourced Open Audio Datasets – Hacktoberfest 2021

Hey HN! For Hacktoberfest this year, we wanted to do something that was geared towards the ML community, and we decided to create an open-source catalog of audio datasets.

If you want to check out the list of datasets: https://dagshub.com/blog/hacktoberfest-2021-open-source-audio-datasets/. I can't wait to see what everyone builds with these.

11dpleban4y ago0

Supporting Hacktoberfest for ML Datasets

1dpleban4y ago0