Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (opens in new tab)

(github.com)

88 pointsqwertygnu2y ago20 comments

20 comments

Nice paper. I particularly like how they talk through the ideas they tried that didn’t work, and the process they used to land on the final results. A lot of ML papers present the finished result as if it appeared from nowhere without trial and error, perhaps with some ablations in the appendix and I wish more papers followed this one in talking about the dead ends along the way.

sjwhevvvvvsj2y ago

Nothing would benefit the scientific enterprise more than explicitly publishing papers about failed experiments.

How much public money, time, and careers have been wasted chasing something that is already known not to work?

vacuity2y ago

Unfortunately, careers don't get advanced that way. There are very backwards incentives.

1 more reply

buildbot2y ago

Very interesting work! More details here: https://depth-anything.github.io/

It seems better overall and per parameter than current work, with relative and absolute measurement.

Is there any research people are aware of that provides sub-mm level models? For 3D modeling purposes? Or is "classic" photogrammetry still the best option there?

MostlyStable2y ago

In grad school I was using stereo video cameras to measure fish. I wonder if a model like this could do it accurately from frame grabs from a single feed now. And of course an AI to identify fish, even if was just which sections of video had/did not have fish, not even doing the species level ID, would have saved a ton of time.

We had a whole workshop on various monitoring technologies and the take home from the various video tools is that having highly trained grad students and/or techs watch and analyze the video is extremely slow and expensive.

I haven't worked with video in a while now, but I wonder if any labs are doing more automated identification these days. It feels like the kind of problem that is probably completely solvable if the right tech gets applied.

CGamesPlay2y ago

Definitely not with this model, because it’s impossible to tell based on the distance alone. Is the fish 34cm away and 34cm long or 30cm away and 30cm long? The fish is floating in a transparent medium, so reference points aren’t even useful as calibration.

bigger_cheese2y ago

Are the fish always the same color/is their color distinct from the background.

I work at an industrial plant we have been able to measure a lot of things simply by analyzing the pixels in the video. For example one application we have a camera pointed down at a conveyor belt. The conveyor belt is one color and objects on the belt are a distinct different color.

- we just count how many pixels in a given frame are a specific color/brightness. Then you can easily work out how much of the conveyor belt has material on it in any given frame.

So if you are tying to work out what section of a video has fish in it you could count how many pixels are a different color to the normal background color.

GaggiX2y ago

You can definitely train a model to identify fish, to be honest you don't really have to train a whole model, there are tons of models trained on millions of images, you can just extract the embeddings from those models and train a single matrix to project them to the different classes and it will work very well.

godelski2y ago

Very likely. Tbh, I think there are a lot of domain tasks where if you added a machine learning expert to the team that success and progress would be a lot higher. But to be fair, there are are a lot of people that can do ML but not a lot of people that have a deep understanding. The difference matters for real world tasks when the difference between dataset performance and generalization performance matter. And it's all too common that works that are SOTA are more difficult to generalize, but this is high variance.

ClassyJacket2y ago

Can someone explain the meaning of labelled vs unlabelled in this context? What kind of information would the labels carry?

Did they have depth maps for all 62 million images or not?

yorwba2y ago

They explain in the paper that they used 1.5 million images with known depth maps (labels) to train a teacher model, and then used the teacher model to create pseudolabels (inferred depth maps) for the full dataset. Then they trained a student model to recover those pseudolabels from distorted versions of the original images.

throwaway815232y ago

Was that better than running the teacher model on the distorted images directly?

xnx2y ago

Very cool to see TikTok sharing its research.

nicollegah2y ago

any information on the inference speed of this vs midas?

leobg2y ago

Impressive demo.

Any FSD startup that put their money on LiDAR is even more screwed now.

buildbot2y ago

Disagree there. Humans have massive compute, dual optics, and amazing filters.

Computer vision has 1-2 of those three, and I don't think we are near an AGI for self driving yet. Driving is IMO, an AGI level task.

Does you dataset have a crocodile in it? Does you monocular depth model get fooled by a billboard that's just a photo?

GaggiX2y ago

>Does you monocular depth model get fooled by a billboard that's just a photo?

This is actually a pretty clever example, I tried a few billboards on the demo online and, as these models are regressive so they output the mean of the possible outputs, sometimes the model is perplexed and doesn't seem to know if to output something completely flat or that actually has a depth, and by being perplexed it outputs something in between.

vlovich1232y ago

AGI is a pretty fuzzy term that will goal post shift like AI has. You can define it that way tautologically but I can easily see a world where we have self driving cars but standalone AI scientists don’t exist. Does that mean we have AGI because we have self driving cars or not because it’s not general in that it can’t also tackle other human endeavors?

pedalpete2y ago

That's only a "happy path" attitude.

How well would a moncular path with headlights moving toward it at night operate? How about in rain, snow, or fog?

I'm not saying LiDAR is the only way, but I don't see a reason to use this as a solution.

I'm not saying this isn't valuable. I used to work in 3D/metaverse space, and having depth from a single photo, and being able to recreate a 3D scene from that is very valuable, and is the future.

j / k navigate · click thread line to collapse

20 comments

Chirono2y ago

sjwhevvvvvsj2y ago

Nothing would benefit the scientific enterprise more than explicitly publishing papers about failed experiments.

How much public money, time, and careers have been wasted chasing something that is already known not to work?

vacuity2y ago

Unfortunately, careers don't get advanced that way. There are very backwards incentives.

1 more reply

buildbot2y ago

Very interesting work! More details here: https://depth-anything.github.io/

It seems better overall and per parameter than current work, with relative and absolute measurement.

Is there any research people are aware of that provides sub-mm level models? For 3D modeling purposes? Or is "classic" photogrammetry still the best option there?

MostlyStable2y ago

CGamesPlay2y ago

bigger_cheese2y ago

Are the fish always the same color/is their color distinct from the background.

- we just count how many pixels in a given frame are a specific color/brightness. Then you can easily work out how much of the conveyor belt has material on it in any given frame.

So if you are tying to work out what section of a video has fish in it you could count how many pixels are a different color to the normal background color.

GaggiX2y ago

godelski2y ago

ClassyJacket2y ago

Can someone explain the meaning of labelled vs unlabelled in this context? What kind of information would the labels carry?

Did they have depth maps for all 62 million images or not?

yorwba2y ago

throwaway815232y ago

Was that better than running the teacher model on the distorted images directly?

xnx2y ago

Very cool to see TikTok sharing its research.

nicollegah2y ago

any information on the inference speed of this vs midas?

leobg2y ago

Impressive demo.

Any FSD startup that put their money on LiDAR is even more screwed now.

buildbot2y ago

Disagree there. Humans have massive compute, dual optics, and amazing filters.

Computer vision has 1-2 of those three, and I don't think we are near an AGI for self driving yet. Driving is IMO, an AGI level task.

Does you dataset have a crocodile in it? Does you monocular depth model get fooled by a billboard that's just a photo?

GaggiX2y ago

>Does you monocular depth model get fooled by a billboard that's just a photo?

vlovich1232y ago

pedalpete2y ago

That's only a "happy path" attitude.

How well would a moncular path with headlights moving toward it at night operate? How about in rain, snow, or fog?

I'm not saying LiDAR is the only way, but I don't see a reason to use this as a solution.

I'm not saying this isn't valuable. I used to work in 3D/metaverse space, and having depth from a single photo, and being able to recreate a 3D scene from that is very valuable, and is the future.

j / k navigate · click thread line to collapse