Extracting AI models from mobile apps (opens in new tab)

(altayakkus.substack.com)

467 pointssmoser1y ago236 comments

236 comments

104 comments · 21 top-level

wat100001y ago· 17 in thread

“ Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner.”

That’s not true, is it? It would be a copyright violation to distribute an extracted model, but you can do what you want with it yourself.

Lerc1y ago

I'm not even sure if event the first part is true. Has it been determined if AI models are intellectual property? Machine generated content may not be copyrightable. It isn't just the output of generative AI that falls under this, the models themselves are.

Can you copyright a set of coefficients for a formula? In the sense of a JPEG it would be considered that the image being reproduced is the thing that has the copyright. Being the first to run the calculations that produces a compressed version of that data should not grant you any special rights to that compressed form.

An AI model is just a form of that writ large. When the models generalize and create new content, it seems hard to see how that either the output or the model that generated it could be considered someone's property.

People possess models, I'm not sure if they own them.

There are however billions of dollars at play here and enough money can buy you whichever legal opinion you want.

echelon1y ago

> AI models are intellectual property

If companies train on data they don't own and expect to own their model weights, that's hypocritical.

Model weights shouldn't be copyrightable if the training data was pilfered.

But this hasn't been tested because models are locked away in data centers as trade secrets. There's no opportunity to observe or copy them outside of using their outputs as synthetic data.

On that subject, training on model outputs should be fair use, and an area we should use legislation to defend access to (similar to web scraping provisions).

4 more replies

baryphonic1y ago

Going a step further, weights, i.e. coefficients, aren't produced by a person at all – they're produced by machine algorithms. Because a human did not create the weights, the weights have no author. Thus they are ineligible for copyright in the first place and are in the public domain. Whether the model architecture is copyrightable is more of an open question, but I think a solid argument could be that the model architecture is simply a mathematical expression – albeit a complex one –, though Python or other source code is almost certainly copyrighted. But I imagine clean-room methods could avoid problems there, and with much less effort than most software.

IANAL, but I have serious doubts about the applicability of current copyright law to existing AI models. I imagine the courts will decide the same.

1 more reply

cle1y ago

I think you have to distinguish between a model, its implementation, and its weights/parameters. AFAIU:

- Models are processes/concepts, thus not copyrightable, but are subject to trade secret law, contract and license restrictions, patents, etc.

- Concrete implementations may be copyrighted like any code.

- Parameters are "facts", thus not copyrightable, but are similarly subject to trade secret and contract law.

IANAL, not legal advice, yadda yadda yadda.

nullc1y ago

The weights are a product of a mechanical process, 5 years ago it would be generally uncontroversial that they would be not subject to copyright in the US... but 'industry' has done a tremendous job of spreading confusion.

slowmovintarget1y ago

Datasets want to be free.

2 more replies

jdietrich1y ago

Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies. Copyright covers the right to make copies, not the right to distribute; "doing what you want with it yourself" may or may not be covered by fair use. Whether or not model weights are copyrightable remains an open question.

https://www.law.cornell.edu/uscode/text/17/1201

nadermx1y ago

Actually, in terms of copyright control "The Federal Circuit went on to clarify the nature of the DMCA's anti-circumvention provisions. The DMCA established causes of action for liability and did not establish a property right. Therefore, circumvention is not infringement in itself."[1]

https://en.m.wikipedia.org/wiki/Chamberlain_Group,_Inc._v._S...

1 more reply

mcny1y ago

>Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies. Copyright covers the right to make copies, not the right to distribute; "doing what you want with it yourself" may or may not be covered by fair use. Whether or not model weights are copyrightable remains an open question.

If that is the law, it is a defect that we need to fix. Laws do not come down from heaven in the form of commandments. We, humans, write laws. If there is a defect in the laws, we should fix it.

If this is the law, time shifting and format shifting is unlawful as well which to me is unacceptable.

Disclaimer: As usual, I anal.

1 more reply

rpdillon1y ago

Your comment confused me, but I'm very interested in what you're getting at.

> Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies.

Yep, this is the DMCA section 1201. Late '90s law in the US.

> Copyright covers the right to make copies, not the right to distribute

This is where I got confused. Copyright covers four rights: copying, distribution, creation of derivative works, and public performance. So I'm not sure what you were getting at with the copy/distribute dichotomy.

But here's a question I'm curious about: Can DMCA apply to a copy-protection mechanism that's being applied to non-copyrightable work? Based on my reading of https://www.copyright.gov/dmca/:

> First, it prohibits circumventing technological protection measures (or TPMs) used by copyright owners to control access to their works.

That's not the letter of the law, but an overview, but it does seem to suggest you can't bring a DMCA 1201 claim against someone circumventing copy-protection for uncopyrightable works.

> Whether or not model weights are copyrightable remains an open question.

And this is where the interaction with the wording of 1201 gets interesting, in my (non-professional) opinion!

1 more reply

larodi1y ago

just imagine, like just for a second how it becomes illegal to train anything that does not then afterwards produce, if publicly used or distributed, a copyright token which is both in the training set - to mark it - and in the produce - to recognize it.

so this is where it all goes in several years, if i were the gov.

BadHumans1y ago

Is using millions of copyrighted works to train your AI a valid exemption? Asking for a few billionaire friends.

dragonwriter1y ago

No, copyright violation occurs at the first unauthorized copying or creation of a derivative work or exercise of any of the other exclusive rights of the copyright holder (that does not fall into an exception like that for fair use.) That distribution is required for a copyright violation is a persistent myth. Distribution is a means by which a violation becomes more likely to be detected and also more likely to involve significant liability for damages.

(OTOH, whether models, as the output of a mechanical process, are subject to copyright is a matter of some debate. The firms training models tend to treat the models as if they were protected by copyright but also tend to treat the source works as if copying for the purpose of training AI were within a copyright exception; why each of those positions is in their interest is obvious, but neither is well-established.)

wslh1y ago

It's also worth noting that there is still no legal clarity on these issues, even if a license claims to provide specific permissions.

Additionally, the debate around the sources companies use to train their models remains unresolved, raising ethical and legal questions about data ownership and consent.

rileymat21y ago

I doubt the models are copyrighted, arn’t works created by machine not eligible? Or you get into cases autogenerating and claiming ownership of all possible musical note combinations.

hnlmorg1y ago

It’s hard to say because as far as I know this stuff hasn’t been definitively tested on any courts that I know of. Europe not America.

AI models are generally regarded as a company’s asset (like a customer database would also be), and rightly so given the cost required to generate one. But that’s a different matter entirely to copyright.

1 more reply

larodi1y ago

its insane to state it tbh

Polizeiposaune1y ago· 16 in thread

You wouldn't train a LLM on a corpus containing copyrighted works without ensuring you had the necessary rights to the works, would you?

Workaccount21y ago

LLMs are not massive archives of data. They are a tiny fraction of a fraction of a percent of the size of their training set.

And before you knee-jerk "it's a compression algo!", I invite you to archive all your data with an LLMs "compression algo".

timewizard1y ago

> LLMs are not massive archives of data.

Neither am I, yet, I am still capable of reproducing copyrighted works to a level that most would describe as illegal.

> And before you knee-jerk "it's a compression algo!"

It's literally a fundamental part of the technology so I can't see how you call it a "knee jerk." It's lossy compression, the same way a JPEG might be, and simply recompressing your picture to a lower resolution does not at all obviate your copyright.

> I invite you to archive all your data with an LLMs "compression algo".

As long as we agree it is _my data_ and not yours.

1 more reply

BobbyTables21y ago

Copying a single sentence verbatim from a 1000 page book is still plagiarism.

And is technically copyright infringement outside fair use exceptions.

1 more reply

int_19h1y ago

It doesn't matter. It's still a derived work.

1 more reply

tomjen31y ago

You wouldn't read a book and teach others its lessons without a derived license, would you?

ben_w1y ago

When I was at school, we were sometimes all sat down in front of a TV to watch some movie on VHS tape (it was the 90s).

At the start of the tape, there was a copyright notice forbidding the VHS tape from being played at, amongst other places, schools.

dijksterhuis1y ago

as an example: saying “i really like james holden’s inheritors album for the rough and dissonant sounds” isn’t covered by copyright.

if i reproduced it verbatim using my mouth, or created a derived work which is noticeably similar to the original, that’s a different question though.

in your example, a derivative work example could be akin to only quoting from the book for the audience and modifying a word of each quote.

“derived” works are always a grey area, especially around generative machine learning right now.

yieldcrv1y ago

and therefore everyone has the necessary rights to read works, the necessary rights to critique of the works including for commercial purposes, and the necessary rights to derivative works including for commercial purposes

deadbabe1y ago

Fair use.

dijksterhuis1y ago

*only available in the USA, terms and conditions apply.

most other places use fair dealing which is more restrictive https://en.m.wikipedia.org/wiki/Fair_dealing

griomnib1y ago

Easy to claim, harder to justify once you start charging money for your subsequent creation.

Unless all LLM are a ruthless parody of human intelligence, which they may be, the legal issues will continue.

bayindirh1y ago

The moment you earn money from it, that's not fair use anymore. When I last checked, unlimited access to said models were not free, plus it's not "research" anymore.

- Addenda -

For the interested parties, the law states the following [0].

Notwithstanding the provisions of sections 17 U.S.C. § 106 and 17 U.S.C. § 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include:

    1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
    2. the nature of the copyrighted work;
    3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
    4. the effect of the use upon the potential market for or value of the copyrighted work.

The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors

So, if you say that these factors can be flexed depending on the defendant, and can be just waved away to protect the wealthy, then it becomes something else, but given these factors, and how damaging this "fair use" is, I can certainly say that training AI models with copyrighted corpus is not fair use in any way.

Of course at the end of the day, IANAL & IANAJ. However, my moral compass directly bars use of copyrighted corpus in publicly accessible, for profit models which undermine many people of their livelihoods.

From my perspective, people can whitewash AI training as they see fit to sleep sound at night, but this doesn't change anything from my PoV.

[0]: https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors

3 more replies

philwelch1y ago

You’re applying a double standard to LLM’s and human creators. Any human writer or artist or filmmaker or musician will be influenced by other people’s works, even while those works are still under copyright.

dijksterhuis1y ago

as a human being, and one that does music stuff, i don’t download terabytes of other peoples works from the internet directly into my brain. i don’t have verbatim reproductions of people’s work sitting around on a hard disk in my stomach/lungs/head/feet.

LLMs are not humans. They’re essentially a probabilistic compression algorithm (encode data into model weights/decode with prompt to retrieve data).

1 more reply

cmiles741y ago

I don't see how this is a double standard. Comparing a person interacting with their culture is not comparable in any way. IMHO, it's kind of a wacky argument to make.

2 more replies

_DeadFred_1y ago

Human creators don't store that 'influence' in a digital machine accessible format generated directly from the copyrighted content though.

Although with the 'good new everyone, we built the torment nexus' trajectory of AI my guess is at this point AI companies would just incorporate actual human brains instead of digital storage if that was the requirement.

2 more replies

JTyQZSnP3cQGa8B1y ago· 8 in thread

> Keep in mind that AI models [...] are considered intellectual property

Is it ironic or missing a /s? I can't really tell here.

SunlitCat1y ago

To be honest, that was my first thought on reading that headline as well. Given that especially those large companies (but who knows how smaller ones got their training data) got a huge amount of backlash for their unprecedented collection of data all over the web and not just there but everywhere else, it's kinda ironic to talk about intellectual property.

If you use one of those AI model as a basis for your AI model the real danger could be that the owners of the originating data are going after you at some point as well.

ToucanLoucan1y ago

Standard corporate hypocrisy. "Rules for thee, not for me."

If you actually expected anything to be open about OpenAI's products, please get in touch, I have an incredible business opportunity for you in the form of a bridge in New York.

xdennis1y ago

They got backlash, but (if I'm not mistaken) it was ruled that it's okay to use copyrighted works in your model.

So if a model is copyrighted, you should still be able to use it if you generate a different one based on it. I.e. copyright laundry. I assume this would be similar to how fonts work. You can copyright a font file, but not the actual shapes. So if you re-encode the shapes with different points, that's legal.

But, I don't think a model can be copyrighted. Isn't it the case that something created mechanically can't be copyrighted? It has to be authored by a person.

I find it weird that so many hackers go out of their way to approve of the legal claims of Big AI before it's even settled, instead of undermining Big AI. Isn't the hacker ethos all about decentralization?

Freak_NL1y ago

Standard disclaimer. Like inserting a bunch of 'hypothetically' in a comment telling one where to find some piece of abandoned media where using an unsanctioned channel would entail infringing upon someone's intellectual property.

biosboiii1y ago

Hey, author here.

I understand that its not very clear if the neural net and its weights & biases are considered as IP, I personally think that if some OpenAI employee just leaks GPT-4o it isn't magically public domain and everyone can just use it. I think lawmakers would start to sue AWS if they just re-host ChatGPT. Not that I endorse it, but especially in IP and in law in general "judge law" ("Richterrecht" in german) is prevalent, and laws are not a DSL with a few ifs and whiles.

But it is also a "cover my ass" notice as others said, I live in Germany and our law regarding "hacking" is quite ancient.

GuB-421y ago

For now, it is better to assume it is the truth.

The simple fact that models are released under license, which may or may not be free, imply that it is intellectual property. You can't license something that is not intellectual property.

It is a standard disclaimer, if you disagree, talk to your lawyer. The legal situation of AI models is such a mess that I am not even sure that a non-specialist professional will be of great help, let alone random people on the internet.

npteljes1y ago

I think it's both. It's

1. the current, unproven-in-court legal understanding, 2. standard disclaimer to cover OP's ass 3. tongue-in-cheek reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data

TeMPOraL1y ago

> reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data

Prevalent or not, phrased this way it's clear how nonsense it is. The data isn't hurt or destroyed in the process of being trained on, nor does the process deprive the data owners from their data or opportunity to monetize it the way they ordinarily would.

The right terms here are "learning from", "taking inspiration from", not "being a parasite".

(Now, feeling entitled to rent because someone invented something useful and your work accidentally turned out to be useful, infinitesimally, in making it happen - now that is wanting to be a parasite on society.)

1 more reply

nthingtohide1y ago· 7 in thread

One thing I noticed in Gboard is it uses homeomorphic encryption to do federated learning of common words used amongst public to do encrypted suggestions.

E.g. there are two common spelling of bizarre which are popular on Gboard : bizzare and bizarre.

Can something similar help in model encryption?

antman1y ago

Had to look it up, this seems to be the paper https://research.google/pubs/federated-learning-for-mobile-k...

1oooqooq1y ago

they have a very "interesting" definition of private data on the paper. it's so outlandish that if you buy their definition, there's zero value on the trained data. heh.

they also claim unsuppervisioned users typing away is better than tagged training data, which explain the wild grammar suggestions on the top comment. guess the age of quantity over quality is finally peaking.

in the end it's the same as grammarly but without any verification of the interested data, and calling the collection of user data "federation"

1 more reply

biosboiii1y ago

Author here, no clue about homeomorphic (or whatever) encryption, what could certainly be done is some sort of encryption of the model into the inference engine.

So e.g.: Apple CoreML issues a Public Key, the model is encrypted with that Public Key, and somewhere in a trusted computing environment the model is decrypted using a private key, and then inferred.

They should of course use multiple keypairs etc. but in the end this is just another obstacle in your way. When you own the device, root it or even gain JTAG access to it, you can access and control everything.

And matrix-multiplication is a computationally expensive process, in which I guess they won't add some sort of encryption technique for each and every cycle.

miki1232111y ago

In principle, device manufacturers could make hardware DRM work for ML models.

You usually inference those on GPUs anyway, and they usually have some kind of hardware DRM support for video already.

The way hardware DRM works is that you pass some encrypted content to the GPU and get a blob containing the content key from somewhere, encrypted in a way that only this GPU can decrypt. This way, even if the OS is fully compromised, it never sees the decrypted content.

1 more reply

umeshunni1y ago

Homomorphic, not homeomorphic

hyperbovine1y ago

`enc(coffee cup) == enc(donut)` would be an interesting guarantee.

vlovich1231y ago

In theory yes, in practice right now no. Homomorphic encryption is too computationally expensive.

jonpo1y ago· 7 in thread

Well done you seem to have liberated an open model trained on open data for blind and visually impaired people.

Paper: https://arxiv.org/pdf/2204.03738

Code: https://github.com/microsoft/banknote-net Training data: https://raw.githubusercontent.com/microsoft/banknote-net/ref...

model: https://github.com/microsoft/banknote-net/blob/main/models/b...

Kinda easier to download it straight from github.

Its licenced under MIT and CDLA-Permissive-2.0 licenses.

But lets not let that get in the way of hating on AI shall we?

dang1y ago

> But lets not let that get in the way of hating on AI shall we?

Can you please edit this kind of thing out of your HN comments? (This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.)

It leads to a downward spiral, as one can see in the progression to https://news.ycombinator.com/item?id=42604422 and https://news.ycombinator.com/item?id=42604728. That's what we're trying to avoid here.

Your post is informative and would be just fine without the last sentence (well, plus the snarky first two words).

Lerc1y ago

Can you clarify this a bit. I presume you are talking about the tone more than the implied statement.

If the last sentence were explicit rather than implied, for instance

This article seems to be serving the growing prejudice against AI

Is that better? It is still likely to be controversial and the accuracy debatable, but it is at least sincere and could be the start of a reasonable conversation, provided the responders behave accordingly.

I would like people to talk about controversial things here if they do so in a considerate manner.

I'd also like to personally acknowledge how much work you do to defuse situations on HN. You represent an excellent example of how to behave. Even when the people you are talking to assume bad faith you hold your composure.

1 more reply

jonpo1y ago

I don't seem to be able to edit it, apologies I will try not to let this type of thing get to me in future.

I would also like to point out that this is a fine tuned classifier vision model based on mobilenetv2 and not an LLM.

DoctorOetker1y ago

Don't you think its intentional, so as not to demonstrate the technique on potentially copyrighted data?

biosboiii1y ago

Author here, it would be nice to claim that I did this on purpose but I really did not know it was open source.

I was rather interested in the process of instrumenting of TF to make this "attack" scalable to other apps.

rob_c1y ago

... Because if he did this with a model that's not open that's sure going to keep everyone happy and not result in lawsuit(s)...

The same method/strategy applies to closed tools and models too, although you should probably be careful if you've handed over a credit card for a decryption key to a service and try this ;)

llama_drama1y ago

If this is exactly the same model then what's the point of encrypting it?

boothby1y ago· 6 in thread

If I understand the position of major players in this field, downloading models in bulk and training a ML model on that corpus shouldn't violate anybody's IP.

zitterbewegung1y ago

IANAL But, this is not true it would be a piece of the software. If there is a copyright on the app itself it would extend to the model. Even models have licenses for example LLAMA is release under this license [1]

[1] https://github.com/meta-llama/llama/blob/main/LICENSE

dragonwriter1y ago

The fact that models creators assert that they are protectrd by copyright and offer licenses does not mean:

(1) That they are actually protected by copyright in the first place, or

(2) That the particular act described does not fall into an exception to copyright like fair use, exactly as many model creators assert that the exact same act done with the materials models are trained on does, rendering the restrictions of the license offered moot for that purpose.

boothby1y ago

LLMs are trained on works -- software, graphics and text -- covered by my copyright. What's the difference?

1 more reply

blitzar1y ago

If I understand the position of major players in this field, copyright itself is optional (for them at least).

2 more replies

Drakim1y ago

Is there a material difference between the copyright laws for software and the copyright laws for images and text?

_DeadFred_1y ago

Yeah no.

An example for legal reference might be convolution reverb. Basically it's a way to record what a fancy reverb machines does (using copyrighted complex math algorithms) and cheaply recreate the reverb on my computer. It seems like companies can do this as long as they distribute protected reverbs separately from the commercial application. So Liquidsonics (https://www.liquidsonics.com/software/) sells reverb software but puts for free download the 'protected' convolution reverbs specifically the Bricasti ones in dispute (https://www.liquidsonics.com/fusion-ir/reverberate-3/)

Also, while a SQL server can be copyright protected, a SQL database is not given copyright protection/ownership to the SQL server software creators by extension of that.

ipsum21y ago· 5 in thread

This is cool, but only the first part in extracting a ML model for usage. The second part is reverse engineering the tokenizer and input transformations that are needed to before passing the data to the model, and outputting a human readable format.

mentalgear1y ago

Would be interesting if someone could detail the approach to decode the pre-post processing steps before it enters the model, and how to find the correct input encoding.

refulgentis1y ago

Boils down to "use Frida to find the arguments to the TensorFlow call beyond the model file"

Key here is, a binary model is just a bag-of-floats with primitively typed inputs and outputs.

It's ~impossible to write up more than what's here because either:

A) you understand reverse engineering and model basics, and thus the current content is clear you'd use Frida to figure out how the arguments are passed to TensorFlow

B) you don't understand this is a binary reverse engineering problem, even when shown Frida. If more content was provided, you'd see it as specific to a particular problem. Which it has to be. You'd also need a walkthrough by hand about batching, tokenization, so on and so forth, too much for a write up, and it'd be too confusing to follow for another model.

TL;Dr a request for more content is asking for a reverse engineering article to give you a full education on modal inference

3 more replies

refulgentis1y ago

This is a good comment, but only in the sense it documents a model file doesn't run the model by itself.

An analogous situation is seeing a blog that purports to "show you code", and the code returns an object, and commenting "This is cool, but doesn't show you how to turn a function return value into a human readable format" More noise, than signal.

The techniques in the article are trivially understood to also apply to discovering the input tokenization format, and Netron shows you the types of inputs and outputs.

Thanks for the article OP, really fascinating.

ipsum21y ago

Just having the shape of the input and output are not sufficient, the image (in this example) needs to be normalized. It's presumably not difficult to find the exact numbers, but it is a source of errors when reverse engineering a ML model.

1 more reply

rob_c1y ago

If you can't fix this with a little help from chatgpt or Google you shouldn't be building the models frankly let alone mucking with other people's...

1 more reply

Ekaros1y ago· 3 in thread

Can you launder AI model by feeding it to some other model or training process? After all that is how it was originally created. So it cannot be any less legal...

benreesman1y ago

There are a family of techniques, often called something like “distillation”. There are also various synthetic training data strategies, it’s a very active area of research.

As for the copyright treatment? As far as I know it’s a bit up in the air at the moment. I suspect that the major frontier vendors would mostly contend that training data is fair use but weights are copyrighted. But that’s because they’re bad people.

qup1y ago

The weights are my training data. I scraped them from the internet

1 more reply

bangaladore1y ago

To some extent this is how many models are being produced today.

Basically its just a synthetic loop of using a previously developed SOTA (was) model like GPT-4 to train your model.

This can produce models with seemingly similar performance at a smaller size, but to some extent, less bits will be less good.

amolgupta1y ago· 3 in thread

For app developers considering tflite, a safer way would be to host the models on firebase and delete them when their job is done. It comes with other features like versioning for model updates, A/B tests, lower apk size etc. https://firebase.google.com/docs/ml/manage-hosted-models

hn87261y ago

That wouldn't help against the technique explained in the article, would it? Since the model makes it way into the device, it can be intercepted in a similar fashion.

I'm not quite sure I understand the firebase feature btw. From the docs, it's pretty much file storage with a dedicated API? I suppose you can use those models for inference in the cloud, but still, the storage API seems redundant.

biosboiii1y ago

I think the comment author means offering inference via Firebase, with the model never leaving the backend.

This works, just like ChatGPT works, but has the downside of 1. You have to pay the computing for every inference 2. Your users can't access it offline 3. Your users will have to use a lot of data from their mobile network operator. 4. Your inference will be slower

And since SeeingAI infers the model every second, your and your customers bill will be huge.

1 more reply

saagarjha1y ago

In addition to the sibling comment this would require repeatedly re-downloading models when you want to use them, which sucks.

do_not_redeem1y ago· 3 in thread

Can anyone explain that resize_to_320.tflite file? Surely they aren't using an AI model to resize images? Right?

smitop1y ago

tflite files can contain a ResizeOp that resizes the image: https://ai.google.dev/edge/api/tflite/java/org/tensorflow/li...

The file is only 7.7kb, so it couldn't contain many weights anyways.

raydiak1y ago

Exactly. Put another way, tensorflow is not an AI. You can build an AI in tensorflow. You can also resize images in tensorflow (using the traditional algorithms, not AI). I am not an expert, but as I understand, it is common for vision models to require a fixed resolution input, and it is common for that resolution to be quite low due to resource constraints.

koe1231y ago

Probably not what your alluding to but AI upscaling of images is definitely a thing

garyfirestorm1y ago· 2 in thread

Lot of comments here seem to think that there’s no novelty. I disagree. As a new ML engineer I am not very familiar with any reverse engineering techniques and this is a good starting point. Something about ML yet it’s simple enough to follow, and my 17yr old cousin who is ambitious to start cyber security would love this article. Maybe its too advanced for him!

biosboiii1y ago

Thanks a lot :)

My general writing style is directed mainly towards my non-technical colleagues, which I wish to inspire to learn about computers.

This is no novelty, by far, it is a pretty standard use-case of Frida. But I think many people, even software developers, don't grasp the concept of "what runs on your device is yours, you just dont have it yet".

Especially in mobile apps, many devs get sloppy on their mobile APIs because you can't just open the developer tools.

UnreachableCode1y ago

I'm a mobile developer and I'm new to using Frida and other such tools. Do you have any tips or reading material on how to use things like Frida?

1 more reply

janalsncm1y ago· 2 in thread

I’m a huge fan of ML on device. It’s a big improvement in privacy for the user. That said, there’s always a chance for the user to extract your model, so on-device models will need to be fairly generic.

Zambyte1y ago

Maybe someday we will build a society where standing on the shoulders of giants is encouraged, even when they haven't been dead for 100 years yet.

andrewfromx1y ago

this would be yellow in https://en.wikipedia.org/wiki/Spiral_Dynamics but we are still a mix of orange and green.

asciii1y ago· 2 in thread

That's pretty cool! I am impressed by the Frida tool, especially to read in the binary and dump it to disk by overwriting the native method.

The author only mentions APK for Android, but what about iOS IPA? Is there an alternative method for handling that archive?

biosboiii1y ago

Yeah, you can basically just unzip IPA files. Gaining them is hard though, I have a pathway if you are interested.

But the Objective C code is actually compiled, and decompilation is a lot harder than with the JVM languages on Android.

My next article will be about CoreML on iOS, doing the same exact thing :)

asciii1y ago

> My next article will be about CoreML on iOS, doing the same exact thing :)

Can't wait - thanks for writing it up!

avg_dev1y ago· 1 in thread

pretty cool; that frida tool seems really nice. https://frida.re/docs/home/

(and a bunch of people seem to be interested in the "IP" note, but I took as, just trying to not get run into legal trouble for advertising "here's how you can 'steal' models!")

frogsRnice1y ago

frida is an amazing tool - it has empowered me to do things that would have otherwise took weeks or even months. This video is a little old, but the creator is also cracked https://www.youtube.com/watch?v=CLpW1tZCblo

It's supposed to be "free-IDA" and the work put in by the developers and maintainers is truly phenomenal.

EDIT: This isn't really an attack imo. If you are going to take "secrets" and shove it into a mobile app, they can't really be considered secret. I suppose it's a tradeoff - if you want to do this kind of thing client-side - the secret sauce isn't so secret.

1vuio0pswjnm71y ago· 1 in thread

"Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner."

If weights and biases contained in "AI models" are prorietary, then for one model owner to detect infingement by another model owner, it may be necessary to download and extract.

1vuio0pswjnm71y ago

Where the model owner is not the owner of the training data consider also that weights may be derivative works:

https://www.arxiv.org/pdf/2407.13493

Fragoel21y ago

There's an interesting research paper from a few years ago that extracted models from Android apps on a large scale: https://impillar.github.io/files/ccs2022advdroid.pdf

VectorLock1y ago

Excellent introduction to some cool tools I wasn't aware of!

kittikitti1y ago

This was a great article and I really appreciate it!

1vuio0pswjnm71y ago

"Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner."

Is that really true. Is the law settled in this area. Is it the same everywhere or does it vary from jurisdiction to jurisdiction.

See, e.g., https://news.ycombinator.com/item?id=42617889

powtain-gen11y ago

Welcome to check out Sam Altman’s January 5, 2025 blog post, “Reflections.”

https://web.powtain.com/pow/qao631

23B11y ago

> hoarding data

Laundering IP. FTFY.

j / k navigate · click thread line to collapse