Why is it impossible to reverse-engineer closed source software?

gedaliyah@lemmy.world · 4 months ago

Why is it impossible to reverse-engineer closed source software?

KISSmyOS@feddit.org · edit-2 4 months ago

The first programs were written in binary/hexadecimal, and only later did we invent coding languages to convert between human readable code and binary machine code.

The first programs were thousands of times smaller and less complicated.

So why can’t we just do the same thing in reverse?

We can.

Couldn’t a very smart person (or AI) just take the existing program and turn it into code?

No, you’d need a team of experienced developers and lots of time.
So much time that the target software you’re trying to reverse engineer usually moves faster than you can catch up.
So you’re constantly falling further behind the current state of what people want to use.
And no one will give you money for your effort. And if you do manage to become successful and make money, a swarm of lawyers will descend upon you.

mesamune@lemmy.world · 4 months ago

It’s taken years for devs to decompile Zelda let alone other projects. It’s crazy how much work goes into such projects.

foggy@lemmy.world · edit-2 4 months ago

It is not. idk who told you it was.

Disassembling an executable is trivial to do. Everything is open source if you can read assembly. Obfuscation be damned.

LavenderDay3544@lemmy.world · 4 months ago

The hard part isn’t reading assembly. The hard part is figuring out why it’s doing what it’s doing with no comments or function names or anything useful to help.

This is like saying if you can read English you can understand an advanced math or physics paper written in English without having any knowledge or context of those subjects.

Thorry84@feddit.nl · 4 months ago

Well decompiling is only one step in the reverse engineering process. I would recommend taking a look at the Legend of Zelda: Ocarina of Time decompile projects. They reversed engineered the whole thing, which took years and was a team effort.

In the end they got perfectly readable source code, fully documented. And the most amazing thing is, when compiled with the right compiler and right flags, it recreates the original rom perfectly.

I would also recommend a YouTuber called Kaze. He’s been working on Mario 64 for years, re-writing large parts of the engine to get some pretty cool stuff going.

Lemminary@lemmy.world · 4 months ago

I’ve used a decompiler to peek at the source code of an app written in Visual Basic I wanted to recreate as a browser addon. It was mostly successful but some variable and function names were messed up.

peopleproblems@lemmy.world · 4 months ago

Variable names, class names, package structure, method names, etc. won’t normally be maintained in the disassembled code. They are meaningless to the CPU, and just a series of memory addresses. In cases where you have method names being mentioned, it’s likely a syscall, and it’s calling a method from an existing library. I’m not familiar with VB, but at least in .Net and .Net Framework, this would be something like the System.Collections.Generic providing the implementation for List<string> and when .Sort() is called, it makes the syscall to that compiled .dll.

Naich@lemmings.world · 4 months ago

You could chuck it at an AI to reverse compile it into something readable.

peopleproblems@lemmy.world · 4 months ago

Instead of just getting the down votes, I’ll explain why that wouldnt work.

The AI itself cannot decompile it without the same tools I would use. The AI would then end up with the same starting spot I have.
Current LLMs do not know how to interpret code logic, and would likely make mistakes in Syscalls, register addresses, and instructions.
Assembly languages themselves have nothing further than instruction sets. I’m sure there are ways to organize it in the super rare case of actually writing assembly, but not to the effect of object oriented or functional programming.

Lastly, other comments have pointed out decompiled code is extremely expensive to analyze. The output from whatever we decompile would easily exceed the input limits for all existing LLMs.

Naich@lemmings.world · 4 months ago

Thanks. I was thinking that you could have an AI “looking over the shoulder” of a compiler, seeing what comes out for the code going in to it. Basically training it to spot sequences in compiled code in order to guess the instructions that compiled into that code.

Norgur@fedia.io · 4 months ago

Imagine being presented with an aircraft. You bloody well know what it does and you get permission to disassemble the whole thing to your heart’s content. How big of a task do you think it’d still be to be able to work out how the winged metal tube works and why it does what it does when it does it?

Exactly.

ℕ𝕖𝕞𝕠@midwest.social · 4 months ago

We can and have done this, but there’s not much gain, which is why it’s mostly done by hobbyists to their favorite older software whose parent company went bust. It’s especially common for older games.

Toes♀@ani.social · 4 months ago

It’s not impossible just difficult.

You could use a tool like ghidra for example to study a program and workout what everything does.

https://ghidra-sre.org/

FuglyDuck@lemmy.world · edit-2 4 months ago

most software packages are ridiculously complicated. it’s not as simple as just running a decompiler and seeing code. It’s labor intensive, and loaded with bugs and errors, many of which you would never catch unless you already had in idea of what was supposed to be there.
many applications rely on external services/system packages that may or may not exist on your machine.
companies take steps to protect their application from it being reverse engineered, making it that much more difficult to actually pull off.
you don’t have access to the documentation/commenting that would be in the uncompiled code, turning a lot of the script into incomprehensible jibberish.
all the labor involved means it’s very likely to not pass the cost/benefit analysis. unless you’re able to add something to it; something the other guy doesn’t have… then you’re not going to be getting a substantial market share. It won’t be profitable.

Contramuffin@lemmy.world · 4 months ago

Yes, and people do do it. It’s just incredibly difficult to do it even for relatively simple programs, and the more complex the program is, the more exponentially hard the reverse engineering will be.

The problem is not necessarily turning it into code, since many decompilers do it already for you nowadays. The issue is understanding what in the world the code is supposed to do. Normally, open source code would be commented and there would be documentation, so it’s easy to edit or build on the code. Decompiled code comes with no documentation or comments, and all the variable names are virtually illegible.

It’s sometimes easier to build something new than to fix what’s broken, and this would be one of those cases where it’s true

Lost_My_Mind@lemmy.world · 4 months ago

people do do it.

giggles under my breath

remotelove@lemmy.ca · 4 months ago

Apps are huge and compilers optimize the fuck out of the code. Code optimization doesn’t always make sense so you need to have a detailed understanding of which compilers were used. There could be hundreds of libraries involved or even layers of obfuscation in some cases. Loops can be unwrapped, or other bits of code optimized for specific architectures. Some of the logic won’t appear logical.

Disassemblers can do a decent job converting code back to C/C++, but even then, you have to go through the code line by line converting function names and variable names back to something that can be referenced later as a meaningful name.

You aren’t wrong: All the code is there. It’s just a matter of putting all the human readable references back into anything you disassembled.

Waaaay back in the day, we could tear apps apart easily if they were small. There were only a few flavors of assembly and compilers were still fairly basic for what they were. Regardless, it wasn’t a small task.

I played around with cracking for a while just to learn about it and honestly, it was kinda easy before everything was offloaded to “the cloud”. It’s just a matter of tracing execution and finding a few critical comparisons or jumps to alter. Even then, it could take me a day or two just to walk through what was basically one or two functions.

gaiussabinus@lemmy.world · 4 months ago

It’s not. I believe lowlevellearning has a tutorial on tearing down binaries. If not him, john hammond does for sure. Both are on youtube. That skill set is usually employed in security research since it pays more than reverse engineering old software with problematic licensing and uncertain ownership.

Björn Tantau@swg-empire.de · 4 months ago

It’s not impossible. It’s being done all the time. It’s just tedious complicated work. So if you don’t have someone willing to invest their time and expertise it won’t be done for most stuff.

Emily (she/her)@lemmy.blahaj.zone · edit-2 4 months ago

It’s not impossible, just very labour intensive and difficult. Compiling an abstract, high level language into machine code is not a reversible process. Even though there are already automated tools to “decompile” machine code back to a high level language, there is still a huge amount of information loss as nearly everything that made the code readable in the first place was stripped away in compilation. Comments? Gone. Function names? Gone. Class names? Gone. Type information? Probably also gone.

Working through the decompiled code to bring it back into something readable (and thus something that can be worked with) is not something a lone “very smart person” can do in any reasonable time. It takes likely a team of smart people months of work (if not years) to understand the entire structure, as well as every function and piece of logic in the entire program. Once they’ve done that, they can’t even use their work directly, since to publish reconstructed code is copyright infringement. Instead, they need to write extremely detailed documentation about every aspect of the program, to be handed to another, completely isolated person who will then write a new program based off the logic and APIs detailed in the documentation. Only at that point do they have a legally usable reverse engineered program that they can then distribute or modify as needed.

Doing this kind of reverse engineering takes a huge amount of effort and motivation, something that an app for 350 total sneakers is unlikely to warrant. AI can’t do it either, because they are incapable of the kind of novel deductive reasoning required for the task. Also, the CarThing has actually always been “open-source”, and people have already experimented with flashing custom firmware. You haven’t heard about it because people quickly realised there was no point - the CarThing is too underpowered to do much beyond its original use.

Fondots@lemmy.world · 4 months ago

To build on/give some example about what you said with the comments and function names (programmers, excuse the sloppy pseudocode that’s about to follow, it’s been a long time since high school intro to computer science)

Let’s say in a video game, you run around collecting coins, and if you get 100 coins you earn an extra life

One small part of that code may look something like:

IF
newGame = TRUE
THEN
coinCount = 0
lifeCount = 3
coinModel.all.visibility = TRUE
//Players start a new game with 3 lives and 0 coins, and all coins are visible in the level

IF
playerModel.isTouching.coinModel.x = TRUE
THEN
coinModel.x.visibility = FALSE
coinCount++
//If the player character model touches one of the coin models, that coin model disappears, and the players coin count is increased by 1
IF
coinCount % 100 = 0
THEN
lifeCount++
//if that coin count is divisible evenly by 100, then the players life count is also increased by 1

Quick notes for people who have even less programming background than me

++ Is used by a lot of programming languages to increase a value by 1

% is often used as the “modulo” operator, which basically returns the remainder from division. So 10 % 2 = 0, because 10 is evenly divisible by 2, 10 % 3 = 1, because 10 is divisible by 3 but not evenly and leaves a remainder of 1

// Are comments, they don’t affect the code, they’re just there for human readability to make it more understandable, so you can explain why you did what you did for anyone who has to maintain the code after you, etc.

Hopefully, between the simple variable names and comments, those pseudocode blocks all pretty readable for laypeople, but if not

The first block basically detects if you’re starting a new game (IF newGame = TRUE)
If it is, then it resets your life counter to a default 3, and you start with 0 coins and sets all of the coins in the level to be visible so you can collect them
Otherwise it would carry over the values from your previous level, or save game, or whatever

The second block detects if you touch a coin (playerModel.isTouching.coinModel.x = TRUE) If you do, that coin vanishes (coin.x.visibility = FALSE)
It also increases your coin count (coinCount++)
Then if your coin count is divisible evenly by 100 (coinCount % 100 = 0) it increases your life total (lifeCount++)

When the code gets compiled, that gets turned into machine code, basically all 1s and 0s that the computer can understand. The computer doesn’t care if you call a coin a coin or if you call it object1, it’s going to strip all of those human-readable elements out because it would just be a waste of storage and processing power to keep it in.

So when you recompile that, you don’t get any of the explanatory comments or the easy to read variable names, so you might end up with something looking kind of like this

IF
Variable1 = TRUE
THEN
Variable2 = 0
Variable3 = 3
object1.all.condition1 = TRUE

IF
object2.condition2.object1.x = TRUE
THEN
object1.x.condition1 = FALSE
variable2++
IF
variable2 % 100 = 0
THEN
variable3++

Which is a lot harder to understand. The code will still work, you could recompile it and run it, but if you want to make any changes, you’d basically need to comb through it, figure out what all the variables, objects, conditions, etc. are, and try to piece together why the programmers who originally wrote the code did it the way they did

And that’s of course a bit of an oversimplification, for various reasons it may not decompile and recompile exactly 1:1 with the original code, it’s almost like translating the same sentence back and forth between 2 languages with Google translate.

And even this little snippet of fairly simple and straightforward code would probably going to be backed up by dozens, if not hundreds or thousands of other lines of code just to make this bit work, defining what a coin is, the hit boxes, animations, how it determines if it’s a new game or or continuing a previous game, etc.

Emily (she/her)@lemmy.blahaj.zone · edit-2 4 months ago

Thank you for adding this! If people want a real life example of the effect shown in this pseudocode, here is a side-by-side comparison of real production code I wrote and it’s decompiled counterpart:

    override fun process(event: MapStateEvent) {
        when(event) {
            is MapStateEvent.LassoButtonClicked -> {
                action(
                    MapStateAction.LassoButtonSelected(false),
                    MapStateAction.Transition(BrowseMapState::class.java)
                )
            }
            is MapStateEvent.SaveSearchClicked -> {
                save(event.name)
            }
            // Propagated from the previous level
            is MapStateEvent.LassoCursorLifted -> {
                load(event.line + event.line.first())
            }
            is MapStateEvent.ClusterClick -> {
                when (val action = ClusterHelper.handleClick(event.cluster)) {
                    is ClusterHelper.Action.OpenBottomDialog ->
                        action(MapStateAction.OpenBottomDialog(action.items))
                    is ClusterHelper.Action.AnimateCamera ->
                        action(MapStateAction.AnimateCamera(action.animation))
                }
            }
            is MapStateEvent.ClusterItemClick -> {
                action(
                    MapStateAction.OpenItem(event.item.proposal)
                )
            }
            else -> {}
        }
    }

decompiled:

    public void c(@l j jVar) {
        L.p(jVar, D.f10724I0);
        if (jVar instanceof j.c) {
            f(new i.h(false), new i.r(c.class, (j) null, 2, (C2498w) null));
        } else if (jVar instanceof j.e) {
            m(((j.e) jVar).f8620a);
        } else if (jVar instanceof j.d) {
            List<LatLng> list = ((j.d) jVar).f8619a;
            j(I.A4(list, I.w2(list)));
        } else if (jVar instanceof j.a) {
            d.a a7 = d.f8573a.a(((j.a) jVar).f8616a);
            if (a7 instanceof d.a.b) {
                f(new i.j(((d.a.b) a7).f8575a));
            } else if (a7 instanceof d.a.C0058a) {
                f(new i.a(((d.a.C0058a) a7).f8574a));
            }
        } else if (jVar instanceof j.b) {
            f(new i.k(((j.b) jVar).f8617a.f11799a));
        }
    }

keep in mind, this was buried in hundreds of unlabeled classes and functions. I was only able to find this in a short amount of time because I have the most intimate knowledge of the code possible, having written it myself.

2484345508@lemy.lol · 4 months ago

In addition to the other comments that explained it well… Back in the day, that process was easier in part because executable files had far fewer instructions.

FaceDeer@fedia.io · 4 months ago

As others have mentioned, it’s possible but very complicated. Decompilers produce code that isn’t very readable for humans.

I am indeed awaiting the big news headlines that will for some reason catch everyone by surprise when a LLM comes along that’s trained to “translate” machine code into a nice easily-comprehensible high-level programming language. It’s going to be a really big development, even though it doesn’t make programs legally “open source” it’ll make it all source available.

xmunk@sh.itjust.works · 4 months ago

Assuming you have all the source code… it is possible. It’s usually a huge pain in the ass though and software is so complicated that it’s extremely difficult to get anything useful.