Our journey from a Python monolith to a managed platform (opens in new tab)

(dropbox.tech)

133 pointsmands5y ago29 comments

29 comments

23 comments · 7 top-level

rowls665y ago· 5 in thread

> Every line of code they wrote was, whether they wanted or not, shared code—they didn’t get to choose what was smart to share, and what was best to keep isolated to a single endpoint.

I am know very little about Python, but does this mean that Python has no way to encapsulate code at a level larger than a class? Something like a package or a module. It does not seem like it should be necessary to break a system into separate services just to get encapsulation at a module of subsystem level.

nonameiguess5y ago

Python doesn't have true encapsulation unless you drop down to the C level to enforce it separately from the actual runtime. Unofficially, there are conventions on how to mark functions as part of a public interface, but practically speaking, if someone can import your library, they can do whatever they want with it, including adding and changing behavior at runtime. Classes, modules, and even class and function definitions are just hash tables under the hood, and you can change what a given symbol refers to in whatever arbitrary way you want to.

Granted, you shouldn't do very much of that except in some extremely specialized and careful circumstances, but in a language that makes it really easy, you have to be extremely disciplined and often delivery velocity takes precedence over careful planning of what really constitutes a public API.

Practically speaking, putting separate services in literally separate process spaces where the only way they can communicate is via message passing is the only way to really enforce encapsulation.

nonameiguess5y ago

For an example of what I mean, check out the https://pypi.org/project/aioify/ library. This intercepts all of the function and method definitions in a module as they are being imported and rewrites them on the fly to turn all normal functions into async functions.

The language designers definitely took the path of we're just going to assume you know what you're doing and give you absolutely every last foot of rope you could possibly want to hang yourself with.

It does make things fast and easy. See all of the complaints here lately about having to rewrite significant parts of an application to become async as soon as any part of it becomes async? Not a problem in Python. There's a library to just automatically rewrite all code as it is being imported, at the cost that the code you're actually running is not the code you see in your repos.

1 more reply

dragonwriter5y ago

> I am know very little about Python, but does this mean that Python has no way to encapsulate code at a level larger than a class?

Python has packages, which contain modules. This doesn't seem to be a reference to the lack of encapsulation, in any case.

Though it seems to be projecting a social problem (absence of a policy involving active decisions on when particular code can be used by endpoints) onto architecture.

neolog5y ago

Python has functions, classes, modules, packages. I don't know why separate codebases would be necessary.

willcipriano5y ago

You can do stuff like this, Java devs in particular get freaked out by this sort of thing in my experience. (Forgive the formatting)

import foo

foo.auth("admin", "badpass") returns False

def new_auth(user, pass):

. . . .return True

foo.auth = new_auth

foo.auth("admin", "badpass") now returns True

1 more reply

lalos5y ago· 5 in thread

Semi-related: when people talk about monorepos, is it implied that all the project has only one version number? Why not just version subprojects of the monorepo, that way you have a small vetting process when cutting a release of a specific subproject. The rest of the subprojects that depend on it can read the release notes for breaking changes, etc.

tudelo5y ago

No. Not at all. A monorepo with multiple back end projects might push all at different times. So what this means is that when designing new features across multiple services you need to design them with push safety in mind with a roll out plan to accomplish that.

For example, you are updating service B to call new endpoint on service A. First you need to make service A endpoint available, and then make service B call service A.

Just because everything exists in the same repo does not mean it all gets shoved out at once. The downside is that you can't just read the code and assume the running service is doing that, unless it's embedded in your build. Processes like automated updates and a forced update cadence (no running binaries over X days old) with proper canary/vetting before a full release allow a large org to still manage this complexity.

dec0dedab0de5y ago

To me the whole point of having a monorepo is to avoid versioning. But that doesnt mean you always need to deploy everything, you can take the newest commit hash from each project directory and only deploy if that has changed.

mumblemumble5y ago

When I'm doing it, I version subprojects independently using Git tags, but that's primarily because we mix and match versions in production, which creates need to make version numbering semantic.

If we were doing continuous delivery, too, I could see there not being much value in messing with independent versioning, semver, whatever. Just make today's date the universal version numbering system for all modules and move along.

reidrac5y ago

Good question! I always assume it is the same version, mostly because a common pattern is to use SCM tags to track versions, and I haven't seen that work fine on any monorepo.

If you have a specific version per subproject, how do you track that in the repo? Different tag schemes for different subprojects? I have used that in a small-ish monorepo and I didn't like it specially.

derekperkins5y ago

We just use the git sha and they works great, no reason to overcomplicate it

qbasic_forever5y ago· 2 in thread

So what's the end game here, is Dropbox going to keep building out an internal Kubernetes-like platform with Atlas, or do they plan to eventually just move to k8s? I noticed this line in particular:

"We evaluated using off-the-shelf solutions to run the platform. But in order to de-risk our migration and ensure low engineering costs, it made sense for us to continue hosting services on the same deployment orchestration platform used by the rest of Dropbox."

It sounds like they acknowledge they're reinventing a lot of stuff but for now are sticking to their internal platform. Perhaps Atlas is a half-step then to get teams used to owning and running their code as isolated services. But everything I read that they built in Atlas--isolated orchestrated services, gRPC load balancing, canary deployments, horizontal scaling, etc.--are bog standard features of Kubernetes today. I'd be very leery of maintaining a bespoke Kubernetes-like platform in 2021 and beyond--in some ways it seems like it's just shifting the monolith technical debt into an internal Atlas platform team's technical debt. What's the plan to get rid of that debt for good I wonder?

This hurdle shows there's already some cracks in the idea of long-term Atlas too:

"While splitting up Metaserver had wins in production, it was infeasible to spin up 200+ Python processes in our integration testing framework. We decided to merge the processes back into a monolith for local development and testing purposes. We also built heavy integration with our Bazel rules, so that the merging happens behind the scene and developers can reference Atlasservlets as regular services."

If I read that right does it really mean the first time a developer's code is run like it will run in production is when it goes out to canary deployment? I.e. integration tests are done in a local monolith instead of setting up a mini-prod cluster. That seems a bit nerve-racking as a dev to have no way to really test the service until bits are hitting user requests. In the k8s world a ton of work has been put into tooling and processes to make setting up local clusters easily. It's a shame to not have something similar for Atlas.

hayst4ck5y ago

Counter considerations would be: What is the delta between out of box solutions and current solutions? What is the cost of the migration? For what period will two services be supported simultaneously? Will development effort continue on the previous service while the new one is created? Will the new service successfully solve problems the old one didn't? What happens when Kubernetes is insufficient for a task or has a critical bug that only appears at scale? How will people be onboarded into the new system? Will the team handling how services run perform the migration or will the teams who own services perform the migration? How much time should be spent experimenting with new things compared to fixing bugs/adding requested features?

> in some ways it seems like it's just shifting the monolith technical debt into an internal Atlas platform team's technical debt.

This is a key insight into the monolith problem. How does a monolith become poor and unmaintainable? A monolith becomes poor quality and unmaintainable when there is no entity enforcing architectural simplicity. It becomes unmaintainable when there is no team focusing solely on how the monolith functions. It becomes unmaintainable if there is no entity capable of saying "no" to a product engineer. A monolith in a company with weak leadership is a tragedy of the commons where everyone takes from the commons by adding complexity and there is no governing entity to ensure that the commons remains viable.

The exact statement you made is the key strength of this approach. Where there was a vacuum of responsibility before (monolith technical debt), a team has been created with direct responsibility and authority creating a governing force over that technical debt/overall complexity and therefore an entity directly responsible for improving it. This is a key first step. Atlas appears to be a compromise solution rather than an ideal end state.

spondyl5y ago

> A monolith becomes poor quality and unmaintainable when there is no entity enforcing architectural simplicity.

Having worked in a company where no single team "owned" the monolith, the term "communally owned" tended to come up.

It was generally understood within the platform teams that if everyone owns it then in reality, no one owns it :)

3 more replies

JonAtkinson5y ago· 2 in thread

I'd be interested to better understand the timeline around this statement:

"Metaserver was stuck on a deprecated legacy framework that unsurprisingly had poor performance and caused maintenance headaches due to esoteric bugs. For example, the legacy framework only supports HTTP/1.0 while modern libraries have moved to HTTP/1.1 as the minimum version."

Dropbox has been around for a lot of years, and raised a lot of cash; was it only recently that they could pay down this technical debt? Were they really so busy in other areas that this was allowed to fester?

muglug5y ago

There's normally some sort of budget for paying down technical debt – presumably there was more pressing technical debt.

wittekm5y ago

(I was at Dropbox from 2015-2020)

The legacy framework was Pylons, which eventually evolved into Pyramid.

The tldr is there were hundreds of unowned endpoints that, yes, were allowed to fester. They eventually got ownership on all endpoints, so you had somebody to exert pressure on to make things happen.

eevilspock5y ago· 2 in thread

I just want to know when they'll switch to native battery efficient clients, especially given the daemon is always running and monitoring file system events.

nickm125y ago

The Dropbox sync engine which runs on desktops was rewritten from Python to Rust and released last year:

https://dropbox.tech/infrastructure/rewriting-the-heart-of-o...

eevilspock5y ago

good to know! thanks.

Then I don’t understand the delay to shipping an Apple Silicon build. Right now we still have to used Rosetta... it’s the only such piece of software I have that does.

bps44845y ago

I'm most curious about how the experience was, both for the atlas team and the product team, around this quote, "Atlas is “managed,” which means that developers writing code in Atlas only need to write the interface and implementation of their endpoints. Atlas then takes care of creating a production cluster to serve these endpoints. The Atlas team owns pushing to and monitoring these clusters."

Does this imply that the atlas team gets into the weeds of understanding the business and business logic behind these endpoints to know the scalability and throughput needs? Is the autoscaler really good enough to handle this? If it's transparent to the product team, are they aware of their usage (potentially unexpected)? I imagine the atlas team would have to be very large with these sorts of responsibilities.

From a product team perspective I imagine they are still responsible for database configuration and tuning? Has the daily auto-deployment led to unexpected breaks? Who is responsible for rollbacks? And is the product team responsible and capable of hotfixes?

Maybe a more broad question which all of my questions above speak to: how are the roles and responsibilities set up between the atlas team and the product engineering team that owns the code, and how has the transition to that system been?

kwdc5y ago

Headline of article: "Our journey from a Python monolith to a managed platform"

So... about this headline. I read this aloud to a friend at a cafe. We laughed. It makes perfect sense to us. We know what Python is. We know what a monolith means in this context.

To my other friends it was the funniest / silliest / nonsensical thing they'd heard for awhile.

IT is weird.

(ps I know no one will see this comment but I'll leave it here. Because.)

j / k navigate · click thread line to collapse

29 comments

23 comments · 7 top-level

rowls665y ago· 5 in thread

> Every line of code they wrote was, whether they wanted or not, shared code—they didn’t get to choose what was smart to share, and what was best to keep isolated to a single endpoint.

nonameiguess5y ago

Practically speaking, putting separate services in literally separate process spaces where the only way they can communicate is via message passing is the only way to really enforce encapsulation.

nonameiguess5y ago

The language designers definitely took the path of we're just going to assume you know what you're doing and give you absolutely every last foot of rope you could possibly want to hang yourself with.

1 more reply

dragonwriter5y ago

> I am know very little about Python, but does this mean that Python has no way to encapsulate code at a level larger than a class?

Python has packages, which contain modules. This doesn't seem to be a reference to the lack of encapsulation, in any case.

Though it seems to be projecting a social problem (absence of a policy involving active decisions on when particular code can be used by endpoints) onto architecture.

neolog5y ago

Python has functions, classes, modules, packages. I don't know why separate codebases would be necessary.

willcipriano5y ago

You can do stuff like this, Java devs in particular get freaked out by this sort of thing in my experience. (Forgive the formatting)

import foo

foo.auth("admin", "badpass") returns False

def new_auth(user, pass):

. . . .return True

foo.auth = new_auth

foo.auth("admin", "badpass") now returns True

1 more reply

lalos5y ago· 5 in thread

tudelo5y ago

For example, you are updating service B to call new endpoint on service A. First you need to make service A endpoint available, and then make service B call service A.

dec0dedab0de5y ago

mumblemumble5y ago

When I'm doing it, I version subprojects independently using Git tags, but that's primarily because we mix and match versions in production, which creates need to make version numbering semantic.

reidrac5y ago

Good question! I always assume it is the same version, mostly because a common pattern is to use SCM tags to track versions, and I haven't seen that work fine on any monorepo.

derekperkins5y ago

We just use the git sha and they works great, no reason to overcomplicate it

qbasic_forever5y ago· 2 in thread

So what's the end game here, is Dropbox going to keep building out an internal Kubernetes-like platform with Atlas, or do they plan to eventually just move to k8s? I noticed this line in particular:

This hurdle shows there's already some cracks in the idea of long-term Atlas too:

hayst4ck5y ago

> in some ways it seems like it's just shifting the monolith technical debt into an internal Atlas platform team's technical debt.

spondyl5y ago

> A monolith becomes poor quality and unmaintainable when there is no entity enforcing architectural simplicity.

Having worked in a company where no single team "owned" the monolith, the term "communally owned" tended to come up.

It was generally understood within the platform teams that if everyone owns it then in reality, no one owns it :)

3 more replies

JonAtkinson5y ago· 2 in thread

I'd be interested to better understand the timeline around this statement:

muglug5y ago

There's normally some sort of budget for paying down technical debt – presumably there was more pressing technical debt.

wittekm5y ago

(I was at Dropbox from 2015-2020)

The legacy framework was Pylons, which eventually evolved into Pyramid.

The tldr is there were hundreds of unowned endpoints that, yes, were allowed to fester. They eventually got ownership on all endpoints, so you had somebody to exert pressure on to make things happen.

eevilspock5y ago· 2 in thread

I just want to know when they'll switch to native battery efficient clients, especially given the daemon is always running and monitoring file system events.

nickm125y ago

The Dropbox sync engine which runs on desktops was rewritten from Python to Rust and released last year:

https://dropbox.tech/infrastructure/rewriting-the-heart-of-o...

eevilspock5y ago

good to know! thanks.

Then I don’t understand the delay to shipping an Apple Silicon build. Right now we still have to used Rosetta... it’s the only such piece of software I have that does.

bps44845y ago

kwdc5y ago

Headline of article: "Our journey from a Python monolith to a managed platform"

So... about this headline. I read this aloud to a friend at a cafe. We laughed. It makes perfect sense to us. We know what Python is. We know what a monolith means in this context.

To my other friends it was the funniest / silliest / nonsensical thing they'd heard for awhile.

IT is weird.

(ps I know no one will see this comment but I'll leave it here. Because.)

j / k navigate · click thread line to collapse