Gremlin Free – Run chaos experiments to prevent outages (opens in new tab)

(gremlin.com)

164 pointsdpritchett7y ago52 comments

52 comments

29 comments · 8 top-level

lklig7y ago· 9 in thread

Hey folks, I work at Gremlin and we're super excited to announce this launch. Drop any questions, comments, or concerns, we're happy to help!

dkersten7y ago

I like the look of this and love that you have released a free version. I am a little dismayed, though, that the two options are $0 and $1000/m (paid annually) with nothing in between. The free version seems great to get started, but I'd really like a lot more of the attacks that the paid version has, but $12,000 is much, much too high a price for a startup or personal project. That's quite a jump in cost.

TheSpiceIsLife7y ago

I can’t speak for this vendor in particular, but one common reason for pricing like this is the vendor doesn’t want to deal with smaller customers as they often have the highest support requirements.

2 more replies

mpls7y ago

Thanks for putting this out! I caught a Gremlin talk at a recent conference and was very impressed with how knowledgeable the developers were.

How does the Gremlin platform interact with one of my hosts? Do I need to install an agent or something? Does it need root access to my host, hypervisor, cloud console?

lklig7y ago

Simply install an agent, authenticate with our control plane, and create attacks through our webapp. No root access required.

Check out more info at https://www.gremlin.com/docs/infrastructure-layer/installati... .

1 more reply

gfs7y ago

I see that you are on the Rust production user page [0]. Can you talk a little bit about what Rust is used for and how the experience has been?

[0]: https://www.rust-lang.org/production/users

philgebhardt7y ago

Hey, I'm an engineer at Gremlin! When you install Gremlin onto your linux hosts for infrastructure experiments, you're using binaries that were completely written in Rust. I would be lying if I said there wasn't a bit of a learning curve (coming from mostly working with Java). Most of that can be attributed to the memory management concepts built into Rust. At first you fight the compiler a bit (asking things like, why am I not allowed to reference this variable?!), but you soon learn to love and rely on the compiler as it builds more confidence in the runtime behavior of the product.

One game changer for Rust is the treatment of Errors as first class citizens. It's literally built into the native types that Rust wants you to work with. That's huge for our product, given it runs in an inherently error-prone environment.

1 more reply

keyle7y ago

I laughed out loud at "Failure as a service". Thanks for that.

ksmail997y ago

Hi, we are startup using a lot of lambda, fargate, rds and dynamodb. Will gremlin work for this? I didn't see any mention of support of fargate or lambda on your website.

lklig7y ago

We've got you covered! Gremlin supports severless products with application layer fault injection.

Take a look at our docs for more: https://www.gremlin.com/docs/application-layer/overview/

1 more reply

Negitivefrags7y ago· 5 in thread

So here is what I don't get about this stuff.

What happens to the in-flight requests? Don't a few users run into random errors whenever a host is killed unexpectedly?

You could have your loadbalancer retry everything that fails, but then wouldn't every single request in your app have to be idempotent?

perfmode7y ago

Server crashes happen. This forces you to deal with them instead of pretending they won’t.

Negitivefrags7y ago

Well yes, but I would suggest that they are uncommon enough that a few requests failing isn't a problem when those happen.

It's an entirely different story when you are killing processes constantly.

2 more replies

flowardnut7y ago

idempotent requests, stateless services, etc are all parts of a fault tolerant system.

your service has a few ways to deal with a dependency going down -- maybe it's a retry, maybe it's opening a circuit breaker and returning a default payload instead of calling that service.

It really depends on what specifically the service is and what it's calling (so it's a very case by case issue).

One of the very neat features of istio is that you can do this tuning in real time -- spin up your services, simulate faults, and then test your service while tuning your retry logic to see what the best user experience is.

LoSboccacc7y ago

well for example in our systems all api calls only moves from a know state to another known state and any call failure redirects the client/user to the dashboard trough an error handler so they have to reload the last good state saved on the database.

not perfect, but having a server crash is not much different than having a connection reset by a wifi status change or an upload timing out due the mobile network going away or the user navigating away or closing the browser.

Negitivefrags7y ago

It sounds like you are saying "The in-flight requests fail" to me.

I really don't like the idea of saying that it's simply okay to give random users a bad user experience like that when you are actually killing servers yourself all the time.

2 more replies

djb_hackernews7y ago· 2 in thread

Is anyone aware of a chaos tool that isn't a SaaS (free or not) and doesn't require using Spinnaker like the current Netflix chaos tool does?

lklig7y ago

Yes, we compiled a list of all the OSS alternatives to Chaos Monkey here!

https://www.gremlin.com/chaos-monkey/chaos-monkey-alternativ...

jedberg7y ago

The old chaos monkey didn’t require spinnaker. You can find it here: https://github.com/Netflix/SimianArmy

espeed7y ago· 2 in thread

NB: This company "Gremlin, Inc", its product "Gremlin Free", and its use of the Gremlin name is in no way affiliated with or related to Apache TinkerPop™ Gremlin, its ASF marks, name, the open-source Gremlin graph programming language, ASF TinkerPop Gremlin Graph Traversal Machine (GSM), associated libraries, or the Gremlin Graph developers group formed in 2009.

http://www.apache.org/foundation/marks/faq/

yodon7y ago

It's also probably not related to the 1984 movie "Gremlins" or the 1970's car of the same name (listed as one of the ugliest cars of all time[0])

[0] https://www.cbsnews.com/pictures/worlds-15-ugliest-cars/7/

espeed7y ago

ASF marks were filed at formation. Identifying and distinguishing the use of similar and potentially confusing marks (esp in software products) is one of the required duties. The use of ASF marks in software products is prohibited to prevent confusion.

2 more replies

goldenkey7y ago· 1 in thread

Failure as a service doesn't make all that much sense considering that a many failure scenarios would make the target host inaccessible to Gremlin.

How does Gremlin handle this?

lklig7y ago

Good question! All of the network attacks have a whitelisting capability, to keep the host accessible. This isn't an issue with state attacks, as the client will come back online once the host reboots. And with resource attacks the client typically remains active, if your application is handling starved resources well.

isuckatcoding7y ago· 1 in thread

How do you prevent abuse of this tool?

lklig7y ago

Security is extremely important to us. Clients authenticate to our control plane either with a secret string or a certificate. Clients can be revoked at any point from our webapp and as well if the client loses communication to our control plane, any ongoing attack is halted.

Check out our security page for more: https://gremlin.com/security

debaserab27y ago· 1 in thread

What infrastructure size does one need to have where this technique is beneficial? Genuinely curious where the threshold is.

farazbabar7y ago

Multiple criteria:

1. When you go from one machine running the code to more than one 2. Any system that may experience failures and detection of such failures and recovery is desirable 3. Most distributed systems due to the failure scenarios inherent in such systems.

ingrid7y ago

Ha, after working on building Uber’s chaos monkey (which was hard and took a while to build) and working with Netflix’s chaos monkey — it’s super nice to see Gremlin release this service so anyone can see the benefits of chaos engineering. I hope they add a “random chaos” feature to keep engineers on their feet. ;-)

j / k navigate · click thread line to collapse

52 comments

29 comments · 8 top-level

lklig7y ago· 9 in thread

Hey folks, I work at Gremlin and we're super excited to announce this launch. Drop any questions, comments, or concerns, we're happy to help!

dkersten7y ago

TheSpiceIsLife7y ago

2 more replies

mpls7y ago

Thanks for putting this out! I caught a Gremlin talk at a recent conference and was very impressed with how knowledgeable the developers were.

How does the Gremlin platform interact with one of my hosts? Do I need to install an agent or something? Does it need root access to my host, hypervisor, cloud console?

lklig7y ago

Simply install an agent, authenticate with our control plane, and create attacks through our webapp. No root access required.

Check out more info at https://www.gremlin.com/docs/infrastructure-layer/installati... .

1 more reply

gfs7y ago

I see that you are on the Rust production user page [0]. Can you talk a little bit about what Rust is used for and how the experience has been?

[0]: https://www.rust-lang.org/production/users

philgebhardt7y ago

1 more reply

keyle7y ago

I laughed out loud at "Failure as a service". Thanks for that.

ksmail997y ago

Hi, we are startup using a lot of lambda, fargate, rds and dynamodb. Will gremlin work for this? I didn't see any mention of support of fargate or lambda on your website.

lklig7y ago

We've got you covered! Gremlin supports severless products with application layer fault injection.

Take a look at our docs for more: https://www.gremlin.com/docs/application-layer/overview/

1 more reply

Negitivefrags7y ago· 5 in thread

So here is what I don't get about this stuff.

What happens to the in-flight requests? Don't a few users run into random errors whenever a host is killed unexpectedly?

You could have your loadbalancer retry everything that fails, but then wouldn't every single request in your app have to be idempotent?

perfmode7y ago

Server crashes happen. This forces you to deal with them instead of pretending they won’t.

Negitivefrags7y ago

Well yes, but I would suggest that they are uncommon enough that a few requests failing isn't a problem when those happen.

It's an entirely different story when you are killing processes constantly.

2 more replies

flowardnut7y ago

idempotent requests, stateless services, etc are all parts of a fault tolerant system.

your service has a few ways to deal with a dependency going down -- maybe it's a retry, maybe it's opening a circuit breaker and returning a default payload instead of calling that service.

It really depends on what specifically the service is and what it's calling (so it's a very case by case issue).

LoSboccacc7y ago

Negitivefrags7y ago

It sounds like you are saying "The in-flight requests fail" to me.

I really don't like the idea of saying that it's simply okay to give random users a bad user experience like that when you are actually killing servers yourself all the time.

2 more replies

djb_hackernews7y ago· 2 in thread

Is anyone aware of a chaos tool that isn't a SaaS (free or not) and doesn't require using Spinnaker like the current Netflix chaos tool does?

lklig7y ago

Yes, we compiled a list of all the OSS alternatives to Chaos Monkey here!

https://www.gremlin.com/chaos-monkey/chaos-monkey-alternativ...

jedberg7y ago

The old chaos monkey didn’t require spinnaker. You can find it here: https://github.com/Netflix/SimianArmy

espeed7y ago· 2 in thread

http://www.apache.org/foundation/marks/faq/

yodon7y ago

It's also probably not related to the 1984 movie "Gremlins" or the 1970's car of the same name (listed as one of the ugliest cars of all time[0])

[0] https://www.cbsnews.com/pictures/worlds-15-ugliest-cars/7/

espeed7y ago

2 more replies

goldenkey7y ago· 1 in thread

Failure as a service doesn't make all that much sense considering that a many failure scenarios would make the target host inaccessible to Gremlin.

How does Gremlin handle this?

lklig7y ago

isuckatcoding7y ago· 1 in thread

How do you prevent abuse of this tool?

lklig7y ago

Check out our security page for more: https://gremlin.com/security

debaserab27y ago· 1 in thread

What infrastructure size does one need to have where this technique is beneficial? Genuinely curious where the threshold is.

farazbabar7y ago

Multiple criteria:

ingrid7y ago

j / k navigate · click thread line to collapse