How We Are Improving Performance (opens in new tab)

(blog.imgix.com)

45 pointssnackai9y ago55 comments

55 comments

37 comments · 11 top-level

mabbo9y ago· 5 in thread

There's two kinds of apologies, and they look very similar on the surface. There's "Sorry for this problem, but here's why it's not my fault" and there's "Sorry for this problem. Here's what happened". Am I taking the blame for the failure of others, or deflecting the blame onto others?

This feels a lot more like the latter, and it's wonderful to see. I'm sure it's a lot harder to write, but it shows a heck of a lot more integrity.

hinkley9y ago

I generally have issues when an engineering manager blames bad luck. Luck doesn't happen as often as people like to believe.

For instance, losing three backup generators in ten minutes (as happened to someone else a while back) might be bad luck, but I'd want to see the maintenance logs and talk to the vendor. Because what's much more likely is that they hadn't been maintained properly, or there was an manufacturing defect in that run and they all came from the same batch (see also, clustered hard drive failures).

If their upstreams have had intermittent problems before, it was only a matter of time until they both had problems at the same time. And capacity planning is one of those strategic decisions that gets bland on the engineering team, who never get what they need because they're a 'cost center', which is a very different and highly damaging way of thinking about the cost of doing business (intrinsic costs).

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

> If their upstreams have had intermittent problems before

Sorry, this might not have been clear. We had never had major issues with any of our upstream providers before -- beyond scheduled maintenances there had never been a service interruption over a number of years.

There's almost always the risk of coordinated failure, no matter how many levels of redundancy you put into place. The cost gap between 99.99% and 100% is humongous for this reason.

From these failures, we learned some valuable lessons. One of our transit providers was acquired a year or so back, and it's apparent that things have dramatically changed in how they operate their service and respond to issues. We won't be continuing the customer relationship there for much longer unless we see positive steps taken. Another transit provider has been much more responsive and the conversations have been more productive -- we can work through the issue we observed and get back to a positive situation there (I think).

And then finally, we need more diversity. We turned up another circuit yesterday, and I'm still working on adding another 1-2 providers in addition to more peering or direct connect situations. These take a bit of time, but it's work we need to do before we actually require it, because by then it's too late. We're working as quickly as possible to get this part of our infrastructure on rock solid footing.

> capacity planning is one of those strategic decisions that gets bland on the engineering team

Absolutely. We were doing capacity planning, but it wasn't anyone's idea of a good time.

We are correcting this with the resources we have today, and we're going to do a better job in the short term. In the long term, this is a non-negotiable core aspect of someone's job -- either already on the team, or a new role that we create. It can't be a secondary, rainy day kind of task you take on time permitting and it also can't be a task that we're not well equipped to handle on a regular basis.

skuhn9y ago

Thanks, it was definitely tough for us to admit that we failed our customers, but the entire team at imgix felt that we had an obligation to be honest and transparent with our customers. I particularly wanted to be sure that we weren't deflecting blame, but also the reality is that there were other parties involved at various points -- that's the reality of operating something on the Internet, there's always a transit provider or datacenter operator or somebody that you're buying critical services from.

The driving reason that led us to start imgix, and then spend years of our lives growing and operating it, is to empower our customers to build awesome stuff on the Internet. We have to work hard to keep up our end of the bargin, and we also have to forge our partnerships and relationships with customers in a million little ways. I hope that through adversity we've forged some new relationships with customers in the way we've communicated this time (and how we will continue to communicate).

As an aside, I was personally impressed by how one of our vendors handled a service issue they had last year [1]. That was a definite inspiration for how we handled this communication.

[1] https://ns1.com/blog/how-we-responded-to-last-weeks-major-mu...

jhorowitz1039y ago

Do you mean the former?

Fishkins9y ago

I think the issue is mabbo listed the types of apology in different orders in two consecutive sentences. This apology is the latter in the former sentence, but the former in the latter sentence.

gingerlime9y ago· 4 in thread

We were using imgix for a while and were generally happy, but things started go downhill and some point, or so it felt anyway. Their support was always a bit opaque. And the service itself didn't evolve (as far as rendering. E.g. Composing more than one of the same filter wasn't really possible... for example adding two watermarks). We've also had issues with CORS headers that weren't resolved and our end users couldn't get images some times...

We switched to host our own thumbor (open source), and couldn't be happier. We pay around a quarter or less than before (even with failover in place) as well.

We really wanted to use a hosted service. We're not keen on hosting stuff out of our core business. But in this case it just didn't work out.

EDIT: link to Thumbor https://github.com/thumbor/thumbor

skuhn9y ago

Sorry to hear that your experience with imgix wasn't great. Performance, quality and correctness are obviously super important and non-negotiable aspects of our service. While we have always heavily focused on these areas, we're going to double down for the short term.

We do have some interesting solutions in the lab for more elaborate compositing amongst other things. Part of the challenge there is to stand up a new API interface, since the URL API gets pretty cumbersome for coordinates and tons of composites. We have some interesting ideas there, but it needs to bake a little while longer.

It's also a challenge to get that new stuff out while focusing heavily on our core areas, but we have to do both. You'll start to see more exciting new things ship from us over the next few months.

If you have a solution that works for you, that's great and best of luck. If you do ever feel the urge to check in on what imgix is up to, we would welcome it. Feel free to reach out to me directly any time (e-mail in profile) and I'll work with you to get what you need.

gingerlime9y ago

Thanks. Not sure why you're being downvoted. Recognising those shortcomings is important, and I understand that you might not have all the answers to all customers.

I think your support could have been more responsive though, and communication improved. You guys could probably detect that we weren't happy and about to leave. Reaching out at that point could have made a difference perhaps...

We did genuinely want to work with imgix, but those hurdles left us looking elsewhere. These things happen all the time. Competition is a good thing overall. I'd like to hope that our feedback helps in some way. Even if it's hard to hear.

1 more reply

ShirsenduK9y ago

I am in the processes to switching to thumbor! Glad to see confirmation of the solution. :)

gingerlime9y ago

feel free to get in touch if you need any tips. Although it's mostly a black-box that "just works" for us.

I was actually surprised there were no hosted thumbor services out there (I found one in Beta I think at the time I was looking, but we weren't sure how stable it is). Maybe because it's so easy to self-host that nobody bothers offering a hosted solution.

1 more reply

ShirsenduK9y ago· 4 in thread

They put out a blogpost while their support keeps stonewalling and denying any such issues. Nor do they reply on twitter or even tweet the blogpost. This seems too little too late.

Disclaimer: I am currently a customer who had financial loss because of them.

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

I'm sorry that you had this experience with imgix. We have had challenges in correctly addressing support issues during this period -- support has been swamped. We've expanded that team, and we're continuing to do so.

You're right, we didn't tweet this blog post, but we did e-mail it to all customers with activity since Jan 1 2017.

In this case, we didn't properly convey that the Shopify integration guide we provided wasn't an official integration with Shopify. It was meant as a best effort "here's how you can do it" sort of thing. I requested that we take it down because it wasn't working due to a change on the Shopify side which prevented us from purging images.

I'd love to talk with you directly to try to make this right for you. If you'd like, my e-mail is in my profile.

ShirsenduK9y ago

I understand your support might be swamped but this is what your support wrote to me;

"Support is on-call during regular business hours, Monday - Friday, and premium customers can request SLAs for support."

This, while other users complained to me via twitter/email that they were also not getting any valid responses. I would not send those words while my service is not working as expected.

I'm in the process of moving out of Imgix esp. after your support said they would be happy to cancel my account if thats what I wanted.

The pushback from your support made me setup a production level thumbor instance which will go live next week.

Thanks for your reply but I wish you had replied to https://twitter.com/troysk704/status/841158793287278593 sooner.

1 more reply

Exuma9y ago

what happened?

ShirsenduK9y ago

They kept serving an old image even after I flushed it multiple times. The image had an old discount code which customers presented us. We had to respect that even though we were technically not running the offer and was financially tough.

Their support (after few days said), they don't support Shopify. Their whole pitch was; we are Shopify compatible; when we signed up.

rcchen9y ago· 4 in thread

I wonder if their render service is still backed by racked Mac Pros (http://photos.imgix.com/racking-mac-pros). If so, considering the lack of updates around that machine for the last several years, I wonder if they are planning to remain with that solution.

ryanSrich9y ago

Wow. I just read that.

> We operate our own hardware, run our own datacenters, and manage our own network infrastructure.

This seems insane to me. Although I don't work with image processing beyond "saving for web" in Photoshop, so I could be wrong. Why would they not use AWS or any number of other cloud providers where capacity planning is handled for you?

billyhoffman9y ago

The bigger question is, why not use elastic infrastructure?

Sure, image optimization is CPU intensive, but their use case is super bursty. When the source image changes, you do multiple optimizations (convert to WebP, lossless JPEG, lossy JPEG quality change, etc), cache the results and you are done. The ratio of "optimizing image" to "serving optimized copy" must be insane.

Dedicated physical hardware feels like a waste.

1 more reply

detaro9y ago

Running the delivery/CDN part on AWS would probably be way to expensive, and if you are running parts of your own infrastructure anyways (and thus are already paying for sysadmins, data center, ...) you can put at least the base load of your processing there also. But they could of course have the option to fall-back/scale to a cloud provider, if they didn't "require" Mac OS X...

Exuma9y ago

I would guess because of costs. At one point I was going to do a video processing app and after doing some calculations the costs were insane, and the savings with actual hardware were tremendous. I imagine image processing is a lot less than video, but maybe the same deal. Then again... they did rack mac pros so who the heck knows what they're thinking.

2 more replies

trevyn9y ago· 2 in thread

I wish this post gave more concrete details, so we could learn from it.

A critical piece of our network infrastructure failed after 3 years of correct operation in a way that proved difficult for our network engineering team to troubleshoot.

Which piece of infrastructure? How did it fail? Why was it difficult to troubleshoot? Can you do anything to prevent this type of issue in the future?

We observed new traffic patterns with significantly lower cache hit rates than our historical median, and it took us some time to determine whether the source was abusive in nature or a legitimate new customer use case.

What were the new traffic patterns, and why did this cause lower cache hit rates? Why did it take longer than expected to determine the nature of the traffic?

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

> Which piece of infrastructure?

In this case it was a top-of-rack switch. That rack had a few of our external DNS resolvers on it. Normally this wouldn't be a problem, because we deploy services across multiple racks to prevent this kind of failure mode.

In this case it was a bit of a problem, because this was the first rack deployed at this site and it turned out to not conform to the standard configuration across the rest of the racks (because it was what we bootstrapped the site with).

So two things: it was tough to get into the device to troubleshoot it (we wound up using the serial port infrastructure deployed for this purpose), and it had a service impact even though it shouldn't have (we have since migrated DNS resolvers out of this rack and have scheduled a future maintenance to get this switch's configuration corrected and ugprade its OS).

> New traffic patterns

This is a bit tougher to dig into, but essentially it was a needle-in-a-haystack kind of situation. Customers can use any imgix URL API parameter they like -- some of these use cases get pretty complicated for the backend to handle. Think of the watermark parameter stacked with another URL that is also an imgix URL with another watermark parameter, five layers deep. These sorts of operations take a lot more rendering resources than a simple ?h=600&w=600 operation.

In this case, we observed an influx of these sorts of more difficult operations. We have various logs and metrics sprinkled throughout the system -- we use Prometheus, Kafka, heka, BigQuery, Grafana and a few other systems to collect and present the data we need to run the service. We also issue unique id's per request to track their path through the system. What we don't have -- and need to -- is one end-to-end view of a request's path through the system and the system's capacity and performance at each point in our stack.

It turned out that for some amount of the increased rendering traffic we saw, it wasn't that we suddenly got more requests. There were simply many new permutations for each original object. That lowers the CDN cache hit rate for a period of time.

The other thing that comes to mind (and I think I'm forgetting a third type here), is that some of our request parameters require normalization so that they can utilize the same cache object. Think of parameters like dpr, which is a floating point number but realistically is only useful up to a few decimal places. dpr=1.33 and dpr=1.3333333333 are actually the same image, but they would have different cache keys and require two renders that are effectively the same object.

We normalize the dpr parameter down to three significant digits. What we found is that this sort of normalization was necessary for another parameter as well.

billyhoffman9y ago

Why isn't this in your blog post? Or at least a separate "technical details" blog post linked from the announcement?

When you have an incident, especially something that hurts your customers beyond just a service outage, you come clean, fast, and as completely transparent as possible.

2 more replies

mnutt9y ago· 2 in thread

Improved our GIF encoding pathway to increase throughput.

I'd be curious to hear exactly what they did around this. I recently worked on some gif encoding, and was surprised that there's actually quite a tradeoff to be made between performance and good color palette choices.

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

I don't have the specific details handy, but I can confirm that it's a bit of a tough one. It's actually much easier to do work in pretty much any other format -- the very reason why people use GIFs (universal support) is also what makes them tough to work with (the spec is old and crusty and from 1989).

I'd like to do more of these sorts of posts -- talk about the challenges with GIF encoding and how imgix has solved them. Probably not this month or anything, but I'll try to get it on the content roadmap for April or May.

mnutt9y ago

That'd be great. I've found https://www.lcdf.org/gifsicle/ to be a good code resource for optimization, if a bit hard to follow.

microcolonel9y ago· 2 in thread

The company that literally racks Mac Pro trashcans sideways, is telling us that they have made technical decisions that allow them to offer a competitive price for features.

rb2k_9y ago

Ha, I thought that was a joke.

Apparently not: https://blog.imgix.com/2015/05/08/racking-mac-pros-hardware-...

brianwawok9y ago

Can you really not run image rendering code on Linux servers? Vs you know, spending thousands of dollars racking OSX servers? Because that would make "increasing capacity" about 3 clicks in AWS....

Perhaps I am missing the competitive advantage of using OSX to resize images, but it sure seems like the shortcomings are obvious :)

1 more reply

Nabi9y ago· 1 in thread

Thanks for explaining and openness. We were bit frustrated as just finished migration to imgix and hit many issues with images delivery.

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

Sorry that you experienced these issues. I appreciate your understanding.

Please feel free to reach out directly to me (e-mail in profile) if there's anything you'd like to talk through.

I'm optimistic that you will have a better experience going forward, but I can guarantee you that we are doing everything possible to provide the service to you and we will do a better job of communicating in future.

amelius9y ago· 1 in thread

It says on their homepage that they use a 3rdparty CDN. I'm wondering how you can improve upon that if you are just a customer from the CDN's point of view.

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

imgix does utilize a partner to handle some of our image delivery components. We're a customer of that CDN in that we pay them money for the services they provide, but we're actually more of a partner in that we have done considerable integration work and we work closely with them to build what we need to deliver our product.

We talk with them frequently, and there are pieces that aren't just off-the-shelf that we utilize to provide the integrated service we sell to our own customers.

To be clear though, the issues we faced that are discussed in this blog post are not related to the CDN or delivering already rendered images to end users. It's around rendering new images, which doesn't involve the CDN.

mkup9y ago· 1 in thread

Font on https://www.imgix.com/pricing page is broken in Windows/Firefox (look how small "e" and "a" are displayed): https://image.ibb.co/ei7m8v/windows_font_problem.png

This is without zooming (100%), 120 DPI, Windows 7 x64.

Please use standard fonts and don't overengineer, font hacks like this will never work as expected across all platforms.

skuhn9y ago

Thanks for pointing this out. The pricing page is on an older version of our site design, and I believe this may already have been corrected in the design proofs we've been working on.

I'll prioritize rolling out the fix for this particular issue next week.

Thaxll9y ago

Imgix is a good example of burden of legacy code / infrastructure that they can't get away from.

j / k navigate · click thread line to collapse

55 comments

37 comments · 11 top-level

mabbo9y ago· 5 in thread

This feels a lot more like the latter, and it's wonderful to see. I'm sure it's a lot harder to write, but it shows a heck of a lot more integrity.

hinkley9y ago

I generally have issues when an engineering manager blames bad luck. Luck doesn't happen as often as people like to believe.

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

> If their upstreams have had intermittent problems before

There's almost always the risk of coordinated failure, no matter how many levels of redundancy you put into place. The cost gap between 99.99% and 100% is humongous for this reason.

> capacity planning is one of those strategic decisions that gets bland on the engineering team

Absolutely. We were doing capacity planning, but it wasn't anyone's idea of a good time.

skuhn9y ago

As an aside, I was personally impressed by how one of our vendors handled a service issue they had last year [1]. That was a definite inspiration for how we handled this communication.

[1] https://ns1.com/blog/how-we-responded-to-last-weeks-major-mu...

jhorowitz1039y ago

Do you mean the former?

Fishkins9y ago

I think the issue is mabbo listed the types of apology in different orders in two consecutive sentences. This apology is the latter in the former sentence, but the former in the latter sentence.

gingerlime9y ago· 4 in thread

We switched to host our own thumbor (open source), and couldn't be happier. We pay around a quarter or less than before (even with failover in place) as well.

We really wanted to use a hosted service. We're not keen on hosting stuff out of our core business. But in this case it just didn't work out.

EDIT: link to Thumbor https://github.com/thumbor/thumbor

skuhn9y ago

It's also a challenge to get that new stuff out while focusing heavily on our core areas, but we have to do both. You'll start to see more exciting new things ship from us over the next few months.

gingerlime9y ago

Thanks. Not sure why you're being downvoted. Recognising those shortcomings is important, and I understand that you might not have all the answers to all customers.

1 more reply

ShirsenduK9y ago

I am in the processes to switching to thumbor! Glad to see confirmation of the solution. :)

gingerlime9y ago

feel free to get in touch if you need any tips. Although it's mostly a black-box that "just works" for us.

1 more reply

ShirsenduK9y ago· 4 in thread

They put out a blogpost while their support keeps stonewalling and denying any such issues. Nor do they reply on twitter or even tweet the blogpost. This seems too little too late.

Disclaimer: I am currently a customer who had financial loss because of them.

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

You're right, we didn't tweet this blog post, but we did e-mail it to all customers with activity since Jan 1 2017.

I'd love to talk with you directly to try to make this right for you. If you'd like, my e-mail is in my profile.

ShirsenduK9y ago

I understand your support might be swamped but this is what your support wrote to me;

"Support is on-call during regular business hours, Monday - Friday, and premium customers can request SLAs for support."

This, while other users complained to me via twitter/email that they were also not getting any valid responses. I would not send those words while my service is not working as expected.

I'm in the process of moving out of Imgix esp. after your support said they would be happy to cancel my account if thats what I wanted.

The pushback from your support made me setup a production level thumbor instance which will go live next week.

Thanks for your reply but I wish you had replied to https://twitter.com/troysk704/status/841158793287278593 sooner.

1 more reply

Exuma9y ago

what happened?

ShirsenduK9y ago

Their support (after few days said), they don't support Shopify. Their whole pitch was; we are Shopify compatible; when we signed up.

rcchen9y ago· 4 in thread

ryanSrich9y ago

Wow. I just read that.

> We operate our own hardware, run our own datacenters, and manage our own network infrastructure.

billyhoffman9y ago

The bigger question is, why not use elastic infrastructure?

Dedicated physical hardware feels like a waste.

1 more reply

detaro9y ago

Exuma9y ago

2 more replies

trevyn9y ago· 2 in thread

I wish this post gave more concrete details, so we could learn from it.

A critical piece of our network infrastructure failed after 3 years of correct operation in a way that proved difficult for our network engineering team to troubleshoot.

Which piece of infrastructure? How did it fail? Why was it difficult to troubleshoot? Can you do anything to prevent this type of issue in the future?

What were the new traffic patterns, and why did this cause lower cache hit rates? Why did it take longer than expected to determine the nature of the traffic?

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

> Which piece of infrastructure?

> New traffic patterns

We normalize the dpr parameter down to three significant digits. What we found is that this sort of normalization was necessary for another parameter as well.

billyhoffman9y ago

Why isn't this in your blog post? Or at least a separate "technical details" blog post linked from the announcement?

When you have an incident, especially something that hurts your customers beyond just a service outage, you come clean, fast, and as completely transparent as possible.

2 more replies

mnutt9y ago· 2 in thread

Improved our GIF encoding pathway to increase throughput.

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

mnutt9y ago

That'd be great. I've found https://www.lcdf.org/gifsicle/ to be a good code resource for optimization, if a bit hard to follow.

microcolonel9y ago· 2 in thread

The company that literally racks Mac Pro trashcans sideways, is telling us that they have made technical decisions that allow them to offer a competitive price for features.

rb2k_9y ago

Ha, I thought that was a joke.

Apparently not: https://blog.imgix.com/2015/05/08/racking-mac-pros-hardware-...

brianwawok9y ago

Can you really not run image rendering code on Linux servers? Vs you know, spending thousands of dollars racking OSX servers? Because that would make "increasing capacity" about 3 clicks in AWS....

Perhaps I am missing the competitive advantage of using OSX to resize images, but it sure seems like the shortcomings are obvious :)

1 more reply

Nabi9y ago· 1 in thread

Thanks for explaining and openness. We were bit frustrated as just finished migration to imgix and hit many issues with images delivery.

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

Sorry that you experienced these issues. I appreciate your understanding.

Please feel free to reach out directly to me (e-mail in profile) if there's anything you'd like to talk through.

amelius9y ago· 1 in thread

It says on their homepage that they use a 3rdparty CDN. I'm wondering how you can improve upon that if you are just a customer from the CDN's point of view.

skuhn9y ago

[I work at imgix and have helped lead the team on the production issues we face and gathering the details for this blog post]

We talk with them frequently, and there are pieces that aren't just off-the-shelf that we utilize to provide the integrated service we sell to our own customers.

mkup9y ago· 1 in thread

Font on https://www.imgix.com/pricing page is broken in Windows/Firefox (look how small "e" and "a" are displayed): https://image.ibb.co/ei7m8v/windows_font_problem.png

This is without zooming (100%), 120 DPI, Windows 7 x64.

Please use standard fonts and don't overengineer, font hacks like this will never work as expected across all platforms.

skuhn9y ago

Thanks for pointing this out. The pricing page is on an older version of our site design, and I believe this may already have been corrected in the design proofs we've been working on.

I'll prioritize rolling out the fix for this particular issue next week.

Thaxll9y ago

Imgix is a good example of burden of legacy code / infrastructure that they can't get away from.

j / k navigate · click thread line to collapse