This feels a lot more like the latter, and it's wonderful to see. I'm sure it's a lot harder to write, but it shows a heck of a lot more integrity.
For instance, losing three backup generators in ten minutes (as happened to someone else a while back) might be bad luck, but I'd want to see the maintenance logs and talk to the vendor. Because what's much more likely is that they hadn't been maintained properly, or there was an manufacturing defect in that run and they all came from the same batch (see also, clustered hard drive failures).
If their upstreams have had intermittent problems before, it was only a matter of time until they both had problems at the same time. And capacity planning is one of those strategic decisions that gets bland on the engineering team, who never get what they need because they're a 'cost center', which is a very different and highly damaging way of thinking about the cost of doing business (intrinsic costs).
> If their upstreams have had intermittent problems before
Sorry, this might not have been clear. We had never had major issues with any of our upstream providers before -- beyond scheduled maintenances there had never been a service interruption over a number of years.
There's almost always the risk of coordinated failure, no matter how many levels of redundancy you put into place. The cost gap between 99.99% and 100% is humongous for this reason.
From these failures, we learned some valuable lessons. One of our transit providers was acquired a year or so back, and it's apparent that things have dramatically changed in how they operate their service and respond to issues. We won't be continuing the customer relationship there for much longer unless we see positive steps taken. Another transit provider has been much more responsive and the conversations have been more productive -- we can work through the issue we observed and get back to a positive situation there (I think).
And then finally, we need more diversity. We turned up another circuit yesterday, and I'm still working on adding another 1-2 providers in addition to more peering or direct connect situations. These take a bit of time, but it's work we need to do before we actually require it, because by then it's too late. We're working as quickly as possible to get this part of our infrastructure on rock solid footing.
> capacity planning is one of those strategic decisions that gets bland on the engineering team
Absolutely. We were doing capacity planning, but it wasn't anyone's idea of a good time.
We are correcting this with the resources we have today, and we're going to do a better job in the short term. In the long term, this is a non-negotiable core aspect of someone's job -- either already on the team, or a new role that we create. It can't be a secondary, rainy day kind of task you take on time permitting and it also can't be a task that we're not well equipped to handle on a regular basis.
The driving reason that led us to start imgix, and then spend years of our lives growing and operating it, is to empower our customers to build awesome stuff on the Internet. We have to work hard to keep up our end of the bargin, and we also have to forge our partnerships and relationships with customers in a million little ways. I hope that through adversity we've forged some new relationships with customers in the way we've communicated this time (and how we will continue to communicate).
As an aside, I was personally impressed by how one of our vendors handled a service issue they had last year [1]. That was a definite inspiration for how we handled this communication.
[1] https://ns1.com/blog/how-we-responded-to-last-weeks-major-mu...
We switched to host our own thumbor (open source), and couldn't be happier. We pay around a quarter or less than before (even with failover in place) as well.
We really wanted to use a hosted service. We're not keen on hosting stuff out of our core business. But in this case it just didn't work out.
EDIT: link to Thumbor https://github.com/thumbor/thumbor
We do have some interesting solutions in the lab for more elaborate compositing amongst other things. Part of the challenge there is to stand up a new API interface, since the URL API gets pretty cumbersome for coordinates and tons of composites. We have some interesting ideas there, but it needs to bake a little while longer.
It's also a challenge to get that new stuff out while focusing heavily on our core areas, but we have to do both. You'll start to see more exciting new things ship from us over the next few months.
If you have a solution that works for you, that's great and best of luck. If you do ever feel the urge to check in on what imgix is up to, we would welcome it. Feel free to reach out to me directly any time (e-mail in profile) and I'll work with you to get what you need.
I think your support could have been more responsive though, and communication improved. You guys could probably detect that we weren't happy and about to leave. Reaching out at that point could have made a difference perhaps...
We did genuinely want to work with imgix, but those hurdles left us looking elsewhere. These things happen all the time. Competition is a good thing overall. I'd like to hope that our feedback helps in some way. Even if it's hard to hear.
I was actually surprised there were no hosted thumbor services out there (I found one in Beta I think at the time I was looking, but we weren't sure how stable it is). Maybe because it's so easy to self-host that nobody bothers offering a hosted solution.
Disclaimer: I am currently a customer who had financial loss because of them.
I'm sorry that you had this experience with imgix. We have had challenges in correctly addressing support issues during this period -- support has been swamped. We've expanded that team, and we're continuing to do so.
You're right, we didn't tweet this blog post, but we did e-mail it to all customers with activity since Jan 1 2017.
In this case, we didn't properly convey that the Shopify integration guide we provided wasn't an official integration with Shopify. It was meant as a best effort "here's how you can do it" sort of thing. I requested that we take it down because it wasn't working due to a change on the Shopify side which prevented us from purging images.
I'd love to talk with you directly to try to make this right for you. If you'd like, my e-mail is in my profile.
"Support is on-call during regular business hours, Monday - Friday, and premium customers can request SLAs for support."
This, while other users complained to me via twitter/email that they were also not getting any valid responses. I would not send those words while my service is not working as expected.
I'm in the process of moving out of Imgix esp. after your support said they would be happy to cancel my account if thats what I wanted.
The pushback from your support made me setup a production level thumbor instance which will go live next week.
Thanks for your reply but I wish you had replied to https://twitter.com/troysk704/status/841158793287278593 sooner.
Their support (after few days said), they don't support Shopify. Their whole pitch was; we are Shopify compatible; when we signed up.
> We operate our own hardware, run our own datacenters, and manage our own network infrastructure.
This seems insane to me. Although I don't work with image processing beyond "saving for web" in Photoshop, so I could be wrong. Why would they not use AWS or any number of other cloud providers where capacity planning is handled for you?
Sure, image optimization is CPU intensive, but their use case is super bursty. When the source image changes, you do multiple optimizations (convert to WebP, lossless JPEG, lossy JPEG quality change, etc), cache the results and you are done. The ratio of "optimizing image" to "serving optimized copy" must be insane.
Dedicated physical hardware feels like a waste.
A critical piece of our network infrastructure failed after 3 years of correct operation in a way that proved difficult for our network engineering team to troubleshoot.
Which piece of infrastructure? How did it fail? Why was it difficult to troubleshoot? Can you do anything to prevent this type of issue in the future?
We observed new traffic patterns with significantly lower cache hit rates than our historical median, and it took us some time to determine whether the source was abusive in nature or a legitimate new customer use case.
What were the new traffic patterns, and why did this cause lower cache hit rates? Why did it take longer than expected to determine the nature of the traffic?
> Which piece of infrastructure?
In this case it was a top-of-rack switch. That rack had a few of our external DNS resolvers on it. Normally this wouldn't be a problem, because we deploy services across multiple racks to prevent this kind of failure mode.
In this case it was a bit of a problem, because this was the first rack deployed at this site and it turned out to not conform to the standard configuration across the rest of the racks (because it was what we bootstrapped the site with).
So two things: it was tough to get into the device to troubleshoot it (we wound up using the serial port infrastructure deployed for this purpose), and it had a service impact even though it shouldn't have (we have since migrated DNS resolvers out of this rack and have scheduled a future maintenance to get this switch's configuration corrected and ugprade its OS).
> New traffic patterns
This is a bit tougher to dig into, but essentially it was a needle-in-a-haystack kind of situation. Customers can use any imgix URL API parameter they like -- some of these use cases get pretty complicated for the backend to handle. Think of the watermark parameter stacked with another URL that is also an imgix URL with another watermark parameter, five layers deep. These sorts of operations take a lot more rendering resources than a simple ?h=600&w=600 operation.
In this case, we observed an influx of these sorts of more difficult operations. We have various logs and metrics sprinkled throughout the system -- we use Prometheus, Kafka, heka, BigQuery, Grafana and a few other systems to collect and present the data we need to run the service. We also issue unique id's per request to track their path through the system. What we don't have -- and need to -- is one end-to-end view of a request's path through the system and the system's capacity and performance at each point in our stack.
It turned out that for some amount of the increased rendering traffic we saw, it wasn't that we suddenly got more requests. There were simply many new permutations for each original object. That lowers the CDN cache hit rate for a period of time.
The other thing that comes to mind (and I think I'm forgetting a third type here), is that some of our request parameters require normalization so that they can utilize the same cache object. Think of parameters like dpr, which is a floating point number but realistically is only useful up to a few decimal places. dpr=1.33 and dpr=1.3333333333 are actually the same image, but they would have different cache keys and require two renders that are effectively the same object.
We normalize the dpr parameter down to three significant digits. What we found is that this sort of normalization was necessary for another parameter as well.
When you have an incident, especially something that hurts your customers beyond just a service outage, you come clean, fast, and as completely transparent as possible.
I'd be curious to hear exactly what they did around this. I recently worked on some gif encoding, and was surprised that there's actually quite a tradeoff to be made between performance and good color palette choices.
I don't have the specific details handy, but I can confirm that it's a bit of a tough one. It's actually much easier to do work in pretty much any other format -- the very reason why people use GIFs (universal support) is also what makes them tough to work with (the spec is old and crusty and from 1989).
I'd like to do more of these sorts of posts -- talk about the challenges with GIF encoding and how imgix has solved them. Probably not this month or anything, but I'll try to get it on the content roadmap for April or May.
Apparently not: https://blog.imgix.com/2015/05/08/racking-mac-pros-hardware-...
Perhaps I am missing the competitive advantage of using OSX to resize images, but it sure seems like the shortcomings are obvious :)
Sorry that you experienced these issues. I appreciate your understanding.
Please feel free to reach out directly to me (e-mail in profile) if there's anything you'd like to talk through.
I'm optimistic that you will have a better experience going forward, but I can guarantee you that we are doing everything possible to provide the service to you and we will do a better job of communicating in future.
imgix does utilize a partner to handle some of our image delivery components. We're a customer of that CDN in that we pay them money for the services they provide, but we're actually more of a partner in that we have done considerable integration work and we work closely with them to build what we need to deliver our product.
We talk with them frequently, and there are pieces that aren't just off-the-shelf that we utilize to provide the integrated service we sell to our own customers.
To be clear though, the issues we faced that are discussed in this blog post are not related to the CDN or delivering already rendered images to end users. It's around rendering new images, which doesn't involve the CDN.
This is without zooming (100%), 120 DPI, Windows 7 x64.
Please use standard fonts and don't overengineer, font hacks like this will never work as expected across all platforms.
I'll prioritize rolling out the fix for this particular issue next week.