To be blunt, you're abusing the shit out of SOMEONE ELSE'S product that you're not even paying for. Your first question shouldn't be to see what Github can do for you to make it so you don't have to make changes. You should be falling over yourself investigating all available avenues for reducing load.
It's an incredibly entitled way to think about things and I would have a real hard time employing someone who's first response was like this.
I certainly don't think the barb about your willingness to employ people who write things on Github issues threads that you disagree with is helping anyone understand any part of this situation. I understand the urge to find ways to be emphatic about how much you disagree with things, and I often find myself compelled to write lines like that, but I think they're virtually always a bad idea.
And it had this passive agressive ring to it, with the hand clapping and the hurray in the beginning and the stone walling in effect.
And what the hell was with the quotes around "free"? Are you paying? No? Then there's no quotes about it.
fwiw: "I would have a really hard time working for someone whose first inclination was always towards criticism over accommodation or compassion." But then I also acknowledge there may be a whole bunch of other stuff going on here behind the scenes. ;)
But I think you may be reading his tone more negatively than necessary. What I see is him starting off by expressing gratitude and then switching voices to communicate very explicitly what the needs and desires of his stakeholders are. He's simply trying to reflect that as clearly as possible and discover the additional context. This was clearly effective, as you can see from the rest of the conversation that with all the information out there, everybody comes to a mutually acceptable consensus. Problem solved!
What we've ultimately got here is a free service built on a free service. CocoaPods has nothing but the time and effort of volunteers. GitHub has input of resources from the commercial side of their business. Both sides clearly want to preserve the utility of this end-to-end workflow in a more sustainable way.
"Falling over yourself" is a subjective amount of effort, but clearly CocoaPods has tried to minimize their impact on GitHub. As it turns out, the attempted optimization of shallow fetching backfired, but that's not from lack of regard for the resources they rely on. What was missing was exactly the context the Github employees provided.
Honestly, I think people are offended second-hand by a perceived lack of groveling on CocoaPods part, and to me, that's way overblown.
> To be blunt, you're abusing the shit out of SOMEONE ELSE’S
> product that you're not even paying for.
I <3 GitHub, but for their own business/values/whatever reason they choose to host open source for free. It’s not like these people have found a loophole and are getting a paid service for free.AFAIC, that makes every free user a customer. They may not be a paying customer, but it's GitHub’s choice to be in the free hosting business.
that being said, i do believe it could help cocoapod's use case since the fetches are done automatically (as i understand it)
EDIT: the upside is that cocoapods will have to either rethink there architecture in order to eat less resources or move to their own paid infrastructure because their package manager will soon be less than functional given the aggressive rate limiting github is performing.
I'd like to see both happen:
* CocoaPods refactoring to be more efficient
* GitHub providing open source projects the option to buy reserved capacity if they're using excessive resources (versus just saying "No").
GitHub people are truly going above and beyond in service even when barely warranted. I'll give them that.
> CocoaPods is a dependency manager for Swift and Objective-C Cocoa projects. It has over ten thousand libraries and can help you scale your projects elegantly.
The developer response:
> [As CocoaPods developers] Scaling and operating this repo is actually quite simple for us as CocoaPods developers whom do not want to take on the burden of having to maintain a cloud service around the clock (users in all time zones) or, frankly, at all. Trying to have a few devs do this, possibly in their spare-time, is a sure way to burn them out. And then there’s also the funding aspect to such a service.
--
So they want to be the go-to scaling solution, but they don't want to have to spend any time thinking about how to scale anything. It should just happen. Other people have free scalable services, they should just hand over their resources.
Thank goodness Github thought about these kinds of cases from the beginning and instituted automatic rate limiting. Having an entire end user base use git to sync up a 16K+ directory tree is not a good idea in the first place. The developers should have long since been thinking about a more efficient solution.
Honestly if I was GitHub, I'd be tempted to just increase the throttling on CocoaPods and call it done, it isn't their problem if the users of that project have a bad experience. GitHub has provided solutions to the problem, it's CocoaPods that's resisting implementing those solutions.
I think that long-term, the solution will be the Swift Package Manager, and CocoaPods will just be deprecated in favor of it. Let Apple host iOS packages; they're the ones that gain the most benefit from easy iOS development; they have the developer expertise, and the hosting costs are a drop in the bucket compared to iCloud & CloudKit. But that's not all that helpful for people who need an Objective-C package now.
I don't think working on CocoaPods is an altruistic endeavor. I imagine (know) that some of the cocoapods folks are app developers and ostensibly CocoaPods makes developing applications easier.
Side Note: its not a tragedy of the commons. Github owns the infrastructure and they enforced their private property rights by rate limiting a group of users that were disproportionately using resources. It is a collective action problem for CP users.
No direct financial benefit, but they are deriving a benefit out of their work.
(Edit: </rhetorical> </sarcasm>)
Also while nit-picking, I would clear up your use of "for free </r></s>" as "as a freebie", again post-[insert: x̄ȳz, inc].
I understand the desire to personally maintain as few of one's own servers as possible, but when the result is negative effects on the service hosting the project and a worse experience for the end-user, it might be time to start looking over what google cloud offers.
2) It makes perfect sense to let GitHub handle the performance hit until issues arise. Premature optimization is the devil, right? But once there start to be issues, it's definitely unfair to turn around and say "well you offer the service for free, so you should fix it"
Sentence 1 would still be true if CocoaPods was only used by ten companies developing the ten biggest (in terms of lines of code) Objective-C projects, but there would no longer be a need to scale in the sense of sentence 2.
"Not having to develop a system that somehow syncs required data at all means we get to spend more time on the work that matters more to us, in this case. (i.e. funding of dev hours)"
In other words, using github as a free unlimited CDN lets them be as inefficient as they like. Such as having 16k entries in a directory ( https://github.com/CocoaPods/Specs/tree/master/Specs ) which every user downloads.
Package management and sync seems to suffer really badly from NIH. Dpkg is over 20 years old and yum is over a decade old. What's up with this particular wheel that people keep reinventing it seemingly without improvement?
Trivial apt operations (e.g. trying to install a package which is already installed) on an NSLU2 (an ancient 266MHz ARM machine) take several minutes, whereas the same operation takes several seconds on a modern laptop.
It turns out this is due to the fact that Debian "main" (Packages.gz) has ballooned to 32MB of plain text when uncompressed, comprising more than 41,000 packages, and it has to be parsed and assembled into a dependency tree for every apt operation. This problem screams for SQLite.
A side project I've started looking into is to make a transparent apt proxy which provides a trimmed down Packages.gz (e.g., removing anything which uses X11), which would be a lot easier that rewriting apt to use a SQLite backend.
This is precisely why yum/dnf has been switching from XML for repodata to SQLite. In fact, the only thing that is still XML-only is the comps file which just lists package groups, is updated rarely and "only" weighs in at half a MB.
Actually, I'm surprised one can actually run a modern Linux on the NSLU2 given its shameful lack of RAM and slooow USB port. But it was a nice gadget when it came out and it was fun to experiment with it.
> It turns out this is due to the fact that Debian "main" (Packages.gz) has ballooned to 32MB of plain text when uncompressed, comprising more than 41,000 packages, and it has to be parsed and assembled into a dependency tree for every apt operation. This problem screams for SQLite.
Correct me if I'm wrong, but isn't apt (and dpkg) basically composed out of a ton of different (perl/shellscript) modules? So it should be possible to create an interface-compatible sqlite data store.
Wouldn't cleaning up the package search interface be a similar effort with much greater payoff?
0: p2pacman - Bittorrent powered pacman wrapper
1: pacman & torrent, feasible?
2: DebTorrent
(0) https://bbs.archlinux.org/viewtopic.php?id=163362
(1) https://bbs.archlinux.org/viewtopic.php?id=115731
(2) https://wiki.debian.org/DebTorrent
I believe scaling this could happen with either: 1) lightweight filesystem\directory versioning support, like how btrfs allows you to mount snapshots. This way, peers could update whichever version of a torrent they have. Or 2) very reliable means to update to the latest torrent release (as reliable as syncing with peers), which afaict means smarter bittorrent clients that can perform DHT-based "crawling". Those recent defcon(?) "hacks" to query peers for similar torrents based on user pools and connection histories (or something like that) would make sense here.
A cool side-note: In one of my few experiences diving into `.git`, I diff'd it before and after making changes to its tracked sources, like adding files and modifying them. It looked like a torrent that included version control data would make out just fine if an updated torrent expected similar data in the same location. Again, a smarter bittorrent client would need to sort some of this out. See also 0': Updating Torrents Via Feed URL. Anyway, most users would probably leave that part out, in favor of only which parts they need.
(0') http://www.bittorrent.org/beps/bep_0039.html
Another cool side-note: This would also allow for easily adding repos from multiple sources... Look at how many ( non-automated :-( ) merge requests com.github/CocoPods/Specs's caregivers have reviewed: 13,331 as of now (0'').
(0'') https://github.com/CocoaPods/Specs/pulls?q=is%3Apr+is%3Aclos...
> 0: p2pacman - Bittorrent powered pacman wrapper
> 1: pacman & torrent, feasible?
> 2: DebTorrent
That's about distributing packages via p2p. The problematic repository doesn't store any package data, it stores package metadata (it's the cocoapods index if you will).
(It may not have been earlier on, I really don't know. ;>)
Something nifty about the new dnf is several of the older yum commands (eg builddep, yum-downloader) are now integrated directly so don't need extra utils installed. Seems like refinement is still happening.
If only my fingers didn't keep typing "dns" instead of "dnf" all the time, it would be great. :D
Fortunately for us, Firefox is canonically hosted in Mercurial. So, I implemented support in Mercurial for transparently cloning from server-advertised pre-generated static files. For hg.mozilla.org, we're serving >1TB/day from a CDN. Our server CPU load has fallen off a cliff, allowing us to scale hg.mozilla.org cheaply. Additionally, consumers around the globe now clone faster and more reliably since they are using a global CDN instead of hitting servers on the USA west coast!
If you have Mercurial 3.7 installed, `hg clone https://hg.mozilla.org/mozilla-central` will automatically clone from a CDN and our servers will incur maybe 5s of CPU time to service that clone. Before, they were taking minutes of CPU time to repackage server data in an optimal format for the client (very similar to the repack operation that Git servers perform).
More technical details and instructions on deploying this are documented in Mercurial itself: https://selenic.com/repo/hg/file/9974b8236cac/hgext/clonebun.... You can see a list of Mozilla's advertised bundles at https://hg.cdn.mozilla.net/ and what a manifest looks like on the server at https://hg.mozilla.org/mozilla-central?cmd=clonebundles.
A number of months ago I saw talk on the Git mailing list about implementing a similar feature (which would likely save GitHub in this scenario). But I don't believe it has manifested into patches. Hopefully GitHub (or any large Git hosting provider) realizes the benefits of this feature and implements it.
Mercurial was designed to be easy to extend, and it shows.
Hg was designed to be a DVCS system.
One correction to the post title: it's not maxing five nodes, but five CPUs.
Even Finder or `ls` will have trouble with that, and anything with * is almost certainly going to fail. Is the use-case for this something that refers to each library directly, such that nobody ever lists or searches all 16k entries?
The other side to consider: “one directory per package” is a very simple policy and it feels right in many ways to people (e.g. Homebrew has a similar structure because it's a natural fit for the domain). If the filesystem and basic tools like ls work just fine (which is certainly the case on OS X, where even "ls -l" or the Finder take less than a second on a directory of that size), isn't there a valid argument that the answer should be some combination of fixing tools which don't handle that well or encouraging people to learn about things like `find` instead of using wildcards which match huge numbers of files?
As loathe as I am to admit anything about Perl is good, CPAN got this right. 161k packages by 12k authors, grouped by A/AU/AUTHOR/Module. That even gives you the added bonus of authorship attribution. Debian splits in a similar way as well, /pool/BRANCH/M/Module/ and even /pool/BRANCH/libM/Module/ as a special case.
Tooling can be considered part of the problem in this case. Because the tooling hides the implementation, nobody (in the project) noticed just how bad it was. I hadn't seen modern FS performance on something of this scale, apparently everything I've worked with has been either much smaller or much larger. Ext4 (and I assume HFS+) is crazy-fast for either `ls -l` or `find` on that repo.
It seems like tooling is part of the solution as well, but from the `git` side. Having "weird" behavior for a tool that's so integral to so many projects scares me a little, but it's awesome that Github has (and uses) enough resources to identify and address such weirdness.
Think about it from their perspective. GitHub advertises a free service, and encourages using it. Partly it's free because it's a loss leader for their paid offerings, and partly it's free because free usage is effectively advertising GitHub. CocoaPods builds builds their project on this free service, and everything is fine for years.
Then one day things start failing mysteriously. It looks like GitHub is down, except GitHub isn't reporting any problems, and other repositories aren't affected.
After lots of headscratching, GitHub gets in touch and says: you're using a ton of resources, we're rate limiting you, you're using git wrong, and you shouldn't even be using git.
That's going to be a bit of a shock! Everything seemed fine, then suddenly it turns out you've been a major problem for a while, but nobody bothered to tell you. And now you're in hair-on-fire mode because it's reached the point where the rate-limiting is making things fail, and nobody told you about any of these problems before they reached a crisis point.
It strikes me as extremely unreasonable to expect a group to avoid abusing a free service when nobody tells them that it's abuse, and as far as they know they're using it in a way that's accepted and encouraged. If somebody is doing something you don't like and you want them to stop, you have to tell them, or nothing will happen!
I'm not blaming GitHub here either. I'm sure they didn't make this a surprise on purpose, and they have a ton of other stuff going on. This looks like one of those things where nobody's really to blame, it's just an unfortunate thing that happened.
(And just to be clear, I don't have much of a dog in this fight on either side. My only real exposure to CocoaPods is having people occasionally bug me to tag my open source repositories to make them easier to incorporate into CocoaPods. I use GitHub for various things like I imagine most of us do, but am not particularly attached to them.)
With respect to CocoaPods, I would hope someone on the team had thought through performance characteristics of their architecture.
It's like they brought a shopping cart onto a city bus and were then surprised that it inconvenienced the bus driver and the other passengers.
GitHub is for source control. That means a limited number of people pulling and submitting changes. That does not mean the general public using it as a CDN.
In fact I seem to remember seeing somewhere active discouragement of using it as a CDN.
> It strikes me as extremely unreasonable to expect a group to avoid abusing a free service when nobody tells them that it's abuse
I don't think so at all. An experienced developer should expect that a free service will rate-limit their offerings at some point, and design around that. Viewing 'free' as 'an eternal resource sponge that we never have to think about' is the extremely unreasonable thing to do, in my opinion. I think that 'abuse' is probably the wrong word to use here, since that implies malice, and they don't appear to be malicious.
Also, I'm amazed this is even a problem. 5 CPUs is not a lot in the scheme of things (even if they mean physical instead of cores). TBs of bandwidth are also virtually free compared to a company the size of Github.
Even better: they are getting basically real world loadtested for free and finding loads of pain points, which may hit paying customers.
Unless I'm missing something, fire more metal at the problem. Many companies would love to be able to have every single cocoapod user (which is nearly every iOS developer) have to type github.com into their terminal for the cost of a bunch of servers + some bandwidth.
Pretty strange, unless this is hitting some really bad area of their service that can't easily be scaled out of (but i would be surprised)
I think their point is that it's using the system in a way that isn't intended or desired. How does that count as "real world" load testing?
And by that logic, shouldn't anybody who gets hit with a DoS attack just say "thanks"? It's tons of free load testing on your network infrastructure, and you'll definitely find some pain points.
What seems insane is to use a single github repo as the universal directory of packages and their versions driving your package manager.
There's a reason rubygems has their own servers and web services to support this use case for the central library registry, even if the source for gems are all individually projects hosted on github.
That only has 3,000 packages vs 15,000 for CocoaPods or 115,000 for RubyGems.
The CocoaPods developers seem to be missing the entire point of git: it's a _distributed_ revision control system.
Setup a post-recieve hook on Github to notify another server, that is setup with a basic installation of git, to pull from Github so as to mirror the master repo. Then, have your client program randomly choose one of these servers to pull from at the start of an operation. Simple load balancer to solve this problem.
If CocoaPods reach out to Rackspace and/or other hosting providers, there's a decent chance they'll be able to pull together a good solution. :)
The downside though, is they'll need to figure out some way to keep it monitored/maintained. :/
> GitHub Support is unable to help with issues specific to CocoaPods/CocoaPods.
---
That's a pretty neat feature!
Therefore, it's not Apple's problem. In fact, I've talked to a non-trivial amount of engineers (both in Cupertino and long time Cocoa devs) that disapprove of the shortcuts that Cocoapods takes all over, software architecture be damned. Reasonable parties can agree to disagree, but I do include 3rd party framework inclusion without a dependency manager as an interview screen for prospective iOS hires.
Since you mention developer relations, I'll assume you're not actually arguing that this is Apple's technical responsibility, but that they should throw around some $ to grease the wheels to make dependency inclusion better. As a platform vendor, funding hosting costs for some project that you don't agree with just to "support the community" is a bad idea. Better idea is to allocate resources to setup a structure that can fix the issue in a technically agreeable way while also benefitting from the independence a FOSS project provides. In doing so, you are correct that it'd be preferable for Apple to fund/use well-known FOSS standards, such as Github.
In conclusion, Apple should setup a FOSS project to address the current inconveniences associated with third party package inclusion and should involve and pay Github somehow.
Afaik the swift package manager works only for swift code, so it's not an replacement.
Also it's a very bad habit to try to stand over the users that supply software for your most valuable product. We've seen a lot of stories lately that indie app development is dead. We also regularly see how weak Apple is in web services (cloud sync).
So either developers invest a lot of time to build something that works (and maybe even share it on GitHub) or they will stick with the holy Apple solution and provide a crappy user experience and go bankrupt. Companies like Google or Amazon (AWS) do a very good propag^developer releations job, IMHO way better than Apple ever did (in the last 10 years).
[1] In my opinion, they have since lost that edge on the UI.
This one is pretty big. https://github.com/torvalds/linux
I wonder how much traffic the Github Linux repo gets. Seems to me that people who want to use Linux, will go get a distro instead. And people who want to develop the kernel, will follow the kernel development process (which doesn't rely on Github).
https://github.com/apache - Lots of mirrors but many projects use it as their main source.
- My school uses GitHub to host and track our software engineering project (which still can be argued as OSS).
- People using GitHub issue system as a forum.
- Friends uploading pdfs to GitHub.
- Recently people posted on HN about using GitHub to generate a status page.
I think this is a really bad trend and people should stop doing that.
Using GitHub Issues as a forum and a source for generating status pages are both ok from a use/abuse perspective, but you may not have the best experience since that isn't what Issues is intended for.
It should be fine to come up with new ways to use Github, as long as it's not causing excessive load.
Imagine a world where GitTorrent is fully developed, includes support for issue tracking, and has a nice GUI client that makes the experience on-par with browsing github.com.
I mention this not as an "Everybody bail out of GitHub and run to GitTorrent!!!" sort of statement, because I believe GitHub's response here was excellent and confidence inspiring. But it's an unnatural relationship for community supported, open source projects to host themselves on commercial platforms such as GitHub. GitHub primarily hosts them to promote its business. That's not necessarily a bad thing, but it results impedance mismatches like demonstrated here.
That isn't to say that a mature GitTorrent would replace GitHub. Rather, I envision GitHub becoming a supernode in the network, an identity provider, and a paid seed offering, all alongside their existing private repo business.
Honestly, once I scrape a few projects off my plate, I'm inclined to dive into GitTorrent, see where it's at in development, and see if I can start contributing code. It just seems like such a cool and useful idea.
The potential downsides seem much more annoying. Do you really want to have your dependencies on an overloaded central server somewhere?
How so ? I bet the cocoapods team knew they were hammering Github with that gigantic repo. They just didn't care and expected Github to just give them more bandwidth, for free.
The response from mhagger is unnecessarily apologetic, and I predict we'll see an official update from GitHub on this soon.
I don't know about that. Both oh-my-zsh[1] and emacs prelude[2] use git repos as their code distribution mechanisms, and that works really well. I think the real issues here are exactly what is called out in the issue: poor usage of git, and poor directory layout.
[1] https://github.com/robbyrussell/oh-my-zsh [2] https://github.com/bbatsov/prelude
CloudFlare starts at $0 and doesn't meter/charge for bandwidth. CloudFront charges 9 cents per GB and is integrated with other AWS APIs (which can be very useful). Both those solutions could be managed with a donation pool, I would try the CloudFlare free tier first.
That price point works out to $0.35/Gbyte. More typical list pricing for US/EU is in the $0.10-0.15/Gbyte ballpark. Prices decline rapidly as your utilization approaches 1PB per month.
Things that GitHub suggested help with that: faster check for updates, breaking up big directories so diffs are computed faster.
At the time, I wrote a script that hammered git commits into a repository using different strategies and looked at what the git repository would look like after 100,000 and a million commits. The "one version per file, nested in a flat structure" had serious issues.
There may still be scaling limits with the Cargo approach, but if we reach them, we have plans to create a new registry with a new initial commit and let the old registry age out, then rinse/repeat. At the moment, we haven't hit limits yet (with about 1/3 of the packages that Cocoapods has).
Go get on the other hand doesn't keep any index. It just uses the url to download the dependency because of the mapping "url==project name" that exists with go projects.
$ git fetch --depth=2147483647
Shallow (depth=1) can be converted into a full clone with the above.
I had no idea it was just CocoaPods repo because my other repos were working fine. I accepted defeat, went to bed and everything was working great in the morning.
1. NuGet packages, which are built from the GitHub repository but then redistributed over a non-Github CDN (NuGet's)
2. tsd management tool (https://github.com/Definitelytyped/tsd), which looks like it prefers Github's CDN raw URLs rather than full/shallow git clones
3. typings management tool (https://github.com/typings/typings) with "ambient" typings searches (`--ambient` flag), which I believe also prefers Github's CDN raw URLs, as it essentially forked from TSD
That said, it's past time to move beyond the giant huge DefinitelyTyped repo, and I for one heartily recommend people migrate to typings which has better support for NPM and other module and package management systems, as well as smaller unit/module-focused Github repositories.
> We would like to provide a package index in the future, and are investigating possible solutions.
[1]: https://github.com/apple/swift-package-manager/blob/master/D...
1. You can't edit the frameworks unless you open a separate project, re-edit and recompile. With Pods, you can edit the Pod in your workspace.
2. Carthage doesn't go the last mile to bundle the framework into your project.
If they addressed those two things, I feel like Cocoapods would probably start losing a lot of steam. Although, after watching some videos about what the goals of Carthage were from the creators, I doubt that those two things will be addressed so I'm waiting for Swift's Package Manager.
The main difference is that homebrew actually updates the git tree to provide updated versions of package specs. CocoaPods adds a new directory and some files for each package version, causing the repo to balloon.
No matter which way you slice it, what CocoaPods is doing is a bit daft, especially at their scale.
I don't know what's normal; last night was one of my first iOS projects.
Seems like a poor design decision on the CocoaPods side.
The problem is not packages, it's the index, containing 16000 subdirectories,
Short term sure, they’re doing the right thing, implementing a nice way to manage the free rider problem without hurting them too much.
But long term it’s different.
Financially, one average programmer = $80k/year, one average cloud server = $4k/year. And, GitHub has hundreds of millions of venture capital. More than enough to provision a few more servers, even if they will be installing new servers just for those pods.
The way they act now will lead to someone will develop a decentralized git+torrent hybrid. When that happens, sure, those pods will no longer consume precious GitHub’s resources. However, for the rest of the github users, there will be no reason to stay on GitHub either.
Not to mention that "just buy another server for this one project" sounds like something CocoaPods should pay for.
Very likely true, but I don’t see how’s that related.
>they start to have pathologies when you have many entries in the same directory
Only true for inefficient filesystems like FAT.
For NTFS, 16k entries is nothing, the performance fill start to degrade (due to directory fragmentation) at around 100k entries: http://stackoverflow.com/a/291292/126995
>"just buy another server for this one project" sounds like something CocoaPods should pay for.
I don’t think that’s how 21 century economy works in this case.
Github’s value is likely between $0.75B and $2B.
Bad PR caused by this story will exceed 10 years TCO of that extra server.