A few pointers from our own experience:
- centralized logs (we tried an ELK stack and have moved to Datadog since they added logging). Using a correlation ID helps tracking the flow as it crosses Lambda / services boundaries
- using datadog also gives us metrics and dashboards for free, although there’s not always the data you’d want
- we started to experiment with X-Ray to track start-up time and how much time is spent where. I’d definitely advise you to try it if you’re tracking down performance issues. It’s a bit of a pain to get working though
- testing : as described in Yubl’s road to serverless (link in another comment), we have a switch to call the code locally or remotely through whichever service triggers the Lambda. This usually insures that the logic is sound before deploying and that remote bugs are mostly linked to integration or rights issues
- deployment : we rolled our own with Ansible and CloudFormation / SAM but if you fit in the Serverless use cases you should probably try that first
- discovery : we use SSM parameter store as a distributed key/value DB and a poor’s man discovery service: if we want to reach a given lambda or service we look up it’s name or arn in SSM PS.
I’m in the process of writing a post (or more likely a series) on our experience and will post to HN when ready
Edit: also, decoupling. If your Lambdas are calling each other directly, consider putting a queue or SNS topic in between. Makes it easier to test each unit independently, can manage timeout / retry issues on your behalf, and gives you a convenient observation point for inter-service traffic