My takeaways from QCon SF 2018

- qcon software architectures

[hamy: change to publish on monday]

Last month, I took a trip out to San Francisco to attend QCon [hamy: link] as part of my yearly conference benefit at APT [hamy: link]. This was the first major tech conference I’d attended and was bombarded by information from every direction - this post is my attempt to solidify takeaways and organize additional info for efficient future consumption.

Key takeaways

Talks, loosely organized by major theme

Microservices / architectures

Microservices in Gaming

The video [hamy: link]

In this talk, scale problems are presented against “current” architectures and you’re walked through the thought process of modernizing them to met these new challenges

Gaming systems (League of Legends)

Require:

Had a two-layer system

[hamy: img of the arch]

Takeaway

Media streaming systems (Hulu)

Called out bit.ly/hulu-landscape for further reading

Requires:

In the original arch, they’d built a lot of stuff in house but there were limitations to what they built as compared to commercial/oss offerings. He specifically mentions the lack of “scripting deploys” from devs as a particular issue.

Takeaways

Microservices in general

Cloud v. DC, go cloud for: * Elasticity * Abstraction

Why microservices * ownership + independence => velocity and reliability, achieved via: * fast, granular deploys * built-in CI checks * operational and dev scaling

Cloud is especially nice because you’re building on the shoulders of constantly improving infrastructure. Dude recommends to just pick a cloud and run with it, you can worry about genericizing your infra later.

Reactive DDD (Domain Driven Design)

The video [hamy: link]

This talk was very much about moving from a rigid push system to a more organic poll system wrt your architectures as it can increase its flexibility and (implied) efficiency. This is very much the argument for message queues like Kafka which can be subscribed to and played back at any time over message queues like RabbitMQ which prioritize single receiver, single play.

[hamy: image of event-driven arch]

Reactive architectures lend well to code fluency or the idea that code easily and accurately conveys the real-world processes it is modeling. Note that this is very similar to one of the primary goals of functional programming.

DDD’s goals are to be * Fluent * Explicit

Thus both synergize well with each other.

Creating an architecture like this lends well to async operations and thus means it can increase your ability to scale (though remember that async ops can never increase actual throughput if all other things are treated equal)

Microservices talk

Just a system that is also autonomous with features like:

Isolated failure:

Resiliency

Elasticity

Airbnb: monolith to microservice

SOA design tenets

  1. Services own data, reads + writes
  2. Services address a specific concern
  3. Data mutations publish based on domain events

SOA best practices

Scaling for cryptomania

Volume testing is good * Perfect parity is not necessary * Use capacity cycle to determine where to increase

Good instrumentation (monitoring/observability) will surface problems whereas bad will obscure them.

Pillars of realistic load testing * Data * Patterns of traffic * Systems

Faster feedback means faster progress

Want ‘real’ load behavior to capture and playback real behaviors

Production ready microservices

Production Readiness: Can trust service to handle production traffic

Stability and reliability * Stable dev cycle * Stable deploy cycle * Dependency management * Routing discovery * Onboarding and dependency procedures

Scalability and performance * Understand growth-scales * Resource awareness * Dependenc/scaling * Constant evaluatoin * Traffic management * Capacity planning

Fault tolerance and disaster recovery * Avoid single point of failure * Resiliency engineering

Containers

K8s commandments

  1. To go fast, you must start deliberately
  2. Always let them know your next move - move past docker build
  3. Never trust - use pod security policies
  4. Never get high off what kube supplies
  5. New mix interna/external traffic - lots of cool tools out there to help like service meshes
  6. If you think you know what’s happening in your cluster, forget it - have observability and logging
  7. Keep your storage and management separate
  8. Use tools
    • Package management
      • Helm2
    • Config management
      • ksonnet
      • pulumi
      • ballerma
    • others
      • skaffold
      • kustomize

Observability

Connectivity, observability, and monitoring

Connect * Service discovery * Resiliency * load balancing

Monitor * metrics * logs * tracing

Manage * find-grained traffic control * policy on requests

Secure * workload identities throughout infra * service to service authentication

A service mesh moves all of these functionalities outside of the app which helps reduce developer overhead and can be used to change federated policies very fast (when compared with having each dev team roll them out themselves).

Some examples of service meshes: * envoy * linkerd * istio

How istio works [hamy : link to the slides?]

Team processes

Netflix: full cycle engineers

The video [hamy: link]

This talk was particularly interesting because it covered many of the efficiencies I’ve seen in large systems - different teams waiting on each other for things and how stuff can often get lost in the cracks. It presents a way to minimize these botched hand-offs though also presents the downsides of the paradigm.

The basis of this talk is essentially that it is an antipattern to need to throw a CM ticket over a silo wall in order to perform your necessary job functions.

Context: * dev team * dev ops * central Ops team

all of whom communicate with each other as part of the dev cycle.

Issues: * Devs and testers don’t understand the machines * devops / ops don’t understand the apps

This leads to high communication overhead

Issues: * people are cautious because they don’t understand everything * leads to long troubleshooting sessions and a high mean time to resolution (MTTR) * poor feedback (as in low quantity/quality) means slow turn-around and prioritization

Solution: * Create specialist teams which will build tools to help overcome each of these problems * Add a specialist onto each team to help them interact with underlying/touching techologies with confidence

To do this, your org must: * increase staffing * increase training via bootcamps and shadowing of other teams/roles * prioritize the creation of these paved-road tools * “The tooling you build is the tooling your developers need”

Tradeoffs - note this is not for everyone * change is scary * each team will need to balance more priorities * this will be empowering * will also lead to increased interruptions and cognitive load * each member will need to decide how they can best fit in/work with the system

Improvements: * Tooling that is oppinionated and uses the best practices * Metrics - to measure impact and areas for improvement

Chaos Engineering with Containers

To chaos: * Monitoring and observability * Incident management * Cost of downtime/hr

Uses: * Outage reproduction * On-call training * Strengthen new prod builds

Process: * Minimize the blast radius * Don’t start in prod * Don’t start at 100% fail

Testing in production

You should test in both testing && prod * Code confidence can increase as it gets more load * When we move to complex systems, it becomes harder to know what ight fail * Debugging much more complex -> give it prod traffi * We often under invest in prod tooling, yet the ode that ships/deploys our code is some of the most important

Every deploy = process + code + system

By deploying to prod, can catch 80% of bugs with 20% of the effort * Other evironments do not have the same traffic (staging != prod) * Real unpredictables (acts of god, acts of stupidity)

The process 1. Test before prod * Does it reasonably work? 2. Test in prod * Literally everything else

Observability/monitoring useful to determine what you should actually build * Feature flags good for testing in prod * Use canaries (slowly roll out/fork traffic), can even shadow * Allow multiple versions at once

misc

Data science as an art

Explorations through solutions * Think and try * hypothesize and experimet

Related reads