Applying Tokyo’s train network practices to Systems Engineering

It’s no secret that Japan has one of the best and safest train networks in the world. This comes from a cultural understanding that prevention is better than recovery. For someone in Systems Engineering, this kind of system is quite interesting.

Sitting in a station in Tokyo, you will notice a few things.

  • Even though load on the network changes, it continues operating in the same manner.
  • Two trains never arrive either side of the platform at the same time.
  • Every platform has a train guard watching for incidents and announcing to passengers.
  • The station name is printed everywhere.
  • The line colour is distinct and visible from ceiling to floor.
  • Boarding points are well outlined on the floor.

Communication through multiple mediums

  • There are a plethora of communication mediums throughout stations.
  • Dashboards:
    • Dashboards are absolutely key to understanding across an organisation of the current status of services.
    • “What’s next” schedule dashboard – Visual representation of:
      • when – When is the next train scheduled / when is the next deployment scheduled for?
      • where – Where is the train at the moment / is the deployment ready?
      • what – What station is the train bound for / what environment will the deployment be in?
    • Failure dashboard – Visual representation of:
      • affected systems – the (service) line, a (component) station…
      • reason for failure – “passenger rescue”…
      • resulting change – “schedule altered”…
  • Auditory cues:
    • There are so many different station chimes and noises that communicate different statuses to passengers. Unfortunately, I’m unsure how to replicate this in the same manner (audio) for our industry.
    • Station chimes – Every station has a different “chime” which announces the imminent departure of a train from the platform. Following this, the train doors close and the train departs.
      • Announcing deployments via Slack (or similar) before they occur. A different chime = different component.
        • The user group mentioned in the notification indicates the service (line).
        • An attached image indicates the application (station).
      • Introducing a purposeful, static physical delay into the beginning deployment procedures.
        • This allows teams to become aware and raise issues if there are any.
        • It makes timing well known from the start of deployment to the application being deployed.

Deploy as often as possible

  • The more you deploy, the less components have the ability to fail at the same time.
  • This limits your blast radius because you are unlikely to make changes to as many components at the same time.
  • It also has the byproduct of reducing team load (you have to automate shifting deployments blue->green).

Don’t deploy alone

  • The people in your organisation are the number one way to prevent and respond to outages. Every deployment should have a well defined:
    • Application specialist – This is the train driver. They take point for the development team and have deep knowledge on the application.
    • Infrastructure specialist – This is the station staff. They take point for the infrastructure team and have deep knowledge on the deployment environment.
  • It is important to note:
    • Not every deployment should have the same specialists. It is important that every single member of an application or infrastructure team (within reason) be able to deploy. If they are not confident enough yet, they should be duet-ing with a more experienced member of the team.
    • The deployment should be visible to the entire organisation. Whether that is in Slack, Microsoft Teams, or god-forbid (gasp) email, it is important to know what is deploying and when it is doing so.
    • All deployments should be planned. Managers and individual team members should not be able to deploy by themselves at a (whims / with no) notice. Whilst managers they can take part in the deployment process, the deployment should be handled by the aforementioned teams.

Have and follow well-outlined procedures

  • The attributes of components may be unknown, but packaging, destination and transit medium are well defined.
    • You know where the edge of the platform is. Don’t put things in the way of the train.
      • Perform a visual check before deployment. You can see platform staff doing this with a physical hand movement, from one side of the platform to the other horizontally, looking for obstructions or issues. Let QA and the team inspect deployments before they go out. This also builds team knowledge.

Don’t deploy multiple projects at the same time

  • Two trains never arrive either side of the platform at the same time. Multiple failures at the same time can cause confusion.

Every failure should be visible

  • Presenting train line failures and issues to passengers allows them to take alternative routes or wait with knowledge of the situation. Likewise, if a particular service is down, teams can queue their deployment for a later time.

Changing schedules over introducing delays

  • Don’t deploy if the teams are unsure of anything.

Published by Alexander

- Alexander is a professional Operations (DevOps/NetOps/SysOps) SRE and Developer living in Tokyo.