QCon 2013

QCon 2013: Thursday Papers

Morning Keynote: 8 Lines of Code

Greg Young

The central point of Greg's talk was that simplicity is good, and magic is bad, because by definition we don't understand magic and therefore can't fix it if it stops working. The title refers to the fact that even with tiny and seemingly simple code, imported magic from frameworks can introduce unseen complexities and make a seemingly simple application difficult to understand.

The particular example that he picks is a seemingly simple command on an object repository; something like:

[Transactional]
public class DeactivateCommand {
	private readonly ItemRepository repository;
	
	public DeactivateCommand(ItemRepository repository){
		this.repository = repository;
	}
	
	public virtual void Deactivate(Item item){
		repository.Deactivate(item);
	}
}

What complexity lurks behind that [Transactional]? If you are using a typical aspect-oriented programming (AOP) framework, a dynamic proxy (i.e. a runtime extension of your class) will be involved to perform the interception and wrapping of methods to implement the transactional behaviour; not only is that complex, difficult to understand and extremely difficult to track down problems with, but it also introduces 'just because' rules to our coding that don't make sense: what happens if we forget to add that virtual (answer: the proxy won't work), or if we return this; from a proxied method (answer: you lose the proxy; this is known as the 'leaky this problem'). How do we explain this to a new team member?

If this command object is instantiated through a dependency injection or IOC container, finding out what it is using as its repository requires looking through magic non-code configuration, too.

Greg says, and I agree with this: "Frameworks have a tendency to introduce magic into my system". They do so in order to hide the complexity involved in solving the complex problem they are designed to address – but we should look at our particular problem and ask whether our problem requires us to solve that one. Can we rephrase our problem to avoid the need for magic? Using the example of dynamic proxies again, we see that the problem they are designed to solve is intercepting method calls with variable parameters; if we control the whole codebase, we can change the problem so that method calls don't have variable parameters, and the AOP framework becomes unnecessary!

In fact in C# we can sometimes avoid even creating a handler interface if we stop using the object oriented approach entirely and use functions: Action<T> and Func<...> types (or even old fashioned delegates) allow for polymorphic behaviour without the boilerplate of a class.

The other framework type which we tend to use without considering whether it is appropriate is dependency injection (or inversion-of-control, IOC). Dependency injection does solve a problem; in some areas, particularly UI libraries and similar component library fields, deep and unwieldy dependency trees are inevitable. But in most application code, they usually aren't, and if you don't have a framework that makes it easy, you would be more careful with dependencies. As Greg says, "IOC makes it really easy for you to do things you should not be doing" – creating overly complex dependency relationships.If you take the functional approach, you can manage dependencies without a framework, by creating closures in code. This mechanism is available in C# through lambda functions, as well as in dynamic languages. Wiring up dependencies in code removes some of the magic (application state coming from annotations or configuration files). Variable dependencies can be injected by passing instance-returning factory functions as parameters, instead of dependency instances directly.

Tools hide problems: if a tool makes it too easy to do something that you shouldn't really be doing, you won't notice that you shouldn't be doing it. We should always consider the purpose of a tool and whether you really need to use it. Once you have a dependency on a framework, you then 'own' that framework, in that any problems with your application are still your fault, even if they originate in the framework. And if it's brought magic with it, you won't be able to solve that problem because you won't understand it. So only use the tools you need.

Greg also made a point about the recent trend towards IDE extensions to assist with debugging frameworks. "Why can't I use your framework from vim?", he asked. If you can't explain what a framework does and use it from a simple editor, it is perhaps too magic.

Morning 1: The Inevitability of Failure

Dave Cliff

Dave comes from a banking background and this presentation talked about failure in real and software systems from a mostly financial perspective. Failures will happen, eventually, in any complex system. He started with several examples of the failure of the financial markets, from the tulip and South Seas bubbles through to the failure of Long Term Capital Management and the May 2010 trough in the Dow Jones index – momentarily the worst one day performance in US market history, although shortly afterwards followed by the best intra-day performance in history as the market recovered.

This type of market failure is happening harder and faster due to the rise of algorithmic trading. In the last decade, algorithmic trading programs ('robot traders') have risen from a small, specialist part of the market into the norm; over 70% of all trades are now performed by computers, often with millisecond response times. A small bug in one of these systems can cause a very fast and serious failure; Knight Capital was seriously damaged by mistakenly deploying development environment market simulation to production, costing them $400m, and on a less serious level, automated book pricers on Amazon.com resulted in a second hand book being offered for over $20m.

Stock trading has always been based on the latest technology and speed of information: first horse messengers, then pigeons, then telegram and telephone communication, and now the Internet. And technology has always had occasional failures, particular when users become involved, as people can't be rigorously modelled, so even a perfect engineering solution can fail once people are included in the picture: the Millenium Bridge in London was engineered correctly, but was not good for users.

Catastrophic failures often happen because of the normalisation of deviance. We start out doing something in a safe and controlled way, and make some guesses about what the safe operating parameters are. But every time we go outside those parameters and there is no failure, even though it triggered all our warning alarms and processes, it is natural to expand the 'safe' operating zone to include the new conditions ... even though the risk of failure is greatly increased. The Challenger and Columbia shuttles were both lost to events which were known to be a potential problem, but for which the deviant parameters had become normalised so that the increased risk was repeatedly taken, until a catastrophic failure did occur.

This problem is also prevalent in financial trading software engineering. Risk management in a new algorithmic trading program is extremely tight, but as the algorithm gets away with making risky decisions, risk management is relaxed until a catastrophic failure (in financial terms this time) occurs. As we see more algorithmic trading in the markets, we are likely to see more technology-created catastrophic market failures like that one in May 2010 (and Dave lists several other examples of individual markets being destabilised by a failure of an algorithm).

Morning 2: The What, Why and How of Open Data

Jeni Tennison, Open Data Institute

Jeni is an advocate of open data, and in this talk she laid out some reasons for us to join in within our own data.

Good data should be reusable – consumable by several different applications or modules – and combinable – an application should be able to read data from multiple sources and work with all of it. Most services are designed to be linked to other services, via data streams, and offer data through application-specific APIs. Using a well defined standard format makes it easy to pass data between services, whether that data is open or not.

Most current data, even that which is publically available, is not open. Open data has to be available to everyone, to do anything with it, for example using a Creative Commons attribution licence. (Share-alike licences, more like the GPL, are also available, but they restrict use cases to some degree.)

Why would a company which generates or provides data want to make it open? The benefits for everyone else are clear, but in the case of a non-altruistic business entity, there must be an incentive for the company too. Providing open data allows other companies or individuals to provide additional services, for example mobile applications or visualisations. As long as the data source is required to be attributed, this can extend brand awareness and user base. Collaborative editing and updated of data can also produce excellent and accurate output, if the user base has an interest in keeping it up to date; for example Wikipedia or OpenStreetMap. Offloading some of that data maintenance onto users lowers the cost of maintaining the same quality.

Whether to open up, and what data to open, has several considerations. Primary data, which is generated at high cost or effort, can have commercial value high enough that it can not make sense to open it up. But most companies generate large amounts of 'secondary data', which is a side effect of other processes (for example transactional data, information in CRMs etc), which can be opened up if it doesn't contain personal data. Any data referring to individual people is likely to have data protection concerns and again may not be eligible for open distribution.

Open data is still an experiment: we don't know exactly which business model works the best, how best to measure the usage of data or how to find open data when we want some. But Jeni asks us to consider the benefits that opening some data up can provide to our own businesses, as well as society at large.

Afternoon 1: Agile Adoption in Practice

Benjamin Mitchell

Agile advice is often abstract, confusing or hard to implement. A lot of material is about principles and themes, and not about how to put that into practice. Some of it even falls into the category of "unactionable advice" (Ben's term), which boils down to "If you follow these principles well, your project will succeed". Well ... that's nice, but what if it doesn't? Your advice doesn't help me solve a problem at all! Similarly, a key plank of agile methodology is that teams should be allowed to self-organise; that's fine if it works, but what if the team organises itself in an ineffective way (e.g. one overly aggressive decision maker takes over)?

Ben told some amusing anecdotes about teams where communication or transparency broke down. It's easy to lose the importance of transparency if you perceive a threat, for example if you think a manager won't let you work in the agile fashion you want. It's also difficult to communicate accurately when things aren't perfect. He showed us a quote from Chris Argyris about people's behaviour when a mistake has been made: "People blame others and the system, denying personal responsiblity, and then deny that they are denying" – obviously ineffective behaviour for resolving the problem. And although agile principles advocate transparency, the truth needs to be phrased in a way that doesn't embarrass or threaten team members.

We don't actually act in the way that we think we do, particularly when we're under pressure. Our colleagues can see when we are not being effective (for example becoming angry, aggressive or otherwise blocking progress), though – so ask them to tell us when that happens. It can also be useful to record your own actions and play them back later, to observe our own behaviour.

The source of most friction within a team or group dynamic is an 'I know best' attitude. People think that they know the correct solution, and therefore anyone who is thinking differently is either uninformed, stupid or deliberately obstructive; that starting position inevitably leads to conflict. Instead, it's better to accept that everyone's world view including our own is imperfect, and someone with a different view might be seeing something you aren't. We should also agree on protocols to discuss negative content.

It's also important to change people's views using real, verifiable data. A disagreement of opinion is very hard to resolve; a disagreement about a fact can be ended by finding out who is right about the fact. Ben introduced the idea of the "Ladder of Inference" for how we come to conclusions about what actions should be taken: first we describe a problem that we see, then we explain why it is a problem, evaluate how best to resolve it, and finally decide on actions to propose. If we drop an action into a discussion (for example "I think we need to estimate more carefully") without explaining the steps that we took to get there, it is likely to encounter confusion and resistance.

Afternoon 2: Accelerating Agile

Dan North

The traditional agile model has shortened release cycles, increased productivity and made development more reliable, among other benefits. But what if our usual sprint cycle is just too slow? Dan (the originator of BDD and a big name in the agile world) gave us a twelve step description of how he had seen a trading desk application, which usually take several months to create, built in only two weeks.

Learn the domain. This company actually sent its programmers on the trading course along with junior traders, as well as including a domain expert in the team. Developers would actually get as close to doing trading work as rules allowed, sitting with traders and watching how they used their trading applications.
Prioritise risky functionality over 'valuable'. This is the opposite of the normal approach, where we go for 'quick wins' – but when your application is moving into the unknown, finding out where unknown problems lurk is valuable, and these problems lie in the risky areas. Turning up a problem can change the plan, so work that would have been 'valuable' becomes no longer necessary, and it would have been wasted if you had done it first. Uncertainties and risks can come from the domain and functional requirements, for example how to integrate with existing data sources or services, or technical non-functional requirements like the best latency and throughput a particular approach can give. As long as the entire minimum acceptable functionality gets done, it doesn't matter which order work is done anyway, so it makes sense to remove uncertainty as fast as possible.
Only plan as far as you need. Planning takes time, and if resolving uncertainties changes the plan for the next days, that time was wasted. The planning horizon can be adjusted as the team grows, as the requirements become clearer and uncertainty reduces, but for accelerated agile it should only ever be as far as you need to plan right now.
Try something different. Use different programming languages, like Scala, to develop prototypes fast; they can always be rewritten later if necessary. Using different languages also exposes the team to new ideas (a point also made in Damian's Wednesday closing keynote); Dan's words are that this is "not necessarily to use the language, but to learn the thought processes that go with it".
"Fire, Aim, Ready". Make releases before all the functionality is complete, to get user feedback. Automated testing will give immediate technical feedback (i.e. whether the software works); frequent updates to the user gets near-immediate non-technical feedback (i.e. whether the software is doing the right thing). If you don't get feedback on design direction frequently, "you can dig a pretty big hole in a week" if you misinterpret the customer's needs.
Build small, separate pieces. Modularity is a standard part of modern development, but isolation is not. This company created totally isolated services, communicating only through data, and not even coupled through common code; a violation of the DRY (don't repeat yourself) principle, but when components should be separated, the coupling introduced between them by using common code can be worse than the duplication. Writing as much logic as possible as functions helps too, as functions are by definition small and independent.
Deploy small, separate pieces. Consider the deployment requirements of each component. Some will need to be up for extended periods (e.g. feed handlers) and can't be taken down for redeployment; others are non-continuous and can be hot-patched. Some will be critical and require extensive and exhaustive testing to remove the chance of a high impact bug; others won't and can be tested to a lower level to allow time to be spent elsewhere. A large application is as slow to deploy as the slowest part and its potential bug impact is that of the most dangerous part, so by splitting it up, change becomes easier and faster. It's also important to deploy components that are self-describing (not only with regard to metadata but also their current state when running), so the deployment of the whole product can be consistent and reliable, and so monitoring becomes easy.
Choose simple solutions over easy answers, a point also covered in Stefan Tilkov's talk about web development.
Make the appropriate trade-offs, e.g. build/buy/open-source, or framework versus roll-your-own. Being ideological about issues like this can result in not using the fastest and most efficient option.
"Share the love": share knowledge around the team through common agile techniques like pair programming, peer review and thorough onboarding of new team members.
Be okay with "failure". Concentrate on product development, not project delivery. Development is a series of experiments, and an experiment that "fails" has still been valuable: now we know that doesn't work. We've learnt something – that's progress!
Because there are always twelve steps! And a warning: delivering fast – every day – can be addictive!

Afternoon 3: How Not to Measure Latency

Gil Tene

Gil is the CTO of Azul, who make a fast, low latency and low pause time JVM (Zing). In this presentation he explained how naive measurements of latency and response time can lead to incorrect results and poor decision making.

Before deciding what measurements to take, it's important to consider why we want to measure response times. What features of the response time distribution do we care about? When a system is loaded, it doesn't have a fixed response time as a function of load; typically, the distribution of 'hiccups' (pauses where response times are anomalously long) is multimodal as different types of 'freeze' take effect. These distributions can't be modelled accurately by average and standard deviation, the whole shape of the distribution is important.

Different applications have different requirements for latency behaviour. A critical system may have absolute limits on what the worst case response time can be, which in a way makes measuring performance easy: the only factor you care about is the maximum time. But for 'soft' real time applications, like algorithmic trading, or interactive systems where the requirement boils down to 'don't annoy the users', the performance percentiles when under projected maximum load is what matters. So before investing time into measuring response times under load, it's important to establish the actual performance percentile requirements of the application. The idea of 'sustainable throughput' is the maximum frequency of requests that can be serviced while satisfying the latency requirements, so it makes no sense without knowing the requirements.

One of the most common problems in measuring response times is the Coordinated Omission Problem: observations don't get missed at random, and it's disproportionally the bad answers that get missed out. Most load testing frameworks create lots of threads or processes, each of which streams requests at the target. That means that if a request takes an unusually long time, the thread or process is waiting for it to return, and not submitting more requests – thereby failing to record as many results during a bad time! This can seriously affect the accuracy of measurements; if you are submitting requests every 10ms, and there is a 'hiccup' of 1 second every 10 seconds, you are failing to record 100 bad results in that time. The 99% latency in this scenario is really 1 second, but a measuring tool will record it as 10ms! An unreasonable difference between the 99% value and the maximum value can be a good indication that your load test has this problem.

Before running a measuring tool against a real system that you're interested in, it's a good idea to create a synthetic system with known hiccup behaviour (for example deliberately turning it off for some time), and make sure that the monitoring tool you are using correctly characterises that system. If it doesn't, Gil offers the HdrHistogram library which can characterise response time results correctly.

Finally, Gil ended with some comparisons of servers running Azul's Zing JVM against those using the standard one – using non-normalised charts because, as he puts it, "it's really hard to depict being 1000× better in 100 pixels".