QCon 2013: Friday

QCon 2013: Friday Papers

Morning Keynote: A Forward Look at Federated Wiki

Ward Cunningham

Ward is one of the big names of the agile movement, but in this presentation he was talking about another of his technological passions: wikis. He referred to the idea of a wiki as a 'pattern language' – patterns have to be documented and described somewhere, but if they are on a wiki, they can change with the times and benefit from collaborative editing, so that they store authoritative information but also allow evolution and evaluation of the patterns. And although for the most part so far they have done, wikis don't just have to simulate paper; they can store data, calculations and visualisations as well as simple content.

His newest idea is that of a federated wiki which can pull in information from an entire information ecosystem. What does he mean by that? Just as a normal wiki allows collaborative accumulation and interpretation of information about one subject, a federated wiki allows access to information (held on individual wikis, naturally) from a variety of sources. For example, a federation about cars could consist of individual wikis held by manufacturers, parts suppliers, garages, amateur enthusiasts and traffic laws.

The actual federation mechanism would act in a similar way to inheritance, with wikis being able to specify others as their base, with pages within the base being overridden by those in the higher level one if there is a conflict. Because content can be data or calculations, as well as text, and data can be passed between pages, that means that calculations or visualisations from one wiki can operate on data and results in another, or even override data sources in another.

A traditional wiki only shows one page at a time, in one browser tab. Ward showed how a view into a federated wiki opens several panels, each showing a different page, and how data flows 'across' the view between pages – so by changing which pages are open to their left, visualisation pages can show a graph of something different, or calculations can operate on different data sources. Information referred to in a calculation or visualisation is looked for up the current page, and if it isn't there, it's looked for to the left.

The wiki itself allows for client- and server-side plugins to interpret the markup, allowing a particular wiki to interpret its own markup as a domain-specific language (DSL). Ward showed a demo of a federated wiki controlling a microcontroller chip over USB, via a serverside plugin which translates wiki markup commands into chip messages.

The issue I see with this idea is that, although there are several examples Ward gave of fields where a federated wiki could usefully bring information together, it requires that the institutions in that area already store their information in the form of wikis. Outside the technical sphere, I don't really think that's true. It will be a hard job to move towards this future, even though it would be a great place to be.

Morning 1: Sustainable Small Architecture

Robbie Clutton

This talk about the architecture of relatively small systems – the type that all of us as agile development teams will generally be producing – was illustrated by examples all the way through, often of how it had been done wrong but then fixed. The first was at SmallEgg, where they did some usability testing and found that users were repeatedly clicking on headers (which were not links). This shows the importance of doing user testing before release; CrazyEgg and UserTesting.com were mentioned as good tools.

The next example was of a wizard-like application for purchasing a complex product. Each step of the wizard added the default option for that component, and the price displayed to the user updated whenever they changed option. When they did user tracking through the system, a large proportion of users bailed out at the same step. It turned out that this step was adding an expensive default item, so when users clicked through to it they saw their displayed price jump from 50 to 350. They felt deceived and left the site, never to complete a purchase at all. The business behind this site updated it so that users were given the choice of whether they wanted the default option or nothing to initialise each step, and the dropoff at that particular step went away. The point here is to learn about what is blocking users from using your application, and address it.

Another example was of a database application, which was running very slowly. It had no indexes on tables, not even primary or foreign keys. By using tools to help you identify simple mistakes (in this case a database analysis tool like those that come with most professional database systems), you can avoid easily avoidable problems like this.

The next section of the talk was more about general principles of how to construct and improve systems.

A common problem that almost all systems run into at some point is complaints that the system is 'too slow'. This can be addressed by using profiling tools to find the slowest part of the application, and then concentrating efforts on that section. Caching can improve performance, but it can be difficult to do it correctly, for example finding all the places where a cache item should be invalidated is not easy. Resorting to a cache can also hide poorly written code, as it will still be slow, it just won't get run as often. Mark Pacheco made a similar point about the pre-rebuild Songkick architecture in his talk later on Friday.

If you have dependencies on other services in your application request-response cycle (or the equivalent in an interactive application), they should be 'weak dependencies'. Robbie shared an example of a web application which had a bug recorded against it as users not being able to register; it turned out that the problem was with the mailing list provider used for signing people up to newsletters, but it had been coded in as a hard dependency, and being unable to connect to it was causing the whole registration process to fail. Wherever possible, calls to third party services should be done in the background or deferred if they can't be executed immediately, and the main application flow be allowed to continue.

A related point is that web applications should load the primary content first, and then pull in secondary content (ads, links to recent news, Twitter feeds etc), so if some piece of secondary data isn't available, the main content is still showed to the user. The perceived performance of a system (for a website, receiving the content that you want to read) is often more important than the actual performance (completed page load time).

The most expensive resource you have is time, particularly in a small agile team, so it doesn't make sense to spend time replicating something that another tool or library already does, particularly if you only have time to make a poor implementation. Just buy and use the tool! It's generally not possible to build an adequate replacement in the time that is available for the cost of the tool, and it makes sense to concentrate on the functionality that will differentiate your application from the rest of the market, not the functionality that everyone has. And only implement what you really need ... which means you need to ask questions of the customer to find out what that minimum set actually is.

One approach to maintain performance of a system under load is to follow the lead of The Guardian with their emergency mode: if load starts to get too high to offer the full experience to all requests, instead of rejecting requests, disable expensive but secondary features so that the main content can always be served.

Architecture should be allowed to evolve: refactoring should occur on a design level as well as a code level. Don't create abstractions and general solutions until you're sure you will need them – creating an abstract general solution takes work, and if it isn't needed then that is wasted effort. A specific case can always be refactored into a general one later if it is needed.

Morning 2: The Past, Present and Future of NoSQL

Matt Asay

Like many 'big new ideas', that of the schemaless, non-relational database is not entirely new, but takes inspiration from the past. Before the introduction of SQL, NASA developed IMS for their Apollo program in 1969. The schemaless approach meant that thought needed to be put into query design up front, at the same time as application planning. What SQL brought to the table was the ability to think about the schema and the data structure up front, but allow query design to be done later, speeding up the planning phase.

So what has changed to provide pressure in the present day for a change to the SQL/relational model? Very large relational databases (very large amounts of data, or numbers of relational connections) become hard to update or change – it took Craigslist 3 months to complete a tiny schema change in their RDB! The advance of storage technology, bandwidth and Internet connectivity means that large web companies now receive more data and load than a RDB can cope with; the big data revolution began with companies like Google and Facebook trying to find something that would deal with their needs. And as business systems become more people-oriented, the data that needs to be stored becomes more flexible and less suitable for storage in a fixed record format.

NoSQL is moving into the mainstream. Major online companies like Amazon and Netflix use NoSQL databases for their user-oriented data storage, e.g. for their recommendation system, and this type of user-centric data is where their focus and innovation are directed, not their record-based billing systems and accounts (where a relational model will always continue to be the right answer). News media is an area where companies need to be flexible and agile, as nobody really knows what the future of news media will be, and several news sites use NoSQL databases. And more companies are reaching the limits of a single relational database on a single server, so the ability of NoSQL solutions to be scalable on commodity servers is a big advantage.

So what of the future? Matt sees a future where NoSQL becomes the 'new normal': in the majority of cases, the choice of data storage mechanism is quite an open one, and both a relational and non-relational database would be a valid option; in those cases, he sees companies choosing NoSQL databases as their standard. A few mainstream organisations have already made this decision; The Guardian is one of them. NoSQL database implementations have come of age in the present day, and are now general purpose, high performance and easy to use. They are not suitable for everything – Matt was very clear throughout the presentation that the relational database will always have its place – but they are already a reasonable choice for most situations. As they become more mature, they will become the default choice for the majority case for a lot more organisations.

Afternoon 1: A Little Graph Theory for the Busy Developer

Jim Webber

This talk continued the Big Data/NoSQL track, which was well attended, and introduced a different way of thinking about data storage: storing data as a graph or network.

Until recently there was a trend of not only storing data in a database, but performing calculations there too, via stored procedures. But as consumer hardware gets faster and more powerful, complex queries aren't necessarily run against the database any more; instead, data is extracted from the database by simple lookups and processed elsewhere. This means that we are free to optimise databases only for storage and simple reading operations, not complex joins and queries.

Several other talks in this stream were about non-relational systems that acted as some kind of key-indexed storage, which is easy to look up by its index, but hard to write cross-cutting queries for. Jim introduced the idea of a different way of looking at data storage – have the data model store not only data points, but connections between data objects as a graph. Graph traversal allows for individual queries to return very quickly, but the maximum throughput of queries per time period is lower than for a NoSQL database.

In Neo4j, they use a 'property graph model': nodes have arbitrary properties associated with them, and relationships connect two nodes in a directed way (for a two way link, two relationships are created) with a label and other properties. There is no schema applied to the nodes or edges within a graph, each one can have different properties.

Graph theory tells us some useful properties of dynamic graphs, especially social ones, that allow us to make predictions and perform retrospective analysis of a network. The first is that a dynamic graph naturally closes triangles: if A has some relationship with B and also with C, then B and C will naturally develop some kind of relationship, and the relationship between B and C will be such as to maintain structural balance within the triangle. These concepts of 'triadic closure' and 'structural balance' are powerful in making predictions: by constructing a graph of existing relationships, and closing each triangle with an appropriate type of link to maintain structural balance, a representation of all the implicit relationships can be made. Jim demonstrated this with a graph of allies and enemies in the mid 19th century, closed all the triangles and showed that it matched the sides in WW1.

Another important property is known as the 'Local Bridge Property': if a node has two or more strong links, any further connections it makes will be 'weak relationships' or 'bridges' to other parts of the network. Bridges are useful predictive tools when considering how to partition a network, as they will usually break first.

A new way of representing data requires a new type of query language in order to get useful results from it. Neo4j uses a query language called Cypher, which has matching clauses based on relationships between nodes in the network and given starting nodes. By choosing appropriate starting nodes for a query, different questions can be asked of the database, but unlike a traditional indexed model, all queries of the same complexity will be roughly equally fast (and the numbers for graph traversal speed Jim claimed were impressive, of the order of microseconds for simple relationship following). A graph-based data model is applicable for a wide variety of applications so this type of storage deserves a closer look, in my opinion.

Afternoon 2: How we Scaled Songkick

Mark Pacheco

Mark comes from Songkick.com, a 2007 web startup which stores information about upcoming music events, and historic events, and allows users to find gigs by bands they like. It is a major success, scoring over 8 million visits per month. The company is still small, with around 30 people, who work in cross-functional teams that resize themselves as required for the work that is being done at any one time.

Their initial architecture was a typical one for a web startup: a mySQL database, fronted onto by a web application layer that did everything else – not just the web site itself but also auxiliary functions like their scraping and data ingest tools. That meant that a change to any part of the site's functionality required them to redeploy everything, and the unknown dependencies within the web layer meant that changing one part of the site could break things in a completely different part. Their builds would take hours, even using an Amazon EC2 cluster, there were complex relationships between supposedly disparate parts of the system, and the dependency graph became very unwieldy (Mark showed it to us, and it was so dense you could hardly see the lines).

They decided they needed to re-architect it to allow for scalability, to allow their development effort to be applied to functionality, not fixing bugs, and to speed up their release cycle so that they could get new features into production faster. That is a big decision: re-architecting takes a long time, and it isn't guaranteed that the outcome will be better than what already exists. There is also the consideration of whether to re-architect within the framework of the existing system, or whether it would be faster to simply start again with the knowledge gained from the first system.

If you are going to do a major redesign, there are some important points. The design work should be collaborative across the whole team, just as development is. Clear names need to be chosen for objects within the design, and agreed upon by everybody (from developers through to managers, product people and salesmen, so that discussions about what work needs doing are clear for everyone. The existing feature set needs to be looked at and you need to decide what can be cut out from the new system, and what the minimum acceptable set of features is for the first release of the new system. And, if it's possible, the redesign should be done piecewise, so that the whole system is always in a working state of some kind – an application that is taken down so it can be rebuilt, or for which development stops and the old version is left up until the new one is finished, is likely to lose custom.

Songkick decided to move to a strongly decoupled service model. They created a collection of Sinatra (Ruby server technology) applications which deal with the database, and accept requests via HTTP containing either JSON or form-encoded data, returning JSON. Their web application, a Ruby on Rails app, acts as a client to these services, and doesn't have a direct connection to the database. They also redesigned the object model for individual pages, moving to an MVC approach, and their page model now consists of components, which themselves may be made up of elements. Elements can be used in different contexts, and they pick up data from the context they are in (from the page model or the component model).

They also chose to link assets together by convention. A component model class name matches the name of the CSS to be applied to that component, to any JavaScript file that needs to be included with that component, and to a directory for any static assets (e.g. images) it uses. That means that it is very clear what needs to be looked at if a component needs changing or removing.

The result was a radical improvement in productivity, with a release cycle ten times faster; much better performing code (they removed all of their internal cache code and page load times didn't change); a code base that halved in size, partly due to having fewer but better targeted tests; much faster builds, down from over an hour on a cluster to under 10 minutes on a single machine; and a more collaborative, evidence-based development process.

Afternoon 3: Approximate Methods for Scalable Data Mining

Andrew Clegg

We don't necessarily think about it, but characterising large data sets is difficult. select distinct or frequency analysis type queries on a large distributed data source are hard and expensive, particularly when the distinct list doesn't fit in memory. Approximate methods are ways of answering this type of question in an approximate, probabilistic way, usually parallelisably, and with a predictable error, while only storing a small summary of the data which is many times smaller than the full data set. Applying approximate methods to a data set will generally increase the CPU load, although not necessarily in the case of a distributed database where CPU time is used to serialise and deserialise information sent between shards, but database servers are often not running at full CPU load in any case.

The first example was that of set membership: given an index on a table, and an element, is the element in the table? In a small data case it is simple to set up a hashtable of the index and call hashtable.contains(element), but this breaks down when the set becomes large. Instead, we can store with the table a Bloom filter of the index, which is a bitfield of size n, and define k hash functions which each return 0 to n-1. An item has a characteristic set of bits, with k bits turned on, those being the result of running each hash function on that item. When an item is added to the table, the filter is updated with that item's bits; when an item is looked up, if all the bits in the index's filter corresponding to that item are not set, it is not in the index. Existing databases already use this method as a preliminary filter to return false quickly on a lookup that fails.

Next he talked about an approximate cardinality measure, i.e. how many distinct elements a set has. The simplest version of probabilistic counting simply looks at the longest run of trailing zeros in the result of running a hash function on every data item; the estimated cardinality is as simple as 2n. As simple as this method appears, by using several hash functions and combining them all in a final answer, an answer with a low error can be produced.

The next example was of frequency estimation. Similar to the Bloom filter, the count-min sketch algorithm uses several hash functions which return indices into an array. This time, the array is of integers, initialised to 0; when an item is added to the index list, the values corresponding to each of the indices of its hash function are incremented. When looking an item up to find its approximate frequency, each value in the count array corresponding to returns from the hash functions is looked up, and the lowest is returned (because the minimum value of those values is the number of times this item is present, but they could be higher due to other items with the same hash function result). As well as having a tunable error rate (based on the number of hash functions and the size of the array), this method has the useful property that it is more accurate for high counts, which are likely to be the items that are more important to the application.

Finally Andrew talked about the similarity search: finding items that are similar to the one we are looking at. Similar is usually defined as having a low distance in some multi-dimensional distance space, and calculating exact values for that distance can often be expensive. An approximate method requires a suitable locality-sensitive hash function, which (unlike a good normal hash function) will return similar values for items which are nearby in terms of that distance. An example for cosine distance is the 'random hyperplane' algorithm: for a hash of n bits, n hyperplanes are constructed in the space, and for each one that a point is above, the appropriate bit is set to 1. There are other hash functions that approximate other real distance algorithms.

Finally, there is a book entitled "Mining of Massive Datasets" which is important in this area, and freely available to download in PDF format.