Wednesday, May 23, 2012

X-Rays for OSGi

One of my frustrations of my OSGi years is that while OSGi provides this tremendous wealth of dynamic data about the state of the system nobody has taken the time to really visualize it. By far the best tool I know is the Apache Felix Web Console because it provides comprehensive information about the operations. However, even with this tools I am well known to waste hours chasing problems that would have been obvious had I looked at the proper page. Though the Web Console is by far the best tool around to find information, it falls far short of a tool that can inform me when there are problems.

Our intelligence is closely related to our visual brain. We can spot problems in one glance, problems that would take ages to discover in raw data. With a 17" MacBook Pro and an extra 27" screen, I've so much screen space (and many developers have a similar setup) that I gladly devote the Pro's 17" to a continuous X-Ray of my running framework. Alas, such a view does not exist.

Well, it did not exist, past tense. Though not related my venture my frustration made me spent some time on creating such an X-Ray for OSGi. The excuse was that I need to learn Javascript anyway and this seemed an excellent opportunity to learn the intriguing d3 library. I had hoped to finish this last week but Richard S. Hall and Karl Pauls spent a worthwhile week here to discuss their plans and see how we can collaborate. This was much more (intense!) talking than I had figured so I had to postpone the finishing touches till this week.

Since I really like the Apache Felix Web Console I started from there. It turns out that it is trivial to do a plugin for the Web Console since there is a good base class that takes of most of the chores. The Web Console does not only look good, it is also well designed. Adding a plugin is as simple as registering a service. The Abstract Web Console Plugin they provide was an easy way to get quickly started.

So after a few hours I had a basic SVG window with bundles and services. Getting it look nice was the really hard part. I wanted to use the diagramming technique I always used, triangles for services and boxes with rounded corners for bundles. That was the easy part. Getting the wires to run without causing a visual mess was the hard part. The d3 library contains some layout managers but none was suitable for my purpose (though bundles and services look pretty cool in a force directed graph!). In the end I settled on a grid where bundles go vertical and services go horizontal. In this model you can always wire with a horizontal and vertical line that do not meet obstacles. This sounds simpler than it is because the diagramming technique for OSGi requires  registering bundles to connect at the sharp side of the triangle, getters at the flat side, and listeners at the angled side. Since wires can overlap you also need a visual cue to see what wires are joined. Getting this right was quite tricky and required me to dust off my trigonometry books. The result looks like the following picture:


What functions did I implement. In some random order:
  • Objects navigate to the page in the Web Console where they are detailed. Clicking on a bundle takes you to the page for that bundle, clicking on a service takes you to the services page.
  • The state of a bundle is indicated with color. An orange bundle is happy and active, a grey bundle is resolved, and a white bundle is installed only.
  • The information is polled from the system and automatically updated. The update is transitioned so uninstalling a bundle causes the bundle beneath it to smoothly crawl up, automatically adjusting any services if necessary. Useless from a functional point of view but pretty cool to see!
  • You can remove services by dragging them off the screen. Refreshing the window brings them back.
  • If a bundle has recent errors or warnings in the log then a small warning icon is displayed. Clicking on this icon takes you to the log, hovering over it gives you the log messages. The Log Service must obviously be present for this to work.
  • If the Service Component Runtime service is registered then the information about DS components is collected and displayed in the bundle. Each component is summarized with a LED that can be red (not satisfied), green (active).
  • Listener Hooks are used to find out what bundles are waiting for what service. Services that are not present but still are waited for are drawn with a dashed outline. Services that are only registered but not used are displayed white. Active services used by bundles are yellow.
The xray bundle only requires the Web Console, you can download it from github repo. Source code for this plugin is inside the JAR. If you just want to try it out quickly, you can:
$ sudo jpm install xray.demo.jar http://dl.dropbox.com/u/2590603/xray.demo.jar
Installed command [xray]
$ # start/stop bundles with the shell ...
$ xray
-> stop 10 ...
Oops, guess jpm is not ubiquitous yet ... So just download it from the URL and use it like:
$ java -jar xray.demo.jar
-> stop 10 ...
If you got it running, open your browser on http://localhost:8080/system/console/xray and have fun. Feedback very much appreciated! Since I am a committer at Apache Felix I will likely move the code over there once it gets a bit more mature.

Peter Kriens

Wednesday, May 9, 2012

Clustering

One of the fundamental features that I need is a way to distribute tasks reliably over the systems in the cluster. The requirements I have are:
  • Load balancing, systems should be evenly loaded between the nodes in the cluster.
  • Persistent, once a task is submitted it should be executed once. If the component can provide this guarantee then transactional results can be achieved without locking.
  • Transient failures should be handled by trying to re-execute the task.
  • Periodic tasks, some tasks must happen on a regular basis.
  • Timed tasks, some tasks should happen after a future time.
  • Asymmetric clusters, that is, no requirement that each cluster is identical. Certain nodes can handle certain tasks that others potentially cannot.
  • Low overhead, though it is clear that a persistent queue is needed,  the mechanism should be useful for simple tasks.
  • Support for non-Java languages.
The model I came up with is a Task Queue service. For example, to queue a task that takes a Charge item:

Charge w = new Charge();
w.card = "6451429121212";
w.exp  = "03/12";
w.ccv  = 887;
w.charge = 1200;

TaskData td = taskQueue.with(w).queue();

Queuing a task will persist it first and then finds a cluster to execute it on. I am currently using Hazelcast as the communications library between nodes. Hazelcast has distributed maps with events for inserting and removal. When a new task is added, the node checks if it can handle that task type, if it can it will queue it locally. When the task is ready for execution it is removed from the distributed map, the first one wins and executes the task. So far I really like Hazelcast because it is very cohesive library and seems to provide straightforward solutions in a really complex area.

Connecting the workers with the task is done with one of my favorite mechanisms: the white board. A worker registers a Worker<T> service, where T is the type of task data it can receive. A worker looks like:

@Component
public class CardWorker implements Worker<Charge> {
  public void execute(Charge charge) throws Exception {
    ... take your time
  }
}

This model allows nodes to differ, not all nodes have to implement all types of workers. This is especially important for rolling updates where different versions must run at the same time. It also automatically load balances the tasks between the different systems that have the appropriate types.

For these systems, the successful execution is usually not that hard to code; the error handling is the hard part.  Especially, if you also want to keep things efficient. And if you think that is difficult then wait until you actually have to test many of the possible error scenarios!

The component does its basic work at the moment and it was very satisfying when I saw a task being executed after I installed a new bundle that provided the appropriate worker type.

The Task Queue component is fully based on the ideas sketched earlier that basically forbid the use of objects between systems. This in general works better than expected and provides many benefits. Lessons are being learned as well but those will be discussed in another blog.

Peter Kriens