2015-12-28

SSL is dead. Long live TLS.

Hmm, do you think there are any other protocols that could be resurrected with a different name? How about good old HTTP? It hasn’t been about “Hypertext” transport for a long time. I mean sure, HTML is still around, but half the time it’s being written on the fly by your Javascript app anyway, not to mention there are CSS, SVG, images, audio, video, and a host of other things being transported via HTTP. And it’s not just passively transferring files, it’s communicating complicated application responses.

Maybe we should just call it Transport Protocol, “TP” for short. Yeah, I like it!

Happy New Year!

Advertisements

Implementing an OpenShift Enterprise routing layer for HA applications

My previous post described how to make an application HA and what exactly that means behind the scenes. This post is to augment the explanation in the HA PEP of how an administrator should expect to implement the routing layer for HA apps.

The routing layer implementation is currently left entirely up to the administrator. At some point OpenShift will likely ship a supported routing layer component, but the first priority was to provide an SPI (Service Provider Interface) so that administrators could reuse existing routing and load balancer infrastructure with OpenShift. Since most enterprises already have such infrastructure, we expected they would prefer to leverage that investment (both in equipment and experience) rather than be forced to use something OpenShift-specific.

Still, this leaves the administrator with the task of implementing the interface to the routing layer. Worldline published an nginx implementation, and we have some reference implementations in the works, but I thought I’d outline some of the details that might not be obvious in such an implementation.

The routing SPI

The first step in the journey is to understand the routing SPI events. The routing SPI itself is an interface on the OpenShift broker app that must be implemented via plugin. The example routing plugin that is packaged for Origin and Enterprise simply serializes the SPI events to YAML and puts them on an ActiveMQ message queue/topic. This is just one way to distribute the events, but it’s a pretty good way, at least in the abstract. For routing layer development and testing, you can just publishes messages on a topic on the ActiveMQ instance OpenShift already uses (for Enterprise, openshift.sh does this for you) and use the trivial “echo” listener to see exactly what comes through. For production, publish events to a queue (or several if multiple instances need updating) on an HA ActiveMQ deployment that stores messages to disk when shutting down (you really don’t want to lose routing events) – note that the ActiveMQ deployment described in OpenShift docs and deployed by the installer does not do this, being intended for much more ephemeral messages.

I’m not going to go into detail about the routing events. You’ll become plenty familiar if you implement this. You can see some good example events in this description, but always check what is actually coming out of the SPI as there may have been updates (generally additions) since. The general outline of the events can be seen in the Sample Routing Plug-in Notifications table from the Deployment Guide or in the example implementation of the SPI. Remember you can always write your own plugin to give you information in the desired format.

Consuming SPI events for app creation

The routing SPI publishes events for all apps, not just HA ones, and you might want to do something with other apps (e.g. implement blue/green deployments), but the main focus of a routing layer is to implement HA apps. So let’s look at how you do that. I’m assuming YAML entries from the sample activemq plugin below — if you use a different plugin, similar concepts should apply just with different details.

First when an app is created you’re going to get an app creation event:

$ rhc app create phpha php-5.4 -s

:action: :create_application
:app_name: phpha
:namespace: demo
:scalable: true
:ha: false

This is pretty much just a placeholder for the application name. Note that it is not marked as HA. There is some work coming to make apps HA at creation, but currently you just get a scaled app and have to make it HA after it’s created. This plugin doesn’t publish the app UUID, which is what I would probably do if I were writing a plugin now. Instead, you’ll identify the application in any future events by the combination of app_name and namespace.

Once an actual gear is deployed, you’ll get two (or more) :add_public_endpoint actions, one for haproxy’s load_balancer type and one for the cartridge web_framework type (and possibly other endpoints depending on cartridge).

:action: :add_public_endpoint
:app_name: phpha
:namespace: demo
:gear_id: 542b72abafec2de3aa000009
:public_port_name: haproxy-1.4
:public_address: 172.16.4.200
:public_port: 50847
:protocols:
- http
- ws
:types:
- load_balancer
:mappings:
- frontend: ''
 backend: ''
- frontend: /health
 backend: /configuration/health

You might expect that when you make the app HA, there is some kind of event specific to being made HA. There isn’t at this time. You just get another load_balancer endpoint creation event for the same app, and you can infer that it’s now HA. For simplicity of implementation, it’s probably just best to treat all scaled apps as if they were already HA and define routing configuration for them.

Decision point 1: The routing layer can either direct requests only to the load_balancer endpoints and let them forward traffic all to the other gears, or it can actually just send traffic directly to all web_framework endpoints. The recommendation is to send traffic to the load_balancer endpoints, for a few reasons:

  1. This allows haproxy to monitor traffic in order to auto-scale.
  2. It will mean less frequent changes to your routing configuration (important when changes mean restarts).
  3. It will mean fewer entries in your routing configuration, which could grow quite large and become a performance concern.

However, direct routing is viable, and allows an implementation of HA without actually going to the trouble of making apps HA. You would just have to set up a DNS entry for the app that points at the routing layer and use that. You’d also have to handle scaling events manually or from the routing layer somehow (or even customize the HAproxy autoscale algorithm to use stats from the routing layer).

Decision point 2: The expectation communicated in the PEP (and how this was intended to be implemented) is that requests will be directed to the external proxy port on the node (in the example above, that would be http://172.16.4.200:50847/). There is one problem with doing this – idling. Idler stats are gathered only on requests that go through the node frontend proxy, so if we direct all traffic to the port proxy, the haproxy gear(s) will eventually idle and the app will be unavailable even though it’s handling lots of traffic. (Fun fact: secondary gears are exempt from idling – doesn’t help, unless the routing layer proxies directly to them.) So, how do we prevent idling? Here are a few options:

  1. Don’t enable the idler on nodes where you expect to have HA apps. This assumes you can set aside nodes for (presumably production) HA apps that you never want to idle. Definitely the simplest option.
  2. Implement health checks that actually go to the node frontend such that HA apps will never idle. You’ll need the gear name, which is slightly tricky – the above endpoint being on the first gear, it will be accessible by a request for http://phpha-demo.cloud_domain/health to the node at 172.16.4.200. When the next gear comes in, you’ll have to recognize that it’s not the head gear and send the health check to e.g. http://542b72abafec2de3aa000009-demo.cloud_domain/health.
  3. Flout the PEP and send actual traffic to the node frontend. This would be the best of all worlds since the idler would work as intended without any special tricks, but there are some caveats I’ll discuss later.

Terminating SSL (TLS)

When hosting multiple applications behind a proxy, it is basically necessary to terminate SSL at the proxy. (Despite SSL having been essentially replaced by TLS at this point, we’re probably going to call it SSL for the lifetime of the internet.) This has to do with the way routing works under HTTPS; during the intialization of the TLS connection, the client has to indicate the name it wants (in our case the application’s DNS name) in the SNI extension to the TLS “hello”. The proxy can’t behave as a dumb layer 4 proxy (just forwarding packets unexamined to another TLS endpoint) because it has to examine the stream at the protocol level to determine where to send it. Since the SNI information is (from my reading of the RFC) volunteered by the client at the start of the connection, it does seem like it would be possible for a proxy to examine the protocol and then act like a layer 4 proxy based on that examination, and indeed I think F5 LBs have this capability, but it does not seem to be a standard proxy/LB capability, and certainly not for existing open source implementations (nginx, haproxy, httpd – someone correct me if I’m missing something here), so to be inclusive we are left with proxies that operate at the layer 7 protocol layer, meaning they perform the TLS negotiation from the client’s perspective.

Edit 2014-10-08: layer 4 routing based on SNI is probably more available than I thought. I should have realized HAproxy 1.5 can do it, given OpenShift’s SNI proxy is using that capability. It’s hard to find details on though. If most candidate routing layer technologies have this ability, then it could simplify a number of the issues around TLS because terminating TLS could be deferred to the node.

If that was all greek to you, the important point to extract is that a reverse proxy has to have all the information to handle TLS connections, meaning the appropriate key and certificate for any requested application name. This is the same information used at the node frontend proxy; indeed, the routing layer will likely need to reuse the same *.cloud_domain wildcard certificate and key that is shared on all OpenShift nodes, and it needs to be made aware of aliases and their custom certificates so that it can properly terminate requests for them. (If OpenShift supported specifying x509 authentication via client certificates [which BTW could be implemented without large structural changes], the necessary parameters would also need to be published to the routing layer in addition to the node frontend proxy.)

We assume that a wildcard certificate covers the standard HA DNS name created for HA apps (e.g. in this case ha-phpha-demo.cloud_domain, depending of course on configuration; notice that no event announces this name — it is implied when an app is HA). That leaves aliases which have their own custom certs needing to be understood at the routing layer:

$ rhc alias add foo.example.com -a phpha
:action: :add_alias
:app_name: phpha
:namespace: demo
:alias: foo.example.com

$ rhc alias update-cert foo.example.com -a phpha --certificate certfile --private-key keyfile
:action: :add_ssl
:app_name: phpha
:namespace: demo
:alias: foo.example.com
:ssl: [...]
:private_key: [...]
:pass_phrase:

Aliases will of course need their own routing configuration entries regardless of HTTP/S, and something will have to create their DNS entries as CNAMEs to the ha- application DNS record.

A security-minded administrator would likely desire to encrypt connections from the routing layer back to the gears. Two methods of doing this present themselves:

  1. Send an HTTPS request back to the gear’s port proxy. This won’t work with any of the existing cartridges OpenShift provides (including the haproxy-1.4 LB cartridge), because none of them expose an HTTPS-aware endpoint. It may be possible to change this, but it would be a great deal of work and is not likely to happen in the lifetime of the current architecture.
  2. Send an HTTPS request back to the node frontend proxy, which does handle HTTPS. This actually works fine, if the app is being accessed via an alias – more about this caveat later.

Sending the right HTTP headers

It is critically important in any reverse-proxy situation to preserve the client’s HTTP request headers indicating the URL at which it is accessing an application. This allows the application to build self-referencing URLs accurately. This can be a little complicated in a reverse-proxy situation, because the same HTTP headers may be used to route requests to the right application. Let’s think a little bit about how this needs to work. Here’s an example HTTP request:

POST /app/login.php HTTP/1.1
Host: phpha-demo.openshift.example.com
[...]

If this request comes into the node frontend proxy, it looks at the Host header, and assuming that it’s a known application, forwards the request to the correct gear on that node. It’s also possible (although OpenShift doesn’t do this, but a routing layer might) to use the path (/app/login.php here) to route to different apps, e.g. requests for /app1/ might go to a different place than /app2/.

Now, when an application responds, it will often create response headers (e.g. a redirect with a Location: header) as well as content based on the request headers that are intended to link to itself relative to what the client requested. The client could be accessing the application by a number of paths – for instance, our HA app above should be reachable either as phpha-demo.openshift.example.com or as ha-phpha-demo.openshift.example.com (default HA config). We would not want a client that requests the ha- address to receive a link to the non-ha- address, which may not even resolve for it, and in any case would not be HA. The application, in order to be flexible, should not make a priori assumptions about how it will be addressed, so every application framework of any note provides methods for creating redirects and content links based on the request headers. Thus, as stated above, it’s critical for these headers to come in with an accurate representation of what the client requested, meaning:

  1. The same path (URI) the client requested
  2. The same host the client requested
  3. The same protocol the client requested

(The last is implemented via the “X-Forwarded-Proto: https” header for secure connections. Interestingly, a recent RFC specifies a new header for communicating items 2 and 3, but not 1. This will be a useful alternative as it becomes adopted by proxies and web frameworks.)

Most reverse proxy software should be well aware of this requirement and provide options such that when the request is proxied, the headers are preserved (for example, the ProxyPreserveHost directive in httpd). This works perfectly with the HA routing layer scheme proposed in the PEP, where the proxied request goes directly to an application gear. The haproxy cartridge does not need to route based on Host: header (although it does route requests based on a cookie it sets for sticky sessions), so the request can come in for any name at all and it’s simply forwarded as-is for the application to use.

The complication arises in situations where, for example, you would like the routing layer to forward requests to the node frontend proxy (in order to use HTTPS, or to prevent idling). The node frontend does care about the Host header because it’s used for routing, so the requested host name has to be one that the OpenShift node knows in relation to the desired gear. It might be tempting to think that you can just rewrite the request to use the gear’s “normal” name (e.g. phpha-demo.cloud_domain) but this would be a mistake because the application would respond with headers and links based on this name. Reverse proxies often offer options for rewriting the headers and even contents of responses in an attempt to fix this, but they cannot do so accurately for all situations (example: links embedded in JavaScript properties) so this should not be attempted. (Side note: the same prohibition applies to rewriting the URI path while proxying. Rewriting example.com/app/… to app.internal.example.com/… is only safe for sites that provide static content and all-relative links.)

What was that caveat?

I mentioned a caveat both on defeating the idler and proxying HTTPS connections to the node frontend, and it’s related to the section above. You can absolutely forward an HA request to the node frontend if the request is for a configured alias of the application, because the node frontend knows how to route aliases (so you don’t have to rewrite the Host: header which, as just discussed, is a terrible idea). The caveat is that, strangely, OpenShift does not create an alias for the ha- DNS entry automatically assigned to an HA app, so manual definition of an alias is currently required per-app for implementation of this scheme. I have created a feature request to instate the ha- DNS entry as an alias, and being hopefully easy to implement, this may soon remove the caveat behind this approach to routing layer implementation.

Things go away too

I probably shouldn’t even have to mention this, but: apps, endpoints, aliases, and certificates can all go away, too. Make sure that you process these events and don’t leave any debris lying around in your routing layer confs. Gears can also be moved from one host to another, which is an easy use case to forget about.

And finally, speaking of going away, the example routing plugin initially provided :add_gear and :remove_gear events, and for backwards compatibility it still does (duplicating the endpoint events). These events are deprecated and should disappear soon.

OpenShift, Apache, and severe hair removal

I just solved a problem that stumped me for a week. It was what we in the business call “a doozy”. I’ll share here, mostly to vent, but also in case the process I went through might help someone else.

The problem: a non-starter

I’ve been working on packaging all of OpenShift Enterprise into a VM image with a desktop for convient hacking on a personal/throwaway environment. It’s about 3 GB (compressed) and takes an hour or so to build. It has to be built on our official build servers using signed RPMs via an unattended kickstart. There have been a few unexpected challenges, and this was basically the cherry on top.

The problem was that the first time the VM booted… openshift-broker and openshift-console failed to start. Everything else worked, including the host-level httpd. Those two (which are httpd-based) didn’t start, and they didn’t leave any logs to indicate why. They didn’t even get to the point of making log files.

And the best part? It only happened the first time. If you started the services manually, they worked. If you simply rebooted after starting the first time… they worked. So basically, the customer’s first impression would be that it was hosed… even though it magically starts working after a reboot, the damage is done. I would look like an idiot trying to release with that little caveat in place.

Can you recreate it? Conveniently?

WTF causes that? And how the heck do I troubleshoot? For a while, the best I could think of was starting the VM up in runlevel 1 (catch GRUB at boot and add the “1” parameter), patching in diagnostics, and then proceeding to init. After one run, if I don’t have what I need… recreate the VM and try again. So painful I literally just avoided it and busied myself with other things.

The first breakthrough was when I tried to test the kinds of things that happen only at the first boot after install. There are a number, potentially – services coming up for the first time and so forth. Another big one is network initialization on a new network and device. I couldn’t see how that would affect these services (they are httpd binding only to localhost), but I did experiment with changing the VM’s network device after booting (remove the udev rule, shut down, remove the device, add another), and found that indeed, it caused the failure on the next boot reliably.

So it had to be something to do with network initialization.

What’s the actual error?

Being able to cause it at will on reboot meant much easier iteration of diagnostics. First I tried just adding a call to ifconfig in the openshift-broker init script. I couldn’t see anything in the console output, so I assumed it was just being suppressed somehow.

Next I tried to zero in on the actual failure. When invoked via init script, the “daemon” function apparently swallows console output from the httpd command, but it provides an opportunity to add options to the command invocation, so I looked up httpd command line options and found two that looked helpful: “-e debug -E /var/log/httpd_console”:

-e level
Sets the LogLevel to level during server startup. This is useful for temporarily increasing the verbosity of the error messages to find problems during startup.

-E file
Send error messages during server startup to file.

This let me bump up the logging level at startup and capture the server startup messages. (Actually probably only the latter matters. Another one to keep in mind is -X which starts it as a single process/thread only – helpful for using strace to follow it. Not helpful here though.)

This let me see the actual failure:

 [crit] (EAI 9)Address family for hostname not supported: alloc_listener: failed to set up sockaddr for 127.0.0.1

Apparently the httpd startup process tries to bind to network interfaces before it even opens logs, and this is what you get when binding to localhost fails.

What’s the fix?

An error is great, but searching the mighty Google for it was not very encouraging. There were a number of reports of the problem, but precious little about what actually caused it. The closest I could find was this httpd bug report:

Bug 52709 – Apache can’t bind to 127.0.0.1 if eth0 has only IPv6

[…]

This bug also affects httpd function ap_get_local_host() as described in http://bugs.debian.org/629899
Httpd will then fail to get the fully qualified domain name.

This occurs when apache starts before dhclient finished its job.

Here was something linking the failure to bind to 127.0.0.1 to incomplete network initialization by dhclient. Suddenly the fact that my test “ifconfig” at service start had no output did not seem like a fluke. When I started the service manually, ifconfig certainly had output.

So, here’s the part I don’t really claim to understand. Apparently there’s a period after NetworkManager does its thing and other services are starting where, in some sense at least, the network isn’t really available. At least not in a way that ifconfig can detect, and not in a way that allows httpd to explicitly bind to localhost.

As a workaround, I added a shim service that would wait for network initialization to actually complete before trying to start the broker and console. I could add a wait into those service scripts directly, but I didn’t want to hack up the official files that way. So I created this very simple service (openshift-await-eth0) that literally just runs “ifconfig eth0” and waits up to 60 seconds for it to include a line beginning “inet addr:”. Notably, if I have it start right after NetworkManager, it finishes immediately, so it seems the network is up at that point, but goes away just in time for openshift-broker and openshift-console to trip over it. So I have the service run right before openshift-broker, and now my first boot successfully starts everything.

Since this is probably the only place we’ll ever deploy OpenShift that has dhclient trying to process a new IP at just the time the broker and console are being started, I don’t know that it will ever be relevant to anyone else. But who knows. Maybe someone can further enlighten me on what’s going on here and a better way to avoid it.

OpenShift with dynamic host IPs?

From the time we began packaging OpenShift Enterprise, we made a decision not to support dynamic changes to host IP addresses. This might seem a little odd since we do demonstrate installation with the assumption that DHCP is in use; we just require it to be used with addresses pinned statically to host names. It’s not that it’s impossible to work with dynamic re-leasing; it’s just that it’s an unnecessary complication and potentially a source of tricky problems.

However, I’ve crawled all over OpenShift configuration for the last few months, and I can say with a fair amount of confidence that it’s certainly possible to handle dynamic changes to host IP, as long as those changes are tracked by DNS with static hostnames.

But there are, of course, a number of caveats.

First off, it should be obvious that DNS must be integrated with DHCP such that hostnames never change and always resolve correctly to the same actual host. Then, if configuration everywhere uses hostnames, it should in theory be able to survive IP changes.

The most obvious exception is the IP(s) of the nameserver(s) themselves. In /etc/resolv.conf clearly the IP must be used, as it’s the source for name resolution, so it can’t bootstrap itself. However, in the unlikely event that nameservers need to re-IP, DHCP could make the transition with a bit of work. You could not use our basic dhclient configuration that statically prepends the installation nameserver IP – instead the DHCP server would need to supply all nameserver definitions, and there would be some complications around the transition since not all hosts would renew leases at the same time. Really, this would probably be the province of a configuration management system. I trust that those who need to do such a thing have thought about it much more than I have.

Then there’s the concern of the dynamic DNS server that OpenShift publishes app hostnames to. Well, no reason that can’t be a hostname as well, as long as the nameserver supplied by DHCP/dhclient knows how to resolve it. Have I mentioned that you should probably implement your host DNS separately from the dynamic app DNS? No reason they need to use the same server, and probably lots of reasons not to.

OK, maybe you’ve looked through /etc/openshift/node.conf and noticed the PUBLIC_IP setting in there. What about that? Well, I’ve tracked that through the code base and as far as I can tell, the only thing it is ever used for is to create a log entry when gears are created. In other words, it has no functional significance. It may have in the past – as I understand it, apps used to be created with A records rather than as CNAMEs to the node hosts. But for now, it’s a red herring.

Something else to worry about are iptables filters. In our instructions we never demonstrate filters for specific addresses, but conscientious sysadmins would likely limit connections to the hosts that are expected to need them in many cases. And they would be unlikely to define them using hostnames. So either don’t do that… or have a plan for handling IP changes.

One more caveat: what do we mean by dynamic IP changes? How dynamic?

If we’re talking about the kind of IP change where you shut down the host (perhaps to migrate its storage) and when it is booted again, it has a new IP, then that use case should be handled pretty well (again, as long as all configured host references use the hostname). This is the sort of thing you would run into in Amazon EC2 where hosts keep their IP as long as they stay up, but when shut down generally get a new IP. All the services on the host are started with the proper IP in use.

It’s a little more tricky to support IP changes while the host is operating. Any services that bind specifically to the external IP address would need restarting. I’ve had a look while writing this, though, and this is a lot less common than I expected. As far as I can see, only one node host service does that: haproxy (which is used by openshift-port-proxy to proxy specific ports from the external interface back to gear ports). The httpd proxy listens to all interfaces so it’s exempt, and individual gears listen on internal interfaces only. On the broker and supporting hosts, ActiveMQ and MongoDB either listen locally or to all interfaces. The nameserver, despite being configured to listen to “all”, appears to bind to specific IPs, so it looks like it would need a restart. You could probably configure dhclient to do this when a lease changes (with appropriate SELinux policy changes to permit it). But you can see how easy this would be to get wrong.

Hopefully this brief exploration of the issues involved demonstrates why we’re going to stick with the current (non-)support policy for the time being. But I also expect some of you out there will try out OpenShift in a dynamic IP environment, and if so I hope you’ll let me know what you run into.

NIC after clone

In the spirit of blogging to remember what I did, I hope I have a solution by the end of this post.

On RHEL 6, using virt-clone, I cloned a paused QEMU/KVM VM , then cloned the clone and brought that up. The  MAC on the NIC was changed, but I couldn’t bring it up:

ifup eth0
Device eth0 does not seem to be present, delaying initialization.

Looking in ifcfg-eth0, everything looked generic:

DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes
NM_MANAGED=no

Just for completeness, I went ahead and added:

HWADDR=52:54:00:72:a6:c0

Don’t think that should be necessary, anyway it made no difference.

My coworker helpfully pointed out that udev might need some changes after a clone. Looking at /etc/udev/rules.d/70-persistent-net.rules I could see some of the problem:

# This file was automatically generated by the /lib/udev/write_net_rules
# program, run by the persistent-net-generator.rules rules file.
#
# You can modify it, as long as you keep each rule on a single
# line, and change only the value of the NAME= key.
# PCI device 0x1af4:0x1000 (virtio-pci)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="52:54:00:ef:64:a4", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
# PCI device 0x1af4:0x1000 (virtio-pci)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="52:54:00:72:a6:c0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"

Looks like rather than replace eth0’s MAC addr, the clone just added a new device, and the record of the old one is still there. But the obvious thing to do here doesn’t seem to help. I tried commenting out the first definition and changing the second one to eth0. No dice – same result. So, more poking around.

StackOverflow turned up something that mentioned /sys/class/net/ which on my clone contains:

# ls /sys/class/net/
eth1 lo

Huh… so as far as the system is concerned, there really is no eth0. I could probably get by just having eth1, but I’d really like to know how to fix this.

This page brought me to some more interesting low-level stuff.

# ip link show eth0
Device "eth0" does not exist.
# ip link show eth1
2: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 52:54:00:72:a6:c0 brd ff:ff:ff:ff:ff:ff

Good to know we’re all reading from the same script. So who determines what gets assigned to eth0 or eth1?

That’s when I started writing this.

And then I found this forum entry. So I deleted the udev rules file and rebooted. And that worked – the NIC came back as eth0 and got a connection without me changing anything.

I guess it makes some sense. The rules file refers to a device that doesn’t exist, so a rule is added for the new device, and just in case the old one comes back, the old rule remains.

But seriously, rebooting? There has to be something I can just run to set things right.

back after a break

It’s kind of gratifying to check the logs and see that people are still ending up at my little blog here even when I’m away having a baby for a month! Hope visitors found something useful.

I’ve had some troubles with DD-WRT since trying it out, just thought I’d note them. The first, major one was that all of the sudden the DHCP server stopped working. It just wouldn’t hand out IPs anymore – existing ones worked fine. I ended up resetting the router and losing configuration over that one. Then one day in the middle of a Skype session I simply lost connectivity. The router was up but I couldn’t ping the gateway for my ISP. I cursed the ISP at first but eventually figured out it wasn’t them, and a quick router reboot actually solved it. But this is disconcerting. Maybe I need a firmware upgrade, haven’t been paying attention.

Demoed WhenDidI at my Android meetup on Wednesday, and folks seemed to think it had good potential and were interested in how it worked. We started talking about the future log analysis capabilities, adding geotags to entries, sharing trackers between multiple people, etc. – stuff that’s mostly way in the future unfortunately, but interesting.

Playing around with DD-WRT

I finally got around to trying DD-WRT. I have a friend who raved about it and I always meant to learn a little more about networking, but I just never got around to it until now. My major motivations for doing it now:

  • Friends/family coming over wanting to use the wireless; but I don’t want to tell them my password and don’t particularly want them on the same network as my other stuff. So I want to offer multiple SSIDs and segregate networks.
  • OK, this isn’t really specific to DD-WRT, but my old router only supports WEP “security” – needed to get with the times!
  • Want to set up outgoing VPN connections at the router level instead of doing this per machine (having this is another good reason to isolate visitors from the main LAN).
  • And hey, a local VPN accessible from the internet too, though most routers probably do this out of the box?
  • Static DHCP and DNS allocations so all my devices get names without having to cart around a hosts file or set up my own DNS server. Since I’ll be setting up a ton of VMs this is kind of important.
  • Ability to set up virtual LANs and fiddle with various networking things I don’t really understand yet.
  • Ownership of my router. Ability to do all this stuff on my terms instead of with whatever capabilities the manufacturer deigns to put in the thing (reserving the best for a more expensive model).

I am not foolhardy enough to try this with the router I currently have in place and risk having no internet connection except my phone. After a little searching on Amazon (having Prime pretty much rules where I look for things now) I decided on the Linksys WRT160N. This looks pretty decent even out of the box, and for $31 I wasn’t going to cry much if I bricked it. I didn’t have any way of knowing which version I’d be getting – according to the router database v1 and v3 are supported, but not v2 – but as luck would have it I got v3.

The installation guide for DD-WRT in general (and the “Peacock” thread) is full of all kinds of warnings about how you need to read *everything* on the wiki to avoid bricking your router. In that way it’s worse than manufacturer manuals! Sorry, fellas, I know you’re trying to help but it’s a bit much. The main thing I got out of this is that it’s really important to do a 30/30/30 reset before and after flashing the firmware.

The basic procedure for flashing DD-WRT onto a router from scratch is as follows:

  1. Comb the DD-WRT site for all the instructions and firmware images you’ll need, and have them on your computer before you try to do anything.
  2. Connect your computer to the router with a cable (doing it wirelessly isn’t gonna fly).
  3. Do a 30/30/30 reset on the router.
  4. Using the router’s standard administration UI (which allows you to reflash firmware in order to enable manufacturer updates), flash the first firmware image – this will usually be a special bootstrap image (or a “kill image”) just to make the transition off the factory router firmware.
  5. Do a 30/30/30 reset on the router.
  6. Now using the DD-WRT UI, flash the real firmware you’ll be using. You may not have a lot of choices – I had two listed in the router database.
  7. Do a 30/30/30 reset on the router.
  8. Start looking through all the crazy options DD-WRT offers, and enjoy!

The specific instructions for the WRT160Nv3 are pretty succinct and I got things working pretty quickly. Yeah – it’s really not too bad! I had it working right away with only one hitch – my browser kept caching the “apply.cgi” page which is used to alter settings. The cached page was simply blank and doesn’t actually apply anything (of course). There used to be a simple way to clear cached items in Firefox but it doesn’t seem to be in the settings anymore – maybe it’s extended settings or needs a plugin now. I solved it simply by setting up a /etc/hosts entry for the router and referring to it by a new name.

DD-WRT offers the capability to set up virtual SSIDs, but this is a little trickier than the interface would have you believe. Once you create a virtual SSID, at least if you want to isolate it as I mentioned above (and why else would you want to do this?), you need to do some hackish stuff. I found a guide here that was very helpful. Alas, I think my stock G1 phone ran afoul of some of the problems with virtual SSIDs mentioned there, because it holds onto the main SSID from an early configuration I tried and completely refuses to see the new setup (even after I manually entered the SSID and WPA2 password for it). My modded G1 seems to see everything fine as does my laptop. I will probably have to play around with SSID visibility to get this right (or – mod my stock G1 – but it’s my main phone currently and I’m reluctant to lose all my settings etc.).

The DD-WRT help that comes with the firmware (displayed from the UI so you don’t even need working internet) is pretty decent (to say nothing of the help you can find on the internet). There is an unbelievably variety of features built in. This should keep me busy for a long time!