Developing Perspective

#143: Bare Metal.


00:00:00   Hello and welcome to Developing Perspective.

00:00:03   Developing Perspective is a podcast discussing news of note in iOS development, Apple and

00:00:06   the like.

00:00:07   I'm your host, David Smith.

00:00:08   I'm an independent iOS and Mac developer based in Herna, Virginia.

00:00:11   This is show number 143.

00:00:13   Today is Tuesday, September 17th.

00:00:15   Developing Perspective is never longer than 15 minutes, so let's get started.

00:00:19   All right, so today I'm going to be unpacking, taking a break from, I guess, the iOS 7 stuff,

00:00:25   new iPhone stuff, all the things I've been talking about recently, and go back a little

00:00:29   bit to something that I talked about for a while before, which is Feed Wrangler and talking

00:00:34   about some of the more web service side of things there. And specifically I'm going to

00:00:39   be unpacking my experience of the last few days and some of the lessons I've learned.

00:00:44   So just a quick bit of background, Feed Wrangler, soon to be pod Wrangler, is a system that

00:00:50   I built that is a RSS aggregating syncing service. So it's a replacement for Google

00:00:55   in a lot of ways that hopefully adds a bit more value

00:00:58   and does things in another way beyond that.

00:01:01   And this is a service that I launched,

00:01:02   I think it was back in May-ish,

00:01:04   it was about something like that.

00:01:06   And it's been doing pretty well.

00:01:08   It's definitely been meeting my expectations

00:01:09   and doing well, sort of doing well

00:01:12   and being a significant interesting part of my business.

00:01:16   And as a result of, sort of,

00:01:19   feeder-angler at its core is a sync service.

00:01:23   It's something whose purpose is to take

00:01:26   rather remarkably large amounts of data,

00:01:29   essentially from RSS feeds all over the internet,

00:01:32   aggregate them together, and then let people browse them

00:01:34   in an easy interface, providing a good API for third party

00:01:38   clients, and so on.

00:01:39   And so it's something that I built initially

00:01:43   on using virtual private servers,

00:01:45   specifically at Linode.

00:01:47   And it's something whose performance characteristics

00:01:51   have always been complicated.

00:01:53   I think by the nature of what it's doing,

00:01:58   there are certain problems that are just part of it

00:02:01   that are inescapable, that no matter what you do,

00:02:03   it's just a hard problem to solve.

00:02:06   There's a lot of data.

00:02:07   There's a lot of, just the sheer bandwidth of items

00:02:08   that I'm trying to manage and sort and deal with

00:02:12   is pretty remarkable.

00:02:15   I think that the simplest version is that right now,

00:02:17   I think the database that sort of stores all the articles

00:02:21   manages them I think is up to about 200 gigabytes or so.

00:02:25   So about 200 gigabytes of data in just about six months or something of work.

00:02:29   And so it's quite a lot to keep track of and to manage.

00:02:33   And I built it using virtual private servers at Linode, which was something that I

00:02:37   was very familiar with and it worked well for me in the past.

00:02:41   And it had, however, what I recently found,

00:02:45   and this is the main story that I'm going to get into now,

00:02:49   that it was never quite performant enough.

00:02:52   It was never quite as fast as I would have liked it to be.

00:02:55   It was never quite as stable as I would have liked it to be.

00:02:58   And so I kept trying and working out ways to get around that.

00:03:03   I kept adding different types of caching,

00:03:06   taking the database and splitting it in two

00:03:08   and having a read slave and a master,

00:03:10   where all the writes happen, and all these types of tricks

00:03:13   that I was kind of heading towards.

00:03:15   What I found, though, is that that

00:03:18   all kinds of other problems with it.

00:03:20   And so here comes the story part of this.

00:03:22   So last week, Reader 2 launched, which I highly recommend.

00:03:27   It's an excellent app.

00:03:28   I was beta testing it for a long time,

00:03:30   and Silvio did a really good job of building it.

00:03:33   And it feels right at home on iOS 7.

00:03:34   It's really sort of slick and has some cool, nice touches.

00:03:38   And I really like it.

00:03:40   And I was actually really excited when it launched,

00:03:43   because it has a lot of features that people

00:03:46   have been asking for.

00:03:47   has full support for feed wrangler,

00:03:48   it has smart stream support.

00:03:50   It does a lot of things that I was really excited about.

00:03:52   One thing I did not anticipate, however,

00:03:54   which is perhaps a little bit of foolishness on my side,

00:03:57   is that it was launched as a new app.

00:03:59   And so everybody who uses it, which

00:04:02   is a pretty high proportion of my user base,

00:04:05   bought it and then proceeded to resync their entire article

00:04:10   lists into it, which meant that my already kind of--

00:04:16   A little bit on the edge, not a lot of headroom server infrastructure was completely crushed

00:04:20   and destroyed.

00:04:21   And so last Wednesday, I believe it was when it launched, all of a sudden things just went

00:04:26   from bad to worse.

00:04:27   And the things that I had been kind of scraping by with and my patches and sort of intermediate

00:04:33   solutions had been working, but were not working well enough at this point, without amount

00:04:40   of traffic, where I think I was going to something about two or three times at least normal traffic

00:04:45   load, and especially the kind of traffic that it was, was very--

00:04:49   was much more difficult to deal with than normal traffic.

00:04:52   Because normal traffic, most applications

00:04:54   are only asking for recent things.

00:04:56   They're asking for, what are the new articles

00:04:58   since a particular time?

00:05:00   Whereas now, when you do a full sync, you kind of go back

00:05:04   and go, OK, what are the articles they have,

00:05:07   so sort of a week ago or a month ago, depending on the client.

00:05:10   And you're kind of going this far more back,

00:05:13   this deep, backward search.

00:05:14   And so that's very hard from a caching perspective to deal with, because rather than dealing

00:05:18   with the normal working set, you're dealing with essentially just an approximation of

00:05:23   the entire working set of the database, which is 200 gigabytes, which is too large to reasonably

00:05:28   be caching.

00:05:29   And so everything just kind of fell apart.

00:05:32   And so I arrived at the place of then having to try and deal with, well, what do I do with

00:05:40   this?

00:05:41   What can I do?

00:05:42   do I keep trying to solve this the way that I have,

00:05:47   with introducing new levels of caching,

00:05:50   new levels of complexity,

00:05:52   splitting traffic in different and interesting ways,

00:05:54   or should I really just sort of do what I probably

00:05:57   should have done from the start,

00:05:58   and that is kind of throw money at the problem,

00:06:00   is one way to say it.

00:06:02   And I'm gonna kind of get into it a little bit later,

00:06:03   but the main lesson that I've been,

00:06:06   as I've been trying to boil down my experience

00:06:07   over the last couple of days,

00:06:09   is into sort of like an aphorism,

00:06:11   I was going to try and put it concisely, is that the lesson I think I've learned is that

00:06:15   I should try to avoid to solve with cleverness that which I could solve with money.

00:06:21   And by that I mean I've recently gone through the process of migrating the entire feed wrangler

00:06:26   back end to dedicated bare metal, ridiculously fast machines that are hosted, I host them

00:06:34   with soft layer, but it's basically I'm literally buying a server that's incredibly beefy, you

00:06:40   know, super fast, big SSD drives in a RAID configuration, the dual processing, big Xeon

00:06:47   chips.

00:06:48   Like, it is a beefy, beefy machine.

00:06:50   And it certainly cost more than what I was paying at Linode, but it means that my performance

00:06:54   now is being-- essentially, my performance problem is being solved by just spending a

00:07:00   lot more on the server.

00:07:02   And the experience of going through this process has been kind of interesting, and I'll talk

00:07:07   about that now.

00:07:08   And it's also just--

00:07:09   I really think I've learned a lot of things

00:07:11   by going through this.

00:07:12   So the process was essentially--

00:07:14   so on Wednesday, the entire system just

00:07:17   started grinding to a halt. Like,

00:07:18   my error rates were getting really high,

00:07:20   and just things were going crazy.

00:07:22   If you're familiar with the load parameter,

00:07:25   if you type top on any Unix or Linux machine,

00:07:28   you can get a load parameter.

00:07:29   And every now and then, my databases

00:07:31   were spiking to loads of like 20 or 30,

00:07:34   which is completely insane and definitely not what you want.

00:07:36   You want to load to probably be between 0 and the number of cores

00:07:41   on your machine at most.

00:07:42   So you're at least nowhere higher than at least 8.

00:07:45   8's still pretty high, but you don't want

00:07:47   to be nowhere up in here, 20 or 30.

00:07:50   And so my I/O was just really crushed.

00:07:54   It was essentially running at 100% utilization the entire time.

00:07:57   And so I started looking at my options.

00:07:59   The first thing I thought about is, OK, is it just

00:08:01   because Linux doesn't have SSDs?

00:08:04   So I started looking at DigitalOcean, which is another VPS provider, which does have the SSD.

00:08:09   And so I took my database, replicated it over there, tried it out.

00:08:13   And the reality was it was a little bit faster, but it was a marginal improvement.

00:08:17   And at this point I just started to think about, "Well, you know what I need to do?

00:08:22   I think I just need to try it. What is the most performant thing that I could possibly do?"

00:08:25   And that sort of led me to just deciding that, "You know what I need to do?

00:08:30   to do, I'm just going to put this on a dedicated host.

00:08:32   And so I've since, in the last few days,

00:08:35   over the course of, I think it was three sort of sleepless

00:08:38   nights doing this all between 2 and 5 in the morning,

00:08:42   so that it has the least impact on customers.

00:08:45   The reality is the thing was barely holding together

00:08:48   as it was, so I suppose I could have done it

00:08:50   in the middle of the day in some ways.

00:08:51   But through a lot of sleepless nights,

00:08:53   I was able to take the entire infrastructure that

00:08:56   was currently on Linode and has now

00:08:58   been migrated onto bare metal servers at SoftLayer.

00:09:03   And that process, I will say, is a little bit insane,

00:09:06   and it's something that I hope to not have to do again.

00:09:08   It's kind of, the best analogy I can think for it

00:09:11   is it's like changing the wheel in your car

00:09:14   while you're driving it.

00:09:15   'Cause you're trying to move these things

00:09:18   and migrate them over while still dealing

00:09:20   with all the incoming traffic.

00:09:21   And so it was a little bit crazy and a little bit hectic,

00:09:24   but thankfully it's done now.

00:09:25   This is, I think, Tuesday, and so as of yesterday,

00:09:27   about Monday morning, I did the last transitions,

00:09:32   and other than some lingering traffic

00:09:35   that still hasn't updated its DNS,

00:09:37   everything's now hitting the soft layer stuff,

00:09:41   and it's blazing.

00:09:43   It's kind of remarkable.

00:09:44   And at first, I wasn't even really sure what to expect

00:09:46   when I went to this new server,

00:09:49   but the reality is it's been incredibly faster.

00:09:51   Not just a little bit, not like, "Oh, yeah,

00:09:53   it's a 20%, 30% improvement."

00:09:53   Everything is at least two to five times faster.

00:09:56   And many operations are 10 times faster than they were.

00:09:59   And they're much more consistent, too,

00:10:01   which is even probably the more important thing.

00:10:03   Because even if it was the same speed as it had been before,

00:10:06   but the variance between different operations

00:10:11   was better, I would have been happy.

00:10:13   But it's both faster and more consistent.

00:10:15   And I love being able to see this objectively, where

00:10:18   I was looking at my server stats for this.

00:10:21   And I was able to see that my typical response times before,

00:10:25   like my overall average response time for all requests,

00:10:27   was somewhere around 750 milliseconds to one second

00:10:31   before.

00:10:33   And as of right now, it's consistently

00:10:36   around 200 milliseconds, and often even as low

00:10:39   as 150 milliseconds.

00:10:41   So that's about a five times improvement

00:10:44   or so in response time.

00:10:46   And that's average, so certainly some things are slower

00:10:48   and some things are faster.

00:10:49   That really means that a lot of things are much, much faster than they were before.

00:10:53   And so that's the experience that I've had.

00:10:55   And I think it's interesting as a broader lesson to understand that if there are things

00:11:00   that you can do, it's very easy, I think, in the heat of the moment to be sort of pennywise

00:11:06   and pound foolish, where I was looking at hosting at Linode as kind of a good option

00:11:12   because it sort of did what I wanted, and it had close enough performance characteristics

00:11:18   that it was possible. But the reality was the amount of time that I've probably spent

00:11:23   in the last three or four months working on performance issues and tuning and caching

00:11:27   and dealing with all these kinds of issues that would have completely gone away if I

00:11:31   had just decided to start off investing the money and the resources into just taking,

00:11:39   sort of letting the hardware solve the problem for me, I think the product would have been

00:11:43   better. And that's sort of the tragedy of this. And that's why I wanted to kind of have

00:11:47   an episode talking about this experience, is that making sure that when you're spending

00:11:53   time on something, understanding whether that time could be better spent doing something

00:11:57   else, and if you could outsource that need to something that you can solve with money,

00:12:05   or with resources and whatever that looks like. And it's a funny thing to say, in some

00:12:08   ways, to say that, "Well, what if you can't afford it?" Well, the reality is if you're

00:12:13   you're spending your time on something that's sort of the-- if spending time, your own time,

00:12:19   is the alternative, then the real question is, you know, can you afford that time? Because time is

00:12:23   even more valuable than-- or much more constrained and limited, often, than money is. And so looking

00:12:30   into ways to avoid spending time when you can solve it with money is very important. And it's

00:12:35   like I really-- I also fell into the trap, I think, as an engineer, of thinking that the-- in some

00:12:42   is I'd be cheating, or it's like, oh, I'd be a copping out

00:12:47   or whatever if I didn't solve these problems with technology.

00:12:52   If I wasn't being super clever, and I kept being like, oh, what if I did this?

00:12:57   What if I did that? And I kept viewing it as this engineering problem.

00:13:02   When the reality is there was a solution and a way to avoid it, to address this problem that didn't require cleverness.

00:13:08   that wasn't about how smart I was,

00:13:13   and it wasn't this problem that I needed to solve.

00:13:15   I could just say, "You know what?

00:13:17   I know it's like I can just move this to something

00:13:19   where it'll be much, much, much better."

00:13:23   And that's just something, it's exactly what that means,

00:13:25   and the implications that might be for yourself

00:13:28   and your own applications are certainly going to be different.

00:13:30   It's, are you solving, would it be better to outsource

00:13:34   a particular part of it to an expert in the field?

00:13:34   Is there a part of your business that you

00:13:35   can outsource in that way?

00:13:37   Are there things that you're spending a lot of time on that

00:13:40   could be solved in another way?

00:13:43   And so the lesson that I've learned from this,

00:13:45   and the thing that I've taken away,

00:13:46   is that, A, it's hard to move servers.

00:13:49   Mind the lesson.

00:13:51   But the more important thing is, don't solve with cleverness

00:13:53   that which you can solve with money, with resources.

00:13:57   And so that's kind of the experience that I've had.

00:13:59   And it's really exciting now.

00:14:01   When I look at Feed Wrangler, I'm

00:14:02   really excited about some of the things

00:14:03   that I can do with it and with PodWrangler,

00:14:06   and being ready for PodWrangler's launch

00:14:08   in a different way, that I'm-- I've got tremendous amounts

00:14:10   of headroom on my servers now that I can grow and play with

00:14:13   and use that I just didn't have before.

00:14:15   And so I'm excited about that.

00:14:17   And I think that's ultimately the biggest plus,

00:14:19   is that I can focus on my apps.

00:14:20   I can focus on my user experience.

00:14:22   I can focus on things that people will notice and enjoy

00:14:26   in a way that I couldn't when I spent so much time and energy

00:14:29   focused on making things just run in the first place.

00:14:32   And so that's my experience.

00:14:34   That's been my last couple of days.

00:14:35   It's been a little rough, but I got through it.

00:14:37   And I think I've learned some things from it, which I'll

00:14:39   hopefully be able to implement in the future.

00:14:40   And anyway, that's it for today's show.

00:14:42   As always, if you have questions, comments,

00:14:43   concerns, complaints, I'm on Twitter

00:14:45   @_davidsmith, david@developingprospective.com.

00:14:47   Otherwise, if you have a great week, enjoy your new iPhones,

00:14:50   enjoy your new iOSes.

00:14:51   And I'll talk to you guys next week.

00:14:53   Bye.