Developing Perspective

#126: Scaling.


00:00:00   Hello and welcome to Developing Perspective. Developing Perspective is a podcast discussing

00:00:03   news of note and iOS development, Apple and the like. I'm your host, David Smith. I'm

00:00:07   an independent iOS and Mac developer based in Herndon, Virginia. This is show number

00:00:11   126 and today is Thursday, May 23rd. Developing Perspective is never longer than 15 minutes,

00:00:16   so let's get started.

00:00:17   All right, so I'm going to talk about scale today. And more specifically, probably the

00:00:23   verb scaling. So this is something that I've been obviously spending a lot of time working

00:00:28   on in Feed Wrangler is to try and have a reasonable approach

00:00:32   to increasing capacity for back end things.

00:00:36   And there's some amount of front end work that goes into this,

00:00:39   but a lot of it is just about how do you plan for,

00:00:43   and then how do you actually execute

00:00:45   on having something that can scale

00:00:47   to a large number of users.

00:00:49   There's kind of a couple of phases that this goes through,

00:00:51   and so I'll kind of walk through this process.

00:00:53   And this is kind of my experience.

00:00:54   I'm not an expert at it.

00:00:55   And if anything, the experience of Feed Wrangler

00:00:57   definitely gave me a lot of respect for the people who do, who work for these big kind

00:01:02   of crazy VC funded startups, where their goal is just users and, you know, so like a free

00:01:07   service with who's just designed and, you know, sort of built on the premise on the

00:01:12   premise that within a few hours of launching, you'll have hundreds, if not thousands, if

00:01:15   not millions of users, and how difficult that must be, in many ways, just technically and

00:01:22   emotionally and so on to handle, you know, I'm working through a lot of issues around

00:01:27   have worked through a lot of issues that have,

00:01:32   at least have the advantage of having a paywall

00:01:34   in the front of it.

00:01:35   So while my user base is good and solid and growing

00:01:38   and kind of what I was hoping for it to be,

00:01:40   it's nowhere near, there's this nice big barrier

00:01:43   between what people are, the number of users

00:01:46   that I can handle, the number of users that will

00:01:48   at any one time jump onto the service.

00:01:50   So definitely mad respect to those people who can handle

00:01:54   and work through those issues.

00:01:55   And obviously there's a lot of that

00:01:56   that's just throwing money at it,

00:01:57   by having people who've done this before,

00:01:59   by who are specialists in very specific attributes of it,

00:02:02   and they can kind of stack those all up.

00:02:04   You know, so you have a database guy, a web guy,

00:02:06   a front-end guy, a network guy,

00:02:08   you can do all those kinds of things,

00:02:09   rather than obviously me just being one guy working on it.

00:02:11   So, first I'm gonna talk a little bit about the planning side.

00:02:16   And planning is kind of a tricky thing.

00:02:18   When you're launching something new,

00:02:20   you never really know what the demand for it is gonna be.

00:02:24   And you can guess, you can hope,

00:02:26   You can kind of maybe back of the napkin kind of guesstimate it, you know, oh, well, there's

00:02:31   so many, you know, so many that the potential user base is this size, and I, you know, I

00:02:36   think I'll have this much of a reach and so on.

00:02:38   But for the most part, you're just guessing.

00:02:41   And so what I tended to do for this is that my goal was to, before launch, to have a system

00:02:46   that could scale pretty widely, but who, which could also scale down fairly easily.

00:02:53   And so that I mean, rather than at every point in the architecture, when I'm sort of building

00:02:58   how it was going to work and how it's structured, my goal was always to make it so that if I

00:03:02   needed to, I could add, I could essentially I could throw money at the problem, which

00:03:06   is essentially what you want, you want to be able to say, if things are going crazy,

00:03:11   awesome, that you can just like, okay, I'm gonna throw money at the problem, I'm going

00:03:13   to get more servers, I'm gonna get bigger servers, I'm gonna get faster servers, whatever

00:03:17   it is, and you know, your capacity and your ability to work with that will just increase,

00:03:22   you know, sort of, if not linearly, but will solidly

00:03:26   increase solidly with that.

00:03:29   So a lot of that is making sure that, like in my case,

00:03:32   it's so that I could probably handle my web traffic on one

00:03:37   big, beefy web front end.

00:03:40   But instead, I chose to put a load balancer in front

00:03:43   of a couple, or I think a trio, of web application servers.

00:03:49   I didn't necessarily need to do that.

00:03:51   I could have just put one big beefy server in the front,

00:03:53   but then all of a sudden I have these issues

00:03:55   if I need more capacity.

00:03:56   Then I have to insert the node balancer in live in the front

00:03:59   and redistribute traffic and those kinds of issues.

00:04:02   It's like, well, I'll just start with that from the beginning.

00:04:04   So now if I need to, I can add more servers on the front end,

00:04:08   and my general capacity will increase.

00:04:11   I can scale it down if I need to.

00:04:12   If I find that I really don't need as many servers as I have,

00:04:15   I can just easily just unprovision them,

00:04:18   remove them from the pool, and we're fine.

00:04:21   similarly on the back end, you know, the database, you want to

00:04:23   set it up so that if you need to distribute it, that database, if

00:04:27   you need to increase its size, you want to have the ability to

00:04:30   do that. And it's a little bit trickier with databases, because

00:04:33   really, you only have a little bit of a few options for doing

00:04:36   that. But you know, it's making sure you're having a researcher

00:04:38   that can support for example, having multiple read slaves or

00:04:41   those kinds of things. And then, you know, I have a worker pool

00:04:44   that doing all the scraping and asynchronous processing for the

00:04:47   system. And same thing, it's designed around that, rather than

00:04:51   putting all that on one machine and having that kind of be my big beefy worker bee, I've

00:04:55   got a whole swarm of those that I can easily add new servers to or remove servers from

00:05:01   in a way that makes sense.

00:05:03   And I found that to be pretty, pretty, you know, worked out pretty well in terms of if

00:05:07   I was able to launch and I was like, "Ooh," I totally underguessed the number of workers

00:05:10   I'd need.

00:05:11   I think I started with two and I'm now to, I don't even know, I've been as high as I

00:05:15   think eight or nine and I think right now I'm settling in around four, which seems to

00:05:20   be work pretty well, but it was good to have the infrastructure in place and to have practiced

00:05:25   but we're pulling up and bringing down servers and so on.

00:05:29   And then in terms of actual scale, this is something that I talked about a little a few

00:05:33   episodes ago where I talked about performance. It's a lot of scale isn't necessarily about

00:05:40   volume of machines and those kinds of things. It's one of these things that I wish everyone

00:05:46   wishes there was a way that you could just kind of take a dial

00:05:48   and turn it up, and your capacity

00:05:51   would increase exactly with that.

00:05:53   And this is kind of the lie or the--

00:05:57   I don't know, the impressions in like Heroku,

00:06:00   one of these managed services gives you, where it's like,

00:06:02   oh, you just up the workers and it'll happen.

00:06:05   Whereas the scaling seems far more-- it's a much more subtle

00:06:07   problem in that really what you're trying to do

00:06:10   is you have to find the bottleneck and work on it.

00:06:12   And so much of what I've been doing now

00:06:14   is just constantly working up and down the stack from top

00:06:17   to bottom, finding a bottleneck, killing it, crushing it,

00:06:20   and moving on to the next one.

00:06:23   And that has dramatically increased your scale.

00:06:25   You'll find these weird database indexes

00:06:27   that you miss that you-- turns out

00:06:29   that there's this one call that needs to do it

00:06:31   and is taking forever that you didn't really plan for.

00:06:34   And I found that that's incredibly helpful to have

00:06:37   as a tool for that is to be able to just work through my stack

00:06:44   in a methodical way.

00:06:47   So I have my database machine.

00:06:48   And actually, this is-- I've got professional help.

00:06:50   I've got a DBA to kind of help me tune and optimize that.

00:06:54   It's kind of the crazy thing.

00:06:56   And that's probably a fair point.

00:06:57   If you don't know how to do something,

00:06:59   either spend the time to learn it,

00:07:02   or just find someone who does, and hire them

00:07:04   for a short contract to help you and kind of nail

00:07:06   out the actual issues.

00:07:09   It's kind of remarkable.

00:07:10   I had these database issues.

00:07:11   I couldn't quite track them down.

00:07:13   I found I was able to locate somebody who's a Postgres DBA,

00:07:17   who was able to take the machine,

00:07:18   and was like, oh, you need to do this, this, this, and this.

00:07:20   And he's just working through it,

00:07:22   which is probably the same kind of experience

00:07:23   that I would have if someone shows me their iOS project

00:07:26   and they're having some problem.

00:07:27   I could be like, oh, here it is.

00:07:29   Fix this, this, and this.

00:07:31   And not being too proud to be like, oh, no, I

00:07:34   need to find and fix it myself.

00:07:35   You do, however, want to make sure you learn all of that,

00:07:38   learn what the problems are.

00:07:39   But it's a nice process to go through.

00:07:43   And the reality is now I could probably run my system on a much fewer servers than I ever

00:07:47   could have before.

00:07:49   But one of the things you always have to fight in this process is premature optimization.

00:07:54   Optimization is going to be essential.

00:07:56   It's going to be necessary.

00:07:57   It's going to be something that you have to do.

00:07:58   But if you optimize too soon, I find that you really kind of struggle because you'll

00:08:03   end up adding complexity to your application in a way that you don't necessarily get a

00:08:09   payback for.

00:08:10   So a lot of bottlenecks, removing bottlenecks

00:08:13   is almost always, unless it's just like a silly mistake

00:08:16   or something that you really shouldn't be doing,

00:08:18   you're going to add some additional complexity

00:08:20   into your application to get a performance benefit out of it.

00:08:24   There's usually something like that, some kind of trade-off

00:08:26   that you're making in terms of your training speed

00:08:30   for something else.

00:08:31   And often I find that that's complexity,

00:08:32   that you're creating--

00:08:34   you're taking tasks and rather than doing them synchronously,

00:08:36   you're moving them into an asynchronous queue

00:08:38   where you then have to manage what

00:08:40   happens if they succeed and fail and so on.

00:08:42   There's all these kind of other things that you're doing.

00:08:45   And so if you add that complexity too soon,

00:08:49   you are just, you're becoming much more,

00:08:52   it's much more complicated than it needs to be.

00:08:54   And if you obviously, if you add it too late,

00:08:56   your system falls down and doesn't work.

00:08:58   And so typically what I've been doing,

00:09:00   and it's lovely that I've actually been able

00:09:01   to round this out and I think I'm back onto the features,

00:09:04   sort of sprint leg of it, which you just kind of sit through

00:09:06   and you just methodically work it through.

00:09:08   I mean, I have this long list of like,

00:09:09   here's things that aren't quite working right,

00:09:11   here's things that are too slow, here's whatever.

00:09:14   You just work through it.

00:09:15   And it's a little bit of a drudge,

00:09:17   it's a little bit of kind of you're just working

00:09:19   your way along.

00:09:20   But the nice thing is, every time you,

00:09:22   it has this lively cumulative effect that you,

00:09:24   you'll often find things that have knock-on effects

00:09:27   to other problems.

00:09:28   So you find, I found some weird issue in my worker queue

00:09:32   setup, and it's like, okay, well, let me fix that.

00:09:35   And now that actually makes a whole bunch of other issues

00:09:36   go away or are mitigated dramatically.

00:09:39   I find something bad in the way that I'm doing feed processing,

00:09:42   and I can fix it, and it actually

00:09:43   fixes six bugs at a time.

00:09:46   And I think that's been very motivating and helpful for me

00:09:49   as I go through this process to be encouraged by the fact

00:09:56   that there's usually a lot less--

00:09:58   problems are far less severe than you probably

00:10:00   fear at the front.

00:10:02   And just dive in there, tackle it,

00:10:03   get your arms around the problem, and work on it

00:10:06   almost always you get a result. There's definitely some times in the last couple of weeks where

00:10:10   I've been just like I'm up at three in the morning trying to work on some weird service

00:10:12   bug and I'm like, "Oh my goodness, what am I doing? This is crazy."

00:10:18   And what I found is just stay calm. It's kind of like the always talk about with I guess

00:10:24   like soldiers, right? The reason you train is so that when you are actually in the situation

00:10:29   that you're prepared for, you'll just act. You don't have to sort of think. You can just

00:10:33   just going to do.

00:10:33   And it's just like, rely on your training and go with it.

00:10:35   And that seems to actually have been working.

00:10:37   That it's like, being well prepared before

00:10:39   helps a lot on the back end to be

00:10:41   able to deal with issues and things down the road.

00:10:44   And that's kind of how I've been scaling it.

00:10:46   And the nice thing is if you take this approach,

00:10:50   you become fairly well prepared, and then

00:10:52   you just kind of methodically work your way up and down

00:10:53   the stack, fixing things and improving things.

00:10:56   You very quickly get to a point that you

00:10:57   can-- I think that you can round the corner

00:10:59   and really start accelerating onto features.

00:11:01   And this is where I've been loving this last couple

00:11:03   of few days.

00:11:04   This week, I'm much more feature-oriented

00:11:06   than bug fix-oriented or stack-oriented.

00:11:08   Things are tuned and humming along

00:11:10   in a way that works really well.

00:11:12   And the last thing I wanted to say around that

00:11:14   is, as things are humming along, I definitely

00:11:16   recommend that if you do any kind of web service work,

00:11:18   that you need to have some kind of monitoring

00:11:21   service attached to it.

00:11:22   For me right now, I use Pingdom, which is just a monitoring

00:11:25   service that basically, they'll hit a URL on a regular basis

00:11:30   and send you a push notification, a text, an email,

00:11:34   however you want to configure, whenever something about that

00:11:37   doesn't meet your criteria.

00:11:39   And then at first blush, you're like, OK, well,

00:11:40   I just wanted this up for some useful URL,

00:11:46   and it'll tell me if my servers are down, which is useful.

00:11:49   The thing I wanted to just talk about a little bit

00:11:51   is you can go far farther with that in a way that

00:11:54   is much more useful if you are selective and careful

00:11:57   and creative about what URL you have it hit.

00:12:00   And so this is something that I've started doing

00:12:02   that I think has been really helpful for me,

00:12:04   is you can, rather than just hitting a main URL,

00:12:06   like going to feed wrangler.net and telling me if it's up,

00:12:10   I create custom URLs in the app that customers will never see,

00:12:14   that aren't useful in that way, but that

00:12:16   are parts of the application that exercise the whole stack

00:12:20   and that examine exactly what's going on in the system

00:12:23   and then report back if there's something there that

00:12:26   isn't quite right.

00:12:27   And so this is making sure that I-- it hits the web server,

00:12:30   it goes through the load balancer to the web server,

00:12:33   hits the database, comes back, and then gets presented

00:12:36   to the user.

00:12:37   Or on my worker side, it's like goes and looks and hits

00:12:39   the web server, goes and looks at the worker queue.

00:12:42   Is the worker queue too high?

00:12:43   Is it above what I think it should be?

00:12:45   If it is, send it back.

00:12:47   And in the process, it's also making sure

00:12:49   my Redis server's up, it's handling all the workers,

00:12:51   it's handling-- make sure the workers are up.

00:12:54   And you can create these kind of interesting URLs

00:12:56   And then they send back basic messages, which you can hit directly.

00:13:00   And you can then include just status information.

00:13:03   So I have an alert that it hits my worker counter.

00:13:06   And if the worker counter is above 4,000, which is a number-- there should never be

00:13:10   that more than 4,000 in queue jobs at any one time, then it sends me a push notification.

00:13:18   And the thing that's kind of funny but subtle about that is making sure you have those things

00:13:21   gives you tremendous peace of mind.

00:13:26   Because I can stop kind of compulsively checking on things

00:13:29   in a way that I would have to otherwise,

00:13:33   where I have to be constantly SSHing into machines

00:13:36   and kind of looking at things and playing with stuff.

00:13:38   It's great to be able to just know that

00:13:41   if something goes wrong at any of these,

00:13:43   like these four parts of the system,

00:13:45   then this particular alert will fire.

00:13:46   If anything goes wrong with these parts of the system,

00:13:49   these lists alert will fire.

00:13:47   And once you have developed a fair bit of trust with that,

00:13:51   once it really seems to work, it's

00:13:52   great to be able to just push that, put that aside,

00:13:55   and focus on features, and focus on making the product better.

00:13:58   One other random note that I want to make

00:14:00   is if you are doing this kind of work,

00:14:01   if you're working on this kind of stuff,

00:14:03   there's a great app called Prompt, which is by Panic,

00:14:05   which is an SSH client.

00:14:07   And configuring that so that you can SSH into your machines

00:14:10   from your phone is amazing in terms of-- there's definitely

00:14:13   once so far I was out to dinner with my wife,

00:14:15   And again, I get this alert that the servers are down.

00:14:17   Something was funny.

00:14:19   It was great to just pick up my phone,

00:14:21   SSH into the machine, found the little thing, which

00:14:24   is some worker had an error and needed to be restarted.

00:14:26   I restarted it.

00:14:27   And boom, I have a backing business.

00:14:29   Come back and enjoy dinner.

00:14:31   So definitely a great tool to have.

00:14:33   You can configure it with your private key

00:14:35   and all that kind of stuff.

00:14:36   So it's really convenient.

00:14:37   All right, hopefully that was helpful.

00:14:38   It's a little bit into the network administrator

00:14:42   side of things.

00:14:42   So I'll be back more into the iOS side, which

00:14:45   I think most of you listen to this for soon,

00:14:47   but it's what I've been thinking about what I've been doing.

00:14:49   All right, so that's it.

00:14:50   If you have questions, comments, concerns, complaints,

00:14:52   compliments, I'm _DavidSmith on Twitter, DavidSmith@appnet.

00:14:56   And otherwise, have a great week and happy coding.