Podcast Search

00:00:00 ◼ ► Hello and welcome to Developing Perspective. Developing Perspective is a podcast discussing

00:00:03 ◼ ► news of note and iOS development, Apple and the like. I'm your host, David Smith. I'm

00:00:07 ◼ ► an independent iOS and Mac developer based in Herndon, Virginia. This is show number

00:00:11 ◼ ► 126 and today is Thursday, May 23rd. Developing Perspective is never longer than 15 minutes,

00:00:16 ◼ ► so let's get started.

00:00:17 ◼ ► All right, so I'm going to talk about scale today. And more specifically, probably the

00:00:23 ◼ ► verb scaling. So this is something that I've been obviously spending a lot of time working

00:00:28 ◼ ► on in Feed Wrangler is to try and have a reasonable approach

00:00:32 ◼ ► to increasing capacity for back end things.

00:00:36 ◼ ► And there's some amount of front end work that goes into this,

00:00:39 ◼ ► but a lot of it is just about how do you plan for,

00:00:43 ◼ ► and then how do you actually execute

00:00:45 ◼ ► on having something that can scale

00:00:47 ◼ ► to a large number of users.

00:00:49 ◼ ► There's kind of a couple of phases that this goes through,

00:00:51 ◼ ► and so I'll kind of walk through this process.

00:00:53 ◼ ► And this is kind of my experience.

00:00:54 ◼ ► I'm not an expert at it.

00:00:55 ◼ ► And if anything, the experience of Feed Wrangler

00:00:57 ◼ ► definitely gave me a lot of respect for the people who do, who work for these big kind

00:01:02 ◼ ► of crazy VC funded startups, where their goal is just users and, you know, so like a free

00:01:07 ◼ ► service with who's just designed and, you know, sort of built on the premise on the

00:01:12 ◼ ► premise that within a few hours of launching, you'll have hundreds, if not thousands, if

00:01:15 ◼ ► not millions of users, and how difficult that must be, in many ways, just technically and

00:01:22 ◼ ► emotionally and so on to handle, you know, I'm working through a lot of issues around

00:01:27 ◼ ► have worked through a lot of issues that have,

00:01:32 ◼ ► at least have the advantage of having a paywall

00:01:34 ◼ ► in the front of it.

00:01:35 ◼ ► So while my user base is good and solid and growing

00:01:38 ◼ ► and kind of what I was hoping for it to be,

00:01:40 ◼ ► it's nowhere near, there's this nice big barrier

00:01:43 ◼ ► between what people are, the number of users

00:01:46 ◼ ► that I can handle, the number of users that will

00:01:48 ◼ ► at any one time jump onto the service.

00:01:50 ◼ ► So definitely mad respect to those people who can handle

00:01:54 ◼ ► and work through those issues.

00:01:55 ◼ ► And obviously there's a lot of that

00:01:56 ◼ ► that's just throwing money at it,

00:01:57 ◼ ► by having people who've done this before,

00:01:59 ◼ ► by who are specialists in very specific attributes of it,

00:02:02 ◼ ► and they can kind of stack those all up.

00:02:04 ◼ ► You know, so you have a database guy, a web guy,

00:02:06 ◼ ► a front-end guy, a network guy,

00:02:08 ◼ ► you can do all those kinds of things,

00:02:09 ◼ ► rather than obviously me just being one guy working on it.

00:02:11 ◼ ► So, first I'm gonna talk a little bit about the planning side.

00:02:16 ◼ ► And planning is kind of a tricky thing.

00:02:18 ◼ ► When you're launching something new,

00:02:20 ◼ ► you never really know what the demand for it is gonna be.

00:02:24 ◼ ► And you can guess, you can hope,

00:02:26 ◼ ► You can kind of maybe back of the napkin kind of guesstimate it, you know, oh, well, there's

00:02:31 ◼ ► so many, you know, so many that the potential user base is this size, and I, you know, I

00:02:36 ◼ ► think I'll have this much of a reach and so on.

00:02:38 ◼ ► But for the most part, you're just guessing.

00:02:41 ◼ ► And so what I tended to do for this is that my goal was to, before launch, to have a system

00:02:46 ◼ ► that could scale pretty widely, but who, which could also scale down fairly easily.

00:02:53 ◼ ► And so that I mean, rather than at every point in the architecture, when I'm sort of building

00:02:58 ◼ ► how it was going to work and how it's structured, my goal was always to make it so that if I

00:03:02 ◼ ► needed to, I could add, I could essentially I could throw money at the problem, which

00:03:06 ◼ ► is essentially what you want, you want to be able to say, if things are going crazy,

00:03:11 ◼ ► awesome, that you can just like, okay, I'm gonna throw money at the problem, I'm going

00:03:13 ◼ ► to get more servers, I'm gonna get bigger servers, I'm gonna get faster servers, whatever

00:03:17 ◼ ► it is, and you know, your capacity and your ability to work with that will just increase,

00:03:22 ◼ ► you know, sort of, if not linearly, but will solidly

00:03:26 ◼ ► increase solidly with that.

00:03:29 ◼ ► So a lot of that is making sure that, like in my case,

00:03:32 ◼ ► it's so that I could probably handle my web traffic on one

00:03:37 ◼ ► big, beefy web front end.

00:03:40 ◼ ► But instead, I chose to put a load balancer in front

00:03:43 ◼ ► of a couple, or I think a trio, of web application servers.

00:03:49 ◼ ► I didn't necessarily need to do that.

00:03:51 ◼ ► I could have just put one big beefy server in the front,

00:03:53 ◼ ► but then all of a sudden I have these issues

00:03:55 ◼ ► if I need more capacity.

00:03:56 ◼ ► Then I have to insert the node balancer in live in the front

00:03:59 ◼ ► and redistribute traffic and those kinds of issues.

00:04:02 ◼ ► It's like, well, I'll just start with that from the beginning.

00:04:04 ◼ ► So now if I need to, I can add more servers on the front end,

00:04:08 ◼ ► and my general capacity will increase.

00:04:11 ◼ ► I can scale it down if I need to.

00:04:12 ◼ ► If I find that I really don't need as many servers as I have,

00:04:15 ◼ ► I can just easily just unprovision them,

00:04:18 ◼ ► remove them from the pool, and we're fine.

00:04:21 ◼ ► similarly on the back end, you know, the database, you want to

00:04:23 ◼ ► set it up so that if you need to distribute it, that database, if

00:04:27 ◼ ► you need to increase its size, you want to have the ability to

00:04:30 ◼ ► do that. And it's a little bit trickier with databases, because

00:04:33 ◼ ► really, you only have a little bit of a few options for doing

00:04:36 ◼ ► that. But you know, it's making sure you're having a researcher

00:04:38 ◼ ► that can support for example, having multiple read slaves or

00:04:41 ◼ ► those kinds of things. And then, you know, I have a worker pool

00:04:44 ◼ ► that doing all the scraping and asynchronous processing for the

00:04:47 ◼ ► system. And same thing, it's designed around that, rather than

00:04:51 ◼ ► putting all that on one machine and having that kind of be my big beefy worker bee, I've

00:04:55 ◼ ► got a whole swarm of those that I can easily add new servers to or remove servers from

00:05:01 ◼ ► in a way that makes sense.

00:05:03 ◼ ► And I found that to be pretty, pretty, you know, worked out pretty well in terms of if

00:05:07 ◼ ► I was able to launch and I was like, "Ooh," I totally underguessed the number of workers

00:05:10 ◼ ► I'd need.

00:05:11 ◼ ► I think I started with two and I'm now to, I don't even know, I've been as high as I

00:05:15 ◼ ► think eight or nine and I think right now I'm settling in around four, which seems to

00:05:20 ◼ ► be work pretty well, but it was good to have the infrastructure in place and to have practiced

00:05:25 ◼ ► but we're pulling up and bringing down servers and so on.

00:05:29 ◼ ► And then in terms of actual scale, this is something that I talked about a little a few

00:05:33 ◼ ► episodes ago where I talked about performance. It's a lot of scale isn't necessarily about

00:05:40 ◼ ► volume of machines and those kinds of things. It's one of these things that I wish everyone

00:05:46 ◼ ► wishes there was a way that you could just kind of take a dial

00:05:48 ◼ ► and turn it up, and your capacity

00:05:51 ◼ ► would increase exactly with that.

00:05:53 ◼ ► And this is kind of the lie or the--

00:05:57 ◼ ► I don't know, the impressions in like Heroku,

00:06:00 ◼ ► one of these managed services gives you, where it's like,

00:06:02 ◼ ► oh, you just up the workers and it'll happen.

00:06:05 ◼ ► Whereas the scaling seems far more-- it's a much more subtle

00:06:07 ◼ ► problem in that really what you're trying to do

00:06:10 ◼ ► is you have to find the bottleneck and work on it.

00:06:12 ◼ ► And so much of what I've been doing now

00:06:14 ◼ ► is just constantly working up and down the stack from top

00:06:17 ◼ ► to bottom, finding a bottleneck, killing it, crushing it,

00:06:20 ◼ ► and moving on to the next one.

00:06:23 ◼ ► And that has dramatically increased your scale.

00:06:25 ◼ ► You'll find these weird database indexes

00:06:27 ◼ ► that you miss that you-- turns out

00:06:29 ◼ ► that there's this one call that needs to do it

00:06:31 ◼ ► and is taking forever that you didn't really plan for.

00:06:34 ◼ ► And I found that that's incredibly helpful to have

00:06:37 ◼ ► as a tool for that is to be able to just work through my stack

00:06:44 ◼ ► in a methodical way.

00:06:47 ◼ ► So I have my database machine.

00:06:48 ◼ ► And actually, this is-- I've got professional help.

00:06:50 ◼ ► I've got a DBA to kind of help me tune and optimize that.

00:06:54 ◼ ► It's kind of the crazy thing.

00:06:56 ◼ ► And that's probably a fair point.

00:06:57 ◼ ► If you don't know how to do something,

00:06:59 ◼ ► either spend the time to learn it,

00:07:02 ◼ ► or just find someone who does, and hire them

00:07:04 ◼ ► for a short contract to help you and kind of nail

00:07:06 ◼ ► out the actual issues.

00:07:09 ◼ ► It's kind of remarkable.

00:07:10 ◼ ► I had these database issues.

00:07:11 ◼ ► I couldn't quite track them down.

00:07:13 ◼ ► I found I was able to locate somebody who's a Postgres DBA,

00:07:17 ◼ ► who was able to take the machine,

00:07:18 ◼ ► and was like, oh, you need to do this, this, this, and this.

00:07:20 ◼ ► And he's just working through it,

00:07:22 ◼ ► which is probably the same kind of experience

00:07:23 ◼ ► that I would have if someone shows me their iOS project

00:07:26 ◼ ► and they're having some problem.

00:07:27 ◼ ► I could be like, oh, here it is.

00:07:29 ◼ ► Fix this, this, and this.

00:07:31 ◼ ► And not being too proud to be like, oh, no, I

00:07:34 ◼ ► need to find and fix it myself.

00:07:35 ◼ ► You do, however, want to make sure you learn all of that,

00:07:38 ◼ ► learn what the problems are.

00:07:39 ◼ ► But it's a nice process to go through.

00:07:43 ◼ ► And the reality is now I could probably run my system on a much fewer servers than I ever

00:07:47 ◼ ► could have before.

00:07:49 ◼ ► But one of the things you always have to fight in this process is premature optimization.

00:07:54 ◼ ► Optimization is going to be essential.

00:07:56 ◼ ► It's going to be necessary.

00:07:57 ◼ ► It's going to be something that you have to do.

00:07:58 ◼ ► But if you optimize too soon, I find that you really kind of struggle because you'll

00:08:03 ◼ ► end up adding complexity to your application in a way that you don't necessarily get a

00:08:09 ◼ ► payback for.

00:08:10 ◼ ► So a lot of bottlenecks, removing bottlenecks

00:08:13 ◼ ► is almost always, unless it's just like a silly mistake

00:08:16 ◼ ► or something that you really shouldn't be doing,

00:08:18 ◼ ► you're going to add some additional complexity

00:08:20 ◼ ► into your application to get a performance benefit out of it.

00:08:24 ◼ ► There's usually something like that, some kind of trade-off

00:08:26 ◼ ► that you're making in terms of your training speed

00:08:30 ◼ ► for something else.

00:08:31 ◼ ► And often I find that that's complexity,

00:08:32 ◼ ► that you're creating--

00:08:34 ◼ ► you're taking tasks and rather than doing them synchronously,

00:08:36 ◼ ► you're moving them into an asynchronous queue

00:08:38 ◼ ► where you then have to manage what

00:08:40 ◼ ► happens if they succeed and fail and so on.

00:08:42 ◼ ► There's all these kind of other things that you're doing.

00:08:45 ◼ ► And so if you add that complexity too soon,

00:08:49 ◼ ► you are just, you're becoming much more,

00:08:52 ◼ ► it's much more complicated than it needs to be.

00:08:54 ◼ ► And if you obviously, if you add it too late,

00:08:56 ◼ ► your system falls down and doesn't work.

00:08:58 ◼ ► And so typically what I've been doing,

00:09:00 ◼ ► and it's lovely that I've actually been able

00:09:01 ◼ ► to round this out and I think I'm back onto the features,

00:09:04 ◼ ► sort of sprint leg of it, which you just kind of sit through

00:09:06 ◼ ► and you just methodically work it through.

00:09:08 ◼ ► I mean, I have this long list of like,

00:09:09 ◼ ► here's things that aren't quite working right,

00:09:11 ◼ ► here's things that are too slow, here's whatever.

00:09:14 ◼ ► You just work through it.

00:09:15 ◼ ► And it's a little bit of a drudge,

00:09:17 ◼ ► it's a little bit of kind of you're just working

00:09:19 ◼ ► your way along.

00:09:20 ◼ ► But the nice thing is, every time you,

00:09:22 ◼ ► it has this lively cumulative effect that you,

00:09:24 ◼ ► you'll often find things that have knock-on effects

00:09:27 ◼ ► to other problems.

00:09:28 ◼ ► So you find, I found some weird issue in my worker queue

00:09:32 ◼ ► setup, and it's like, okay, well, let me fix that.

00:09:35 ◼ ► And now that actually makes a whole bunch of other issues

00:09:36 ◼ ► go away or are mitigated dramatically.

00:09:39 ◼ ► I find something bad in the way that I'm doing feed processing,

00:09:42 ◼ ► and I can fix it, and it actually

00:09:43 ◼ ► fixes six bugs at a time.

00:09:46 ◼ ► And I think that's been very motivating and helpful for me

00:09:49 ◼ ► as I go through this process to be encouraged by the fact

00:09:56 ◼ ► that there's usually a lot less--

00:09:58 ◼ ► problems are far less severe than you probably

00:10:00 ◼ ► fear at the front.

00:10:02 ◼ ► And just dive in there, tackle it,

00:10:03 ◼ ► get your arms around the problem, and work on it

00:10:06 ◼ ► almost always you get a result. There's definitely some times in the last couple of weeks where

00:10:10 ◼ ► I've been just like I'm up at three in the morning trying to work on some weird service

00:10:12 ◼ ► bug and I'm like, "Oh my goodness, what am I doing? This is crazy."

00:10:18 ◼ ► And what I found is just stay calm. It's kind of like the always talk about with I guess

00:10:24 ◼ ► like soldiers, right? The reason you train is so that when you are actually in the situation

00:10:29 ◼ ► that you're prepared for, you'll just act. You don't have to sort of think. You can just

00:10:33 ◼ ► just going to do.

00:10:33 ◼ ► And it's just like, rely on your training and go with it.

00:10:35 ◼ ► And that seems to actually have been working.

00:10:37 ◼ ► That it's like, being well prepared before

00:10:39 ◼ ► helps a lot on the back end to be

00:10:41 ◼ ► able to deal with issues and things down the road.

00:10:44 ◼ ► And that's kind of how I've been scaling it.

00:10:46 ◼ ► And the nice thing is if you take this approach,

00:10:50 ◼ ► you become fairly well prepared, and then

00:10:52 ◼ ► you just kind of methodically work your way up and down

00:10:53 ◼ ► the stack, fixing things and improving things.

00:10:56 ◼ ► You very quickly get to a point that you

00:10:57 ◼ ► can-- I think that you can round the corner

00:10:59 ◼ ► and really start accelerating onto features.

00:11:01 ◼ ► And this is where I've been loving this last couple

00:11:03 ◼ ► of few days.

00:11:04 ◼ ► This week, I'm much more feature-oriented

00:11:06 ◼ ► than bug fix-oriented or stack-oriented.

00:11:08 ◼ ► Things are tuned and humming along

00:11:10 ◼ ► in a way that works really well.

00:11:12 ◼ ► And the last thing I wanted to say around that

00:11:14 ◼ ► is, as things are humming along, I definitely

00:11:16 ◼ ► recommend that if you do any kind of web service work,

00:11:18 ◼ ► that you need to have some kind of monitoring

00:11:21 ◼ ► service attached to it.

00:11:22 ◼ ► For me right now, I use Pingdom, which is just a monitoring

00:11:25 ◼ ► service that basically, they'll hit a URL on a regular basis

00:11:30 ◼ ► and send you a push notification, a text, an email,

00:11:34 ◼ ► however you want to configure, whenever something about that

00:11:37 ◼ ► doesn't meet your criteria.

00:11:39 ◼ ► And then at first blush, you're like, OK, well,

00:11:40 ◼ ► I just wanted this up for some useful URL,

00:11:46 ◼ ► and it'll tell me if my servers are down, which is useful.

00:11:49 ◼ ► The thing I wanted to just talk about a little bit

00:11:51 ◼ ► is you can go far farther with that in a way that

00:11:54 ◼ ► is much more useful if you are selective and careful

00:11:57 ◼ ► and creative about what URL you have it hit.

00:12:00 ◼ ► And so this is something that I've started doing

00:12:02 ◼ ► that I think has been really helpful for me,

00:12:04 ◼ ► is you can, rather than just hitting a main URL,

00:12:06 ◼ ► like going to feed wrangler.net and telling me if it's up,

00:12:10 ◼ ► I create custom URLs in the app that customers will never see,

00:12:14 ◼ ► that aren't useful in that way, but that

00:12:16 ◼ ► are parts of the application that exercise the whole stack

00:12:20 ◼ ► and that examine exactly what's going on in the system

00:12:23 ◼ ► and then report back if there's something there that

00:12:26 ◼ ► isn't quite right.

00:12:27 ◼ ► And so this is making sure that I-- it hits the web server,

00:12:30 ◼ ► it goes through the load balancer to the web server,

00:12:33 ◼ ► hits the database, comes back, and then gets presented

00:12:36 ◼ ► to the user.

00:12:37 ◼ ► Or on my worker side, it's like goes and looks and hits

00:12:39 ◼ ► the web server, goes and looks at the worker queue.

00:12:42 ◼ ► Is the worker queue too high?

00:12:43 ◼ ► Is it above what I think it should be?

00:12:45 ◼ ► If it is, send it back.

00:12:47 ◼ ► And in the process, it's also making sure

00:12:49 ◼ ► my Redis server's up, it's handling all the workers,

00:12:51 ◼ ► it's handling-- make sure the workers are up.

00:12:54 ◼ ► And you can create these kind of interesting URLs

00:12:56 ◼ ► And then they send back basic messages, which you can hit directly.

00:13:00 ◼ ► And you can then include just status information.

00:13:03 ◼ ► So I have an alert that it hits my worker counter.

00:13:06 ◼ ► And if the worker counter is above 4,000, which is a number-- there should never be

00:13:10 ◼ ► that more than 4,000 in queue jobs at any one time, then it sends me a push notification.

00:13:18 ◼ ► And the thing that's kind of funny but subtle about that is making sure you have those things

00:13:21 ◼ ► gives you tremendous peace of mind.

00:13:26 ◼ ► Because I can stop kind of compulsively checking on things

00:13:29 ◼ ► in a way that I would have to otherwise,

00:13:33 ◼ ► where I have to be constantly SSHing into machines

00:13:36 ◼ ► and kind of looking at things and playing with stuff.

00:13:38 ◼ ► It's great to be able to just know that

00:13:41 ◼ ► if something goes wrong at any of these,

00:13:43 ◼ ► like these four parts of the system,

00:13:45 ◼ ► then this particular alert will fire.

00:13:46 ◼ ► If anything goes wrong with these parts of the system,

00:13:49 ◼ ► these lists alert will fire.

00:13:47 ◼ ► And once you have developed a fair bit of trust with that,

00:13:51 ◼ ► once it really seems to work, it's

00:13:52 ◼ ► great to be able to just push that, put that aside,

00:13:55 ◼ ► and focus on features, and focus on making the product better.

00:13:58 ◼ ► One other random note that I want to make

00:14:00 ◼ ► is if you are doing this kind of work,

00:14:01 ◼ ► if you're working on this kind of stuff,

00:14:03 ◼ ► there's a great app called Prompt, which is by Panic,

00:14:05 ◼ ► which is an SSH client.

00:14:07 ◼ ► And configuring that so that you can SSH into your machines

00:14:10 ◼ ► from your phone is amazing in terms of-- there's definitely

00:14:13 ◼ ► once so far I was out to dinner with my wife,

00:14:15 ◼ ► And again, I get this alert that the servers are down.

00:14:17 ◼ ► Something was funny.

00:14:19 ◼ ► It was great to just pick up my phone,

00:14:21 ◼ ► SSH into the machine, found the little thing, which

00:14:24 ◼ ► is some worker had an error and needed to be restarted.

00:14:26 ◼ ► I restarted it.

00:14:27 ◼ ► And boom, I have a backing business.

00:14:29 ◼ ► Come back and enjoy dinner.

00:14:31 ◼ ► So definitely a great tool to have.

00:14:33 ◼ ► You can configure it with your private key

00:14:35 ◼ ► and all that kind of stuff.

00:14:36 ◼ ► So it's really convenient.

00:14:37 ◼ ► All right, hopefully that was helpful.

00:14:38 ◼ ► It's a little bit into the network administrator

00:14:42 ◼ ► side of things.

00:14:42 ◼ ► So I'll be back more into the iOS side, which

00:14:45 ◼ ► I think most of you listen to this for soon,

00:14:47 ◼ ► but it's what I've been thinking about what I've been doing.

00:14:49 ◼ ► All right, so that's it.

00:14:50 ◼ ► If you have questions, comments, concerns, complaints,

00:14:52 ◼ ► compliments, I'm _DavidSmith on Twitter, DavidSmith@appnet.

00:14:56 ◼ ► And otherwise, have a great week and happy coding.

PodSearch

Developing Perspective

#126: Scaling.