#126: Scaling.
00:00:00
◼
►
Hello and welcome to Developing Perspective. Developing Perspective is a podcast discussing
00:00:03
◼
►
news of note and iOS development, Apple and the like. I'm your host, David Smith. I'm
00:00:07
◼
►
an independent iOS and Mac developer based in Herndon, Virginia. This is show number
00:00:11
◼
►
126 and today is Thursday, May 23rd. Developing Perspective is never longer than 15 minutes,
00:00:16
◼
►
so let's get started.
00:00:17
◼
►
All right, so I'm going to talk about scale today. And more specifically, probably the
00:00:23
◼
►
verb scaling. So this is something that I've been obviously spending a lot of time working
00:00:28
◼
►
on in Feed Wrangler is to try and have a reasonable approach
00:00:32
◼
►
to increasing capacity for back end things.
00:00:36
◼
►
And there's some amount of front end work that goes into this,
00:00:39
◼
►
but a lot of it is just about how do you plan for,
00:00:43
◼
►
and then how do you actually execute
00:00:45
◼
►
on having something that can scale
00:00:47
◼
►
to a large number of users.
00:00:49
◼
►
There's kind of a couple of phases that this goes through,
00:00:51
◼
►
and so I'll kind of walk through this process.
00:00:53
◼
►
And this is kind of my experience.
00:00:54
◼
►
I'm not an expert at it.
00:00:55
◼
►
And if anything, the experience of Feed Wrangler
00:00:57
◼
►
definitely gave me a lot of respect for the people who do, who work for these big kind
00:01:02
◼
►
of crazy VC funded startups, where their goal is just users and, you know, so like a free
00:01:07
◼
►
service with who's just designed and, you know, sort of built on the premise on the
00:01:12
◼
►
premise that within a few hours of launching, you'll have hundreds, if not thousands, if
00:01:15
◼
►
not millions of users, and how difficult that must be, in many ways, just technically and
00:01:22
◼
►
emotionally and so on to handle, you know, I'm working through a lot of issues around
00:01:27
◼
►
have worked through a lot of issues that have,
00:01:32
◼
►
at least have the advantage of having a paywall
00:01:34
◼
►
in the front of it.
00:01:35
◼
►
So while my user base is good and solid and growing
00:01:38
◼
►
and kind of what I was hoping for it to be,
00:01:40
◼
►
it's nowhere near, there's this nice big barrier
00:01:43
◼
►
between what people are, the number of users
00:01:46
◼
►
that I can handle, the number of users that will
00:01:48
◼
►
at any one time jump onto the service.
00:01:50
◼
►
So definitely mad respect to those people who can handle
00:01:54
◼
►
and work through those issues.
00:01:55
◼
►
And obviously there's a lot of that
00:01:56
◼
►
that's just throwing money at it,
00:01:57
◼
►
by having people who've done this before,
00:01:59
◼
►
by who are specialists in very specific attributes of it,
00:02:02
◼
►
and they can kind of stack those all up.
00:02:04
◼
►
You know, so you have a database guy, a web guy,
00:02:06
◼
►
a front-end guy, a network guy,
00:02:08
◼
►
you can do all those kinds of things,
00:02:09
◼
►
rather than obviously me just being one guy working on it.
00:02:11
◼
►
So, first I'm gonna talk a little bit about the planning side.
00:02:16
◼
►
And planning is kind of a tricky thing.
00:02:18
◼
►
When you're launching something new,
00:02:20
◼
►
you never really know what the demand for it is gonna be.
00:02:24
◼
►
And you can guess, you can hope,
00:02:26
◼
►
You can kind of maybe back of the napkin kind of guesstimate it, you know, oh, well, there's
00:02:31
◼
►
so many, you know, so many that the potential user base is this size, and I, you know, I
00:02:36
◼
►
think I'll have this much of a reach and so on.
00:02:38
◼
►
But for the most part, you're just guessing.
00:02:41
◼
►
And so what I tended to do for this is that my goal was to, before launch, to have a system
00:02:46
◼
►
that could scale pretty widely, but who, which could also scale down fairly easily.
00:02:53
◼
►
And so that I mean, rather than at every point in the architecture, when I'm sort of building
00:02:58
◼
►
how it was going to work and how it's structured, my goal was always to make it so that if I
00:03:02
◼
►
needed to, I could add, I could essentially I could throw money at the problem, which
00:03:06
◼
►
is essentially what you want, you want to be able to say, if things are going crazy,
00:03:11
◼
►
awesome, that you can just like, okay, I'm gonna throw money at the problem, I'm going
00:03:13
◼
►
to get more servers, I'm gonna get bigger servers, I'm gonna get faster servers, whatever
00:03:17
◼
►
it is, and you know, your capacity and your ability to work with that will just increase,
00:03:22
◼
►
you know, sort of, if not linearly, but will solidly
00:03:26
◼
►
increase solidly with that.
00:03:29
◼
►
So a lot of that is making sure that, like in my case,
00:03:32
◼
►
it's so that I could probably handle my web traffic on one
00:03:37
◼
►
big, beefy web front end.
00:03:40
◼
►
But instead, I chose to put a load balancer in front
00:03:43
◼
►
of a couple, or I think a trio, of web application servers.
00:03:49
◼
►
I didn't necessarily need to do that.
00:03:51
◼
►
I could have just put one big beefy server in the front,
00:03:53
◼
►
but then all of a sudden I have these issues
00:03:55
◼
►
if I need more capacity.
00:03:56
◼
►
Then I have to insert the node balancer in live in the front
00:03:59
◼
►
and redistribute traffic and those kinds of issues.
00:04:02
◼
►
It's like, well, I'll just start with that from the beginning.
00:04:04
◼
►
So now if I need to, I can add more servers on the front end,
00:04:08
◼
►
and my general capacity will increase.
00:04:11
◼
►
I can scale it down if I need to.
00:04:12
◼
►
If I find that I really don't need as many servers as I have,
00:04:15
◼
►
I can just easily just unprovision them,
00:04:18
◼
►
remove them from the pool, and we're fine.
00:04:21
◼
►
similarly on the back end, you know, the database, you want to
00:04:23
◼
►
set it up so that if you need to distribute it, that database, if
00:04:27
◼
►
you need to increase its size, you want to have the ability to
00:04:30
◼
►
do that. And it's a little bit trickier with databases, because
00:04:33
◼
►
really, you only have a little bit of a few options for doing
00:04:36
◼
►
that. But you know, it's making sure you're having a researcher
00:04:38
◼
►
that can support for example, having multiple read slaves or
00:04:41
◼
►
those kinds of things. And then, you know, I have a worker pool
00:04:44
◼
►
that doing all the scraping and asynchronous processing for the
00:04:47
◼
►
system. And same thing, it's designed around that, rather than
00:04:51
◼
►
putting all that on one machine and having that kind of be my big beefy worker bee, I've
00:04:55
◼
►
got a whole swarm of those that I can easily add new servers to or remove servers from
00:05:01
◼
►
in a way that makes sense.
00:05:03
◼
►
And I found that to be pretty, pretty, you know, worked out pretty well in terms of if
00:05:07
◼
►
I was able to launch and I was like, "Ooh," I totally underguessed the number of workers
00:05:11
◼
►
I think I started with two and I'm now to, I don't even know, I've been as high as I
00:05:15
◼
►
think eight or nine and I think right now I'm settling in around four, which seems to
00:05:20
◼
►
be work pretty well, but it was good to have the infrastructure in place and to have practiced
00:05:25
◼
►
but we're pulling up and bringing down servers and so on.
00:05:29
◼
►
And then in terms of actual scale, this is something that I talked about a little a few
00:05:33
◼
►
episodes ago where I talked about performance. It's a lot of scale isn't necessarily about
00:05:40
◼
►
volume of machines and those kinds of things. It's one of these things that I wish everyone
00:05:46
◼
►
wishes there was a way that you could just kind of take a dial
00:05:48
◼
►
and turn it up, and your capacity
00:05:51
◼
►
would increase exactly with that.
00:05:53
◼
►
And this is kind of the lie or the--
00:05:57
◼
►
I don't know, the impressions in like Heroku,
00:06:00
◼
►
one of these managed services gives you, where it's like,
00:06:02
◼
►
oh, you just up the workers and it'll happen.
00:06:05
◼
►
Whereas the scaling seems far more-- it's a much more subtle
00:06:07
◼
►
problem in that really what you're trying to do
00:06:10
◼
►
is you have to find the bottleneck and work on it.
00:06:12
◼
►
And so much of what I've been doing now
00:06:14
◼
►
is just constantly working up and down the stack from top
00:06:17
◼
►
to bottom, finding a bottleneck, killing it, crushing it,
00:06:20
◼
►
and moving on to the next one.
00:06:23
◼
►
And that has dramatically increased your scale.
00:06:25
◼
►
You'll find these weird database indexes
00:06:27
◼
►
that you miss that you-- turns out
00:06:29
◼
►
that there's this one call that needs to do it
00:06:31
◼
►
and is taking forever that you didn't really plan for.
00:06:34
◼
►
And I found that that's incredibly helpful to have
00:06:37
◼
►
as a tool for that is to be able to just work through my stack
00:06:44
◼
►
in a methodical way.
00:06:47
◼
►
So I have my database machine.
00:06:48
◼
►
And actually, this is-- I've got professional help.
00:06:50
◼
►
I've got a DBA to kind of help me tune and optimize that.
00:06:54
◼
►
It's kind of the crazy thing.
00:06:56
◼
►
And that's probably a fair point.
00:06:57
◼
►
If you don't know how to do something,
00:06:59
◼
►
either spend the time to learn it,
00:07:02
◼
►
or just find someone who does, and hire them
00:07:04
◼
►
for a short contract to help you and kind of nail
00:07:06
◼
►
out the actual issues.
00:07:09
◼
►
It's kind of remarkable.
00:07:10
◼
►
I had these database issues.
00:07:11
◼
►
I couldn't quite track them down.
00:07:13
◼
►
I found I was able to locate somebody who's a Postgres DBA,
00:07:17
◼
►
who was able to take the machine,
00:07:18
◼
►
and was like, oh, you need to do this, this, this, and this.
00:07:20
◼
►
And he's just working through it,
00:07:22
◼
►
which is probably the same kind of experience
00:07:23
◼
►
that I would have if someone shows me their iOS project
00:07:26
◼
►
and they're having some problem.
00:07:27
◼
►
I could be like, oh, here it is.
00:07:29
◼
►
Fix this, this, and this.
00:07:31
◼
►
And not being too proud to be like, oh, no, I
00:07:34
◼
►
need to find and fix it myself.
00:07:35
◼
►
You do, however, want to make sure you learn all of that,
00:07:38
◼
►
learn what the problems are.
00:07:39
◼
►
But it's a nice process to go through.
00:07:43
◼
►
And the reality is now I could probably run my system on a much fewer servers than I ever
00:07:47
◼
►
could have before.
00:07:49
◼
►
But one of the things you always have to fight in this process is premature optimization.
00:07:54
◼
►
Optimization is going to be essential.
00:07:56
◼
►
It's going to be necessary.
00:07:57
◼
►
It's going to be something that you have to do.
00:07:58
◼
►
But if you optimize too soon, I find that you really kind of struggle because you'll
00:08:03
◼
►
end up adding complexity to your application in a way that you don't necessarily get a
00:08:09
◼
►
payback for.
00:08:10
◼
►
So a lot of bottlenecks, removing bottlenecks
00:08:13
◼
►
is almost always, unless it's just like a silly mistake
00:08:16
◼
►
or something that you really shouldn't be doing,
00:08:18
◼
►
you're going to add some additional complexity
00:08:20
◼
►
into your application to get a performance benefit out of it.
00:08:24
◼
►
There's usually something like that, some kind of trade-off
00:08:26
◼
►
that you're making in terms of your training speed
00:08:30
◼
►
for something else.
00:08:31
◼
►
And often I find that that's complexity,
00:08:32
◼
►
that you're creating--
00:08:34
◼
►
you're taking tasks and rather than doing them synchronously,
00:08:36
◼
►
you're moving them into an asynchronous queue
00:08:38
◼
►
where you then have to manage what
00:08:40
◼
►
happens if they succeed and fail and so on.
00:08:42
◼
►
There's all these kind of other things that you're doing.
00:08:45
◼
►
And so if you add that complexity too soon,
00:08:49
◼
►
you are just, you're becoming much more,
00:08:52
◼
►
it's much more complicated than it needs to be.
00:08:54
◼
►
And if you obviously, if you add it too late,
00:08:56
◼
►
your system falls down and doesn't work.
00:08:58
◼
►
And so typically what I've been doing,
00:09:00
◼
►
and it's lovely that I've actually been able
00:09:01
◼
►
to round this out and I think I'm back onto the features,
00:09:04
◼
►
sort of sprint leg of it, which you just kind of sit through
00:09:06
◼
►
and you just methodically work it through.
00:09:08
◼
►
I mean, I have this long list of like,
00:09:09
◼
►
here's things that aren't quite working right,
00:09:11
◼
►
here's things that are too slow, here's whatever.
00:09:14
◼
►
You just work through it.
00:09:15
◼
►
And it's a little bit of a drudge,
00:09:17
◼
►
it's a little bit of kind of you're just working
00:09:19
◼
►
your way along.
00:09:20
◼
►
But the nice thing is, every time you,
00:09:22
◼
►
it has this lively cumulative effect that you,
00:09:24
◼
►
you'll often find things that have knock-on effects
00:09:27
◼
►
to other problems.
00:09:28
◼
►
So you find, I found some weird issue in my worker queue
00:09:32
◼
►
setup, and it's like, okay, well, let me fix that.
00:09:35
◼
►
And now that actually makes a whole bunch of other issues
00:09:36
◼
►
go away or are mitigated dramatically.
00:09:39
◼
►
I find something bad in the way that I'm doing feed processing,
00:09:42
◼
►
and I can fix it, and it actually
00:09:43
◼
►
fixes six bugs at a time.
00:09:46
◼
►
And I think that's been very motivating and helpful for me
00:09:49
◼
►
as I go through this process to be encouraged by the fact
00:09:56
◼
►
that there's usually a lot less--
00:09:58
◼
►
problems are far less severe than you probably
00:10:00
◼
►
fear at the front.
00:10:02
◼
►
And just dive in there, tackle it,
00:10:03
◼
►
get your arms around the problem, and work on it
00:10:06
◼
►
almost always you get a result. There's definitely some times in the last couple of weeks where
00:10:10
◼
►
I've been just like I'm up at three in the morning trying to work on some weird service
00:10:12
◼
►
bug and I'm like, "Oh my goodness, what am I doing? This is crazy."
00:10:18
◼
►
And what I found is just stay calm. It's kind of like the always talk about with I guess
00:10:24
◼
►
like soldiers, right? The reason you train is so that when you are actually in the situation
00:10:29
◼
►
that you're prepared for, you'll just act. You don't have to sort of think. You can just
00:10:33
◼
►
just going to do.
00:10:33
◼
►
And it's just like, rely on your training and go with it.
00:10:35
◼
►
And that seems to actually have been working.
00:10:37
◼
►
That it's like, being well prepared before
00:10:39
◼
►
helps a lot on the back end to be
00:10:41
◼
►
able to deal with issues and things down the road.
00:10:44
◼
►
And that's kind of how I've been scaling it.
00:10:46
◼
►
And the nice thing is if you take this approach,
00:10:50
◼
►
you become fairly well prepared, and then
00:10:52
◼
►
you just kind of methodically work your way up and down
00:10:53
◼
►
the stack, fixing things and improving things.
00:10:56
◼
►
You very quickly get to a point that you
00:10:57
◼
►
can-- I think that you can round the corner
00:10:59
◼
►
and really start accelerating onto features.
00:11:01
◼
►
And this is where I've been loving this last couple
00:11:03
◼
►
of few days.
00:11:04
◼
►
This week, I'm much more feature-oriented
00:11:06
◼
►
than bug fix-oriented or stack-oriented.
00:11:08
◼
►
Things are tuned and humming along
00:11:10
◼
►
in a way that works really well.
00:11:12
◼
►
And the last thing I wanted to say around that
00:11:14
◼
►
is, as things are humming along, I definitely
00:11:16
◼
►
recommend that if you do any kind of web service work,
00:11:18
◼
►
that you need to have some kind of monitoring
00:11:21
◼
►
service attached to it.
00:11:22
◼
►
For me right now, I use Pingdom, which is just a monitoring
00:11:25
◼
►
service that basically, they'll hit a URL on a regular basis
00:11:30
◼
►
and send you a push notification, a text, an email,
00:11:34
◼
►
however you want to configure, whenever something about that
00:11:37
◼
►
doesn't meet your criteria.
00:11:39
◼
►
And then at first blush, you're like, OK, well,
00:11:40
◼
►
I just wanted this up for some useful URL,
00:11:46
◼
►
and it'll tell me if my servers are down, which is useful.
00:11:49
◼
►
The thing I wanted to just talk about a little bit
00:11:51
◼
►
is you can go far farther with that in a way that
00:11:54
◼
►
is much more useful if you are selective and careful
00:11:57
◼
►
and creative about what URL you have it hit.
00:12:00
◼
►
And so this is something that I've started doing
00:12:02
◼
►
that I think has been really helpful for me,
00:12:04
◼
►
is you can, rather than just hitting a main URL,
00:12:06
◼
►
like going to feed wrangler.net and telling me if it's up,
00:12:10
◼
►
I create custom URLs in the app that customers will never see,
00:12:14
◼
►
that aren't useful in that way, but that
00:12:16
◼
►
are parts of the application that exercise the whole stack
00:12:20
◼
►
and that examine exactly what's going on in the system
00:12:23
◼
►
and then report back if there's something there that
00:12:26
◼
►
isn't quite right.
00:12:27
◼
►
And so this is making sure that I-- it hits the web server,
00:12:30
◼
►
it goes through the load balancer to the web server,
00:12:33
◼
►
hits the database, comes back, and then gets presented
00:12:36
◼
►
to the user.
00:12:37
◼
►
Or on my worker side, it's like goes and looks and hits
00:12:39
◼
►
the web server, goes and looks at the worker queue.
00:12:42
◼
►
Is the worker queue too high?
00:12:43
◼
►
Is it above what I think it should be?
00:12:45
◼
►
If it is, send it back.
00:12:47
◼
►
And in the process, it's also making sure
00:12:49
◼
►
my Redis server's up, it's handling all the workers,
00:12:51
◼
►
it's handling-- make sure the workers are up.
00:12:54
◼
►
And you can create these kind of interesting URLs
00:12:56
◼
►
And then they send back basic messages, which you can hit directly.
00:13:00
◼
►
And you can then include just status information.
00:13:03
◼
►
So I have an alert that it hits my worker counter.
00:13:06
◼
►
And if the worker counter is above 4,000, which is a number-- there should never be
00:13:10
◼
►
that more than 4,000 in queue jobs at any one time, then it sends me a push notification.
00:13:18
◼
►
And the thing that's kind of funny but subtle about that is making sure you have those things
00:13:21
◼
►
gives you tremendous peace of mind.
00:13:26
◼
►
Because I can stop kind of compulsively checking on things
00:13:29
◼
►
in a way that I would have to otherwise,
00:13:33
◼
►
where I have to be constantly SSHing into machines
00:13:36
◼
►
and kind of looking at things and playing with stuff.
00:13:38
◼
►
It's great to be able to just know that
00:13:41
◼
►
if something goes wrong at any of these,
00:13:43
◼
►
like these four parts of the system,
00:13:45
◼
►
then this particular alert will fire.
00:13:46
◼
►
If anything goes wrong with these parts of the system,
00:13:49
◼
►
these lists alert will fire.
00:13:47
◼
►
And once you have developed a fair bit of trust with that,
00:13:51
◼
►
once it really seems to work, it's
00:13:52
◼
►
great to be able to just push that, put that aside,
00:13:55
◼
►
and focus on features, and focus on making the product better.
00:13:58
◼
►
One other random note that I want to make
00:14:00
◼
►
is if you are doing this kind of work,
00:14:01
◼
►
if you're working on this kind of stuff,
00:14:03
◼
►
there's a great app called Prompt, which is by Panic,
00:14:05
◼
►
which is an SSH client.
00:14:07
◼
►
And configuring that so that you can SSH into your machines
00:14:10
◼
►
from your phone is amazing in terms of-- there's definitely
00:14:13
◼
►
once so far I was out to dinner with my wife,
00:14:15
◼
►
And again, I get this alert that the servers are down.
00:14:17
◼
►
Something was funny.
00:14:19
◼
►
It was great to just pick up my phone,
00:14:21
◼
►
SSH into the machine, found the little thing, which
00:14:24
◼
►
is some worker had an error and needed to be restarted.
00:14:26
◼
►
I restarted it.
00:14:27
◼
►
And boom, I have a backing business.
00:14:29
◼
►
Come back and enjoy dinner.
00:14:31
◼
►
So definitely a great tool to have.
00:14:33
◼
►
You can configure it with your private key
00:14:35
◼
►
and all that kind of stuff.
00:14:36
◼
►
So it's really convenient.
00:14:37
◼
►
All right, hopefully that was helpful.
00:14:38
◼
►
It's a little bit into the network administrator
00:14:42
◼
►
side of things.
00:14:42
◼
►
So I'll be back more into the iOS side, which
00:14:45
◼
►
I think most of you listen to this for soon,
00:14:47
◼
►
but it's what I've been thinking about what I've been doing.
00:14:49
◼
►
All right, so that's it.
00:14:50
◼
►
If you have questions, comments, concerns, complaints,
00:14:52
◼
►
compliments, I'm _DavidSmith on Twitter, DavidSmith@appnet.
00:14:56
◼
►
And otherwise, have a great week and happy coding.