Learning to program? Be curious and unafraid.
I’ve been asked a few times now how to learn to program. My advice has varied as I’ve continued thinking about this, but I’ve recently realized the answer has very little to do with programming itself:
There’s something you need beyond just knowing a particular language. It doesn’t matter if you stay pure with Haskell or find your one true way with Python; it’s irrelevant whether you focus on embedded systems or concentrate on infrastructure at scale; and it certainly doesn’t matter whether you end your day with :wq or C-x C-c.
If you want to learn to program (or get better at just about anything), you need to be curious and unafraid. You should frequently arrive at the brink of your knowledge and constantly immerse yourself in the depths of each cliff.
—
I was fortunate while growing up that my parents ran a computer-based business out of our house. The Macintosh was new enough (at least for St. Louis businesses) that there was no computer store to call. For a business, this is no excuse: a misbehaving computer could not lead to a delay with our newspaper or any of our clients’ printing jobs.
Recognizing that truth, and aware that we couldn’t rely on any local shops for help, my mom bought a huge book on computers and taught herself how to fix the machines as issues arose. My parents quickly learned to stock up on parts (as some organizations took to abandoning their aging or malfunctioning machines, we stocked up on those pieces to fix our own equipment).
As I was growing up at this time and paying close attention to the world around me, I absorbed this one important lesson from my parents: to not be afraid by the possibility of a computer breaking, but to instead relish in its resiliency when it did.
My parents made sure that I grew up on these ideas. With spare computers all around the house, I was encouraged to play around on any machine and not worry about what might go wrong. If I clicked the wrong icon or tugged on a cord that should be left in place, I knew that it’d just be a temporary blip: my mom could fix the computer and I’d pick right back up where I left off.
I’ve carried this throughout my life. In school, of fifty math problems, it was the one that had me wracking my brain for hours that stood out; when I studied taekwondo, it was my sparring partners that knocked me down that had me excited for a rematch; and with programming, it was the symbols that looked most foreign that reminded me how my time with computers was just getting started.
—
It’s the unknown that can keep us on our toes. If you’re just learning to program, you might look at something like this:
public class Hello {
public static void main(String []args) {
System.out.println("Hello, World!");
}
}
and mutter a few curses about whatever the hell a public static void is.
Great if so! The most important thing is not that you immediately know what all of these words mean; instead, it’s that you be slightly irritated by not knowing. Once you’re ready to start exploring those unknowns, just remember that there’s very little you can really break, so don’t worry about what could go wrong – try it anyway and see what happens.
If you can master being aware and bothered by the constraints of your knowledge, then you’ll always know when it’s time to push your own boundaries and you’ll have the drive to do so.
So go on. Give it a try.
Thank you Alex MacCaw and Patrick Collison for reading my drafts.
@Stripe’d
Recently, I decided to join Stripe full-time to help shape and hack on the support team. I’ve actually been working with Stripe since February doing part-time support in the Campfire chat, and even before then in a completely unofficial role as a user and community advocate.
As I also have meld with my mom, I’ve historically been against the thought of joining another company in a full-time capacity, even if I got to continue working on meld. But there’s something really special about Stripe, and I felt it from the moment I first got my beta invite.
When I started using Stripe, my mom and I were just launching an entirely rethought version of meld (then, QR Card Us). Everything was going smoothly (especially when you factor in no sleep for 48 hours—but that’s a post for a different time), but a larger client found our site and was interested for the whole company. There was only one problem: the form wouldn’t generate his card token.
Okay, there were actually two problems: the form wouldn’t generate his card token and I couldn’t reproduce. He was extremely patient (you know who you are—thanks!), but I had to get this fixed. Not knowing what else to do, I blindly emailed support@stripe.com with the customer’s IP address in hopes that they could help pinpoint the problem by verifying whether or not they were receiving his create token request.
What I got back was so much more. Greg Brockman promptly wrote me, not only giving insight into their logs, but actually going to my website, sifting through JavaScript that was completely separate from their end, and helping to point out potential problem areas.
Thanks to Stripe, I was able to find the issue and get the client promptly signed up.
This is not an isolated incident, nor just a matter of Greg being an exception to the rule. Continuing to ask Stripe for pointers along the way, I found myself witnessing a recurring theme: they were, quite simply put, fantastic people.
Thanks to Stripe, I quickly found myself with a whole team of people behind what was really just a 2-person startup.
Many companies claim they care about their customers, but are prompt to levy heavy taxes and curt responses instead. Stripe was different. They surpassed what I had ever expected from a team; even as a total stranger, they actually cared about what was best for me.
In a situation like this, there’s really only one good way to respond: I started hanging around the Campfire room to pay it forward when possible, and I took to the tubes of the Internets to defend Stripe and raise my flag in favor of them. Stripe was not just a product I wanted to recommend, but a group of people I wanted to introduce.
So when Stripe and I started talking about making my role more official, and I gave it some thought, I realized how much of a no brainer it really was. There is no other company that I’d have even considered this for, but Stripe was and is clearly a creative, morals-driven team that truly cares more about the person than the bottom line.
Now that I’m at Stripe, I’m going to work hard to continue this even as we keep growing and becoming synonymous with accepting payments online. If you even just want to say hi, email us at support@stripe.com or come hang out with us on Campfire.
I hope this is never the case, but if you ever feel that we’re not meeting this, please know that you are welcome to email me directly at michael@stripe.com. I also wouldn’t mind hearing from you for any other reason, too :-)
The people behind Stripe are as crazy as you’d expect from a group choosing to turn the entire concept of accepting payments online on its head by actually making it simple and really caring about the customer. If you think you’re just as crazy, come change how payments are done with us.
Mom, Friend, & Co-Founder
At a dinner recently, I was asked what it’s like to be co-founders with my mom. It was such a fantastic question that it’s been on my mind since then, so I wanted to put my thoughts into writing.
Paul Graham notes friendship as an important quality in founders, and I can’t agree more. I’m proud to say that my mom is one of my best friends, which is why we are able to work so well together.
Like Paul mentions, “startups do to the relationship between the founders what a dog does to a sock: if it can be pulled apart, it will be.” But, even at the lowest times in startup life, the thought of damaging my friendship with my mom outweighs any bad. There are certainly times that we disagree, but we have always had an understanding going into any discussion that there’s some separation between personal and work life. We’ll argue, pause to have a peaceful dinner together with my dad, and pick up where we left off. Remembering the importance of family and friends is key to maintaining sanity in a startup.
Looking at ourselves as not just mom and son, but also best friends, grants us the opportunity to have complete respect for one another (in traditional relationships I’ve seen, respect unfortunately flows only in one direction). I know the areas in which she’s more knowledgeable, where her life experience is most applicable, and she knows the same for me; so, we can ask each other anything, disagree with what the other has to say, and speak our thoughts freely without offending one another.
Relatedly, this open dialog leads to a higher level of understanding each other. When I need help or am uncomfortable with a situation, I don’t even need to say a thing - she almost always knows and is able to lend a hand.
Paul writes in another essay that “you need colleagues…to cheer you up when things go wrong.” Things do go wrong, but my mom is one of the reasons I’m navigating startup life in the first place. Certainly, we want to improve the world, but I would be lying if I said that improving my family’s life wasn’t also a goal. If I’m having a bad day, all I have to do is see my mom working alongside me and it instantly gives me the right perspective and cheers me up. I think the opposite is true, too.
Overall, it’s fantastic having the opportunity to work with my mom. Given the opportunity again, I would—without hesitation—want her to be my co-founder.
NoDaddy In Review (& How I Trolled Back)
Background
I’m sure anyone reading this is aware of SOPA and PIPA by now. At the end of December 2011, the Internet came out in complaint that GoDaddy, a popular but controversial registrar and webhost, supported SOPA.
Enter NoDaddy:

As Drew Olanoff of The Next Web [writes](http://thenextweb.com/insider/2011/12/23/nodaddy-lets-you-pledge-to-boycott-go-daddy-for-its-stance-on-sopa/], he “was talking to Ben Huh and Ben suggested in jest that someone come up with a counter to track everyone who is pledging to leave Go Daddy.” So, I opened a new desktop, my favorite editor, and got started.
The setup was easy because I already had a sandbox in place called Rawr, where I host things like my @episod Tracker, so all I had to do was add a new Django app. The goal was to make the page as simple as possible, so I gave a brief background, a pledge form, and a total count of customers that GoDaddy stood to lose by their support of SOPA.
To drive home that these are real people that GoDaddy was hurting, there was also a random selection (inclusion optional) of people’s profile pictures from Gravatar.
Trolls?
Drew requested in his post that I add a domain count so that people could optionally include the number of domains that they pledged to transfer from GoDaddy. This was a great idea because, for some companies, numbers are necessary to call them to action, and this would certainly help convey economic incentive. However, I was worried that people would troll this by inputting large, false amounts, which would invalidate the overall message.
Still, the audience was limited enough and driven by a common cause, and it was indeed a good idea, so I decided to give it a shot and added the domain count. People that had already pledged could resubmit the form using their same email address and it would update the domain count without inflating the total number of customers.
Although it was difficult to be certain if someone padded their input at all, I paid close attention to the pledges and for the most part–and frankly, to my surprise–everyone was quite honest. Unfortunately, by the close of the night, I did have one fairly dedicated troll. Thankfully, this troll was extremely consistent in using the same large input number, so deleting became easy (albeit irritating since it was getting late).
By the next morning, after waking up to delete their overnight troll, I had an idea.
Troll Back
The only (fun) way to fight trolls is to troll back, and it’s important to be subtle lest they find a workaround. So, I modified the pledge code so that if the domain count was over what I considered to be a reasonable threshold I would set a cookie in their browser representing the number of domains they pledged. With support for multiple pledges (in case they tried many email addresses), they would see the results of their trolling, but no one else would.
This method was amazingly effective as I noticed no trolls slipping through. My only regret is that I didn’t separately track the submissions that were marked as a troll (they were never saved to the database). I would love to know just how many attempts I trapped.
Spoils
Although I didn’t keep track of all trapped trolls, I did find one little gem on Twitter: a tweet (picture too!) by Carter Cole about his successful trolling of the afternoon wherein he pledged to transfer 10 billion domains.

The setup was successful, though. Carter seemed to be proud of his accomplishment, but it never made it to anyone else’s screen and therefore didn’t hurt the integrity of NoDaddy’s message.
Conclusion
Overall, I am extremely proud of the pledge site. I’m thankful to Ben and Drew for the idea because it helped unite those against GoDaddy and show a common, strong force. I am also impressed by and thankful to the community for their mature response: I only noticed a handful of abuses out of 658 total pledges, and they were dealt with swiftly to minimize impact.
Thank you everyone for helping make NoDaddy successful and working with me in battling back SOPA and any company that supports such legislation.
TechCrunch: A Better Image Gallery
TechCrunch somewhat recently redesigned their website. While I actually enjoy the new design, I dislike the fact that you can’t easily go through images in an image gallery.
So, I’m scratching my own itch here and releasing a user script to fix that. It makes it so that you can click on an image and see it full-size above, plus it provides handy previous/next buttons for going through the entire gallery. All without popping back and forth from page to page.
The code is commented and on GitHub, so you can easily see what’s going on behind the scenes. It’s a bit rough around the edges in some aspects, but I think it works well for this case.
Installation
To install this, all you need to do is grab the user script.
Google Chrome will support this by default. For other browsers, you will need a user script extension:
Note that I’ve only tried this on Google Chrome and Firefox. Your mileage may vary. If you find issues, please let me know I’ll update accordingly.
Usage
To use, once it’s installed, simply visit a TechCrunch post and click on an
image in the gallery. A preview will show up above the gallery and you can
cycle through the images using the << Prev and Next >> links.
You can test it on this Ice Cream Sandwich post.
Contributing
Like I mentioned, it’s open source and on GitHub, so if you see something you
want added, either let me know (information in README on GitHub) or, better
yet, send a pull request my way.
Conclusion
I hope everyone enjoys! I know I’m looking forward now to seeing more images in TC posts.
[Notes] A Tale of Three Trees by @chacon
Here’s some more Strange Loop 2011 material–this time about a talk on git given by Scott Chacon (@chacon)!
In his talk, Scott focuses on demystifying git’s reset command through an explanation of git’s three trees: HEAD, index, and the working tree. Toward the end of the talk, he also includes some of Git’s plumbing goodies that can be useful in local scripts for automatic backup, cleaning past commits, and so forth.
Hope you enjoy the notes. As always, I typed quickly and so guarantee absolutely no accuracy. You might reference these notes with his slides (warning: PDF).
Trees
- Head
- Indirect pointer to the last commit
- Points to a tree
- Index
- Working Directory
- Tree Roles
- Last commit, next parent
- Index proposed next commit
- Could
git addeverything,rmall files, andgit commitwould still work just fine
- Could
- Work Dir sandbox
git statustells you the diff of the three trees
git reset
- A tool to manipulate these three trees
- Path Form:
git reset [file]- Opposite of
git add [file] - Takes entry from
HEADand make index look like that - Lets you manipulate your index without touching your working directory.
- Opposite of
- Commit form:
git reset [commit]- Does three things in order. The option determines where in this process it stops.
--soft: moveHEADto target- Moving the branch to somewhere else. Does not change index or working directory; just changes where the branch points.
git reset --soft HEAD~HEAD~: parent ofHEAD- Undoes results of last commit.
[--mixed]: then copy to index- Takes what
HEADis pointing at and makes your index look like that.
- Takes what
--hard— then copy to work dir- Touches your working directory
- Path Form:
- Can use this to squash the last two commits into one
git reset --soft HEAD~2; git commit- Moves
HEADback tw ocommits, keep index
- Moves
- Awesome for staging your work in progress and then making one nice, beautiful commit
git checkoutgit checkout [commit] [path]git checkout [commit]- If you’re in a dev branch and have made some commits:
git reset masterwill move your index to where the branch startedgit checkout masterwill moveHEADto point to master
- reset vs. checkout
- Patchy work
git add --patch [file]git reset --patch (commit) [file]git checkout --patch (commit) [file]- Revert parts of a file for commit
Tidbits
git add -p- Uses the index tree as a staging area for partial addition
git commit --amend==git reset --soft HEAD~; git commit ...git log --stat (branch)
The Plumbing Commands
- A summary of the below commands
rev-parse- Take any string you give it and tell you its SHA. E.g.,
git rev-parse origin/master git rev-parse master~163^2~3^2— walks backwards, does some cool stuff. Figures out its SHA.- Can do ranges:
git rev-parse master~163^2~3^2..origin/master
- Take any string you give it and tell you its SHA. E.g.,
hash-object- Use git as a raw key-value store
git hash-object -w ~/.sshid_rsa.pubecho 'my awesome value' | git hash-object -w --stdin- Will return SHA and write into database
ls-filesgit ls-files -s: shows you your staging area
read-tree- Reads a tree value into your index at a raw value
git ls-files -s # show index git ls-files -r HEAD # show index git read-tree HEAD~2 # basically same as git reset
- Reads a tree value into your index at a raw value
write-tree- Takes whatever your index looks like and writes it out as a tree object.
git write-tree: tells you the tree that a commit would make if you commited it.- Does not commit.
commit-treeecho 'my commit message' | git commit-tree- Commits without
git commit
update-ref- Part of
git branchmechanism - Updates your reflog as well
git update-ref refs/heads/newbranch
- Part of
symbolic-ref- Update HEAD itself
git symbolic-ref HEAD refs/heads/newbranch
- Example usage
- Could make an auto-backup system for your working directory
- Publish documentation to another branch
- These are all under the rules that you shouldn’t mess with history if you’ve
already pushed that to people.
- So, use this stuff locally.
[Notes] Machine Learning: A Workshop by @hmason
I had the great fortune to attend Hilary Mason’s (@hmason) workshop this past Sunday at Strange Loop 2011. I was, in fact, so excited by the opportunity to attend this workshop that I actually got up early Saturday morning and prepared to leave for the hotel when I realized that I still had 24 hours to go–at least I didn’t make it all the way to the hotel before realizing.
I do want to give a big shout out to Hilary for giving me permission to post these notes. Note: I have redacted a few pieces of information that are irrelevant to anyone who was not at the workshop.
Our Goals
- Clustering — finding groups of related things (unsupervised automation)
- Named entity disambiguation — “Big Apple” ~ Translation
- Classification — what language is something in?
- Recommendations — Amazon, Netflix
Special Kinds of Data
- (Not getting into today)
- Geographic information (where do people eat bagels?)
- Time-series analysis
- Mathematically the same, applicably different
Black Box Model
- Google Prediction API — free; runs model selection on data to automatically determine best algorithm
Model for Thinking Through Data Problems
- Obtain
- Scrub
- Explore
- Model (and verify; error bars to know if you’re right)
- iNterpret
Supervised Learning and Classification
- Picture of cat and dog — which is cat, which is dog?
- Assignment of a label to a previously unlabeled piece of data
- Not really in opposition to unsupervised learning; can use together
- Examples
- Spam filters (see public domain Enron email db)
- Language identification
- Face detections
- Places to get data
- Data Source Handbook
- Hilary’s Bundle of Links
- NY Time’s Data: human checked, so clean and accurate
- View source on article: tons of accurate data in “ tags
- Also in their API
Classifying Text
- NY Times:
[REDACTED](seearts,sports) - Process
- Parse labeled data
- Define features of data
- calculate likely features for each label
- For new, unlabeled data, predict
- Trying to create featureset such that one is getting value from data, throwing away the useless
- Naive Bayes
P(A)is the probability that A is trueP(True) = 1P(False) = 00 <= P(A) <= 1P(A or B) = P(A) + P(B) - P(A and B)p(B|A) = [p(A|B)p(B)]/p(A)- If we know the probability of B, and that of A, and that of A given B, we can figure out B versus A
- Non-bayesian models too (frequentists)
- Example
- Population of 10,000
- 1% have a rare desease
- Test that is 99% effective
- 99% of sick pateitnts test positive
- 99% of healthy patients test negative
- 50% probability that is one actually sick (99 sick patients test positive, 99 healthy patients test positive)
p(sick|test_pos) = [p(test_pos|sick)p(sick)]/p(test_pos) = 99/198 = 1/2- Features
- what test result is
- we know how many people are sick
- Naive because we assume each feature is independent
- Python has a good one built in, using Hilary’s
[REDACTED] - Put computation cost up-front to train, pickle (serialize) onto disk
- How to know when probability is
- Significant
- Use threshold (e.g., unless 0.8 or higher, not in category)
- Wrong entirely
- Significant
- Feature Analysis — want to reduce featureset as much as possible
without losing value in data
- Punctuation
- Case
- Length of words
- Stopwords - e.g., “the”, “and”, “a” (i.e., drop any word 3 chars or less)
- Stemming — backing word up to linguistic stems
- e.g.,
program = programming = programmer => program
- e.g.,
- Porter Stemming Algorithm
- M.F. Porter, 1980; English only
- Best known
- More data! WordNet
- There’s an open source, image-based version of WordNet
- Use to bootstrap for working with smaller featuresets
- E.g., Twitter text is small. WordNet gives synonyms
- No clean text?
lynx -dump— does a great job of getting text
- Different types of classification algorithms and data
- K-Nearest Neighbors
- Works poorly on text data, pretty well on images
- Small
kcomplex boundary - Large
kresults in course averaging - Can’t have
k > data set; also CPU bound
- K-Means and K-Nearest Neighbors are different, but close
- K-Means is analog of supervised learning
- K-Nearest Neighbors assumes you have labels
- i.e., one is without labels, one is with
- K-Nearest Neighbors
Confusion Matrix: how do you know when you’ve won
Diagram of actual and predicted values
. kitten puppy penguin [Predicted Values] kitten 5 2 0 puppy 2 4 0 penguin 0 0 6 [Actual Values]- Can assume penguins have a distinct set of features: little confusion
- Use scipy for
these kinds of analysis
- Goal of example is to take images and recognize them
- Goal: OCR (digits)
- See
[REDACTED] - Data is in
[REDACTED]subdirectory 0is most often confused with5- White means there were lots of matches (want to see this on (0,0), (1,1), etc.)
- Sample size
- Depends on how much discrimination in one’s data set
- Consider on order of magnitudes: 10,000s of samples if needed to do a pretty good job
- Interpretability is costly (meaning, why did you recommend this?)
- news.me does two models
- run a fast one that is not interpretable
- have a second one that takes longer to generate and can be used later to trace back why a recommendation was made (but with different results, so not the same results)
- news.me does two models
- Boosting
- Combine weak-learners to create a strong-learner
- Compute a weighted sum over the weak-learners
- Very slow but very interpretable
- Kitchen sink approach: throw every algorithm, featureset you
- Canonical boosting algorithm: adaboost
- Google Prediction API: Boosting (but they call it “model selection”)
- Combine weak-learners to create a strong-learner
- Examples from bit.ly
- Spam and Malware Identification
- Every URL shortened is queued and analyzed for malicious characteristics
- Malicious Features
- Domain (bloom filter)
- TLD (.co.cc and .info are high-risk)
- Randomness of URL (hack: gzip URL and see how long it is)
- Words in HTML title
- Number of redirections
is_spamis dependent on threshold on bit.ly API
- Topic Identification (e.g., what percent is about the weather?)
- Find the relationship between these topics
- Can now put in URL and get very fast output of category predictions
- Wikipedia is the dirty secret for seeding most things
- Spam and Malware Identification
- D3 is awesome visualization library
- d3py — Python objects to JavaScript for D3
- Sometimes metadata is better than real data
- What spoken languages are in a page?
- bit.ly sees the user who clicks on the link, knows their browser locales
- Use this to know what locales clicking a link are most present
- Use entropy calculation
- What spoken languages are in a page?
- References
- See slides
- NY Times:
Unsupervised Learning and Clustering
- Clustering
- Find similar things when you know nothing about what you’re looking at
- Parametric vs. Non-Parametric
- Agglomerative Clustering
- Find each point, find closest point
- Merge together until one has a cluster you want
- Manually set thresholds
- K-Means: Canonical Clustering Algorithm
- Decide on number of clusters
- Randomly place centroids
- iterate until clusters are formed
- Iterate until convergence
- Keep iterating: eventually you’ll get something more accurate and useful
- Distance
- Manhattan distance (city-block distance or taxi cab geometry)
- Depending on features, it can be better than
- Jacard Index
- Good for text data (or data where there is not a good mapping to Euclidian space)
0when sets are disjoint1when sets are identical
- Consider absolute value (trying to measure similarity)
- If it’s intuitive (e.g., 2- or 3-D space), perhaps use Euclidian
- Unsure, usually best to ask someone who knows.
- Technique
- Unsupervised approach at first to try to begin getting clusters
- Follow up with a supervised approach
- VariationL k-mediods (see slide)
- Hierarchical clustering
- Agglomerative
- Combine closest items into a new item and repeat until there is just one item
- “A Grand Tour of the Data” — flash graphs from running different algorithms in front of you; manual intervention (get some popcorn)
- Dengigram (“clusted URLs”) — typical graph used to represent
hierarchical clustering
- Links together things that are closetogether (have lines between them)
- Height of line means at what round of algorithm they were agglomerated
- Build an understanding of what’s normal based on clustering over time;
this then lets you know when something isn’t normal.
- E.g., drastic gains or drops from norm indicate something important
- Recommendation Systems
- Special case of an unsupervised clustering problem
- Define a starting node and want to find similar nodes
- hcluster
- bit.ly: recommend up to the second
- Online / Offline Learning Problem
- Offline: Two parts to the calculation - one every
nminutes, one query at a time (~10 minutes) - Online: Take the URL and query the cluster. Using Single
Value Decomposition (SVD)
- Do SVD on only one cluster
- It’s a hack; not necessarily accurate
- Offline: Two parts to the calculation - one every
- Online / Offline Learning Problem
- ML on Streaming Data (depends on problem)
- Can snapshot and just focus on data for a block of time
- Other options exist (like not iterating over whole set)
- Less accurate, but it’s an option
- References
- See slides
- Special case of an unsupervised clustering problem
Conclusion
- Don’t use R in production!
- Once your data hits memory bound, you’re screwed.
- No mind paid to programming aesthetics
- How do you know if you won?
- Supervised
- Save some of labeled data for a test. Also a good way to test if your data needs to be retrained.
- Unsupervised
- E.g., looking for realtime recommendations of links.
- Supervised
- Don’t use R in production!
Going from Here
- Book: Toby Segerand: Programming Collective Intelligence by O’Reilly
- Book: Pattern Recognition and Machine Learning by Christopher Bishop
- Book: Reinforcement Learning by Tom Mitchell
- Tom is considered founder of ML field
- Class: (Online) Stanford Machine Learning
- Class: (Online) Stanford AI
- Dataists
[Notes] Vim: From Essentials to Mastery by @wnodom
This morning at Strange Loop 2011, I had the opportunity to attend Bill Odom’s excellent vim talk. Bill co-runs the local vim-geeks group and I have attended his talks on vim in the past and every single time he has impressed me and managed to teach me more and more–he’s a true master at it.
Bill has already managed to put his 300 slides online. I should note that we didn’t go through all 300 slides (I don’t believe that’s even possible in an hour), but instead we jumped around based on the core concepts that Bill wanted us to learn and the things that the audience thought most interesting. So, I’m posting my notes here just to highlight a few things that I saw as notable.
Sorry for any typos, and if you see anything that’s incorrect, please let me know. I had to type quickly to try to keep up, so I can’t guarantee that it’s 100% correct.
Help
:help :help:helpgrep (:helpg):help!:h holy-grail:h 42
vim
- (insert)
(C-O)gets you out to do a single normal mode command - Normal mode
- Moving around
H— highM— middleL— lowgj— up screen linesgk— down screen lines
- Traveling without moving
zz— shifts line to middlezt— shifts line to topzb— shifts line to bottom
*— find the word you’re sitting on (search forward)#— find the word you’re sitting on (search backward)g*— same thing as*, but in a different wayg#— same thing as#, but in a different way@:— rerun last command
- Moving around
Command line mode
- Ex commands
- Search commands (
!)- Pump your text to an external program and then push it back into your file
- Filter commands
Plugins
- File management
- Taglist
- BufExplorer
Registers
- 26 named register (
"athrough"z)- Uppercase the name to append
- Numbered registers (
"1through"9) :regshows registers- Can be more specific:
:reg adg
- Can be more specific:
"%— urrent filename"#— alternate filename"_— Last “small” delete"/— Last search":— last Ex command"*— system clipboard"+— system selection (X11)"_— black hole: delete without storing in a register(C-R)register— accessing registers:let @a = ""— assign register- macros and registers say
(C-R)=— calculatorqaq— Record an empty macro
Resources
- Note: It’s highly recommended to not just copy these, but rather to go through them and pick out the gems that interest one most.
- Bill Odom
- Steven Pritchard
- Damian Conway
- Steve Losh
- Tim Pope
[Notes] Storm: Twitter’s scalable realtime computation system by @nathanmarz
At Strange Loop 2011 this morning, Nathan Marz (@nathanmarz) made a wonderful announcement this morning: the open-sourcing of Storm, the realtime computation system that he developed at BackType (acquired by Twitter).
Here, I’m including the notes that I typed up during his presentation. Apologies in advance for any typos or errors (I removed anything that I was especially unsure of, just to be safe)–I had to type quickly to keep up.
Most importantly, check out the four repositories he open-sourced in the middle of the talk:
At last, here are the notes:
History: Before Storm
- Queues and Workers
- Example
- Firehose ~ Queues ~ Workers (Hadoop) ~ Queues ~ Workers (Cassandra)
- Message Locality
- Any URL update must go through the same worker
- Why?
- No transactions in Cassandra (+ no atomic increments at the time)
- More effective batching of updates
- Implementing
- Have a queue for each consuming worker
- Choose queue for URL using consistent hashing
- Take hash of URL mod Queue = Index of Queue
- Same URL goes to same queue
- Evenly distribute URLs to queues
- Problems
- Scaling: Adding a Worker
- Deploy a new worker and new queue for that worker
- Must redeploy other workers: use consistent hashing, must let them know that there’s a new queue
- Poor fault-tolerance
- Coding is tedious
- Scaling: Adding a Worker
- Example
Storm
- What we want
- Guaranteed data processing
- Easily horizontally scalable
- Fault-tolerance
- No intermediate message brokers
- Conflict with desire for guaranteed data processing
- If worker fails, can always ask for it from the data broker again.
- Problem: complex and slow. Messages have to go through third party and persist to disk.
- Higher level abstraction than message passing
- Just works
- Use Cases
- Stream Processing
- Distributed RPC
- Parallelize an intense function; invoke on the fly and compute quickly
- Continuous computation
- Storm Cluster
- Three Classes of Nodes
- Nimbus: Master Node (similar to Hadoop JobTracker)
- Submit topologies and code for execution
- Launches workers
- Monitors computations
- Restart things that fail
- ZooKeeper: cluster coordination
- Worker nodes: actually run computations
- Nodes pass messages to each other
- Run daemon called Supervisor, communicates with Nimbus through Zookeeper
- Nimbus: Master Node (similar to Hadoop JobTracker)
- Three Classes of Nodes
- Concepts
- Streams
- Unbounded sequence of tuples
- All tuples must have same schema (same number and same types)
- Supports primitive types (serialized and deserialized)
- Also support for custom types
- Spouts
- Source of streams
- Examples
- Kstrel spout: read from kestrel queue
- Read from twitter stream
- Bolts
- Processes input streams
- Can run
- Functions
- Filters
- Aggregations
- Joins
- Talk to databases
- Topologies
- Network of spouts and bolts
- Each bolt subscribes to any number of output streams
- Streams
- Tasks
- Spouts and bolts execute as many tasks across the cluster
- Lots of tasks across many machines, all passing messages to one another
- Stream grouping
- When a tuple is emitted, to which task does it go?
- Describes how to partition that stream
- Shuffle grouping: picks a random task
- Fields grouping: consistent hashing on a subset of the tuple fields
- Similar to queues and workers, but higher level of abstraction
- All grouping: send to all tasks
- Use with care
- Global grouping: pick task with lowest id
- There are more, but not going into here
- Streaming word count
TopologyBuilderis used to construct topologies in Java- See slides for example implementation
- Split sentences into words with parallelism of 8 tasks
- Create a word count stream
- Can easily run some other script, such as Python to evaluate
- Can run topology in local mode for development test
Traditional data processing
- Traditional Method (pre-Storm)
- All of your data ~ precompute indexes to run query quickly
- Precompute happens with intense processing: Hadoop, databases, etc.
- Example: how many tweets on a URL between 7am on Sun. and 10pm on Mon.
- Indexed by hour; sum over those few hours when querying
- All of your data ~ precompute indexes to run query quickly
- Storm: intense processing on both sides. Distributed RPC flow on Storm.
Distributed RPC Server(easy to implement, Storm comes with one)- Coordinates distributed RPC dataflow
- Gives data to spout
- Topology parallelizes computation, gives to bolt
- Bolt gives to distributed RPC
- Client gets result
- Example
- Compute reach of URL
- Get URL, compute all tweeters. Find their followers.
- Get set of distrinct follower.
- Count ~ Reach
- Extremely intense computation: can be millions of people
- Storm
- Spout emits (requestid, tweeterid)
GetTweetersgoes toGetFollowers; emits(requestid, followerid)PartialDistinctCountAggregatordoes global grouping, receives one tuple from each, and sums- All done completely in parallel
- What might takes hours now takes two seconds
- Going down to 200ms. See “State spout” below
- Compute reach of URL
- Guaranteeing message processing
- Uses ZeroMQ
- “Tuple Tree”
- A spout tuple is fully processed when all tuples in the tree have been completed
- If a tuple tree is not completed within a specified timeout, it is
considered failed and replayed from the spout
. Reliability API: must do a little bit of work
- Emit a word: anchor
- Anchoring creates a new edge in the tuple tree
- Collector acks the tuple; marks the single node as complete
- Storm does the rest
- timeouts when necessary
- tracking what’s processed
- seeing when it’s complete
- Storm tracks tuple trees for you in an extremely efficient way
- See the wiki on GitHub for explanation of this algorithm
- Emit a word: anchor
- Storm UI: see slides
- Storm on EC2: it’s super easy. Use
storm-deploy
The Future
- State spout (almost done)
- Synchronize a large amount of frequently changing state into a topology
- Example 1
- Optimize reach topology by eliminating the database calls.
- Each GetFollowers task keeps a synchronous cache of a subst of the
social graph
- Works because GetFollowers repartitions the social graph the same way it partitions GetTweeter’s stream
- Storm on Mesos
- Mesos is cluster/resource framework
- Allow more fine-grained resource usage
- “Swapping”
- If you currently want to update a Storm topology, must kill it and submit
a new one. Takes a few minutes.
- This is bad for a realtime system!
- Lets you safely swap one topology ofr a new one.
- Atomic swaps.
- Minimize downtime
- Prevent message duplication
- If you currently want to update a Storm topology, must kill it and submit
a new one. Takes a few minutes.
- Auto-scaling
- Storm can automatically scale topology to data
- No work on your end; increase as message throughput increases
- Also handles bursts of traffic. Temporary provisioning of more resources, then scale itself back down.
- Higher level abstractions
- Work can be done still to improve this
- DSLs in variety of language, etc.
Be A Better Entrepreneur: Be A Little Selfish
Some Background
Several weeks ago I listened to a commencement speech by Chris Sacca (@sacca) that he gave to the Carlson School of Management. Many of the things that he said in this have stuck, but I don’t want to talk about all of the awesomeness contained within the speech (you should watch it for that). I instead want to pick out one thing in particular: playing offense.
As an entrepreneur, it is easy to find something to keep you busy no matter the time of the day. In fact, you usually don’t have to try that hard if you’re running a business–stuff to be done practically adds itself to your schedule (and if it doesn’t, others will–thanks, email!). I know this feeling of being busy all too well–I have always felt in my heart that I was an entrepreneur, and so I began pursuing that in 7th grade when I did computer consultation for my neighbors as well as a local insurance company. At the same time, I am someone who is very proud of his grades and dedicated to his formal education, so I was a middle schooler with the task of balancing business with school while still being able to excel in both.
To make the point that I am coming to hopefully make sense, I need to mention an aspect of my childhood: I was a pretty chubby kid. However, I never lied to myself about this–I admitted it to myself, and there were some to pick on me just in case I didn’t manage the self-realization. A couple of times throughout the years, I had started to exercise a little bit, and then realized that I was far too busy for that–there was always more school and business work to fill my schedule, so I wanted to take care of that first. In my mind, I could always come back and revisit my weight later, but making the most of my education and running the business were things that needed to be done now.
Be A Little Selfish
Unfortunately, and I did not realize it so much then, but even by focusing on the business I had co-founded as well as my own education, I wasn’t really thinking about myself, because if I were, I would have known that my health was much more important. I was not being selfish.
Having reflected on Chris’ speech, and knowing myself that getting into shape is something I need to do, I have recently started exercising again, but this time with a vigorous commitment. I bike every day, without fail, no matter what else my schedule has in store, and I have to say: I love it!
What’s changed?, you might be thinking, did you suddenly get un-busy?. The answer to that is a resounding no: I’m busier than ever before with the launch of QR Card Us and another service that we have yet to publicly announce. However, I have realized that in order to be a true success–and I mean more than just with the business, which I already consider a success as we have been able to help so many people, but rather to be successful in life as a whole–I need to be physically healthy now, not later. I need to play offense with my entire life, not just with the business.
The Benefits
This seems like an obvious one: I get to be get physically fit. I haven’t been doing this long enough to really see that benefits yet (though I am confident I will!), but I have to mention an immediate change: I am so much more mentally agile, and it feels good.
Although the exercising takes a precious chunk of time out of my schedule, I am much more productive with the rest of my time–I think more clearly and I want to code even later into the night. I noticed this just last night: around 2am I was going to break from work to play a game, but I was suddenly having so much more fun than I ever have before with what I was doing that I kept procrastinating with my gaming, instead telling myself that I “just need to write one more function,” and then another, and another, and then it was 5am–a game-free night with tons of productivity to show for it.
(PSA: don’t think that you can add exercise or anything else to your schedule and offset it by staying up later–sleep is still highly important for everyone, especially entrepreneurs that desire to be successful, so try for a good amount. I just tend to schedule my work later into the evenings so I can stay up late and ‘sleep in’.)
Succinctly: although I take a measly hour or so out of my schedule to exercise, during the rest of the time I am so productive because of the mental benefits that I more than make up for this relatively small time commitment. If you really want to be harsh on yourself though, know that exercising is not “wasted time”–the time that I spend exercising is also time that I spend away from the keyboard thinking about the future, how to handle changing market conditions, and even getting ahead of myself by picturing just how awesome this new service is going to be once launched.
Why Wait?
I was originally going to wait until I was physically fit so that I could post a blog entry like this–maybe even show a before/after picture. I mean–why should you listen to someone relatively hasn’t been doing this for too long? I don’t have a really good reason for that, although I can attest for the success of this commitment in my own personal happiness.
So, while it might make more sense that I should wait until there are better proven results from my commitment, I have been thinking: why wait? I’m currently playing offense with my life, and I think a lot of other entrepreneurs should join me now in this–not later. It doesn’t have to be exercise, especially if you’re one with a fabulous metabolism with little effort (curse you!), but anything: take music lessons like you may have always wanted to (I recently started doing just that), practice a skill unrelated to your work, or just do anything that makes you feel like a more complete person. I promise that it will reflect positively both in your ability to run your business and in your personal happiness.
But, it’s hard!
I will be the first to admit, that being selfish and focusing on yourself is a very tough thing for an entrepreneur to do, so I encourage you to have someone else there to keep you in line. Personally, I exercise with my best friend/business partner/mom, so if I’m not feeling like going, she pressures me into it anyway, and vice-versa. The person you use for support does not even need to be directly involved: I share information from my music lessons with my parents so that I know that if I fail to go to a lesson, I’m not just letting myself down, but I’m also letting them down.
So, I guess my real message here is that when you play offense, remember to be a little selfish and keep yourself in mind, because it will ultimately benefit the entire team.
If you have any personal stories of your own to share, or just want to chat, I’d love to hear from you. Feel free to comment here or tweet me @sch. I want to also thank Chris so much for having given that commencement speech, and Carlson for posting it on YouTube. It is one of the most inspirational things I have heard in years, and I suspect it will stay near the top of that list for the duration of my life.