This was posted 3 years ago. It has 0 notes. .
This was posted 3 years ago. It has 0 notes. .
This was posted 3 years ago. It has 0 notes. .
This was posted 3 years ago. It has 0 notes. .

Mom, Friend, & Co-Founder

At a dinner recently, I was asked what it’s like to be co-founders with my mom. It was such a fantastic question that it’s been on my mind since then, so I wanted to put my thoughts into writing.

Paul Graham notes friendship as an important quality in founders, and I can’t agree more. I’m proud to say that my mom is one of my best friends, which is why we are able to work so well together.

Like Paul mentions, “startups do to the relationship between the founders what a dog does to a sock: if it can be pulled apart, it will be.” But, even at the lowest times in startup life, the thought of damaging my friendship with my mom outweighs any bad. There are certainly times that we disagree, but we have always had an understanding going into any discussion that there’s some separation between personal and work life. We’ll argue, pause to have a peaceful dinner together with my dad, and pick up where we left off. Remembering the importance of family and friends is key to maintaining sanity in a startup.

Looking at ourselves as not just mom and son, but also best friends, grants us the opportunity to have complete respect for one another (in traditional relationships I’ve seen, respect unfortunately flows only in one direction). I know the areas in which she’s more knowledgeable, where her life experience is most applicable, and she knows the same for me; so, we can ask each other anything, disagree with what the other has to say, and speak our thoughts freely without offending one another.

Relatedly, this open dialog leads to a higher level of understanding each other. When I need help or am uncomfortable with a situation, I don’t even need to say a thing - she almost always knows and is able to lend a hand.

Paul writes in another essay that “you need colleagues…to cheer you up when things go wrong.” Things do go wrong, but my mom is one of the reasons I’m navigating startup life in the first place. Certainly, we want to improve the world, but I would be lying if I said that improving my family’s life wasn’t also a goal. If I’m having a bad day, all I have to do is see my mom working alongside me and it instantly gives me the right perspective and cheers me up. I think the opposite is true, too.

Overall, it’s fantastic having the opportunity to work with my mom. Given the opportunity again, I would—without hesitation—want her to be my co-founder.

This was posted 2 years ago. It has 0 notes.

NoDaddy In Review (& How I Trolled Back)

Background

I’m sure anyone reading this is aware of SOPA and PIPA by now. At the end of December 2011, the Internet came out in complaint that GoDaddy, a popular but controversial registrar and webhost, supported SOPA.

Enter NoDaddy:

As Drew Olanoff of The Next Web [writes](http://thenextweb.com/insider/2011/12/23/nodaddy-lets-you-pledge-to-boycott-go-daddy-for-its-stance-on-sopa/], he “was talking to Ben Huh and Ben suggested in jest that someone come up with a counter to track everyone who is pledging to leave Go Daddy.” So, I opened a new desktop, my favorite editor, and got started.

The setup was easy because I already had a sandbox in place called Rawr, where I host things like my @episod Tracker, so all I had to do was add a new Django app. The goal was to make the page as simple as possible, so I gave a brief background, a pledge form, and a total count of customers that GoDaddy stood to lose by their support of SOPA.

To drive home that these are real people that GoDaddy was hurting, there was also a random selection (inclusion optional) of people’s profile pictures from Gravatar.

Trolls?

Drew requested in his post that I add a domain count so that people could optionally include the number of domains that they pledged to transfer from GoDaddy. This was a great idea because, for some companies, numbers are necessary to call them to action, and this would certainly help convey economic incentive. However, I was worried that people would troll this by inputting large, false amounts, which would invalidate the overall message.

Still, the audience was limited enough and driven by a common cause, and it was indeed a good idea, so I decided to give it a shot and added the domain count. People that had already pledged could resubmit the form using their same email address and it would update the domain count without inflating the total number of customers.

Although it was difficult to be certain if someone padded their input at all, I paid close attention to the pledges and for the most part–and frankly, to my surprise–everyone was quite honest. Unfortunately, by the close of the night, I did have one fairly dedicated troll. Thankfully, this troll was extremely consistent in using the same large input number, so deleting became easy (albeit irritating since it was getting late).

By the next morning, after waking up to delete their overnight troll, I had an idea.

Troll Back

The only (fun) way to fight trolls is to troll back, and it’s important to be subtle lest they find a workaround. So, I modified the pledge code so that if the domain count was over what I considered to be a reasonable threshold I would set a cookie in their browser representing the number of domains they pledged. With support for multiple pledges (in case they tried many email addresses), they would see the results of their trolling, but no one else would.

This method was amazingly effective as I noticed no trolls slipping through. My only regret is that I didn’t separately track the submissions that were marked as a troll (they were never saved to the database). I would love to know just how many attempts I trapped.

Spoils

Although I didn’t keep track of all trapped trolls, I did find one little gem on Twitter: a tweet (picture too!) by Carter Cole about his successful trolling of the afternoon wherein he pledged to transfer 10 billion domains.

The setup was successful, though. Carter seemed to be proud of his accomplishment, but it never made it to anyone else’s screen and therefore didn’t hurt the integrity of NoDaddy’s message.

Conclusion

Overall, I am extremely proud of the pledge site. I’m thankful to Ben and Drew for the idea because it helped unite those against GoDaddy and show a common, strong force. I am also impressed by and thankful to the community for their mature response: I only noticed a handful of abuses out of 658 total pledges, and they were dealt with swiftly to minimize impact.

Thank you everyone for helping make NoDaddy successful and working with me in battling back SOPA and any company that supports such legislation.

This was posted 2 years ago. It has 0 notes.

TechCrunch: A Better Image Gallery

TechCrunch somewhat recently redesigned their website. While I actually enjoy the new design, I dislike the fact that you can’t easily go through images in an image gallery.

So, I’m scratching my own itch here and releasing a user script to fix that. It makes it so that you can click on an image and see it full-size above, plus it provides handy previous/next buttons for going through the entire gallery. All without popping back and forth from page to page.

The code is commented and on GitHub, so you can easily see what’s going on behind the scenes. It’s a bit rough around the edges in some aspects, but I think it works well for this case.

Installation

To install this, all you need to do is grab the user script.

Google Chrome will support this by default. For other browsers, you will need a user script extension:

Note that I’ve only tried this on Google Chrome and Firefox. Your mileage may vary. If you find issues, please let me know I’ll update accordingly.

Usage

To use, once it’s installed, simply visit a TechCrunch post and click on an image in the gallery. A preview will show up above the gallery and you can cycle through the images using the << Prev and Next >> links.

You can test it on this Ice Cream Sandwich post.

Contributing

Like I mentioned, it’s open source and on GitHub, so if you see something you want added, either let me know (information in README on GitHub) or, better yet, send a pull request my way.

Conclusion

I hope everyone enjoys! I know I’m looking forward now to seeing more images in TC posts.

This was posted 2 years ago. It has 0 notes.

[Notes] A Tale of Three Trees by @chacon

Here’s some more Strange Loop 2011 material–this time about a talk on git given by Scott Chacon (@chacon)!

In his talk, Scott focuses on demystifying git’s reset command through an explanation of git’s three trees: HEAD, index, and the working tree. Toward the end of the talk, he also includes some of Git’s plumbing goodies that can be useful in local scripts for automatic backup, cleaning past commits, and so forth.

Hope you enjoy the notes. As always, I typed quickly and so guarantee absolutely no accuracy. You might reference these notes with his slides (warning: PDF).

Trees

  • Head
    • Indirect pointer to the last commit
    • Points to a tree
  • Index
  • Working Directory
  • Tree Roles
    • Last commit, next parent
    • Index proposed next commit
      • Could git add everything, rm all files, and git commit would still work just fine
    • Work Dir sandbox
  • git status tells you the diff of the three trees

git reset

  • A tool to manipulate these three trees
    • Path Form: git reset [file]
      • Opposite of git add [file]
      • Takes entry from HEAD and make index look like that
      • Lets you manipulate your index without touching your working directory.
    • Commit form: git reset [commit]
      • Does three things in order. The option determines where in this process it stops.
      • --soft: move HEAD to target
        • Moving the branch to somewhere else. Does not change index or working directory; just changes where the branch points.
        • git reset --soft HEAD~
          • HEAD~: parent of HEAD
          • Undoes results of last commit.
      • [--mixed]: then copy to index
        • Takes what HEAD is pointing at and makes your index look like that.
      • --hard — then copy to work dir
        • Touches your working directory
  • Can use this to squash the last two commits into one
    • git reset --soft HEAD~2; git commit
      • Moves HEAD back tw ocommits, keep index
    • Awesome for staging your work in progress and then making one nice, beautiful commit
  • git checkout
    • git checkout [commit] [path]
    • git checkout [commit]
    • If you’re in a dev branch and have made some commits:
      • git reset master will move your index to where the branch started
      • git checkout master will move HEAD to point to master
    • reset vs. checkout
  • Patchy work
    • git add --patch [file]
    • git reset --patch (commit) [file]
    • git checkout --patch (commit) [file]
      • Revert parts of a file for commit

Tidbits

  • git add -p
    • Uses the index tree as a staging area for partial addition
  • git commit --amend == git reset --soft HEAD~; git commit ...
  • git log --stat (branch)

The Plumbing Commands

  • A summary of the below commands
  • rev-parse
    • Take any string you give it and tell you its SHA. E.g., git rev-parse origin/master
    • git rev-parse master~163^2~3^2 — walks backwards, does some cool stuff. Figures out its SHA.
    • Can do ranges: git rev-parse master~163^2~3^2..origin/master
  • hash-object
    • Use git as a raw key-value store
    • git hash-object -w ~/.sshid_rsa.pub
    • echo 'my awesome value' | git hash-object -w --stdin
      • Will return SHA and write into database
  • ls-files
    • git ls-files -s: shows you your staging area
  • read-tree
    • Reads a tree value into your index at a raw value git ls-files -s # show index git ls-files -r HEAD # show index git read-tree HEAD~2 # basically same as git reset
  • write-tree
    • Takes whatever your index looks like and writes it out as a tree object.
    • git write-tree: tells you the tree that a commit would make if you commited it.
    • Does not commit.
  • commit-tree
    • echo 'my commit message' | git commit-tree
    • Commits without git commit
  • update-ref
    • Part of git branch mechanism
    • Updates your reflog as well
    • git update-ref refs/heads/newbranch
  • symbolic-ref
    • Update HEAD itself
    • git symbolic-ref HEAD refs/heads/newbranch
  • Example usage
    • Could make an auto-backup system for your working directory
    • Publish documentation to another branch
  • These are all under the rules that you shouldn’t mess with history if you’ve already pushed that to people.
    • So, use this stuff locally.
This was posted 2 years ago. It has 0 notes.

[Notes] Machine Learning: A Workshop by @hmason

I had the great fortune to attend Hilary Mason’s (@hmason) workshop this past Sunday at Strange Loop 2011. I was, in fact, so excited by the opportunity to attend this workshop that I actually got up early Saturday morning and prepared to leave for the hotel when I realized that I still had 24 hours to go–at least I didn’t make it all the way to the hotel before realizing.

I do want to give a big shout out to Hilary for giving me permission to post these notes. Note: I have redacted a few pieces of information that are irrelevant to anyone who was not at the workshop.

Our Goals

  • Clustering — finding groups of related things (unsupervised automation)
  • Named entity disambiguation — “Big Apple” ~ Translation
  • Classification — what language is something in?
  • Recommendations — Amazon, Netflix

Special Kinds of Data

  • (Not getting into today)
  • Geographic information (where do people eat bagels?)
  • Time-series analysis
  • Mathematically the same, applicably different

Black Box Model

Model for Thinking Through Data Problems

  • Obtain
  • Scrub
  • Explore
  • Model (and verify; error bars to know if you’re right)
  • iNterpret

Supervised Learning and Classification

  • Picture of cat and dog — which is cat, which is dog?
  • Assignment of a label to a previously unlabeled piece of data
  • Not really in opposition to unsupervised learning; can use together
  • Examples
    • Spam filters (see public domain Enron email db)
    • Language identification
    • Face detections
  • Places to get data
  • Classifying Text

    • NY Times: [REDACTED] (see arts, sports)
    • Process
      • Parse labeled data
      • Define features of data
      • calculate likely features for each label
      • For new, unlabeled data, predict
    • Trying to create featureset such that one is getting value from data, throwing away the useless
    • Naive Bayes
      • P(A) is the probability that A is true
      • P(True) = 1
      • P(False) = 0
      • 0 <= P(A) <= 1
      • P(A or B) = P(A) + P(B) - P(A and B)
      • p(B|A) = [p(A|B)p(B)]/p(A)
        • If we know the probability of B, and that of A, and that of A given B, we can figure out B versus A
      • Non-bayesian models too (frequentists)
      • Example
        • Population of 10,000
        • 1% have a rare desease
        • Test that is 99% effective
          • 99% of sick pateitnts test positive
          • 99% of healthy patients test negative
        • 50% probability that is one actually sick (99 sick patients test positive, 99 healthy patients test positive)
        • p(sick|test_pos) = [p(test_pos|sick)p(sick)]/p(test_pos) = 99/198 = 1/2
        • Features
          • what test result is
          • we know how many people are sick
      • Naive because we assume each feature is independent
      • Python has a good one built in, using Hilary’s [REDACTED]
      • Put computation cost up-front to train, pickle (serialize) onto disk
      • How to know when probability is
        • Significant
          • Use threshold (e.g., unless 0.8 or higher, not in category)
        • Wrong entirely
      • Feature Analysis — want to reduce featureset as much as possible without losing value in data
        • Punctuation
        • Case
        • Length of words
        • Stopwords - e.g., “the”, “and”, “a” (i.e., drop any word 3 chars or less)
        • Stemming — backing word up to linguistic stems
          • e.g., program = programming = programmer => program
      • Porter Stemming Algorithm
        • M.F. Porter, 1980; English only
        • Best known
      • More data! WordNet
        • There’s an open source, image-based version of WordNet
        • Use to bootstrap for working with smaller featuresets
          • E.g., Twitter text is small. WordNet gives synonyms
      • No clean text? lynx -dump — does a great job of getting text
    • Different types of classification algorithms and data
      • K-Nearest Neighbors
        • Works poorly on text data, pretty well on images
        • Small k complex boundary
        • Large k results in course averaging
        • Can’t have k > data set; also CPU bound
      • K-Means and K-Nearest Neighbors are different, but close
        • K-Means is analog of supervised learning
        • K-Nearest Neighbors assumes you have labels
        • i.e., one is without labels, one is with
    • Confusion Matrix: how do you know when you’ve won

      • Diagram of actual and predicted values

        .           kitten   puppy  penguin [Predicted Values]
        kitten      5       2       0
        puppy       2       4       0
        penguin     0       0       6
        [Actual Values]
        
      • Can assume penguins have a distinct set of features: little confusion
    • Use scipy for these kinds of analysis
      • Goal of example is to take images and recognize them
      • Goal: OCR (digits)
      • See [REDACTED]
      • Data is in [REDACTED] subdirectory
      • 0 is most often confused with 5
      • White means there were lots of matches (want to see this on (0,0), (1,1), etc.)
      • Sample size
        • Depends on how much discrimination in one’s data set
        • Consider on order of magnitudes: 10,000s of samples if needed to do a pretty good job
      • Interpretability is costly (meaning, why did you recommend this?)
        • news.me does two models
          • run a fast one that is not interpretable
          • have a second one that takes longer to generate and can be used later to trace back why a recommendation was made (but with different results, so not the same results)
    • Boosting
      • Combine weak-learners to create a strong-learner
        • Compute a weighted sum over the weak-learners
        • Very slow but very interpretable
        • Kitchen sink approach: throw every algorithm, featureset you
      • Canonical boosting algorithm: adaboost
        • Google Prediction API: Boosting (but they call it “model selection”)
    • Examples from bit.ly
      • Spam and Malware Identification
        • Every URL shortened is queued and analyzed for malicious characteristics
        • Malicious Features
          • Domain (bloom filter)
          • TLD (.co.cc and .info are high-risk)
          • Randomness of URL (hack: gzip URL and see how long it is)
          • Words in HTML title
          • Number of redirections
          • is_spam is dependent on threshold on bit.ly API
        • Topic Identification (e.g., what percent is about the weather?)
          • Find the relationship between these topics
          • Can now put in URL and get very fast output of category predictions
          • Wikipedia is the dirty secret for seeding most things
    • D3 is awesome visualization library
      • d3py — Python objects to JavaScript for D3
    • Sometimes metadata is better than real data
      • What spoken languages are in a page?
        • bit.ly sees the user who clicks on the link, knows their browser locales
        • Use this to know what locales clicking a link are most present
        • Use entropy calculation
    • References
      • See slides

Unsupervised Learning and Clustering

  • Clustering
    • Find similar things when you know nothing about what you’re looking at
    • Parametric vs. Non-Parametric
    • Agglomerative Clustering
      • Find each point, find closest point
      • Merge together until one has a cluster you want
      • Manually set thresholds
    • K-Means: Canonical Clustering Algorithm
      • Decide on number of clusters
      • Randomly place centroids
      • iterate until clusters are formed
      • Iterate until convergence
      • Keep iterating: eventually you’ll get something more accurate and useful
      • Distance
        • Manhattan distance (city-block distance or taxi cab geometry)
        • Depending on features, it can be better than
        • Jacard Index
          • Good for text data (or data where there is not a good mapping to Euclidian space)
          • 0 when sets are disjoint
          • 1 when sets are identical
        • Consider absolute value (trying to measure similarity)
        • If it’s intuitive (e.g., 2- or 3-D space), perhaps use Euclidian
          • Unsure, usually best to ask someone who knows.
      • Technique
        • Unsupervised approach at first to try to begin getting clusters
        • Follow up with a supervised approach
    • VariationL k-mediods (see slide)
    • Hierarchical clustering
      • Agglomerative
      • Combine closest items into a new item and repeat until there is just one item
    • "A Grand Tour of the Data" — flash graphs from running different algorithms in front of you; manual intervention (get some popcorn)
    • Dengigram (“clusted URLs”) — typical graph used to represent hierarchical clustering
      • Links together things that are closetogether (have lines between them)
      • Height of line means at what round of algorithm they were agglomerated
    • Build an understanding of what’s normal based on clustering over time; this then lets you know when something isn’t normal.
      • E.g., drastic gains or drops from norm indicate something important
  • Recommendation Systems
    • Special case of an unsupervised clustering problem
      • Define a starting node and want to find similar nodes
    • hcluster
    • bit.ly: recommend up to the second
      • Online / Offline Learning Problem
        • Offline: Two parts to the calculation - one every n minutes, one query at a time (~10 minutes)
        • Online: Take the URL and query the cluster. Using Single Value Decomposition (SVD)
          • Do SVD on only one cluster
          • It’s a hack; not necessarily accurate
    • ML on Streaming Data (depends on problem)
      • Can snapshot and just focus on data for a block of time
      • Other options exist (like not iterating over whole set)
        • Less accurate, but it’s an option
    • References
      • See slides
  • Conclusion

    • Don’t use R in production!
      • Once your data hits memory bound, you’re screwed.
      • No mind paid to programming aesthetics
    • How do you know if you won?
      • Supervised
        • Save some of labeled data for a test. Also a good way to test if your data needs to be retrained.
      • Unsupervised
        • E.g., looking for realtime recommendations of links.
  • Going from Here

    • Book: Toby Segerand: Programming Collective Intelligence by O’Reilly
    • Book: Pattern Recognition and Machine Learning by Christopher Bishop
    • Book: Reinforcement Learning by Tom Mitchell
      • Tom is considered founder of ML field
    • Class: (Online) Stanford Machine Learning
    • Class: (Online) Stanford AI
    • Dataists
This was posted 2 years ago. It has 0 notes.

[Notes] Vim: From Essentials to Mastery by @wnodom

This morning at Strange Loop 2011, I had the opportunity to attend Bill Odom’s excellent vim talk. Bill co-runs the local vim-geeks group and I have attended his talks on vim in the past and every single time he has impressed me and managed to teach me more and more–he’s a true master at it.

Bill has already managed to put his 300 slides online. I should note that we didn’t go through all 300 slides (I don’t believe that’s even possible in an hour), but instead we jumped around based on the core concepts that Bill wanted us to learn and the things that the audience thought most interesting. So, I’m posting my notes here just to highlight a few things that I saw as notable.

Sorry for any typos, and if you see anything that’s incorrect, please let me know. I had to type quickly to try to keep up, so I can’t guarantee that it’s 100% correct.

Help

  • :help :help
  • :helpgrep (:helpg)
  • :help!
  • :h holy-grail
  • :h 42

vim

  • (insert) (C-O) gets you out to do a single normal mode command
  • Normal mode
    • Moving around
      • H — high
      • M — middle
      • L — low
      • gj — up screen lines
      • gk — down screen lines
    • Traveling without moving
      • zz — shifts line to middle
      • zt — shifts line to top
      • zb — shifts line to bottom
    • * — find the word you’re sitting on (search forward)
    • # — find the word you’re sitting on (search backward)
    • g* — same thing as *, but in a different way
    • g# — same thing as #, but in a different way
    • @: — rerun last command

Command line mode

  • Ex commands
  • Search commands (!)
    • Pump your text to an external program and then push it back into your file
  • Filter commands

Plugins

Registers

  • 26 named register ("a through "z)
    • Uppercase the name to append
  • Numbered registers ("1 through "9)
  • :reg shows registers
    • Can be more specific: :reg adg
  • "% — urrent filename
  • "# — alternate filename
  • "_ — Last “small” delete
  • "/ — Last search
  • ": — last Ex command
  • "* — system clipboard
  • "+ — system selection (X11)
  • "_ — black hole: delete without storing in a register
  • (C-R)register — accessing registers
  • :let @a = "" — assign register
    • macros and registers say
  • (C-R)= — calculator
  • qaq — Record an empty macro

Resources

This was posted 2 years ago. It has 0 notes.

[Notes] Storm: Twitter’s scalable realtime computation system by @nathanmarz

At Strange Loop 2011 this morning, Nathan Marz (@nathanmarz) made a wonderful announcement this morning: the open-sourcing of Storm, the realtime computation system that he developed at BackType (acquired by Twitter).

Here, I’m including the notes that I typed up during his presentation. Apologies in advance for any typos or errors (I removed anything that I was especially unsure of, just to be safe)–I had to type quickly to keep up.

Most importantly, check out the four repositories he open-sourced in the middle of the talk:

At last, here are the notes:

History: Before Storm

  • Queues and Workers
    • Example
      • Firehose ~ Queues ~ Workers (Hadoop) ~ Queues ~ Workers (Cassandra)
    • Message Locality
      • Any URL update must go through the same worker
      • Why?
        • No transactions in Cassandra (+ no atomic increments at the time)
        • More effective batching of updates
      • Implementing
        • Have a queue for each consuming worker
        • Choose queue for URL using consistent hashing
        • Take hash of URL mod Queue = Index of Queue
          • Same URL goes to same queue
          • Evenly distribute URLs to queues
      • Problems
        • Scaling: Adding a Worker
          • Deploy a new worker and new queue for that worker
          • Must redeploy other workers: use consistent hashing, must let them know that there’s a new queue
        • Poor fault-tolerance
        • Coding is tedious

Storm

  • What we want
    • Guaranteed data processing
    • Easily horizontally scalable
    • Fault-tolerance
    • No intermediate message brokers
      • Conflict with desire for guaranteed data processing
      • If worker fails, can always ask for it from the data broker again.
      • Problem: complex and slow. Messages have to go through third party and persist to disk.
    • Higher level abstraction than message passing
    • Just works
  • Use Cases
    • Stream Processing
    • Distributed RPC
      • Parallelize an intense function; invoke on the fly and compute quickly
    • Continuous computation
  • Storm Cluster
    • Three Classes of Nodes
      • Nimbus: Master Node (similar to Hadoop JobTracker)
        • Submit topologies and code for execution
        • Launches workers
        • Monitors computations
        • Restart things that fail
      • ZooKeeper: cluster coordination
      • Worker nodes: actually run computations
        • Nodes pass messages to each other
        • Run daemon called Supervisor, communicates with Nimbus through Zookeeper
  • Concepts
    • Streams
      • Unbounded sequence of tuples
      • All tuples must have same schema (same number and same types)
        • Supports primitive types (serialized and deserialized)
        • Also support for custom types
    • Spouts
      • Source of streams
      • Examples
        • Kstrel spout: read from kestrel queue
        • Read from twitter stream
    • Bolts
      • Processes input streams
      • Can run
        • Functions
        • Filters
        • Aggregations
        • Joins
        • Talk to databases
    • Topologies
      • Network of spouts and bolts
      • Each bolt subscribes to any number of output streams
  • Tasks
    • Spouts and bolts execute as many tasks across the cluster
    • Lots of tasks across many machines, all passing messages to one another
  • Stream grouping
    • When a tuple is emitted, to which task does it go?
    • Describes how to partition that stream
    • Shuffle grouping: picks a random task
    • Fields grouping: consistent hashing on a subset of the tuple fields
      • Similar to queues and workers, but higher level of abstraction
    • All grouping: send to all tasks
      • Use with care
    • Global grouping: pick task with lowest id
    • There are more, but not going into here
  • Streaming word count
    • TopologyBuilder is used to construct topologies in Java
      • See slides for example implementation
      • Split sentences into words with parallelism of 8 tasks
    • Create a word count stream
    • Can easily run some other script, such as Python to evaluate
    • Can run topology in local mode for development test

Traditional data processing

  • Traditional Method (pre-Storm)
    • All of your data ~ precompute indexes to run query quickly
      • Precompute happens with intense processing: Hadoop, databases, etc.
      • Example: how many tweets on a URL between 7am on Sun. and 10pm on Mon.
        • Indexed by hour; sum over those few hours when querying
  • Storm: intense processing on both sides. Distributed RPC flow on Storm.
    • Distributed RPC Server (easy to implement, Storm comes with one)
      • Coordinates distributed RPC dataflow
      • Gives data to spout
      • Topology parallelizes computation, gives to bolt
      • Bolt gives to distributed RPC
      • Client gets result
    • Example
      • Compute reach of URL
        • Get URL, compute all tweeters. Find their followers.
        • Get set of distrinct follower.
        • Count ~ Reach
        • Extremely intense computation: can be millions of people
      • Storm
        • Spout emits (requestid, tweeterid)
        • GetTweeters goes to GetFollowers; emits (requestid, followerid)
        • PartialDistinct
        • CountAggregator does global grouping, receives one tuple from each, and sums
        • All done completely in parallel
      • What might takes hours now takes two seconds
        • Going down to 200ms. See “State spout” below
  • Guaranteeing message processing
    • Uses ZeroMQ
    • "Tuple Tree"
    • A spout tuple is fully processed when all tuples in the tree have been completed
    • If a tuple tree is not completed within a specified timeout, it is considered failed and replayed from the spout . Reliability API: must do a little bit of work
      • Emit a word: anchor
        • Anchoring creates a new edge in the tuple tree
      • Collector acks the tuple; marks the single node as complete
      • Storm does the rest
        • timeouts when necessary
        • tracking what’s processed
        • seeing when it’s complete
      • Storm tracks tuple trees for you in an extremely efficient way
        • See the wiki on GitHub for explanation of this algorithm
  • Storm UI: see slides
  • Storm on EC2: it’s super easy. Use storm-deploy

The Future

  • State spout (almost done)
    • Synchronize a large amount of frequently changing state into a topology
    • Example 1
      • Optimize reach topology by eliminating the database calls.
      • Each GetFollowers task keeps a synchronous cache of a subst of the social graph
        • Works because GetFollowers repartitions the social graph the same way it partitions GetTweeter's stream
  • Storm on Mesos
    • Mesos is cluster/resource framework
    • Allow more fine-grained resource usage
  • "Swapping"
    • If you currently want to update a Storm topology, must kill it and submit a new one. Takes a few minutes.
      • This is bad for a realtime system!
    • Lets you safely swap one topology ofr a new one.
      • Atomic swaps.
      • Minimize downtime
      • Prevent message duplication
  • Auto-scaling
    • Storm can automatically scale topology to data
    • No work on your end; increase as message throughput increases
    • Also handles bursts of traffic. Temporary provisioning of more resources, then scale itself back down.
  • Higher level abstractions
    • Work can be done still to improve this
    • DSLs in variety of language, etc.
This was posted 2 years ago. It has 0 notes.

Be A Better Entrepreneur: Be A Little Selfish

Some Background

Several weeks ago I listened to a commencement speech by Chris Sacca (@sacca) that he gave to the Carlson School of Management. Many of the things that he said in this have stuck, but I don’t want to talk about all of the awesomeness contained within the speech (you should watch it for that). I instead want to pick out one thing in particular: playing offense.

As an entrepreneur, it is easy to find something to keep you busy no matter the time of the day. In fact, you usually don’t have to try that hard if you’re running a business–stuff to be done practically adds itself to your schedule (and if it doesn’t, others will–thanks, email!). I know this feeling of being busy all too well–I have always felt in my heart that I was an entrepreneur, and so I began pursuing that in 7th grade when I did computer consultation for my neighbors as well as a local insurance company. At the same time, I am someone who is very proud of his grades and dedicated to his formal education, so I was a middle schooler with the task of balancing business with school while still being able to excel in both.

To make the point that I am coming to hopefully make sense, I need to mention an aspect of my childhood: I was a pretty chubby kid. However, I never lied to myself about this–I admitted it to myself, and there were some to pick on me just in case I didn’t manage the self-realization. A couple of times throughout the years, I had started to exercise a little bit, and then realized that I was far too busy for that–there was always more school and business work to fill my schedule, so I wanted to take care of that first. In my mind, I could always come back and revisit my weight later, but making the most of my education and running the business were things that needed to be done now.

Be A Little Selfish

Unfortunately, and I did not realize it so much then, but even by focusing on the business I had co-founded as well as my own education, I wasn’t really thinking about myself, because if I were, I would have known that my health was much more important. I was not being selfish.

Having reflected on Chris’ speech, and knowing myself that getting into shape is something I need to do, I have recently started exercising again, but this time with a vigorous commitment. I bike every day, without fail, no matter what else my schedule has in store, and I have to say: I love it!

What’s changed?, you might be thinking, did you suddenly get un-busy?. The answer to that is a resounding no: I’m busier than ever before with the launch of QR Card Us and another service that we have yet to publicly announce. However, I have realized that in order to be a true success–and I mean more than just with the business, which I already consider a success as we have been able to help so many people, but rather to be successful in life as a whole–I need to be physically healthy now, not later. I need to play offense with my entire life, not just with the business.

The Benefits

This seems like an obvious one: I get to be get physically fit. I haven’t been doing this long enough to really see that benefits yet (though I am confident I will!), but I have to mention an immediate change: I am so much more mentally agile, and it feels good.

Although the exercising takes a precious chunk of time out of my schedule, I am much more productive with the rest of my time–I think more clearly and I want to code even later into the night. I noticed this just last night: around 2am I was going to break from work to play a game, but I was suddenly having so much more fun than I ever have before with what I was doing that I kept procrastinating with my gaming, instead telling myself that I “just need to write one more function,” and then another, and another, and then it was 5am–a game-free night with tons of productivity to show for it.

(PSA: don’t think that you can add exercise or anything else to your schedule and offset it by staying up later–sleep is still highly important for everyone, especially entrepreneurs that desire to be successful, so try for a good amount. I just tend to schedule my work later into the evenings so I can stay up late and ‘sleep in’.)

Succinctly: although I take a measly hour or so out of my schedule to exercise, during the rest of the time I am so productive because of the mental benefits that I more than make up for this relatively small time commitment. If you really want to be harsh on yourself though, know that exercising is not “wasted time”–the time that I spend exercising is also time that I spend away from the keyboard thinking about the future, how to handle changing market conditions, and even getting ahead of myself by picturing just how awesome this new service is going to be once launched.

Why Wait?

I was originally going to wait until I was physically fit so that I could post a blog entry like this–maybe even show a before/after picture. I mean–why should you listen to someone relatively hasn’t been doing this for too long? I don’t have a really good reason for that, although I can attest for the success of this commitment in my own personal happiness.

So, while it might make more sense that I should wait until there are better proven results from my commitment, I have been thinking: why wait? I’m currently playing offense with my life, and I think a lot of other entrepreneurs should join me now in this–not later. It doesn’t have to be exercise, especially if you’re one with a fabulous metabolism with little effort (curse you!), but anything: take music lessons like you may have always wanted to (I recently started doing just that), practice a skill unrelated to your work, or just do anything that makes you feel like a more complete person. I promise that it will reflect positively both in your ability to run your business and in your personal happiness.

But, it’s hard!

I will be the first to admit, that being selfish and focusing on yourself is a very tough thing for an entrepreneur to do, so I encourage you to have someone else there to keep you in line. Personally, I exercise with my best friend/business partner/mom, so if I’m not feeling like going, she pressures me into it anyway, and vice-versa. The person you use for support does not even need to be directly involved: I share information from my music lessons with my parents so that I know that if I fail to go to a lesson, I’m not just letting myself down, but I’m also letting them down.

So, I guess my real message here is that when you play offense, remember to be a little selfish and keep yourself in mind, because it will ultimately benefit the entire team.

If you have any personal stories of your own to share, or just want to chat, I’d love to hear from you. Feel free to comment here or tweet me @sch. I want to also thank Chris so much for having given that commencement speech, and Carlson for posting it on YouTube. It is one of the most inspirational things I have heard in years, and I suspect it will stay near the top of that list for the duration of my life.

This was posted 2 years ago. It has 0 notes.

HTML Conditional Comments with blaze-html

Why

A couple of days ago I ran across a really neat boilerplate for mobile-friendly development called Skeleton. This seemed great, and because I do my web development in Haskell and use Jasper’s excellent blaze-html, I wanted the index.html coded in Haskell.

No problem, right? Wrong.

As soon as you look at the index.html file, these top lines raise concern:

Can blaze-html do comments like that, or more complex still, are its combinators capable of closing the final if statement but leaving the HTML tag unclosed?

How

The answer to this is yes, although I admit my solution is kind of hackish. As I got farther into this, I began to realize that perhaps I should have just hard-coded these if statements with preEscapedText and moved on with my life, but it was too late at this point. I was already deep in Text.Blaze.Internal's source.

The first realization is that, due to the nature of the combinators, I had to define my own html and body tags that would allow me to:

  • Open the html tag without closing it
  • Open the body tag but close both it and html

This resulted in:

Here, Parent is used to define the opening and closing tags. Note that the missing > is intentional in the opening string: blaze adds it automatically after appending any attributes.

Next came actually allowing for the creation of the comments. This was actually quite simple after StaticString/OverloadedStrings were understood.

comment is very similar to html' and htmlBody from above, but with one key difference: the addition of ss. This is needed to repack the StaticString that blaze uses for efficiency, which is typically created automagically via the OverloadedStrings language pragma.

Usage

Usage is actually quite simple once the combinators are setup, which was of course my goal all along. First, I created htmlTag, a function with the specific task of outputting just the conditional HTML tags:

I was then able to use this effortlessly in the skeletonBase code that actually output the template:

And at last, we have HTML conditional comments in blaze!

Notes

As mentioned before, I know this seems a bit hackish, but it feels less hackish (and more fun!) than the alternative of just shoving a hard-coded string into the template via preEscapedText. Plus, it was a great reason to get to know blaze’s internals a bit better.

If you can think of a better way to do this, please do let me know.

This was posted 2 years ago. It has 0 notes.

Announcing hs-vcard

I recently began working on a new venture, QR Card Us to help support my education and fund another venture–it just launched tonight. This isn’t so much a sales pitch about that though as it is an announcement of an open-sourced Haskell module I wrote in the process (though, you’re certainly welcome to go check it out, tell your friends, and order some!).

As I was reading up on vCards, I found it most helpful to read RFC 2426. I wanted to easily play around with vCards in my favorite language, but didn’t care for the existing vCard module, so I decided to write my own instead, thus letting me announce hs-vcard. I think it’s fairly straight forward and well-documented, so I’ll end with an example input and the output.

This is the vCard for Frank Dawson, one of the RFC 2426 authors, constructed in Haskell:

and printed out as:

Please let me know if you find any problems, and even better: fix them and submit patches to the GitHub repository. I also chose to not implement reading them in, primarily because I had absolutely no use for that, but people are again welcome to contribute and do so.

I hope you find it useful! Please feel free to comment with any questions or feedback you might have on how I could do things better.

This was posted 2 years ago. It has 0 notes.

Indigenous Tweets: True Worldwide Twitter Discovery

tl;dr

My mentor, Prof. Kevin Scannell, made a pretty awesome website called Indigenous Tweets. It finds and ranks tweeters who tweet in one of the over 30 indexed languages. He’ll also be blogging in order to talk with some of the main tweeters in each of these languages and help grow the online presence of these language communities.  

Intro

My mentor and esteemed SLU Professor Kevin Scannell is at it again: he’s providing a way for members of language communities to harness the power of the Internet in order to connect with one another, this time by finding the top users of over 30 languages on Twitter and ranking both them and the languages on Indigenous Tweets.

Why is this needed?

Twitter describes itself as “the best way to discover what’s new in your world,”¹ but there is a fundamental issue with this: “world” is presently limited by the inclusion of only a handful of languages. Although people can tweet in any language on Twitter, finding users who speak the same language is a difficult, or even a seemingly impossible, task. This is especially true for the minority languages on which Prof. Scannell focuses. Further, while Twitter attempts to classify the language of every tweet, it does a poor job.

This isn’t to blame Twitter–in fact, classifying many of these languages can be difficult due to a lack of data, but Prof. Scannell has been working on similar problems for many years and as such has amassed large corpora with which to classify and analyze languages. With his data, Twitter’s API, and the magic of Perl at hand, Prof. Scannell was able to write a bot that crawls Twitter as far as the API allows, seeded by a search of common but distinctive words in each language. For every user that is encountered in a search, the bot then considers not only that tweeter’s timeline for ranking, but also his or her following and follower graphs in an attempt to find other language users.

A bit of social.

Of course, as Twitter continues to grow, Indigenous Tweets aims to do the same. Twitter’s API was very helpful in gathering the data, Prof. Scannell has told me, but he knows that some tweeters were likely missed in the process. To counter this, every page is affixed with a form where the usernames of those thought to have been missed can be suggested, letting the community be directly involved in the website. As he continues to crawl Twitter, those suggested will be added to the queue for consideration.

He’s also created a blog (that he’ll definitely keep updated²). Through the blog, he plans to further engage the community, primarily by interviewing top tweeters in each language. He hopes that this in conjunction with the ranking system on Indigenous Tweets will put the need for increased Internet communication at the forefront of language communities’ minds.

Conclusion

It’s a really great service, so I should just stop talking about it so that you can go check it out!

Footnotes

¹ From Twitter’s about page. ² I visit his office frequently, so I’ll be sure to pester him if I don’t see a new post every once in awhile.

This was posted 3 years ago. It has 0 notes.