Posterous theme by Cory Watilo

[Notes] A Tale of Three Trees by @chacon

Here’s some more Strange Loop 2011 material–this time about a talk on git given by Scott Chacon (@chacon)!

In his talk, Scott focuses on demystifying git’s reset command through an explanation of git’s three trees: HEAD, index, and the working tree. Toward the end of the talk, he also includes some of Git’s plumbing goodies that can be useful in local scripts for automatic backup, cleaning past commits, and so forth.

Hope you enjoy the notes. As always, I typed quickly and so guarantee absolutely no accuracy. You might reference these notes with his slides (warning: PDF).

Trees

  • Head
    • Indirect pointer to the last commit
    • Points to a tree
  • Index
  • Working Directory
  • Tree Roles
    • Last commit, next parent
    • Index proposed next commit
      • Could git add everything, rm all files, and git commit would still work just fine
    • Work Dir sandbox
  • git status tells you the diff of the three trees

git reset

  • A tool to manipulate these three trees
    • Path Form: git reset [file]
      • Opposite of git add [file]
      • Takes entry from HEAD and make index look like that
      • Lets you manipulate your index without touching your working directory.
    • Commit form: git reset [commit]
      • Does three things in order. The option determines where in this process it stops.
      • --soft: move HEAD to target
        • Moving the branch to somewhere else. Does not change index or working directory; just changes where the branch points.
        • git reset --soft HEAD~
          • HEAD~: parent of HEAD
          • Undoes results of last commit.
      • [--mixed]: then copy to index
        • Takes what HEAD is pointing at and makes your index look like that.
      • --hard — then copy to work dir
        • Touches your working directory
  • Can use this to squash the last two commits into one
    • git reset --soft HEAD~2; git commit
      • Moves HEAD back tw ocommits, keep index
    • Awesome for staging your work in progress and then making one nice, beautiful commit
  • git checkout
    • git checkout [commit] [path]
    • git checkout [commit]
    • If you’re in a dev branch and have made some commits:
      • git reset master will move your index to where the branch started
      • git checkout master will move HEAD to point to master
    • reset vs. checkout
  • Patchy work
    • git add --patch [file]
    • git reset --patch (commit) [file]
    • git checkout --patch (commit) [file]
      • Revert parts of a file for commit

Tidbits

  • git add -p
    • Uses the index tree as a staging area for partial addition
  • git commit --amend == git reset --soft HEAD~; git commit ...
  • git log --stat (branch)

The Plumbing Commands

  • A summary of the below commands
  • rev-parse
    • Take any string you give it and tell you its SHA. E.g., git rev-parse origin/master
    • git rev-parse master~163^2~3^2 — walks backwards, does some cool stuff. Figures out its SHA.
    • Can do ranges: git rev-parse master~163^2~3^2..origin/master
  • hash-object
    • Use git as a raw key-value store
    • git hash-object -w ~/.sshid_rsa.pub
    • echo 'my awesome value' | git hash-object -w --stdin
      • Will return SHA and write into database
  • ls-files
    • git ls-files -s: shows you your staging area
  • read-tree
    • Reads a tree value into your index at a raw value git ls-files -s # show index git ls-files -r HEAD # show index git read-tree HEAD~2 # basically same as git reset
  • write-tree
    • Takes whatever your index looks like and writes it out as a tree object.
    • git write-tree: tells you the tree that a commit would make if you commited it.
    • Does not commit.
  • commit-tree
    • echo 'my commit message' | git commit-tree
    • Commits without git commit
  • update-ref
    • Part of git branch mechanism
    • Updates your reflog as well
    • git update-ref refs/heads/newbranch
  • symbolic-ref
    • Update HEAD itself
    • git symbolic-ref HEAD refs/heads/newbranch
  • Example usage
    • Could make an auto-backup system for your working directory
    • Publish documentation to another branch
  • These are all under the rules that you shouldn’t mess with history if you’ve already pushed that to people.
    • So, use this stuff locally.

[Notes] Machine Learning: A Workshop by @hmason

I had the great fortune to attend Hilary Mason’s (@hmason) workshop this past Sunday at Strange Loop 2011. I was, in fact, so excited by the opportunity to attend this workshop that I actually got up early Saturday morning and prepared to leave for the hotel when I realized that I still had 24 hours to go–at least I didn’t make it all the way to the hotel before realizing.

I do want to give a big shout out to Hilary for giving me permission to post these notes. Note: I have redacted a few pieces of information that are irrelevant to anyone who was not at the workshop.

Our Goals

  • Clustering — finding groups of related things (unsupervised automation)
  • Named entity disambiguation — “Big Apple” ~ Translation
  • Classification — what language is something in?
  • Recommendations — Amazon, Netflix

Special Kinds of Data

  • (Not getting into today)
  • Geographic information (where do people eat bagels?)
  • Time-series analysis
  • Mathematically the same, applicably different

Black Box Model

Model for Thinking Through Data Problems

  • Obtain
  • Scrub
  • Explore
  • Model (and verify; error bars to know if you’re right)
  • iNterpret

Supervised Learning and Classification

  • Picture of cat and dog — which is cat, which is dog?
  • Assignment of a label to a previously unlabeled piece of data
  • Not really in opposition to unsupervised learning; can use together
  • Examples
    • Spam filters (see public domain Enron email db)
    • Language identification
    • Face detections
  • Places to get data
  • Classifying Text
    • NY Times: [REDACTED] (see arts, sports)
    • Process
      • Parse labeled data
      • Define features of data
      • calculate likely features for each label
      • For new, unlabeled data, predict
    • Trying to create featureset such that one is getting value from data, throwing away the useless
    • Naive Bayes
      • P(A) is the probability that A is true
      • P(True) = 1
      • P(False) = 0
      • 0 <= P(A) <= 1
      • P(A or B) = P(A) + P(B) - P(A and B)
      • p(B|A) = [p(A|B)p(B)]/p(A)
        • If we know the probability of B, and that of A, and that of A given B, we can figure out B versus A
      • Non-bayesian models too (frequentists)
      • Example
        • Population of 10,000
        • 1% have a rare desease
        • Test that is 99% effective
          • 99% of sick pateitnts test positive
          • 99% of healthy patients test negative
        • 50% probability that is one actually sick (99 sick patients test positive, 99 healthy patients test positive)
        • p(sick|test_pos) = [p(test_pos|sick)p(sick)]/p(test_pos) = 99/198 = 1/2
        • Features
          • what test result is
          • we know how many people are sick
      • Naive because we assume each feature is independent
      • Python has a good one built in, using Hilary’s [REDACTED]
      • Put computation cost up-front to train, pickle (serialize) onto disk
      • How to know when probability is
        • Significant
          • Use threshold (e.g., unless 0.8 or higher, not in category)
        • Wrong entirely
      • Feature Analysis — want to reduce featureset as much as possible without losing value in data
        • Punctuation
        • Case
        • Length of words
        • Stopwords – e.g., “the”, “and”, “a” (i.e., drop any word 3 chars or less)
        • Stemming — backing word up to linguistic stems
          • e.g., program = programming = programmer => program
      • Porter Stemming Algorithm
        • M.F. Porter, 1980; English only
        • Best known
      • More data! WordNet
        • There’s an open source, image-based version of WordNet
        • Use to bootstrap for working with smaller featuresets
          • E.g., Twitter text is small. WordNet gives synonyms
      • No clean text? lynx -dump — does a great job of getting text
    • Different types of classification algorithms and data
      • K-Nearest Neighbors
        • Works poorly on text data, pretty well on images
        • Small k complex boundary
        • Large k results in course averaging
        • Can’t have k > data set; also CPU bound
      • K-Means and K-Nearest Neighbors are different, but close
        • K-Means is analog of supervised learning
        • K-Nearest Neighbors assumes you have labels
        • i.e., one is without labels, one is with
    • Confusion Matrix: how do you know when you’ve won
      • Diagram of actual and predicted values

        .           kitten   puppy  penguin [Predicted Values]
          kitten      5       2       0
          puppy       2       4       0
          penguin     0       0       6
          [Actual Values]
      • Can assume penguins have a distinct set of features: little confusion
    • Use scipy for these kinds of analysis
      • Goal of example is to take images and recognize them
      • Goal: OCR (digits)
      • See [REDACTED]
      • Data is in [REDACTED] subdirectory
      • 0 is most often confused with 5
      • White means there were lots of matches (want to see this on (0,0), (1,1), etc.)
      • Sample size
        • Depends on how much discrimination in one’s data set
        • Consider on order of magnitudes: 10,000s of samples if needed to do a pretty good job
      • Interpretability is costly (meaning, why did you recommend this?)
        • news.me does two models
          • run a fast one that is not interpretable
          • have a second one that takes longer to generate and can be used later to trace back why a recommendation was made (but with different results, so not the same results)
    • Boosting
      • Combine weak-learners to create a strong-learner
        • Compute a weighted sum over the weak-learners
        • Very slow but very interpretable
        • Kitchen sink approach: throw every algorithm, featureset you
      • Canonical boosting algorithm: adaboost
        • Google Prediction API: Boosting (but they call it “model selection”)
    • Examples from bit.ly
      • Spam and Malware Identification
        • Every URL shortened is queued and analyzed for malicious characteristics
        • Malicious Features
          • Domain (bloom filter)
          • TLD (.co.cc and .info are high-risk)
          • Randomness of URL (hack: gzip URL and see how long it is)
          • Words in HTML title
          • Number of redirections
          • is_spam is dependent on threshold on bit.ly API
        • Topic Identification (e.g., what percent is about the weather?)
          • Find the relationship between these topics
          • Can now put in URL and get very fast output of category predictions
          • Wikipedia is the dirty secret for seeding most things
    • D3 is awesome visualization library
      • d3py — Python objects to JavaScript for D3
    • Sometimes metadata is better than real data
      • What spoken languages are in a page?
        • bit.ly sees the user who clicks on the link, knows their browser locales
        • Use this to know what locales clicking a link are most present
        • Use entropy calculation
    • References
      • See slides

Unsupervised Learning and Clustering

  • Clustering
    • Find similar things when you know nothing about what you’re looking at
    • Parametric vs. Non-Parametric
    • Agglomerative Clustering
      • Find each point, find closest point
      • Merge together until one has a cluster you want
      • Manually set thresholds
    • K-Means: Canonical Clustering Algorithm
      • Decide on number of clusters
      • Randomly place centroids
      • iterate until clusters are formed
      • Iterate until convergence
      • Keep iterating: eventually you’ll get something more accurate and useful
      • Distance
        • Manhattan distance (city-block distance or taxi cab geometry)
        • Depending on features, it can be better than
        • Jacard Index
          • Good for text data (or data where there is not a good mapping to Euclidian space)
          • 0 when sets are disjoint
          • 1 when sets are identical
        • Consider absolute value (trying to measure similarity)
        • If it’s intuitive (e.g., 2- or 3-D space), perhaps use Euclidian
          • Unsure, usually best to ask someone who knows.
      • Technique
        • Unsupervised approach at first to try to begin getting clusters
        • Follow up with a supervised approach
    • VariationL k-mediods (see slide)
    • Hierarchical clustering
      • Agglomerative
      • Combine closest items into a new item and repeat until there is just one item
    • “A Grand Tour of the Data” — flash graphs from running different algorithms in front of you; manual intervention (get some popcorn)
    • Dengigram (“clusted URLs”) — typical graph used to represent hierarchical clustering
      • Links together things that are closetogether (have lines between them)
      • Height of line means at what round of algorithm they were agglomerated
    • Build an understanding of what’s normal based on clustering over time; this then lets you know when something isn’t normal.
      • E.g., drastic gains or drops from norm indicate something important
  • Recommendation Systems
    • Special case of an unsupervised clustering problem
      • Define a starting node and want to find similar nodes
    • hcluster
    • bit.ly: recommend up to the second
      • Online / Offline Learning Problem
        • Offline: Two parts to the calculation – one every n minutes, one query at a time (~10 minutes)
        • Online: Take the URL and query the cluster. Using Single Value Decomposition (SVD)
          • Do SVD on only one cluster
          • It’s a hack; not necessarily accurate
    • ML on Streaming Data (depends on problem)
      • Can snapshot and just focus on data for a block of time
      • Other options exist (like not iterating over whole set)
        • Less accurate, but it’s an option
    • References
      • See slides
  • Conclusion

    • Don’t use R in production!
      • Once your data hits memory bound, you’re screwed.
      • No mind paid to programming aesthetics
    • How do you know if you won?
      • Supervised
        • Save some of labeled data for a test. Also a good way to test if your data needs to be retrained.
      • Unsupervised
        • E.g., looking for realtime recommendations of links.
  • Going from Here

    • Book: Toby Segerand: Programming Collective Intelligence by O'Reilly
    • Book: Pattern Recognition and Machine Learning by Christopher Bishop
    • Book: Reinforcement Learning by Tom Mitchell
      • Tom is considered founder of ML field
    • Class: (Online) Stanford Machine Learning
    • Class: (Online) Stanford AI
    • Dataists

[Notes] Vim: From Essentials to Mastery by @wnodom

This morning at Strange Loop 2011, I had the opportunity to attend Bill Odom’s excellent vim talk. Bill co-runs the local vim-geeks group and I have attended his talks on vim in the past and every single time he has impressed me and managed to teach me more and more–he’s a true master at it.

Bill has already managed to put his 300 slides online. I should note that we didn’t go through all 300 slides (I don’t believe that’s even possible in an hour), but instead we jumped around based on the core concepts that Bill wanted us to learn and the things that the audience thought most interesting. So, I’m posting my notes here just to highlight a few things that I saw as notable.

Sorry for any typos, and if you see anything that’s incorrect, please let me know. I had to type quickly to try to keep up, so I can’t guarantee that it’s 100% correct.

Help

  • :help :help
  • :helpgrep (:helpg)
  • :help!
  • :h holy-grail
  • :h 42

vim

  • (insert) (C-O) gets you out to do a single normal mode command
  • Normal mode
    • Moving around
      • H — high
      • M — middle
      • L — low
      • gj — up screen lines
      • gk — down screen lines
    • Traveling without moving
      • zz — shifts line to middle
      • zt — shifts line to top
      • zb — shifts line to bottom
    • * — find the word you’re sitting on (search forward)
    • # — find the word you’re sitting on (search backward)
    • g* — same thing as *, but in a different way
    • g# — same thing as #, but in a different way
    • @: — rerun last command

Command line mode

  • Ex commands
  • Search commands (!)
    • Pump your text to an external program and then push it back into your file
  • Filter commands

Plugins

Registers

  • 26 named register ("a through "z)
    • Uppercase the name to append
  • Numbered registers ("1 through "9)
  • :reg shows registers
    • Can be more specific: :reg adg
  • "% — urrent filename
  • "# — alternate filename
  • "_ — Last “small” delete
  • "/ — Last search
  • ": — last Ex command
  • "* — system clipboard
  • "+ — system selection (X11)
  • "_ — black hole: delete without storing in a register
  • (C-R)register — accessing registers
  • :let @a = "" — assign register
    • macros and registers say
  • (C-R)= — calculator
  • qaq — Record an empty macro

Resources

[Notes] Storm: Twitter's scalable realtime computation system by @nathanmarz

At Strange Loop 2011 this morning, Nathan Marz (@nathanmarz) made a wonderful announcement this morning: the open-sourcing of Storm, the realtime computation system that he developed at BackType (acquired by Twitter).

Here, I’m including the notes that I typed up during his presentation. Apologies in advance for any typos or errors (I removed anything that I was especially unsure of, just to be safe)–I had to type quickly to keep up.

Most importantly, check out the four repositories he open-sourced in the middle of the talk:

At last, here are the notes:

History: Before Storm

  • Queues and Workers
    • Example
      • Firehose ~ Queues ~ Workers (Hadoop) ~ Queues ~ Workers (Cassandra)
    • Message Locality
      • Any URL update must go through the same worker
      • Why?
        • No transactions in Cassandra (+ no atomic increments at the time)
        • More effective batching of updates
      • Implementing
        • Have a queue for each consuming worker
        • Choose queue for URL using consistent hashing
        • Take hash of URL mod Queue = Index of Queue
          • Same URL goes to same queue
          • Evenly distribute URLs to queues
      • Problems
        • Scaling: Adding a Worker
          • Deploy a new worker and new queue for that worker
          • Must redeploy other workers: use consistent hashing, must let them know that there’s a new queue
        • Poor fault-tolerance
        • Coding is tedious

Storm

  • What we want
    • Guaranteed data processing
    • Easily horizontally scalable
    • Fault-tolerance
    • No intermediate message brokers
      • Conflict with desire for guaranteed data processing
      • If worker fails, can always ask for it from the data broker again.
      • Problem: complex and slow. Messages have to go through third party and persist to disk.
    • Higher level abstraction than message passing
    • Just works
  • Use Cases
    • Stream Processing
    • Distributed RPC
      • Parallelize an intense function; invoke on the fly and compute quickly
    • Continuous computation
  • Storm Cluster
    • Three Classes of Nodes
      • Nimbus: Master Node (similar to Hadoop JobTracker)
        • Submit topologies and code for execution
        • Launches workers
        • Monitors computations
        • Restart things that fail
      • ZooKeeper: cluster coordination
      • Worker nodes: actually run computations
        • Nodes pass messages to each other
        • Run daemon called Supervisor, communicates with Nimbus through Zookeeper
  • Concepts
    • Streams
      • Unbounded sequence of tuples
      • All tuples must have same schema (same number and same types)
        • Supports primitive types (serialized and deserialized)
        • Also support for custom types
    • Spouts
      • Source of streams
      • Examples
        • Kstrel spout: read from kestrel queue
        • Read from twitter stream
    • Bolts
      • Processes input streams
      • Can run
        • Functions
        • Filters
        • Aggregations
        • Joins
        • Talk to databases
    • Topologies
      • Network of spouts and bolts
      • Each bolt subscribes to any number of output streams
  • Tasks
    • Spouts and bolts execute as many tasks across the cluster
    • Lots of tasks across many machines, all passing messages to one another
  • Stream grouping
    • When a tuple is emitted, to which task does it go?
    • Describes how to partition that stream
    • Shuffle grouping: picks a random task
    • Fields grouping: consistent hashing on a subset of the tuple fields
      • Similar to queues and workers, but higher level of abstraction
    • All grouping: send to all tasks
      • Use with care
    • Global grouping: pick task with lowest id
    • There are more, but not going into here
  • Streaming word count
    • TopologyBuilder is used to construct topologies in Java
      • See slides for example implementation
      • Split sentences into words with parallelism of 8 tasks
    • Create a word count stream
    • Can easily run some other script, such as Python to evaluate
    • Can run topology in local mode for development test

Traditional data processing

  • Traditional Method (pre-Storm)
    • All of your data ~ precompute indexes to run query quickly
      • Precompute happens with intense processing: Hadoop, databases, etc.
      • Example: how many tweets on a URL between 7am on Sun. and 10pm on Mon.
        • Indexed by hour; sum over those few hours when querying
  • Storm: intense processing on both sides. Distributed RPC flow on Storm.
    • Distributed RPC Server (easy to implement, Storm comes with one)
      • Coordinates distributed RPC dataflow
      • Gives data to spout
      • Topology parallelizes computation, gives to bolt
      • Bolt gives to distributed RPC
      • Client gets result
    • Example
      • Compute reach of URL
        • Get URL, compute all tweeters. Find their followers.
        • Get set of distrinct follower.
        • Count ~ Reach
        • Extremely intense computation: can be millions of people
      • Storm
        • Spout emits (requestid, tweeterid)
        • GetTweeters goes to GetFollowers; emits (requestid, followerid)
        • PartialDistinct
        • CountAggregator does global grouping, receives one tuple from each, and sums
        • All done completely in parallel
      • What might takes hours now takes two seconds
        • Going down to 200ms. See “State spout” below
  • Guaranteeing message processing
    • Uses ZeroMQ
    • “Tuple Tree”
    • A spout tuple is fully processed when all tuples in the tree have been completed
    • If a tuple tree is not completed within a specified timeout, it is considered failed and replayed from the spout . Reliability API: must do a little bit of work
      • Emit a word: anchor
        • Anchoring creates a new edge in the tuple tree
      • Collector acks the tuple; marks the single node as complete
      • Storm does the rest
        • timeouts when necessary
        • tracking what’s processed
        • seeing when it’s complete
      • Storm tracks tuple trees for you in an extremely efficient way
        • See the wiki on GitHub for explanation of this algorithm
  • Storm UI: see slides
  • Storm on EC2: it’s super easy. Use storm-deploy

The Future

  • State spout (almost done)
    • Synchronize a large amount of frequently changing state into a topology
    • Example 1
      • Optimize reach topology by eliminating the database calls.
      • Each GetFollowers task keeps a synchronous cache of a subst of the social graph
        • Works because GetFollowers repartitions the social graph the same way it partitions GetTweeter’s stream
  • Storm on Mesos
    • Mesos is cluster/resource framework
    • Allow more fine-grained resource usage
  • “Swapping”
    • If you currently want to update a Storm topology, must kill it and submit a new one. Takes a few minutes.
      • This is bad for a realtime system!
    • Lets you safely swap one topology ofr a new one.
      • Atomic swaps.
      • Minimize downtime
      • Prevent message duplication
  • Auto-scaling
    • Storm can automatically scale topology to data
    • No work on your end; increase as message throughput increases
    • Also handles bursts of traffic. Temporary provisioning of more resources, then scale itself back down.
  • Higher level abstractions
    • Work can be done still to improve this
    • DSLs in variety of language, etc.