[Notes] Machine Learning: A Workshop by @hmason


I had the great fortune to attend Hilary Mason’s (@hmason) workshop this past Sunday at Strange Loop 2011. I was, in fact, so excited by the opportunity to attend this workshop that I actually got up early Saturday morning and prepared to leave for the hotel when I realized that I still had 24 hours to go–at least I didn’t make it all the way to the hotel before realizing.

I do want to give a big shout out to Hilary for giving me permission to post these notes. Note: I have redacted a few pieces of information that are irrelevant to anyone who was not at the workshop.

Our Goals

  • Clustering – finding groups of related things (unsupervised automation)
  • Named entity disambiguation – “Big Apple” ~ Translation
  • Classification – what language is something in?
  • Recommendations – Amazon, Netflix

Special Kinds of Data

  • (Not getting into today)
  • Geographic information (where do people eat bagels?)
  • Time-series analysis
  • Mathematically the same, applicably different

Black Box Model

Model for Thinking Through Data Problems

  • _O_btain
  • _S_crub
  • _E_xplore
  • _M_odel (and verify; error bars to know if you’re right)
  • i_N_terpret

Supervised Learning and Classification

  • Picture of cat and dog – which is cat, which is dog?
  • Assignment of a label to a previously unlabeled piece of data
  • Not really in opposition to unsupervised learning; can use together
  • Examples
    • Spam filters (see public domain Enron email db)
    • Language identification
    • Face detections
  • Places to get data
  • Classifying Text

Unsupervised Learning and Clustering

  • Clustering

    • Find similar things when you know nothing about what you’re looking at
    • Parametric vs. Non-Parametric
    • Agglomerative Clustering
      • Find each point, find closest point
      • Merge together until one has a cluster you want
      • Manually set thresholds
    • K-Means: Canonical Clustering Algorithm
      • Decide on number of clusters
      • Randomly place centroids
      • iterate until clusters are formed
      • Iterate until convergence
      • Keep iterating: eventually you’ll get something more accurate and useful
      • Distance
        • Manhattan distance (city-block distance or taxi cab geometry)
        • Depending on features, it can be better than
        • Jacard Index
          • Good for text data (or data where there is not a good mapping to Euclidian space)
          • 0 when sets are disjoint
          • 1 when sets are identical
        • Consider absolute value (trying to measure similarity)
        • If it’s intuitive (e.g., 2- or 3-D space), perhaps use Euclidian
          • Unsure, usually best to ask someone who knows.
      • Technique
        • Unsupervised approach at first to try to begin getting clusters
        • Follow up with a supervised approach
    • VariationL k-mediods (see slide)
    • Hierarchical clustering
      • Agglomerative
      • Combine closest items into a new item and repeat until there is just one item
    • “A Grand Tour of the Data” – flash graphs from running different algorithms in front of you; manual intervention (get some popcorn)
    • Dengigram (“clusted URLs”) – typical graph used to represent hierarchical clustering
      • Links together things that are closetogether (have lines between them)
      • Height of line means at what round of algorithm they were agglomerated
    • Build an understanding of what’s normal based on clustering over time; this then lets you know when something isn’t normal.
      • E.g., drastic gains or drops from norm indicate something important
  • Recommendation Systems

    • Special case of an unsupervised clustering problem
      • Define a starting node and want to find similar nodes
    • hcluster
    • bit.ly: recommend up to the second
      • Online / Offline Learning Problem
        • Offline : Two parts to the calculation - one every n minutes, one query at a time (~10 minutes)
        • Online : Take the URL and query the cluster. Using Single Value Decomposition (SVD)
          • Do SVD on only one cluster
          • It’s a hack; not necessarily accurate
    • ML on Streaming Data (depends on problem)
      • Can snapshot and just focus on data for a block of time
      • Other options exist (like not iterating over whole set)
        • Less accurate, but it’s an option
    • References
      • See slides
  • Conclusion

  • Going from Here

← all posts

michael schade

I built out engineering and operations teams at Stripe as employee #20 from 2012 to 2019; now I'm working on something new. I like helping people, photography, reading, gym, traveling, and learning new things. Say hi! 🏳️‍🌈