[Notes] Machine Learning: A Workshop by @hmason

2011-09-21

I had the great fortune to attend Hilary Mason’s (@hmason) workshop this past Sunday at Strange Loop 2011. I was, in fact, so excited by the opportunity to attend this workshop that I actually got up early Saturday morning and prepared to leave for the hotel when I realized that I still had 24 hours to go–at least I didn’t make it all the way to the hotel before realizing.

I do want to give a big shout out to Hilary for giving me permission to post these notes. Note: I have redacted a few pieces of information that are irrelevant to anyone who was not at the workshop.

Our Goals

Clustering – finding groups of related things (unsupervised automation)
Named entity disambiguation – “Big Apple” ~ Translation
Classification – what language is something in?
Recommendations – Amazon, Netflix

Special Kinds of Data

(Not getting into today)
Geographic information (where do people eat bagels?)
Time-series analysis
Mathematically the same, applicably different

Black Box Model

Google Prediction API – free; runs model selection on data to automatically determine best algorithm

Model for Thinking Through Data Problems

_O_btain
_S_crub
_E_xplore
_M_odel (and verify; error bars to know if you’re right)
i_N_terpret

Supervised Learning and Classification

Picture of cat and dog – which is cat, which is dog?
Assignment of a label to a previously unlabeled piece of data
Not really in opposition to unsupervised learning; can use together
Examples
- Spam filters (see public domain Enron email db)
- Language identification
- Face detections
Places to get data
- Data Source Handbook
- Hilary’s Bundle of Links
- NY Time’s Data: human checked, so clean and accurate
  - View source on article: tons of accurate data in “ tags
  - Also in their API
Classifying Text

Unsupervised Learning and Clustering

Clustering
- Find similar things when you know nothing about what you’re looking at
- Parametric vs. Non-Parametric
- Agglomerative Clustering
  - Find each point, find closest point
  - Merge together until one has a cluster you want
  - Manually set thresholds
- K-Means: Canonical Clustering Algorithm
  - Decide on number of clusters
  - Randomly place centroids
  - iterate until clusters are formed
  - Iterate until convergence
  - Keep iterating: eventually you’ll get something more accurate and useful
  - Distance
    - Manhattan distance (city-block distance or taxi cab geometry)
    - Depending on features, it can be better than
    - Jacard Index
      - Good for text data (or data where there is not a good mapping to Euclidian space)
      - 0 when sets are disjoint
      - 1 when sets are identical
    - Consider absolute value (trying to measure similarity)
    - If it’s intuitive (e.g., 2- or 3-D space), perhaps use Euclidian
      - Unsure, usually best to ask someone who knows.
  - Technique
    - Unsupervised approach at first to try to begin getting clusters
    - Follow up with a supervised approach
- VariationL k-mediods (see slide)
- Hierarchical clustering
  - Agglomerative
  - Combine closest items into a new item and repeat until there is just one item
- “A Grand Tour of the Data” – flash graphs from running different algorithms in front of you; manual intervention (get some popcorn)
- Dengigram (“clusted URLs”) – typical graph used to represent hierarchical clustering
  - Links together things that are closetogether (have lines between them)
  - Height of line means at what round of algorithm they were agglomerated
- Build an understanding of what’s normal based on clustering over time; this then lets you know when something isn’t normal.
  - E.g., drastic gains or drops from norm indicate something important
Recommendation Systems
- Special case of an unsupervised clustering problem
  - Define a starting node and want to find similar nodes
- hcluster
- bit.ly: recommend up to the second
  - Online / Offline Learning Problem
    - Offline : Two parts to the calculation - one every n minutes, one query at a time (~10 minutes)
    - Online : Take the URL and query the cluster. Using Single Value Decomposition (SVD)
      - Do SVD on only one cluster
      - It’s a hack; not necessarily accurate
- ML on Streaming Data (depends on problem)
  - Can snapshot and just focus on data for a block of time
  - Other options exist (like not iterating over whole set)
    - Less accurate, but it’s an option
- References
  - See slides
Conclusion
Going from Here

← all posts

michael schade

I like learning new things. Previously: Kenchi founder, eng & ops teams at Stripe from 2012-2019. Say hi! 🏳️‍🌈

archive