[Notes] Machine Learning: A Workshop by @hmason
2011-09-21I had the great fortune to attend Hilary Mason’s (@hmason) workshop this past Sunday at Strange Loop 2011. I was, in fact, so excited by the opportunity to attend this workshop that I actually got up early Saturday morning and prepared to leave for the hotel when I realized that I still had 24 hours to go–at least I didn’t make it all the way to the hotel before realizing.
I do want to give a big shout out to Hilary for giving me permission to post these notes. Note: I have redacted a few pieces of information that are irrelevant to anyone who was not at the workshop.
Our Goals
- Clustering – finding groups of related things (unsupervised automation)
- Named entity disambiguation – “Big Apple” ~ Translation
- Classification – what language is something in?
- Recommendations – Amazon, Netflix
Special Kinds of Data
- (Not getting into today)
- Geographic information (where do people eat bagels?)
- Time-series analysis
- Mathematically the same, applicably different
Black Box Model
Model for Thinking Through Data Problems
- _O_btain
- _S_crub
- _E_xplore
- _M_odel (and verify; error bars to know if you’re right)
- i_N_terpret
Supervised Learning and Classification
- Picture of cat and dog – which is cat, which is dog?
- Assignment of a label to a previously unlabeled piece of data
- Not really in opposition to unsupervised learning; can use together
- Examples
- Spam filters (see public domain Enron email db)
- Language identification
- Face detections
- Places to get data
- Classifying Text
Unsupervised Learning and Clustering
-
Clustering
- Find similar things when you know nothing about what you’re looking at
- Parametric vs. Non-Parametric
- Agglomerative Clustering
- Find each point, find closest point
- Merge together until one has a cluster you want
- Manually set thresholds
- K-Means: Canonical Clustering Algorithm
- Decide on number of clusters
- Randomly place centroids
- iterate until clusters are formed
- Iterate until convergence
- Keep iterating: eventually you’ll get something more accurate and useful
- Distance
- Manhattan distance (city-block distance or taxi cab geometry)
- Depending on features, it can be better than
- Jacard Index
- Good for text data (or data where there is not a good mapping to Euclidian space)
0
when sets are disjoint
1
when sets are identical
- Consider absolute value (trying to measure similarity)
- If it’s intuitive (e.g., 2- or 3-D space), perhaps use Euclidian
- Unsure, usually best to ask someone who knows.
- Technique
- Unsupervised approach at first to try to begin getting clusters
- Follow up with a supervised approach
- VariationL k-mediods (see slide)
- Hierarchical clustering
- Agglomerative
- Combine closest items into a new item and repeat until there is just one item
- “A Grand Tour of the Data” – flash graphs from running different algorithms in front of you; manual intervention (get some popcorn)
- Dengigram (“clusted URLs”) – typical graph used to represent hierarchical clustering
- Links together things that are closetogether (have lines between them)
- Height of line means at what round of algorithm they were agglomerated
- Build an understanding of what’s normal based on clustering over time; this then lets you know when something isn’t normal.
- E.g., drastic gains or drops from norm indicate something important
-
Recommendation Systems
- Special case of an unsupervised clustering problem
- Define a starting node and want to find similar nodes
- hcluster
- bit.ly: recommend up to the second
- Online / Offline Learning Problem
- Offline : Two parts to the calculation - one every
n
minutes, one query at a time (~10 minutes)
- Online : Take the URL and query the cluster. Using Single Value Decomposition (SVD)
- Do SVD on only one cluster
- It’s a hack; not necessarily accurate
- ML on Streaming Data (depends on problem)
- Can snapshot and just focus on data for a block of time
- Other options exist (like not iterating over whole set)
- Less accurate, but it’s an option
- References
-
Conclusion
-
Going from Here