[Notes] Machine Learning: A Workshop by @hmason
I had the great fortune to attend Hilary Mason’s (@hmason) workshop this past Sunday at Strange Loop 2011. I was, in fact, so excited by the opportunity to attend this workshop that I actually got up early Saturday morning and prepared to leave for the hotel when I realized that I still had 24 hours to go–at least I didn’t make it all the way to the hotel before realizing.
I do want to give a big shout out to Hilary for giving me permission to post these notes. Note: I have redacted a few pieces of information that are irrelevant to anyone who was not at the workshop.
Our Goals
- Clustering — finding groups of related things (unsupervised automation)
- Named entity disambiguation — “Big Apple” ~ Translation
- Classification — what language is something in?
- Recommendations — Amazon, Netflix
Special Kinds of Data
- (Not getting into today)
- Geographic information (where do people eat bagels?)
- Time-series analysis
- Mathematically the same, applicably different
Black Box Model
- Google Prediction API — free; runs model selection on data to automatically determine best algorithm
Model for Thinking Through Data Problems
- Obtain
- Scrub
- Explore
- Model (and verify; error bars to know if you’re right)
- iNterpret
Supervised Learning and Classification
- Picture of cat and dog — which is cat, which is dog?
- Assignment of a label to a previously unlabeled piece of data
- Not really in opposition to unsupervised learning; can use together
- Examples
- Spam filters (see public domain Enron email db)
- Language identification
- Face detections
- Places to get data
- Data Source Handbook
- Hilary’s Bundle of Links
- NY Time’s Data: human checked, so clean and accurate
- View source on article: tons of accurate data in `` tags
- Also in their API
- Classifying Text
- NY Times:
[REDACTED](seearts,sports) - Process
- Parse labeled data
- Define features of data
- calculate likely features for each label
- For new, unlabeled data, predict
- Trying to create featureset such that one is getting value from data, throwing away the useless
- Naive Bayes
P(A)is the probability that A is trueP(True) = 1P(False) = 00 <= P(A) <= 1P(A or B) = P(A) + P(B) - P(A and B)p(B|A) = [p(A|B)p(B)]/p(A)- If we know the probability of B, and that of A, and that of A given B, we can figure out B versus A
- Non-bayesian models too (frequentists)
- Example
- Population of 10,000
- 1% have a rare desease
- Test that is 99% effective
- 99% of sick pateitnts test positive
- 99% of healthy patients test negative
- 50% probability that is one actually sick (99 sick patients test positive, 99 healthy patients test positive)
p(sick|test_pos) = [p(test_pos|sick)p(sick)]/p(test_pos) = 99/198 = 1/2- Features
- what test result is
- we know how many people are sick
- Naive because we assume each feature is independent
- Python has a good one built in, using Hilary’s
[REDACTED] - Put computation cost up-front to train, pickle (serialize) onto disk
- How to know when probability is
- Significant
- Use threshold (e.g., unless 0.8 or higher, not in category)
- Wrong entirely
- Significant
- Feature Analysis — want to reduce featureset as much as possible
without losing value in data
- Punctuation
- Case
- Length of words
- Stopwords – e.g., “the”, “and”, “a” (i.e., drop any word 3 chars or less)
- Stemming — backing word up to linguistic stems
- e.g.,
program = programming = programmer => program
- e.g.,
- Porter Stemming Algorithm
- M.F. Porter, 1980; English only
- Best known
- More data! WordNet
- There’s an open source, image-based version of WordNet
- Use to bootstrap for working with smaller featuresets
- E.g., Twitter text is small. WordNet gives synonyms
- No clean text?
lynx -dump— does a great job of getting text
- Different types of classification algorithms and data
- K-Nearest Neighbors
- Works poorly on text data, pretty well on images
- Small
kcomplex boundary - Large
kresults in course averaging - Can’t have
k > data set; also CPU bound
- K-Means and K-Nearest Neighbors are different, but close
- K-Means is analog of supervised learning
- K-Nearest Neighbors assumes you have labels
- i.e., one is without labels, one is with
- K-Nearest Neighbors
- Confusion Matrix: how do you know when you’ve won
Diagram of actual and predicted values
. kitten puppy penguin [Predicted Values] kitten 5 2 0 puppy 2 4 0 penguin 0 0 6 [Actual Values]
- Can assume penguins have a distinct set of features: little confusion
- Use scipy for
these kinds of analysis
- Goal of example is to take images and recognize them
- Goal: OCR (digits)
- See
[REDACTED] - Data is in
[REDACTED]subdirectory 0is most often confused with5- White means there were lots of matches (want to see this on (0,0), (1,1), etc.)
- Sample size
- Depends on how much discrimination in one’s data set
- Consider on order of magnitudes: 10,000s of samples if needed to do a pretty good job
- Interpretability is costly (meaning, why did you recommend this?)
- news.me does two models
- run a fast one that is not interpretable
- have a second one that takes longer to generate and can be used later to trace back why a recommendation was made (but with different results, so not the same results)
- news.me does two models
- Boosting
- Combine weak-learners to create a strong-learner
- Compute a weighted sum over the weak-learners
- Very slow but very interpretable
- Kitchen sink approach: throw every algorithm, featureset you
- Canonical boosting algorithm: adaboost
- Google Prediction API: Boosting (but they call it “model selection”)
- Combine weak-learners to create a strong-learner
- Examples from bit.ly
- Spam and Malware Identification
- Every URL shortened is queued and analyzed for malicious characteristics
- Malicious Features
- Domain (bloom filter)
- TLD (.co.cc and .info are high-risk)
- Randomness of URL (hack: gzip URL and see how long it is)
- Words in HTML title
- Number of redirections
is_spamis dependent on threshold on bit.ly API
- Topic Identification (e.g., what percent is about the weather?)
- Find the relationship between these topics
- Can now put in URL and get very fast output of category predictions
- Wikipedia is the dirty secret for seeding most things
- Spam and Malware Identification
- D3 is awesome visualization library
- d3py — Python objects to JavaScript for D3
- Sometimes metadata is better than real data
- What spoken languages are in a page?
- bit.ly sees the user who clicks on the link, knows their browser locales
- Use this to know what locales clicking a link are most present
- Use entropy calculation
- What spoken languages are in a page?
- References
- See slides
- NY Times:
Unsupervised Learning and Clustering
- Clustering
- Find similar things when you know nothing about what you’re looking at
- Parametric vs. Non-Parametric
- Agglomerative Clustering
- Find each point, find closest point
- Merge together until one has a cluster you want
- Manually set thresholds
- K-Means: Canonical Clustering Algorithm
- Decide on number of clusters
- Randomly place centroids
- iterate until clusters are formed
- Iterate until convergence
- Keep iterating: eventually you’ll get something more accurate and useful
- Distance
- Manhattan distance (city-block distance or taxi cab geometry)
- Depending on features, it can be better than
- Jacard Index
- Good for text data (or data where there is not a good mapping to Euclidian space)
0when sets are disjoint1when sets are identical
- Consider absolute value (trying to measure similarity)
- If it’s intuitive (e.g., 2- or 3-D space), perhaps use Euclidian
- Unsure, usually best to ask someone who knows.
- Technique
- Unsupervised approach at first to try to begin getting clusters
- Follow up with a supervised approach
- VariationL k-mediods (see slide)
- Hierarchical clustering
- Agglomerative
- Combine closest items into a new item and repeat until there is just one item
- “A Grand Tour of the Data” — flash graphs from running different algorithms in front of you; manual intervention (get some popcorn)
- Dengigram (“clusted URLs”) — typical graph used to represent
hierarchical clustering
- Links together things that are closetogether (have lines between them)
- Height of line means at what round of algorithm they were agglomerated
- Build an understanding of what’s normal based on clustering over time;
this then lets you know when something isn’t normal.
- E.g., drastic gains or drops from norm indicate something important
- Recommendation Systems
- Special case of an unsupervised clustering problem
- Define a starting node and want to find similar nodes
- hcluster
- bit.ly: recommend up to the second
- Online / Offline Learning Problem
- Offline: Two parts to the calculation – one every
nminutes, one query at a time (~10 minutes) - Online: Take the URL and query the cluster. Using Single
Value Decomposition (SVD)
- Do SVD on only one cluster
- It’s a hack; not necessarily accurate
- Offline: Two parts to the calculation – one every
- Online / Offline Learning Problem
- ML on Streaming Data (depends on problem)
- Can snapshot and just focus on data for a block of time
- Other options exist (like not iterating over whole set)
- Less accurate, but it’s an option
- References
- See slides
- Special case of an unsupervised clustering problem
Conclusion
- Don’t use R in production!
- Once your data hits memory bound, you’re screwed.
- No mind paid to programming aesthetics
- How do you know if you won?
- Supervised
- Save some of labeled data for a test. Also a good way to test if your data needs to be retrained.
- Unsupervised
- E.g., looking for realtime recommendations of links.
- Supervised
- Don’t use R in production!
Going from Here
- Book: Toby Segerand: Programming Collective Intelligence by O'Reilly
- Book: Pattern Recognition and Machine Learning by Christopher Bishop
- Book: Reinforcement Learning by Tom Mitchell
- Tom is considered founder of ML field
- Class: (Online) Stanford Machine Learning
- Class: (Online) Stanford AI
- Dataists