Posterous theme by Cory Watilo

Filed under: machinelearning

[Notes] Machine Learning: A Workshop by @hmason

I had the great fortune to attend Hilary Mason’s (@hmason) workshop this past Sunday at Strange Loop 2011. I was, in fact, so excited by the opportunity to attend this workshop that I actually got up early Saturday morning and prepared to leave for the hotel when I realized that I still had 24 hours to go–at least I didn’t make it all the way to the hotel before realizing.

I do want to give a big shout out to Hilary for giving me permission to post these notes. Note: I have redacted a few pieces of information that are irrelevant to anyone who was not at the workshop.

Our Goals

  • Clustering — finding groups of related things (unsupervised automation)
  • Named entity disambiguation — “Big Apple” ~ Translation
  • Classification — what language is something in?
  • Recommendations — Amazon, Netflix

Special Kinds of Data

  • (Not getting into today)
  • Geographic information (where do people eat bagels?)
  • Time-series analysis
  • Mathematically the same, applicably different

Black Box Model

Model for Thinking Through Data Problems

  • Obtain
  • Scrub
  • Explore
  • Model (and verify; error bars to know if you’re right)
  • iNterpret

Supervised Learning and Classification

  • Picture of cat and dog — which is cat, which is dog?
  • Assignment of a label to a previously unlabeled piece of data
  • Not really in opposition to unsupervised learning; can use together
  • Examples
    • Spam filters (see public domain Enron email db)
    • Language identification
    • Face detections
  • Places to get data
  • Classifying Text
    • NY Times: [REDACTED] (see arts, sports)
    • Process
      • Parse labeled data
      • Define features of data
      • calculate likely features for each label
      • For new, unlabeled data, predict
    • Trying to create featureset such that one is getting value from data, throwing away the useless
    • Naive Bayes
      • P(A) is the probability that A is true
      • P(True) = 1
      • P(False) = 0
      • 0 <= P(A) <= 1
      • P(A or B) = P(A) + P(B) - P(A and B)
      • p(B|A) = [p(A|B)p(B)]/p(A)
        • If we know the probability of B, and that of A, and that of A given B, we can figure out B versus A
      • Non-bayesian models too (frequentists)
      • Example
        • Population of 10,000
        • 1% have a rare desease
        • Test that is 99% effective
          • 99% of sick pateitnts test positive
          • 99% of healthy patients test negative
        • 50% probability that is one actually sick (99 sick patients test positive, 99 healthy patients test positive)
        • p(sick|test_pos) = [p(test_pos|sick)p(sick)]/p(test_pos) = 99/198 = 1/2
        • Features
          • what test result is
          • we know how many people are sick
      • Naive because we assume each feature is independent
      • Python has a good one built in, using Hilary’s [REDACTED]
      • Put computation cost up-front to train, pickle (serialize) onto disk
      • How to know when probability is
        • Significant
          • Use threshold (e.g., unless 0.8 or higher, not in category)
        • Wrong entirely
      • Feature Analysis — want to reduce featureset as much as possible without losing value in data
        • Punctuation
        • Case
        • Length of words
        • Stopwords – e.g., “the”, “and”, “a” (i.e., drop any word 3 chars or less)
        • Stemming — backing word up to linguistic stems
          • e.g., program = programming = programmer => program
      • Porter Stemming Algorithm
        • M.F. Porter, 1980; English only
        • Best known
      • More data! WordNet
        • There’s an open source, image-based version of WordNet
        • Use to bootstrap for working with smaller featuresets
          • E.g., Twitter text is small. WordNet gives synonyms
      • No clean text? lynx -dump — does a great job of getting text
    • Different types of classification algorithms and data
      • K-Nearest Neighbors
        • Works poorly on text data, pretty well on images
        • Small k complex boundary
        • Large k results in course averaging
        • Can’t have k > data set; also CPU bound
      • K-Means and K-Nearest Neighbors are different, but close
        • K-Means is analog of supervised learning
        • K-Nearest Neighbors assumes you have labels
        • i.e., one is without labels, one is with
    • Confusion Matrix: how do you know when you’ve won
      • Diagram of actual and predicted values

        .           kitten   puppy  penguin [Predicted Values]
          kitten      5       2       0
          puppy       2       4       0
          penguin     0       0       6
          [Actual Values]
      • Can assume penguins have a distinct set of features: little confusion
    • Use scipy for these kinds of analysis
      • Goal of example is to take images and recognize them
      • Goal: OCR (digits)
      • See [REDACTED]
      • Data is in [REDACTED] subdirectory
      • 0 is most often confused with 5
      • White means there were lots of matches (want to see this on (0,0), (1,1), etc.)
      • Sample size
        • Depends on how much discrimination in one’s data set
        • Consider on order of magnitudes: 10,000s of samples if needed to do a pretty good job
      • Interpretability is costly (meaning, why did you recommend this?)
        • news.me does two models
          • run a fast one that is not interpretable
          • have a second one that takes longer to generate and can be used later to trace back why a recommendation was made (but with different results, so not the same results)
    • Boosting
      • Combine weak-learners to create a strong-learner
        • Compute a weighted sum over the weak-learners
        • Very slow but very interpretable
        • Kitchen sink approach: throw every algorithm, featureset you
      • Canonical boosting algorithm: adaboost
        • Google Prediction API: Boosting (but they call it “model selection”)
    • Examples from bit.ly
      • Spam and Malware Identification
        • Every URL shortened is queued and analyzed for malicious characteristics
        • Malicious Features
          • Domain (bloom filter)
          • TLD (.co.cc and .info are high-risk)
          • Randomness of URL (hack: gzip URL and see how long it is)
          • Words in HTML title
          • Number of redirections
          • is_spam is dependent on threshold on bit.ly API
        • Topic Identification (e.g., what percent is about the weather?)
          • Find the relationship between these topics
          • Can now put in URL and get very fast output of category predictions
          • Wikipedia is the dirty secret for seeding most things
    • D3 is awesome visualization library
      • d3py — Python objects to JavaScript for D3
    • Sometimes metadata is better than real data
      • What spoken languages are in a page?
        • bit.ly sees the user who clicks on the link, knows their browser locales
        • Use this to know what locales clicking a link are most present
        • Use entropy calculation
    • References
      • See slides

Unsupervised Learning and Clustering

  • Clustering
    • Find similar things when you know nothing about what you’re looking at
    • Parametric vs. Non-Parametric
    • Agglomerative Clustering
      • Find each point, find closest point
      • Merge together until one has a cluster you want
      • Manually set thresholds
    • K-Means: Canonical Clustering Algorithm
      • Decide on number of clusters
      • Randomly place centroids
      • iterate until clusters are formed
      • Iterate until convergence
      • Keep iterating: eventually you’ll get something more accurate and useful
      • Distance
        • Manhattan distance (city-block distance or taxi cab geometry)
        • Depending on features, it can be better than
        • Jacard Index
          • Good for text data (or data where there is not a good mapping to Euclidian space)
          • 0 when sets are disjoint
          • 1 when sets are identical
        • Consider absolute value (trying to measure similarity)
        • If it’s intuitive (e.g., 2- or 3-D space), perhaps use Euclidian
          • Unsure, usually best to ask someone who knows.
      • Technique
        • Unsupervised approach at first to try to begin getting clusters
        • Follow up with a supervised approach
    • VariationL k-mediods (see slide)
    • Hierarchical clustering
      • Agglomerative
      • Combine closest items into a new item and repeat until there is just one item
    • “A Grand Tour of the Data” — flash graphs from running different algorithms in front of you; manual intervention (get some popcorn)
    • Dengigram (“clusted URLs”) — typical graph used to represent hierarchical clustering
      • Links together things that are closetogether (have lines between them)
      • Height of line means at what round of algorithm they were agglomerated
    • Build an understanding of what’s normal based on clustering over time; this then lets you know when something isn’t normal.
      • E.g., drastic gains or drops from norm indicate something important
  • Recommendation Systems
    • Special case of an unsupervised clustering problem
      • Define a starting node and want to find similar nodes
    • hcluster
    • bit.ly: recommend up to the second
      • Online / Offline Learning Problem
        • Offline: Two parts to the calculation – one every n minutes, one query at a time (~10 minutes)
        • Online: Take the URL and query the cluster. Using Single Value Decomposition (SVD)
          • Do SVD on only one cluster
          • It’s a hack; not necessarily accurate
    • ML on Streaming Data (depends on problem)
      • Can snapshot and just focus on data for a block of time
      • Other options exist (like not iterating over whole set)
        • Less accurate, but it’s an option
    • References
      • See slides
  • Conclusion

    • Don’t use R in production!
      • Once your data hits memory bound, you’re screwed.
      • No mind paid to programming aesthetics
    • How do you know if you won?
      • Supervised
        • Save some of labeled data for a test. Also a good way to test if your data needs to be retrained.
      • Unsupervised
        • E.g., looking for realtime recommendations of links.
  • Going from Here

    • Book: Toby Segerand: Programming Collective Intelligence by O'Reilly
    • Book: Pattern Recognition and Machine Learning by Christopher Bishop
    • Book: Reinforcement Learning by Tom Mitchell
      • Tom is considered founder of ML field
    • Class: (Online) Stanford Machine Learning
    • Class: (Online) Stanford AI
    • Dataists