In the previous post, the content of the webpage is further processed to bag of words format, which is able for machine-learning. The algorithm chosen for the bag of words with labels is RandomForestClassifierfrom sklearn.ensemble. The code and the results are in my github.

Use LabelEncoder from sklearn.preprocessing to transform labels into integers. Use cross_val_score from sklearn.cross_validation to evaluate the forest. f1_score from sklearn.metrics evaluate the accuracy of each class.

To visualize the data, TSNE is used to reduce data dimension. The color used for multiple classes is seaborn.color_palette("hls",12). Details of how to use color_palette can be found in seaborn’s document.