2011-12-08

Yahoo!のニュースコメント欄からスパムを排除するには

研究

論文紹介のコーナー．*1
今回紹介するのは，KDD'2011のUnbiased Online Active Learning in Data Streams (Wei Chu, Martin Zinkevich, Lihong Li, Achint Thomas, and Belle Tseng)．
Yahoo! Labsのグループによる研究です．(その後，第一著者はMicrosoftへ移動しています)

本論文は，ユーザーがコンテンツを生成できるウェブサービスから効率的にスパムやアダルトコンテンツを排除する手法について提案されています．
このようなサービス形態はUser-Generated Content(UGC)と呼ばれ，ニュースサイトのコメント欄や掲示板・SNS・ソーシャルゲーム・ユーザー投稿型動画サイトが主な例として挙げられます．

*1:院の論文輪講講義でこの論文を紹介したので，ブログにも載せてみました．

2011-12-08

適当な仕事をしている注釈者を発見せよ！

研究

NIPS2011の論文を紹介していくコーナー．
今回対象とする論文は，Ranking annotators for crowdsourced labeling tasks．

概要

本論文は，標本のラベル付を複数人にしてもらう時に，標本をきちんと見ず，適当な注釈を行なっている人を見付け出すためのスコアリングを提案しています．
Mechanical Turk等のクラウドソーシングで今後必要になりそうなテーマですね．
本論文では，このような適当な注釈者をスパマーと呼び，スパマーを効率的に見つけ出すためのランキング手法を提案しています．

手法

2クラスの場合と多クラスの場合について議論していますが，今回は2クラスの場合を簡単に紹介します．
スパマーはコイン投げと同じようにラベルを選んでいるため，P(注釈者のラベル|真のラベル)が0.5になります．[α,β]
一方で，仕事が出来る注釈者は上の確率が1に，悪意を持った注釈者は確率が0に近づきます．
この性質を利用し，スパマーはスコアが0に，その他の注釈者は1に値が近づく様なスコアリング手法を設計したのが(3)式となります．
- 今回はバイアスのないコイン投げを仮定しましたが，(3)式の定義からバイアスを持ったコイン投げの場合も同様にスコアが0に近づきます．

ここで，悪意のある注釈者と仕事のできる注釈者のスコアが同様に1に近づくことに疑問を持つところですが，悪意のある注釈者がいた場合，その注釈者のラベルを反転させれば有用な情報となりえます．
- 最終的な情報価値では，スパマーの方がいらない子，となるわけです．

スコアリングの定義をした後に，真のラベルyの値，α,βの値，そして(3)式のスコアSの値を求めるため，EMアルゴリズムを用います．
論文中にはEMアルゴリズムの定式化について記載していませんが，Learning from Crowds(MLR 2010)と同じ定式化となります．
1. E-step : スコアに基づき各注釈者を重み付けして，真のラベルyを改善
2. M-step : 真のラベルyから，新たな条件付き確率α,β，そしてスコアSを再計算

実験

人工データとMechanical Turkで実験しており，スパマーを上手く抽出出来ていることを確認しています．

感想

モデル化は，Learning from Crowdsの素直な改良．
とにかく読みやすい．

2011-11-14

WSDM 2012気になった論文リスト

WSDM2012のAccepted Paperが公開されています． http://wsdm2012.org/program/overview.html
最近，気になった論文リストしか書いていないですね・・．
何かエントリ書きます．

How to Win Friends and Influence People, Truthfully: Influence Maximization Mechanisms for Social Networks,
- Yaron Singer

Find Me Opinion Sources in Blogosphere: Opinionated Blog Feed Retrieval,
- Xueke Xu, Songbo Tan, Yue Liu and Xueqi Cheng

Learning Evolving and Emerging Topics in Social Media: A Dynamic NMF approach with Temporal Regularization,
- Ankan Saha and Vikas Sindhwani

Adding Semantics to Microblog Posts,
- Edgar Meij, Wouter Weerkamp and Maarten De Rijke

A Large-Scale Sentiment Analysis for Yahoo! Answers,
- Onur Kucuktunc, B. Barla Cambazoglu, Ingmar Weber and Hakan Ferhatosmanoglu

What's in a Hashtag? Content based Prediction of the Spread of Ideas in Microblogging Communities,
- Oren Tsur and Ari Rappoport

Scalable Inference in Latent Variable Models,
- Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy and Alex Smola

The Life and Death of Online Groups: Predicting Group Growth and Longevity,
- Sanjay Kairam, Dan Wang and Jure Leskovec

2011-09-16

NIPS2011気になった論文リスト

機械学習研究

NIPS2011のAccepted Papersが公開されました。（まだタイトルのみですが）
- http://nips.cc/Conferences/2011/Program/accepted-papers.php
いつもどおり、備忘録です。
Active Learning, Crowd, Submodular, Manifoldといったキーワードが流行しているように見えます。
まだタイトルを眺めただけですが、NIPSは良い論文が多いですね…。

A Collaborative Mechanism for Crowdsourcing Prediction Problems

J. Abernethy, R. Frongillo

A Convergence Analysis of Log-Linear Training

S. Wiesler, H. Ney

Active Classification based on Value of Classifier

T. Gao, D. Koller

An Annealing Technique for Selecting Good Citizens on Manifolds

N. Shroff, P. Turaga, R. Chellappa
この論文タイトルをつけるセンスを見習いたい

A Non-Parametric Approach to Dynamic Programming

O. Kroemer, J. Peters

Beating SGD: Learning SVMs in Sublinear Time

E. Hazan, T. Koren, N. Srebro
必読。

Better Mini-Batch Algorithms via Accelerated Gradient Methods

K. Sridharan, O. Shamir, A. Cotter, N. Srebro

Composite Multiclass Losses

R. Williamson, E. Vernet, M. Reid

Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

M. Schmidt, N. Le Roux, F. Bach

Distributed Delayed Stochastic Optimization

A. Agarwal, J. Duchi

Divide-and-Conquer Matrix Factorization

L. Mackey, A. Talwalkar, M. Jordan
分散学習とかに適用できそうなタイトル…

Efficient Online Learning via Randomized Rounding

N. Cesa-Bianchi, O. Shamir
Sublinearといい、乱択系のが多い気がしますね

Fast and Accurate k-means For Large Datasets

M. Shindler, A. Meyerson, A. Wong
ICMLでいうNB枠…？

From Bandits to Experts: On the Value of Side-Observations

S. Mannor, O. Shamir

High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

P. Loh, M. Wainwright

Improved Algorithms for Linear Stochastic Bandits

Y. Abbasi-yadkori, D. Pal, C. Szepesvari

Large-Scale Sparse Principal Component Analysis with Application to Text Data

Y. Zhang, L. Ghaoui
応用面では、これが気になる

Learning Anchor Planes for Classification

Z. Zhang, L. Ladicky, A. Saffari, P. Torr
必読。

Learning large-margin halfspaces with more malicious noise

P. Long, R. Servedio

Lower Bounds for Passive and Active Learning

M. Raginsky, S. Rakhlin
Passive Learning…？

Multiclass Boosting: Theory and Algorithms

M. Saberian, N. Vasconcelos

Nearest Neighbor based Greedy Coordinate Descent

A. Tewari, P. Ravikumar, I. Dhillon

Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

E. Hazan, S. Kale
Perceptron一族に新たな仲間

Online Learning: Stochastic, Constrained, and Smoothed Adversaries

S. Rakhlin, K. Sridharan, A. Tewari

On the Universality of Online Mirror Descent

K. Sridharan, N. Srebro, A. Tewari

Sparse Features for PCA-Like Linear Regression

M. Magdon-Ismail, C. Boutsidis, P. Drineas

Trace Lasso: a trace norm regularization for correlated designs

E. Grave, G. Obozinski, F. Bach

2011-08-19

ECML/PKDD2011気になった論文リスト

自分用メモ．
当日，チェックしておきたいAccepted Papersを以下に纏めておきます．
ペーパーが公開されているものは，アブストをナナメ読みした感想を簡単に書いています．

Frequency-aware Truncated methods for Sparse Online Learning

Hidekazu Oiwa, Shin Matsushima, Hiroshi Nakagawa
手前味噌ですが，自分達の論文．

Active learning with evolving streaming data

Indrė Žliobaitė, Albert Bifet, Bernhard Pfahringer, Geoff Holmes
ストリームデータ環境での能動学習．

Manifold Coarse Graining for Online Semi-Supervised Learning

Mehrdad Farajtabar, Amirreza Shaban, Hamid Rabiee, Mohammad Hossein Rohban
初見では全くわからない．既存手法より少ないデータ数で，多様体の構造を保ったままで最適解を求めることが可能なアルゴリズムを提案しているらしい．

A boosting approach to multiview classification with cooperation

Sokol Koço, Cécile Capponi
Multiview Classificationに対するboosting手法．

On the Stratification of Multi-Label Data

Konstantinos Sechidis, Grigorios Tsoumakas, Ioannis Vlahavas
stratification samplingをマルチラベルなデータに適応する手法，及びその性能を評価．

Compact Coding for Hyperplane Classifiers in Heterogeneous Environment

Hao Shao, Bin Tong, Einoshin Suzuki

Fast approximate text document clustering using Compressive Sampling

Laurence A. F. Park

Is there a best quality metric for graph clusters?

Hélio Almeida, Dorgival Guedes, Wagner Meira Jr, Mohammed Zaki

Active Supervised Domain Adaptation

Avishek Saha, Piyush Rai, Hal Daumé III, Suresh Venkatasubramanian, Scott DuVall
能動学習とドメイン適応の融合．

Multi-Label Ensemble Learning

Chuan Shi, Xiangnan Kong, Philip S. Yu, Bai Wang

Fast projections onto L1,q-norm balls for grouped feature selection

Suvrit Sra

Feature Selection for Transfer Learning

Selen Uguroglu, Jaime Carbonell

Influence and Passivity in Social Media

Daniel M. Romero, Wojciech Galuba, Sitaram Asur, Bernardo A. Huberman

Learning Recommendations in Social Media Systems By Weighting Multiple Relations

Boris Chidlovskii

Toward a Fair Review-Management System

Theodoros Lappas, Evimaria Terzi

Adaptive Boosting for Transfer Learning using Dynamic Updates

Samir Al-Stouhi, Chandan K. Reddy

Datum-Wise Classification: A Sequential Approach to Sparsity

Gabriel Dulac-Arnold, Ludovic Denoyer, Philippe Preux, Patrick Gallinari
L0正則化を導入したempirical riskを最小化する問題を解く（実際に解くのは緩和したもの）．アルゴリズムは各データに関して，予測の際に用いる特徴集合をあらかじめ抽出する形となる．
一押し．

Transfer Learning With Adaptive Regularizers

Ulrich Rückert, Marius Kloft

Analyzing Word Frequencies in Large Text Corpora using Inter-arrival Times and Bootstrapping

Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki, Heikki Mannila

A Game Theoretic Framework for Data Privacy Preservation in Recommender Systems

Maria Halkidi, Iordanis Koutsopoulos

Linear Discriminant Dimensionality Reduction

Quanquan Gu, Zhenhui Li, Jiawei Han

2011-08-05

PRankを実装しました

ランキング学習

ランキング学習の一手法であるPRankを実装しました．
PRankはPerceptronに似たアルゴリズムであり，実装も非常に簡単です．

ソースコード

github

解説

ランキング学習及びPRankの解説は，先日のDSIRNLPで発表された以下の記事が詳しいです．
DSIRNLP#1で「ランキング学習ことはじめ」を発表しました - シリコンの谷のゾンビ

PRankは，1文書の特徴量と対応するランク情報のみを用いてパラメータ更新を行うPointwise手法の一種です．
(2文書のペアを用いてパラメータ更新を行うのがPairwise，1つのクエリに対するランキング情報を全て用いてパラメータ更新を行うのがListwise手法)
PRankでは，重みベクトルとランキング順位数と同じ数のしきい値を用意し，重みベクトルと特徴ベクトルとの内積の値を初めて超えるしきい値の番号を予測値として出力します．
予測が間違っていた場合重みベクトルとしきい値の両方を更新し，よりよい予測を実現します．
Perceptronにしきい値という概念を導入し，順序関係を持つ多値分類アルゴリズムへと拡張したのがPRankという見方も可能です．

実験結果

以下のような簡単なテストデータを作成して，実験を行ないました．

4 1:1 2:4 3:6
4 1:2 2:3 3:5
3 1:3 2:4 3:4
3 1:6 2:1 3:5
2 1:5 2:4 3:3
2 1:4 2:3 3:3
1 1:8 2:2 3:1
1 1:7 2:5
0 1:10 2:2
0 1:12 2:3 3:1

データはlibsvm形式で記述しています．
（各行がひとつの文書を表し，一番左の数値が <特徴id>:<特徴値> <特徴id>:<特徴値>...という形式）
特徴id 1の特徴は，rankingが低いほど多く，id 3の特徴はrankingが高いほど多いデータになっています．

実験結果は，学習されたパラメータは以下の通り．

重みベクトル : -18 -6 29
しきい値 : -88 -27 29 78 inf

想定通りの挙動を示しています．
もう少し大きなデータセットで試してみたいですね．

2011-08-03

デマをデマと見抜けない人はTwitterを使うのは難しい

研究

Twitterにおけるデマ検出手法を論じた研究が，ついにEMNLP2011に出てきたので紹介します．
論文：Rumor has it: Identifying Misinformation in Microblogs[Qazvinian et al., 2011]
Twitter上のデマに関する興味深い統計情報も幾つか含まれているので，興味のある方は一読されると良いかと思います．

概要

噂と噂に関連するツイートを検出すると同時に，その噂の信頼度を推定
様々な特徴量を用いて実験
ツイートの文面を使って分類器を作るだけで，高い精度が実現可能！
- ただし，アノテートされたツイートを教師データとして使用

背景

マイクロブログ上で噂は急速に広まる
デマや誤情報は，企業にとって大きな障害となりうるので自動で特定したい
この研究では，以下の手順でデマや誤情報を検出する
- 特定の噂に関して言及しているツイートを網羅的に取得 [Rumor Retrieval]
- 噂をどのくらいの割合の人が信じているか（噂の信頼度）を推定 [Belief classification]

問題設定／手法

タスク１：Rumor Retrieval

誤情報・デマを含むツイートを同定
高いpresicion/recall率が求められる
- 特定の噂に関してのツイート[presicion]を網羅的に[recall]取得したいため
- 標準的なIR手法では不十分

タスク２：Belief Classification

タスク１で集めたツイートを，噂の支持度に応じて分類
- 噂を信じているツイート
- 噂に対し疑問を呈しているツイート
「噂」という正解が曖昧なものに対する評判分析
- こちらも手法を工夫する必要がある

利用したデータ

Twitter API + 正規表現(Regexp)で噂に関連するツイートを網羅的に取得
- 手作業で高recallとなる正規表現(Regexp)を記述
教師データを作成するため，上で集めたツイートをアノテート (10400tweets)

手法

タスク１・２共にBayes Classifierによる尤度最大化
- L1-regularized log-linear model [Andrew and Gao, 2007] + QWL-QN [Gao et al., 2007]
用いる特徴量を色々変化させ，実験を行う

Content-based Features

単語情報 [TXT1 : unigram] [TXT2 : bigram]
品詞情報 (+HASHTAG/URL) [POS1 : unigram] [POS2 : bigram]

Network-based Features

RTした側のユーザーは，噂に対してPositiveかNegativeかという情報
RTされた側のユーザーは，噂に対してPositiveかNegativeかという情報

Twitter Specific Memes

Hashtag
URL [URL1 : unigram] [URL2 : bigram]

実験結果

Rumor Retrieval / Belief Classification共に，Content-based Featuresが高性能
- F値 : 約95% (Rumor Retrieval) / 93.2% (Belief Classification)
- 全特徴を入れて実験した場合も大体同じ結果
教師データの数に応じてPresicionがどのように変化するかを実験 (Figure 2)
- 教師データが全くない(新規のデマ検出)場合は，Presicionは約60%

概要

手法

実験

感想

A Collaborative Mechanism for Crowdsourcing Prediction Problems

A Convergence Analysis of Log-Linear Training

Active Classification based on Value of Classifier

An Annealing Technique for Selecting Good Citizens on Manifolds

A Non-Parametric Approach to Dynamic Programming

Beating SGD: Learning SVMs in Sublinear Time

Better Mini-Batch Algorithms via Accelerated Gradient Methods

Composite Multiclass Losses

Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

Distributed Delayed Stochastic Optimization

Divide-and-Conquer Matrix Factorization

Efficient Online Learning via Randomized Rounding

Fast and Accurate k-means For Large Datasets

From Bandits to Experts: On the Value of Side-Observations

High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

Improved Algorithms for Linear Stochastic Bandits

Large-Scale Sparse Principal Component Analysis with Application to Text Data

Learning Anchor Planes for Classification

Learning large-margin halfspaces with more malicious noise

Lower Bounds for Passive and Active Learning

Multiclass Boosting: Theory and Algorithms

Nearest Neighbor based Greedy Coordinate Descent

Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

Online Learning: Stochastic, Constrained, and Smoothed Adversaries

On the Universality of Online Mirror Descent

Sparse Features for PCA-Like Linear Regression

Trace Lasso: a trace norm regularization for correlated designs

Frequency-aware Truncated methods for Sparse Online Learning

Active learning with evolving streaming data

Manifold Coarse Graining for Online Semi-Supervised Learning

A boosting approach to multiview classification with cooperation

On the Stratification of Multi-Label Data

Compact Coding for Hyperplane Classifiers in Heterogeneous Environment

Fast approximate text document clustering using Compressive Sampling

Is there a best quality metric for graph clusters?

Active Supervised Domain Adaptation

Multi-Label Ensemble Learning

Fast projections onto L1,q-norm balls for grouped feature selection

Feature Selection for Transfer Learning

Influence and Passivity in Social Media

Learning Recommendations in Social Media Systems By Weighting Multiple Relations

Toward a Fair Review-Management System

Adaptive Boosting for Transfer Learning using Dynamic Updates

Datum-Wise Classification: A Sequential Approach to Sparsity

Transfer Learning With Adaptive Regularizers

Analyzing Word Frequencies in Large Text Corpora using Inter-arrival Times and Bootstrapping

A Game Theoretic Framework for Data Privacy Preservation in Recommender Systems

Linear Discriminant Dimensionality Reduction

解説

実験結果

概要

背景

問題設定／手法

タスク１：Rumor Retrieval

タスク２：Belief Classification

利用したデータ

手法

Content-based Features

Network-based Features

Twitter Specific Memes

実験結果

関連研究

噂（デマ・誤情報含む）の検知と分析

評判分析