REU Site on Integrated Machine Learning Systems:
Detecting Internet Censorship


Jedidiah Crandall

Past work has shown that it is possible to discover what keywords a country censors via probing using simple machine learning techniques such as latent semantic analysis and k-means clustering to choose which words to test (e.g., see ConceptDoppler). Using more advanced techniques to do this efficiently will help give us a more complete picture of Internet censorship. Other research opportunities related to this effort include named entity extraction to pick the important names, dates, and places out of news articles, developing algorithms for understanding the wealth of data the ConceptDoppler project has collected, or using Good-Turing estimation (a technique developed at Bletchley Park during World War II as part of the effort to crack the Enigma code) to better understand the scope and size of different keyword blacklists.

REU Home
Back to Projects

fagg [[at]]

Last modified: Mon Jan 5 23:14:41 2009