Discovering Word Associations in News Media via Feature Selection and Sparse Classification
We were fortunate to have our work accepted into the proceedings of the
11th ACM SIGMM International Conference on
Multimedia Information Retrieval. Here you may find a gallery of material complementary
to this work.
The paper
You can download a PDF copy of the paper here.
The presentation
You can download a PDF copy of the slides presented at MIR 2010
here.
The data
-
NYTWData.txt, [44.6 MB]: A tab-delimited text file
encoding the appearance of words across paragraphs. Each paragraph-word pair extant in
the data receives a line of text for which Column 1 provides the paragraph ID, Column 2
provides the word ID, and Column 3 provides the number of times the word appeared in the
paragraph.
-
NYTWDict.txt, [1.1 MB]: A tab-delimited text file
with each line providing a word identification number (as used in the matrix above) and
it's associated plaintext word.
-
NYTWStops.txt, [4 KB]: A text file listing words and
word IDs deemed a priori uninteresting; these were dropped from the above matrix
when conducting our imaging experiments.
The experiments
-
NYTWQueries.txt, [4 KB]: A text file listing the
47 words and their associated IDs used as labels for our imaging experiments.
-
NYTWSplits.zip, [11.3 MB]: A ZIP file
containing 47 text files, each corresponding to the . Each line of each text file
provides each paragraph ID with a designation indicating training set membership (a value
of 0) or test set membership (a value of 1) for the particular query.
The authors
(Note from Jan 2023: Many of these links are from 2010! I'll update them soon.)
This work was conducted as part of the
StatNews Project.