My dissertation advisor was Dr.
Roger T.
Hartley.
Reliability of Information on the
World Wide Web
My dissertation research, to develop a measure of the reliability of
information found on medical web pages, focused on:
* Defining reliability of information in the medical domain.
* Determining features, which can be extracted from the page and the
topology of the Web near the page, are good indicators of reliability.
* Learning to classify medical web pages based on their features.
To define reliability, I started with standards developed by library
and
information scientists (e.g. accuracy, authority, completeness,
currency, objectivity), then considered document attributes specific to
the Web (e.g. hub/authority, inlinks) and finally the standards from
Evidence Based Medicine. (Validity of the final set of standards
will be shown empirically.)
I used a variety of techniques from natural language processing, data
mining and machine learning to extract features from web documents,
which are indicative of reliability of medical information. My parser
takes advantage of both the natural language and the html structure of
the page. Some examples of features are: vector length of the page (in
LSA semantic space), presence of subjective adjectives, presence of
copyright, number of inlinks from outside the domain of the page.
Given the set of features, I used decision
trees (C4.5), Naive Bayes, k-NN, and Support Vector
Machines to
learn to classify the pages: both in terms of reliability of the page
and in terms of page type (patient leaflet, links, commercial,
etc.). I also used straightforward hierarchal clustering
based on the similarity of the documents in the LSA created semantic
space. The results I obtained were comparable to those of other,
similiarly difficult, classification tasks involving natural language.
Other Research Projects
Discourse Modeling with Latent
Semantic Analysis (LSA): work with Dr. Peter Foltz
(Computing Research Laboratory)
From January 2002 to present, I have worked for Dr. Peter Foltz, and as
part of a research group with co-PIs, Dr. Foltz and Dr. Nancy Cooke at
Arizona State East, funded by grants from ARL and ONR. My work with Dr.
Foltz is on automatic discourse analysis in the team-communication
domain. The ultimate goal is to have a real-time system that analyses
the team discourse and provides feedback to team members and trainers
or supervisors so that team performance can be improved on critical
tasks.
My work has primarily centered on automatically discourse tagging using
a predefined tag-set developed to categorized content of sequential
team communication. My basic algorithm takes an utterance and uses LSA
to find the most semantically similar utterances, which have previously
been tagged, and estimates the most probable tag for the current
utterance. I have improved the algorithm by adding some superficial
syntactic features. We are in the process of investigating the
correlation of tag counts and counts of tag sequences (e.g. bigrams) to
overall team performance. Preliminary results confirm the findings of
Bowers et al. Results are reported in Martin and Foltz NAACL 2004. A
demo of this work can be found by clicking the "Discourse Analysis"
link on http://bluff.nmsu.edu/~ahmed/.
In addition I have been involved in the hiring and supervision of other
research group members. I bear primary responsibility for ensuring the
integrity and consistency of the data used in LSA processing. I am
currently supervising a project to annotate additional data.
Automatic Recognition of Subjectivity:
work with Dr. Janyce Wiebe
(University of Pittsburgh)
From May 1998 to December 2001, I worked with Dr. Janyce Wiebe and
colleagues on projects aimed at automatically recognizing subjectivity
in text. Projects included:
* Topic segmentation
* Ideological point of view
* Flame recognition
We used discourse processing techniques from computational linguistics
and probabilistic classification. In probabilistic methods, we
investigated probabilistic model search procedures and methods for
representing lexical information as input features to machine learning
algorithms.
My comprehensive exam was on the feasibility of developing an automatic
system that would, given a collection of text from the Internet about a
given topic, segment the text by ideological point of view.
The written portion is available at:
http://www.cs.nmsu.edu/~mmartin/courses/comps_all.html
I developed annotation instructions for recognizing flames (hostile or
abusive messages) in Usenet newsgroups and supervised the annotation.
Results are reported in Wiebe et al. Sigdial 2001 and Computational
Linguistics 2004. I worked with master's students to implement
algorithms for automatic flame detection.