I am interested in the synergy between statistical data analysis
and knowledge based reasoning and, particularly, in investigating the use of
light-weight semantics in combination
with machine learning techniques for scalable and automated applications in
information retrieval, text and image understanding, and data mining.
Technology areas working on at Siemens
Text Mining: Analyzing diverse unstructured text sources such as free text service reports, scanned-in legacy
documents, call center work order descriptions, news articles, and blogs. Technologies developed include statistical NLP-based
term extraction, ontology mining, unsupervised clustering, SVM and logistic regression supervised classifiers, topic models,
automatic text summarization, semantic search, tracking and monitoring, and information extraction from OCR processing. These
technologies are being applied to various applications deployed for Siemens Energy, Healthcare, Industry, and Corporate
Communications.
Data Quality: Developed systems on data cleansing for Siemens Energy and spend management for Corporate Supply
Chain and Procurement. Technologies developed include large-scale statistical supervised classification, rule engines, and
learning-based data reconciliation.
Association Rule Mining: Developed a system for mining patterns, which correlate events to errors, by analyzing
event and error sequences in log files of Siemens Healthcare instruments. Technologies developed include information theory
based association rule mining on large scale event sequence data, and monitoring patterns and key performance indicators for
a fleet of instruments.
Medical Informatics: Working on Theseus-Medico next generation clinical image search project, particularly
applying knowledge representation and reasoning technologies, including RDF/RDFS, OWL, and various clinical ontologies, to
semantic image annotation and retrieval.
Forecasting: Parts forecasting system for Siemens Energy using a bottom-up technology of risk analysis on
data integrated from diverse sources containing sales, operational, and engineering information.
Selected Publications:
Classifying Spend Transactions with Off-the-Shelf Learning Components -
Saikat Mukherjee, Dmitriy Fradkin, Michael Roth in
IEEE International Conference on Tools in Artifical Intelligence (ICTAI)' 2008
[pdf]
Context-driven Ontological Annotations in DICOM Images: Towards Semantic PACS -
Manuel Moeller and Saikat Mukherjee in
International Conference on Health Informatics (HEALTHINF) 2009
Medical Image Understanding through the Integration of Cross-Modal Object Recognition with Formal Domain Knowledge -
Manuel Moeller, Michael Sintek, Paul Buitelaar, Saikat Mukherjee, Xiang Sean Zhou, Joerg Freund in
International Conference on Health Informatics (HEALTHINF) 2008
[pdf]
Phd dissertation
In my PhD dissertation,
Automated Semantic Analysis of Schematic Data ,
I worked on the semantic understanding
of template-generated semi-structured and unstructured data sources.
By coupling machine learning and domain knowledge I developed
highly automated and scalable solutions to this problem.
These ideas were applied to relevant problems in
assistive browsing, mobile devices browsing, Web transactions, and
information extraction.
Selected Publications:
Automated Semantic Analysis of Schematic Data -
Saikat Mukherjee, I.V. Ramakrishnan in World Wide Web Journal (Springer), 11(4), 2008
[pdf]
Bootstrapping Semantic Annotation for Content-Rich HTML Documents -
Saikat Mukherjee, I.V. Ramakrishnan, Amarjeet Singh in
International Conference on Data Engineering (ICDE)' 2005
[pdf]
Automatic Annotation of Content-Rich Web Documents: Structural and Semantic Analysis -
Saikat Mukherjee, Guizhen Yang, I.V. Ramakrishnan in
International Semantic Web Conference (ISWC)' 2003
[pdf]