Events 

Seminars 

Prof. I. G. Sarma Memorial Lecture Series 

Prof. Priti Shankar Memorial Seminar Series 

Multimedia Archive 



UPCOMING SEMINARS 

Series: Department Seminar Title: The Construction of Semidefinite Representation of Convex Set from
its Projections  Speaker: Dr. Anusuya Ghosh
Postdoctoral fellow in IIM Bangalore  Date and Time: Tuesday, May 02, 2017, 10:00 AM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract The construction methods of semidefinite representation of
convex sets draw a major role in recent research on modern convex
optimization. As we know, if a convex set is semidefinite representable,
there is no specific way to construct its semidefinite representation. This
work deals with a construction method of semidefinite representation of
convex body if its projections on lowerdimensional space are known. An
algorithm has been proposed and an approximate solution has been achieved
for such construction. Further we show that the approximate solution
converges to the convex set. We consider a detailed comparison of our
construction method with other methods (which are available in recent
literature).
Speaker Bio: Anusuya Ghosh is currently working as Postdoctoral fellow in IIM
Bangalore. She received her PhD in Industrial Engineering and Operations
Research from IIT Bombay. She also received "Best Project Award" from
department of mathematics in IIT Kharagpur in 2009 (during M. Sc.).
Host Faculty: Chiranjib Bhattacharyya
 Series: M.Sc. (Engg) Colloquium Title: A Study of Thompson Sampling Approach for
Sleeping MultiArmed Bandit Problems  Speaker: Mr. Aritra Chatterjee
M.Sc. (Engg) Student, CSA, IISc  Faculty Advisor: Prof. Y. Narahari
 Date and Time: Friday, April 28, 2017, 4:00 PM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract The Multiarmed bandit (MAB) problem provides a convenient abstraction for
many online learning problems arising in modern applications including
Internet advertising, crowdsourcing, online procurement, smart grids, etc.
Several variants of the MAB problem have been proposed to extend the basic
model to a variety of practical and general settings. The sleeping
multiarmed (SMAB) problem is one such variant where the subset of
available arms varies with time. This study is focused on analyzing the
efficacy of the Thompson Sampling algorithm for solving the SMAB problem.
Any algorithm for the classical MAB problem is expected to choose one of K
available arms (actions) in each of T consecutive rounds. Each choice of an
arm generates a stochastic reward from an unknown but fixed distribution.
The goal of the algorithm is to maximize the expected sum of rewards over
the T rounds (or equivalently minimize the expected total regret), relative
to the best fixed action in hindsight. In many of these settings, however,
not all arms may be available in any given round. For example, in Internet
advertising, some advertisers might choose to stay away from the auction
due to budget constraints; in crowdsourcing, some workers may not be
available at a given time due to timezone difference, etc. This implies
that the algorithm needs to work only with the set of available arms.
Further, unlike in the classical setting, the performance of an algorithm
can no longer be evaluated relative to the best fixed arm in hindsight; the
regret is now to be measured relative to the best "available" arm in
hindsight. One possible way is to compute the overall regret as the
worstcase regret over all possible sample paths of availabilities (that
is,under adversarial selection of available arms).
In the literature, upper confidence bound (UCB)based approaches have been
proposed and investigated for the SMAB problem. Our contribution is to
investigate a Thompson sampling based approach. Our key finding is to
establish a logarithmic regret bound, which nontrivially generalizes a
similar bound known for this approach in the classical MAB setting. Our
bound also matches (up to constants) the bestknown lower bound for the
SMAB problem. And, we show via detailed simulations, that the Thompson
Sampling approach in fact outperforms the known algorithms for the SMAB
problem.


PAST SEMINARS 

Series: Ph.D. Colloquium Title: New Methods for Learning from Heterogeneous and Strategic Agents  Speaker: Divya Padmanabhan (PhD student)
 Faculty Advisor: Prof. Shirish Shevade, Prof. Y. Narahari
 Date and Time: Friday, April 21, 2017, 12:00 PM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract In this thesis, we address several problems concerned with learning from
multiple heterogeneous agents. The problems are relevant to several
applications today such as crowdsourcing and sponsored search auctions. The
agents are noisy in scenarios such as crowdsourcing and provide the
training data for the learning problems. However the noise levels of the
agents are unknown. Any learning algorithm which relies on information
provided by these agents must learn their qualities while simultaneously
learning a model. Another challenge arises when agents are strategic. The
agents could be strategic on the efforts they put in towards performing a
task. They could also be strategic on reporting the costs they incur. The
learning algorithms suffer if incorrect information is provided by the
agents. Therefore it is necessary to ensure that the agents exhibit
desirable behaviour. We solve the following problems that arise while
learning from such agents.
(1)Multilabel Classification with Heterogeneous Noisy Agents
Multilabel classification is a wellknown supervised machine learning
problem where each instance is associated with multiple classes. Since
several labels can be assigned to a single instance, one of the key
challenges in this problem is to learn the correlations between the
classes. We first assume labels from a perfect source and ropose a novel
topic model called MultiLabel PresenceAbsence Latent Dirichlet Allocation
(MLPALDA). In the current era, a natural source for procuring the
training dataset is through mining usergenerated content or directly
through users in a crowdsourcing platform. In this more practical scenario
of crowdsourcing, an additional challenge arises as the labels of the
training instances are provided by noisy, heterogeneous crowdworkers with
unknown qualities. With this motivation, we further augment our topic model
to the scenario where the labels are provided by multiple noisy sources and
refer to this model as MLPALDAMNS. With experiments on standard datasets,
we show that the proposed models achieve superior performance over state of
the art.
(2) Active Linear Regression with Heterogeneous, Noisy and Strategic Agents
We study the problem of training a linear regression model by procuring
labels from multiple noisy agents or crowd annotators, under a budget
constraint. We propose a Bayesian model for linear regression from multiple
noisy sources and use variational inference for parameter estimation. When
labels are sought from agents, it is important to minimize the number of
labels procured as every call to an agent incurs a cost. Towards this, we
adopt an active learning approach. In this specific context, we prove the
equivalence of wellstudied criteria of active learning like entropy
minimization and expected error reduction. For the purpose of annotator
selection in active learning, we observe a useful connection with the
multiarmed bandit framework. Due to the nature of the distribution of the
rewards on the arms, we resort to the Robust Upper Confidence Bound (UCB)
scheme with truncated empirical mean estimator to solve the annotator
selection problem. This yields provable guarantees on the regret. We
further apply our model to the scenario where annotators are strategic and
design suitable incentives to induce them to put in their best efforts.
(3)Ranking with Heterogeneous Strategic Agents
We look at the problem where a planner must rank multiple strategic agents,
a problem that has many applications including sponsored search auctions
(SSA). Stochastic multiarmed bandit (MAB) mechanisms have been used to
solve this problem. Existing stochastic MAB mechanisms with a deterministic
payment rule, proposed in the literature, necessarily suffer a regret
of [image:
Omega(T^{2/3})], where [image: T] is the number of time steps. This
happens because the existing mechanisms consider the worst case scenario
where the means of the agents' stochastic rewards are separated by a very
small amount that depends on [image: T]. Our idea is to allow the planner
to indicate the resolution, [image: Delta], with which the agents must be
distinguished. This immediately leads us to introduce the notion of [image:
Delta]Regret. We propose a dominant strategy incentive compatible (DSIC)
and individually rational (IR), deterministic MAB mechanism, based on ideas
from the Upper Confidence Bound (UCB) family of MAB algorithms. The
proposed mechanism [image: Delta]UCB achieves a [image:
Delta]regret of [image:
O(log T)]. We first establish the results for single slot SSA and then
nontrivially extend the results to the case where multiple slots are to be
allocated.
 Series: M.Sc. (Engg) Colloquium Title: An improved lower bound for multiric depth four
circuits as a function of the number of input variables  Speaker: Mr. Sumant Hegde
M.Sc. (Engg) student at CSA department  Faculty Advisor: Dr. Chandan Saha
 Date and Time: Friday, April 21, 2017, 11:00 AM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract In this work we study the multiric formula model introduced by Kayal
and Saha (2015) and improve upon the lower bound for multiric depth
four circuits given in the work by Kayal, Saha and Tavenas (2016)
[KST16], when viewed as a function of the number of input variables N.
The improvement leads to superpolynomial lower bounds for
values of r significantly higher than what is known from prior works.
A (syntactically) multiric formula is an arithmetic formula in which
the formal degree with respect to every variable is at most r at
every gate. The formal degree of an input gate with respect to a
variable x is defined to be 1 if the gate is labelled with x and 0 if it
is labelled with a field element or a different variable. The formal degree of a sum
(respectively, product) gate with respect to x is defined as the
maximum (respectively, sum) of the formal degrees of its children with respect
to x. A multiric formula computes a polynomial with individual degree
of every variable bounded by r.
Multiric formulas are a natural extension of the relatively
wellstudied multilinear formulas in the work by Raz (2009) and
Raz and Yehudayoff (2009) [RY09]. In this work, we focus on multiric formulas
that compute multilinear polynomials. They are interesting because they
allow the formal degree of the formula to be as high
as r times the number of underlying variables. This gives extra room for ‘clever’
cancellations of the high degree components inside the formula thereby
making this type of formulas harder to analyse (as formula
homogenization is not known to be doable without blowing up the size
superpolynomially unless degree is very small, as Raz (2010) showed).
Most lower bound proofs in the literature operate under the
restriction of low formal degree or multilinearity (as in the works by
Raz (2009), [RY09], Kayal, Saha and Saptharishi (2014) and Kayal,
Limaye, Saha and Srinivasan (2014)). In this light, multiric
formulas computing multilinear polynomials form a reasonable intermediate
model to study in order to gain some insight on how to deal with high
formal degree in general formulas. Another motivation
for understanding the high formal degree case better (even at depth three)
comes from the depth reduction result by Gupta, Kamat, Kayal and
Saptharishi (2013).
With the aim of making progress on multiric formula lower bound,
[KST16] gave a (N/(d·r^2))^Ω(√(d/r))lower bound for multiric depth
four formulas computing the Nvariate Iterated Matrix Multiplication
(IMM) polynomial of degree d. As a function of N, the lower bound is at
most 2^Ω(√(N/r^3)) when d = Θ(N/r^2). In this thesis, our focus
is on getting multiric depth four formulas with larger r into
the arena of models that provenly admit a superpolynomial lower bound. In
[KST16], r can be at most N^(1/3) for the bound to remain
superpolynomial. Our result (stated below) gives a superpolynomial lower bound
for multiric depth four formulas where r can be as high as (N·log
N)^(0.9).
Theorem: Let N, d, r be positive integers such that 0.51·N ≤ d ≤ 0.9·N
and r ≤ (N·log N)^(0.9). Then there is an explicit Nvariate
degreed multilinear polynomial in VNP such that any multiric
depth four circuit computing it has size 2^Ω(√(N·(log N)/r)).
The theorem yields a better lower bound than that of [KST16], when
viewed as a function of N. Also, the bound matches the best
known lower bound (as a function of N) for multilinear (r = 1) depth four
circuits, given by [RY09], which is 2^Ω(√(N·log N)).
The improvement is obtained by analysing the shifted partials dimension
(SPD) of an Nvariate polynomial in VNP (as opposed to a VP
polynomial in [KST16]) of high degree range of Θ(N), and comparing
it with the SPD of a depth four multiric circuit. In [KST16]
a variant of shifted partials, called shifted skewed partials, is
critically used to analyse the IMM polynomial (which is in VP). We
observe that SPD (without “skew”) is still effective for the
NisanWigderson polynomial (which is in VNP), and yields a better
lower bound as a function of N when degree d is naturally chosen to be high.
Our analysis gives a better range for r and a better lower bound in the
high degree regime, not only for depth four multiric circuits
but also for the weaker models: multiric depth three circuits and
multiric depth four circuits with low bottom support. These (weaker)
models are instrumental in gaining insight about general
depth four multiric circuits, both in [KST16] and our work.
 Series: Ph.D. Thesis Defense Title: Scalable Sparse Bayesian Nonparametric and Matrix
Trifactorization Models for Text Mining applications  Speaker: Ranganath B. N
 Faculty Advisor: Prof. Shalabh Bhatnagar
 Date and Time: Tuesday, April 18, 2017, 2:00 PM
 Venue: CSA Conference Room (Room No. 225, First Floor)
Abstract Hierarchical Bayesian Models and Matrix factorization methods
provide an unsupervised way to learn latent components of a data from the
grouped or sequence data. For example, in document data, latent
component corresponds to topic with each topic as a distribution over
a finite vocabulary of words. For many applications, there exist sparse
relationships between the domain entities and the latent components of
the data. Traditional approaches for topic modeling do not take into
account these sparsity considerations. Modeling these sparse
relationships helps in extracting relevant information leading to
improvements in topic accuracy and scalable solution. In our thesis,
we explore these sparsity relationships for different applications
such as text segmentation, topical analysis and entity resolution in
dyadic data through the Bayesian and Matrix trifactorization
approaches, proposing scalable solutions.
In our first work, we address the problem of segmentation of a collection
of sequence data such as documents using probabilistic models. Existing
stateoftheart Hierarchical Bayesian Models are connected to the notion
of Complete Exchangeability or Markov Exchangeability. Bayesian
Nonparametric Models based on the notion of Markov Exchangeability such
as HDPHMM and Sticky HDPHMM, allow very restricted permutations of
latent variables in grouped data (topics in documents), which in turn
lead to computational challenges for inference. At the other extreme,
models based on Complete Exchangeability such as HDP allow arbitrary
permutations within each group or document, and inference is
significantly more tractable as a result, but segmentation is not
meaningful using such models. To overcome these problems, we explored a
new notion of exchangeability that lies between Markov Exchangeability
and Complete Exchangeability for which segmentation is meaningful, but
inference is computationally less expensive than both Markov and Complete
Exchangeability. Parametrically, Block Exchangeability contains sparser
number of transition parameters, linear in number of states compared to
the quadratic order for Markov Exchangeability that is still less
than that for Complete Exchangeability and for which parameters are on
the order of the number of documents. For this, we propose a
nonparametric Block Exchangeable model(BEM) based on the new notion of
Block Exchangeability, which we have shown to be a superclass of Complete
Exchangeability and subclass of Markov Exchangeability. We propose a
scalable inference algorithm for BEM to infer the topics for words and
segment boundaries associated with topics for a document using the
collapsed Gibbs Sampling procedure.
Empirical results show that BEM outperforms stateoftheart
nonparametric models in terms of scalability and
generalization ability and shows nearly the same segmentation quality on
News dataset, Product review dataset and on a Synthetic dataset.
Interestingly, we can tune the scalability by varying the block size
through a parameter in our model for a small tradeoff with segmentation
quality.
In addition to exploring the association between documents and words,
we also explore the sparse relationships for dyadic data, where
associations between one pair of domain entities such as (documents,
words) and associations between another pair such as (documents, users)
are completely observed.
We motivate the analysis of such dyadic data introducing an additional
discrete dimension, which we call topics, and explore sparse
relationships between the domain entities and the topic, such as of
usertopic and documenttopic respectively.
In our second work, for this problem of sparse topical analysis of dyadic
data, we propose a formulation using sparse matrix trifactorization. This
formulation requires sparsity constraints, not only on the individual
factor matrices, but also on the product of two of the factors. To the
best of our knowledge, this problem of sparse matrix trifactorization has
not been studied before. We propose a solution that introduces a surrogate
for the product of factors and enforces sparsity on this surrogate as
well as on the individual factors through L1regularization.
The resulting optimization problem is efficiently solvable in an
alternating minimization framework over subproblems involving individual
factors using the well known FISTA algorithm.
For the subproblems that are constrained, we use a projected variant of
the FISTA algorithm.
We also show that our formulation leads to independent subproblems
towards solving a factor matrix, thereby supporting parallel
implementation leading to a scalable solution.
We perform experiments over bibliographic and product review data to show
that the proposed framework based on sparse trifactorization formulation
results in better generalization ability and factorization accuracy
compared to baselines that use sparse bifactorization.
Even though the second work performs sparse topical analysis for dyadic
data, finding sparse topical associations for the
users, the user references with different names could belong to the
same entity and those with same names could belong
to different entities. The problem of entity resolution is widely studied
in the research community, where the goal is
to identify real users associated with the user references in the documents.
Finally, we focus on the problem of entity resolution in dyadic data,
where associations between one pair of domain entities such as
documentswords and associations between another pair such as
documentsusers are observed, an example of which includes bibliographic
data.
In our final work, for this problem of entity resolution in bibliographic
data, we propose a Bayesian nonparametric `Sparse entity resolution model'
(SERM) exploring the sparse relationships between the
grouped data involving grouping of the documents, and the topics/author
entities in the group. Further, we also exploit the sparseness
between an author entity and the associated author aliases. Grouping of
the documents is achieved with the stick breaking prior for the Dirichlet
processes (DP). To achieve sparseness, we propose a solution that
introduces separate Indian Buffet process (IBP) priors over topics and
the author entities for the groups and kNN mechanism for selecting author
aliases
for the author entities.
We propose a scalable inference for SERM by appropriately combining
partially collapsed Gibbs sampling scheme in Focussed topic model (FTM),
the inference scheme used for parametric IBP prior and the kNN
mechanism. We perform experiments over bibliographic datasets, Citeseer
and Rexa, to
show that the proposed SERM model improves the accuracy of entity
resolution by finding relevant author entities
through modeling sparse relationships and is scalable, when compared to
the stateoftheart baseline.
 Series: Department Seminar Title: Identifying groups of strongly correlated variables through
Smoothed Ordered Weighted l1norms  Speaker: Mr. Raman Sankaran
PhD student, CSA  Faculty Advisor: Prof. Chiranjib Bhattacharyya
 Date and Time: Thursday, April 13, 2017, 3:00 PM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract The failure of LASSO to identify groups of correlated predictors in linear
regression has sparked significant research interest. Re cently, various
norms [1, 2] were proposed, which can be best described as instances of
ordered weighted l1 norms (OWL) [3], as an alternative to l1
regularization used in LASSO. OWL can identify groups of cor related
variables but it forces the model to be constant within a group. This
artifact induces unnecessary bias in the model esti mation. In this paper
we take a submodu lar perspective and show that OWL can be posed as the
Lov ́asz extension of a suitably defined submodular function. The
submodu lar perspective not only explains the group wise constant
behavior of OWL, but also sug gests alternatives. The main contribution
of this paper is smoothed OWL (SOWL), a new family of norms, which not
only identifies the groups but also allows the model to be flexi ble
inside a group. We establish several algo rithmic and theoretical
properties of SOWL including group identification and model con sistency.
We also provide algorithmic tools to compute the SOWL norm and its proxi
mal operator, whose computational complex ity (O(dlogd)) is significantly
better than that of general purpose solvers (O(d2 log d)). In our
experiments, SOWL compares favor ably with respect to OWL in the regimes
of interest.
Speaker Bio: Raman Sankaran is a PhD student in the Department of CSA
This is a practice talk for AISTATS.
 Series: M.Sc. (Engg) Colloquium Title: Supervised Classification of Missense Mutations as Pathogenic or Tolerated
using Ensemble Learning Methods.  Speaker: Ms. Rashmi Balasubramanyam
M.Sc Student  Faculty Advisor: Prof Chiranjib Bhattacharyya
 Date and Time: Thursday, April 06, 2017, 9:30 AM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract Missense mutations account for more than 50% of the mutations known to be involved in human inherited diseases. Missense classification is a challenging task that involves sequencing of the genome, identifying the variations, and assessing their deleteriousness. This is a very laborious, time and cost intensive task to be carried out in the laboratory. Advancements in bioinformatics have led to several largescale nextgeneration genome sequencing projects, and subsequently the identification of genome variations. Several studies have combined this data with information on established deleterious and neutral variants to develop machine learning based classifiers.
There are significant issues with the missense classifiers due to which missense classification is still an open area of research. These issues can be classified under two broad categories: (a) Dataset overlap issue  where the performance estimates reported by the stateoftheart classifiers are overly optimistic as they have often been evaluated on datasets that have significant overlaps with their training datasets. Also, there is no comparative analysis of these tools using a common benchmark dataset that contains no overlap with the training datasets, therefore making it impossible to identify the best classifier among them. Also, such a common benchmark dataset is not available. (b) Inadequate capture of vital biological information of the protein and mutations  such as conservation of longrange amino acid dependencies, changes in certain physicochemical properties of the wildtype and mutant amino acids, due to the mutation. It is also not clear how to extract and use this information. Also, some classifiers use structural information that is not available for all proteins.
In this study, we compiled a new dataset, containing around 2  15% overlap with the popularly used training datasets, with 18,036 mutations in 5,642 proteins. We reviewed and evaluated 15 stateoftheart missense classifiers  SIFT, PANTHER, PROVEAN, PhDSNP, Mutation Assessor, FATHMM, SNPs&GO, SNPs&GO3D, nsSNPAnalyzer, PolyPhen2, SNAP, MutPred, PONP2, CONDEL and MetaSNP, using the six metrics  accuracy, sensitivity, specificity, precision, NPV and MCC. When evaluated on our dataset, we observe huge performance drops from what has been claimed. Average drop in the performance for these 13 classifiers are around 15% in accuracy, 17% in sensitivity, 14% in specificity, 7% in NPV, 24% in precision and 30% in MCC. With this we show that the performance of these tools is not consistent on different datasets, and thus not reliable for practical use in a clinical setting.
As we observed that the performance of the existing classifiers is poor in general, we tried to develop a new classifier that is robust and performs consistently across datasets, and better than the stateoftheart classifiers. We developed a novel method of capturing longrange amino acid dependency conservation by boosting the conservation frequencies of substrings of amino acids of various lengths around the mutation position using AdaBoost learning algorithm. This score alone performed equivalently to the sequence conservation based tools in classifying missense mutations. Popularly used sequence conservation properties was combined with this boosted longrange dependency conservation scores using AdaBoost algorithm. This reduced the class bias, and improved the overall accuracy of the classifer. We trained a third classifier by incorporating changes in 21 important physicochemical properties, due to the mutation. In this case, we observed that the overall performance further improved and the class bias further reduced. The performance of our final classifier is comparable with the stateoftheart classifiers. We did not find any significant improvement, but the classspecific accuracies and precisions are marginally better by around 12% than those of the existing classifiers. In order to understand our classifier better, we dissected our benchmark dataset into: (a) seen and unseen proteins, and (b) pure and mixed proteins, and analyzed the performance in detail. Finally we concluded that our classifier performs consistently across each of these categories of seen, unseen, pure and mixed protein.
 Series: Ph.D. Thesis Defense Title: Program Repair by Automated Generation of Hints  Speaker: Ms. Shalini Kaleeswaran
Ph.D student  Faculty Advisor: Prof. Aditya Kanade
 Date and Time: Wednesday, April 05, 2017, 12:00 PM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract Programming has become an important skill in today's technologydriven
world. It is a complex activity because of which programmers make mistakes
in their software. Student programmers make mistakes in their programs due
to lack of programming experience and also due to lack of understanding of
the algorithmic concepts being used. An experienced programmer in the
industry, on the other hand, makes mistakes in the software he develops
because of the mere complexity of an industrygrade software. Debugging
one's programs, i.e., finding and fixing faults is one of the challenging
activities for a programmer. This necessitates the need to develop
automated techniques that are effective in assisting programmers repair
faults in their programs.
In this thesis, we present new computerassisted approaches to help
programmers repair their programs. Finding repair for a fault in the
program is essentially a search over the space of all possible repairs.
Traditional repair techniques output a complete repaired program that
eliminates the fault in the original program. This ambitious goal of
finding a complete repair is often unachievable because the search space
becomes too large to navigate efficiently. The key insight of our work is
that programmers only need guidelines that help them understand the faults
they made and suggestions on how to fix them, as opposed to a complete
repair. Therefore, this thesis proposes a novel perspective of solving the
program repair problem, where we generate textual hints that direct the
programmer on how to fix the fault. A hint specifies which part of the
program needs modification as well as how to modify it. For student
programmers, our hints help them reflect on the mistakes they made and
improve their understanding of the problem being solved in the program. For
experienced programmers, our hints help them easily identify the cause of
fault and guide them to arrive at a repair. Our approaches use novel
combinations of syntactic, symbolic and statistical reasoning techniques to
achieve this goal of semiautomated program repair.
Our first contribution is an algorithmic technique for automated synthesis
of repair hints. Our technique takes as input a faulty program and a set of
test cases and outputs a ranked list of repair hints that suggest how to
rectify a faulty statement in the program. In this technique, we leverage
symbolic execution and statistical correlation analysis to identify
expressions that are likely to occur in the repaired code. Then we use
syntactic pattern matching to combine these expressions algorithmically and
generate repair hints. Since we restrict ourselves to finding ``building
blocks'' of repair, our hints are very effective in guiding the programmer
to the final repair, even in scenarios where a complete repair cannot be
obtained. We implemented this technique to develop a tool called
``MintHint'', that performs automated synthesis of repair hints for C
programs. We evaluated the effectiveness of MintHint by performing a user
study and found that programmers who used MintHint were able to repair more
faults compared to those who didn't. In addition, they obtained the repair
5.8 times faster.
Our next contribution is a semisupervised feedback generation methodology
for student programming assignments. In this methodology, we take a set of
student submissions as input and generate personalized, verified feedback
for each submission that suggests modifications to fix faults in the
program, if any. Student submissions are first clustered by solution
strategy, which is followed by an automated feedback generation phase. We
use unsupervised clustering to reduce the burden on the instructor in
providing reference implementations for each possible solution strategy.
The instructor is called upon simply to identify a correct submission in
each cluster, which acts as the reference implementation for all
submissions in the cluster. In the next phase, we use a novel
counterexample guided feedback generation algorithm that compares a
student submission with a reference implementation to generate verified
feedback. We applied this methodology to programs implementing iterative
dynamic programming algorithms and developed a tool called ``DPAssist''
that works on C programs. We evaluated DPAssist on a dataset of 2226 student
submissions to 4 programming problems from the programming contest site
CodeChef and successfully generated verified feedback for 85% of them.
 Series: M.Sc. (Engg) Thesis Defense Title: Learning Tournament Solutions from Preferencebased MultiArmed Bandits  Speaker: Mr. Siddartha Ramamohan
M.Sc (Engg.) student at CSA department  Faculty Advisor: Prof. Chiranjib Bhattacharyya
 Date and Time: Thursday, March 30, 2017, 10:00 AM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract We consider the dueling bandits problem, a sequential decision task where the goal is to learn
to pick ‘good’ arms out of an available pool by actively querying for and observing relative
preferences between selected pairs of arms. The noisy observed preferences are assumed to be
generated by a ﬁxed but unknown stochastic preference model. Motivated by applications in
information retrieval, ecommerce, crowdsourcing, etc, a number of bandit algorithms have
been proposed in recent years for this task. These have mostly addressed restricted settings
wherein the underlying preference model satisﬁes various structural assumptions; such as
being based on a random utility function or a featurespace embedding, or satisfying
transitivity or sparsity properties, or at least possessing a Condorcet winner – a
single ‘best’ arm that is preferred to all others. In seeking to move beyond such
restricted settings, there has been a recent shift towards alternative notions of ‘good’
arms (including Borda, Copeland and von Neumann winners).
In this work, we extend the dueling bandits problem by adopting, as the
desired target set of good (or ’winning’) arms, a number of tournament
solutions that have been proposed in social choice and voting theory
literature as embodying principled and natural criteria for identifying
good arms based on preference relations. We then propose a family of upper
conﬁdence bound (UCB) based dueling bandit algorithms that learn to play
winning arms from several popular tournament solutions: the top
cycle, uncovered set, Banks set and Copeland set.
We derive these algorithms by ﬁrst proposing a generic UCBbased
framework algorithm that can be instantiated for diﬀerent tournament
solutions by means of appropriately designed ‘selection procedures’. We
show suﬃciency conditions for the resulting dueling bandit algorithms to
satisfy distributiondependent, horizonfree bounds on natural regret
measures deﬁned w.r.t. the target tournament solutions. In contrast to
previous work, these bounds do not require restrictive structural
assumptions on the preference model and hold for a range of diﬀerent tournament solutions.
We develop selection procedures that satisfy the suﬃciency conditions for several popular
tournament solutions, yielding dueling bandit algorithms UCBTC, UCBUC, UCBBA and
UCBCO for the top cycle, uncovered set, Banks set and the Copeland set
respectively. The O((K^2 lnT)⁄g^2 ) bounds we derive are optimal in their
dependence on the time horizon T; and we show that the distributiondependent ‘margin’ g is lowerbounded
by the separation or the relative advantage of top cycle arms over nontop cycle arms.
We empirically validate these claims and evaluate the proposed algorithms, comparing
them to dueling bandit algorithms RUCB, SAVAGE and BTMB over synthetic and realworld
preference models. We show that the UCBTS algorithms perform competitively over
models that possess a Condorcet winner, but outperform the other algorithms over
more general models that do not possess a Condorcet winner.
 Series: Ph.D. Thesis Defense Title: New Techniques for Automatic Short Answer Grading  Speaker: Mr. Shourya Roy
Xerox Research Centre India
Bangalore  Faculty Advisor: Prof. Y. Narahari (IISc) Dr. Manish Gupta (Xerox Research Centre India)
 Date and Time: Tuesday, March 28, 2017, 3:00 PM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract Assessing learners' acquired knowledge by students is a key aspect of
the
pedagogical ecosystem. Conducting periodic assessment exercises by
manual
means is clearly monotonous, nonrewarding, and moreover, infeasible for
large student populations such as in massive online open courses. The
scope of computer assisted assessment has been predominantly
constrained
to "recognition" type questions (for example, multiple choice
questions).
Research towards automatic grading of students' constructed responses is
still in infancy and this thesis represents an attempt to advance the
stateoftheart in this area. In particular, the thesis focuses
attention
on automatic short answer grading (ASAG) which involves automatically
grading short answers to questions in natural language, having a length
of
a few words to a few sentences. We propose new techniques and novel
algorithms that could become the core of ASAG systems.
Unsupervised Approach
We first explore approaches based on textual similarity measures. We
advance the current art by applying wordembedding models and showing
that they provide a strong baseline. Second, we focus our attention on
the
vexed problem of unreliable performance of unsupervised ASAG techniques
caused by their sole reliance on instructor provided model answers. We
make an intriguing observation, "Wisdom of students": students' answers
to
a question contain significant lexical commonalities and more often than
not, the commonalities are related to the correct answer to the
question.
Harnessing classical sequential pattern mining techniques, we propose a
"fluctuation smoothing" approach that not only improves the performances
of ASAG techniques but makes them more reliable irrespective of how the
model answers are articulated.
Supervised Approach
Most prior work in supervised ASAG systems have considered groundtruth
scores of nominal or continuous type. We show that an ordinal regression
formulation with discrete ratio scores yields a superior performance
than
classification or regression. Besides, we demonstrate that considering
students' responses in an ensemble along with model answer based feature
helps handle certain types of questions in a much better way.
Transfer Learning Based Approach
While supervised techniques provide superior performance, they need
ongoing instructor involvement for creating labeled data in the form of
graded answerscripts. This undermines the benefit obtained by the
automation of the grading process. We propose an iterative transfer
learning based approach for ASAG by considering different questions as
different domains. Building on the supervised ensemble framework, we
employ canonical correlation analysis based common feature
representation
to transfer model answer based features from a source question to a
target
question and iteratively build the labeled data pool for the students'
response based classifier.
Selecting an Optimal Technique
We highlight that the performance of ASAG techniques depends on the
nature
and difficulty levels of questions as well as on the diversity of
students' answers. This is compounded by the fact that evaluation
measures
for ASAG techniques have remained fairly nonstandard and different
measures favor different ASAG techniques. Building on this observation,
we
propose a rigorous method to automatically learn a mapping from
questions
to ASAG techniques using minimal human expert/crowd) feedback. We do
this
by formulating the learning task as a contextual bandits problem and
provide a regret minimization algorithm that handles key practical
considerations such as noisy experts and similarity between questions.
 Series: M.Sc. (Engg) Colloquium Title: Supervised Classification of Missense Mutations as Pathogenic or Tolerated
using Ensemble Learning Methods  Speaker: Ms. Rashmi Balasubramanyam
M.Sc (Engg.) student at CSA department  Faculty Advisor: Prof. Chiranjib Bhattacharyya
 Date and Time: Tuesday, March 28, 2017, 10:00 AM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract Missense mutations account for more than 50% of the mutations known to be involved
in human inherited diseases. Missense classification is a challenging task
that involves sequencing of the genome, identifying the variations, and
assessing their deleteriousness. This is a very laborious, time and cost
intensive task to be carried out in the laboratory. Advancements in bioinformatics
have led to several largescale nextgeneration genome sequencing projects, and
subsequently the identification of genome variations. Several studies have
combined this data with information on established deleterious and neutral
variants to develop machine learning based classifiers.
There are significant issues with the missense classifiers due to which missense
classification is still an open area of research. These issues can be classified
under two broad categories: (a) Dataset overlap issue  where the performance
estimates reported by the stateoftheart classifiers are overly optimistic
as they have often been evaluated on datasets that have significant overlaps
with their training datasets. Also, there is no comparative analysis of these
tools using a common benchmark dataset that contains no overlap with the
training datasets, therefore making it impossible to identify the best
classifier among them. Also, such a common benchmark dataset is not
available. (b) Inadequate capture of vital biological information of the
protein and mutations  such as conservation of longrange amino acid
dependencies, changes in certain physicochemical properties of the
wildtype and mutant amino acids, due to the mutation. It is also not
clear how to extract and use this information. Also, some classifiers
use structural information that is not available for all proteins.
In this study, we compiled a new dataset, containing around 2  15%
overlap with the popularly used training datasets, with 18,036 mutations
in 5,642 proteins. We reviewed and evaluated 15 stateoftheart missense
classifiers  SIFT, PANTHER, PROVEAN, PhDSNP, Mutation Assessor, FATHMM,
SNPs&GO, SNPs&GO3D, nsSNPAnalyzer, PolyPhen2, SNAP, MutPred, PONP2,
CONDEL and MetaSNP, using the six metrics  accuracy, sensitivity, specificity,
precision, NPV and MCC. When evaluated on our dataset, we observe huge
performance drops from what has been claimed. Average drop in the
performance for these 13 classifiers are around 15% in accuracy, 17% in
sensitivity, 14% in specificity, 7% in NPV, 24% in precision and 30% in MCC.
With this we show that the performance of these tools is not consistent on
different datasets, and thus not reliable for practical use in a clinical setting.
As we observed that the performance of the existing classifiers is poor in
general, we tried to develop a new classifier that is robust and performs
consistently across datasets, and better than the stateoftheart
classifiers. We developed a novel method of capturing longrange amino
acid dependency conservation by boosting the conservation frequencies of
substrings of amino acids of various lengths around the mutation position
using AdaBoost learning algorithm. This score alone performed equivalently
to the sequence conservation based tools in classifying missense mutations.
Popularly used sequence conservation properties was combined with this boosted
longrange dependency conservation scores using AdaBoost algorithm. This
reduced the class bias, and improved the overall accuracy of the classifer.
We trained a third classifier by incorporating changes in 21 important
physicochemical properties, due to the mutation. In this case, we
observed that the overall performance further improved and the class
bias further reduced. The performance of our final classifier is comparable
with the stateoftheart classifiers. We did not find any significant
improvement, but the classspecific accuracies and precisions are marginally
better by around 12% than those of the existing classifiers. In order to
understand our classifier better, we dissected our benchmark
dataset into: (a) seen and unseen proteins, and (b) pure and mixed proteins,
and analyzed the performance in detail. Finally we concluded that our
classifier performs consistently across each of these categories of seen, unseen,
pure and mixed protein.
 Series: M.Sc. (Engg) Colloquium Title: Computing Contour Trees for 2D Piecewise Polynomial Functions  Speaker: Girijanandan Nucha
MSc (Engg) student  Faculty Advisor: Prof. Vijay Natarajan
 Date and Time: Monday, March 27, 2017, 11:00 AM
 Venue: CSA Multimedia Class (Room No. 252, First Floor)
Abstract Contour trees are extensively used in scalar field analysis. The contour tree is a
data structure that
tracks the evolution of level set topology in a scalar field. Scalar fields are
typically available as
samples at vertices of a mesh and are linearly interpolated within each cell of the
mesh. A more
suitable way of representing scalar fields, especially when a smoother function
needs to be modeled,
is via higher order interpolants. We propose an algorithm to compute the contour
tree for such func
tions. The algorithm computes a local structure by connecting critical points using
a numerically
stable monotone path tracing procedure. Such structures are computed for each cell
and are stitched
together to obtain the contour tree of the function. The algorithm is scalable to
higher degree inter
polants whereas previous methods were restricted to quadratic or linear
interpolants. The algorithm
is intrinsically parallelizable and has potential applications to isosurface
extraction.
 Series: Ph.D. Thesis Defense Title: Resolving the Complexity of Some Fundamental
Problems in Computational Social Choice  Speaker: Mr. Palash Dey
Ph.D student  Faculty Advisor: Prof. Y. Narahari and Dr. Arnab Bhattacharyya
 Date and Time: Friday, March 24, 2017, 9:00 AM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract In many real world situations, especially involving multiagent systems
and
artificial intelligence, participating agents often need to agree upon a
common alternative even if they have differing preferences over the
available alternatives. Voting is one of the tools of choice in these
situations. Common and classic applications of voting in modern
applications include collaborative filtering and recommender systems,
metaserach engines, coordination and planning among multiple automated
agents, etc. Agents in these applications usually have computational
power
at their disposal. This makes the study of computational aspects of
voting
crucial. This thesis is devoted to a study of computational complexity
of
several fundamental problems arising in the context of voting theory.
The typical setting for our work is an "election"; an election consists
of
a set of voters or agents, a set of alternatives, and a voting rule. The
vote of any agent can be thought of as a ranking (more precisely, a
complete order) of the set of alternatives. A voting profile comprises a
collection of votes of all the agents. Finally, a voting rule is a
mapping
that takes as input voting profile and outputs an alternative, which is
called the "winner" or "outcome" of the election. Our work can be
categorized into three parts as described below.
Part I: Preference Elicitation
In the first part of the thesis, we study the problem of eliciting the
preferences of a set of voters by asking a small number of comparison
queries (such as who a voter prefers between two given two alternatives)
for various interesting domains of preferences.
We commence with considering the domain of single peaked preferences on
trees. We show tight dependencies between query complexity of
preference elicitation and various parameters of the single peaked tree,
for example, number of leaves, diameter, path width, maximum degree of a
node, etc. We next consider preference elicitation for the domain of
single crossing preference profiles. We establish that the query
complexity of preference elicitation in this domain crucially depends on
how the votes are accessed and on whether or not any single crossing
ordering is known a priori.
Part II: Winner Determination in Elections
In the second part of the thesis, we undertake a study of computational
complexity of several important problems related to determining the
winner
of an election. We begin with a study of the following problem: Given an
election, predict the winner of the election under some fixed voting
rule
by sampling as few votes as possible. We establish optimal or almost
optimal bounds on the number of votes one needs to sample for many
commonly used voting rules. We next study efficient sampling based
algorithms for estimating the margin of victory of a given election for
many common voting rules.
Following this, we design an optimal algorithm for determining the
plurality winner of an election when the votes are arriving one by one
in
a streaming fashion. Our work resolves an intriguing question on finding
heavy hitters, that has remained open for more than 35 years, in the
data
stream literature. We also provide near optimal algorithms for
determining
the winner of a stream of votes for other popular voting rules, for
example, veto, Borda, maximin, etc.
Voters' preferences are often partial orders instead of complete orders.
This is known as the incomplete information setting in computational
social choice theory. We study the kernelization complexity (under the
complexity
theoretic framework of parameterized complexity) of the possible winner
problem and show that there do not exist kernels of size that is
polynomial in the number of alternatives for this problem for commonly
used voting rules. However, we show that the problem of coalitional
manipulation which is an important special case of the possible winner
problem admits a kernel whose size is polynomially bounded in the number
of alternatives for common voting rules.
Part III: Election Control
In the final part of the thesis, we study the computational complexity
of
various interesting aspects of strategic behavior in voting.
First, we consider the impact of partial information in the context of
strategic manipulation. We show that lack of complete information makes
the computational problem of manipulation intractable for many commonly
used voting rules. Next, we initiate the study of the computational
problem of detecting possible instances of election manipulation. We
show
that detecting manipulation may be computationally easy under certain
scenarios even when manipulation is intractable.
The computational problem of bribery is an extensively studied problem
in
computational social choice theory. We study computational complexity of
bribery when the briber is frugal in nature. We show for many common
voting rules that the bribery problem remains intractable even when the
briber's behavior is restricted to be frugal, thereby strengthening
intractability results from the literature.
 Series: Department Seminar Title: Learning from Untrusted Data  Speaker: Prof. Moses Charikar,
Computer Science Stanford University USA.  Date and Time: Friday, March 24, 2017, 3:00 PM
 Venue: CSA Lecture Hall (Room No. 117, Ground Floor)
Abstract The vast majority of theoretical results in machine learning and statistics assume that the available
training data is a reasonably reliable reflection of the phenomena to be learned or estimated. Similarly,
the majority of machine learning and statistical technique s used in practice are brittle to the presence
of large amounts of biased or malicious data. In this work we propose two novel frameworks in which
to study estimation, learning, and optimization in the presence of significant fractions of arbitrary data.
The first framework, which we termlistdecodable learning, asks whether it is possible to return a list of
answers, with the guarantee that at least one of them is accurate. For example, given a dataset of
n points for which an unknown subset of αn points are drawn from a distribution of interest, and no assu
mptions are made about the remaining (1−α)n points, is it possible to return a list of poly(1/α)
answers, one of which is correct? The second framework, which we term the semiverified
learning model considers the extent to which a small dataset of trusted data (drawn from th
e distribution in question) can be leveraged to enable the accurate extraction of information from a much
larger but untrusted dataset (of which only an αfraction is drawn from the distribution).
We show strong positive results in both settings, and provid e an algorithm for robust learning in a very
general stochastic optimization setting. This general result has immediate implications for robust esti
mation in a number of settings, including for robustly estimating the mean of distributions with bounded
second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions
in random graphs in which significant portions of the graph have been perturbed by an adversary.
Speaker Bio: Moses Charikar is a professor of Computer Science at Stanford University.
He obtained his PhD from Stanford in 2000, spent a year in the research
group at Google, and was on the faculty at Princeton from 20012014. He is
broadly interested in the design and analysis of algorithms with an
emphasis on approximation algorithms for hard problems, metric embeddings
and algorithmic techniques for big data. His work on dimension reduction
won the best paper award at FOCS 2003. He was awarded the 2012 Paris
Kanellakis Theory and Practice Award for his work on locality sensitive
hashing, and was named a Simons Investigator in theoretical computer
science in 2014.
Host Faculty: Dr. Anand Louis
 Series: Ph.D. Thesis Defense Title: "Design of Quality Assuring Mechanisms with Learning, for Strategic
Crowds"  Speaker: Satyanath Bhat
Ph.D student  Faculty Advisor: Prof. Y Narahari
 Date and Time: Wednesday, March 22, 2017, 11:30 AM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract In this thesis, we address several generic problems concerned with
procurement of tasks from a crowd that consists of strategic workers
with
uncertainty in their qualities. These problems assume importance as the
quality of services in a service marketplace is known to degrade when
there is (unchecked) information asymmetry pertaining to quality.
Moreover, crowdsourcing is increasingly being used for a wide variety of
tasks these days since it offers high levels of flexibility to workers
as
well as employers. We seek to address the issue of quality uncertainty
in
crowdsourcing through mechanism design and machine learning. As the
interactions in webbased crowdsourcing platform are logged, the data
captured could be used to learn unknown parameters such as qualities of
individual crowd workers. Further, many of these platforms invite bids
by
crowd workers for available tasks but the "strategic" workers may not
bid
truthfully. This warrants the use of mechanism design to induce truthful
bidding. There ensues a complex interplay between machine learning and
mechanism design, leading to interesting technical challenges. We
resolve
some generic challenges in the context of the following problems.
(1) Design of a quality eliciting mechanism with interdependent values
We consider an expert sourcing problem, where a planner seeks opinions
from a pool of experts. Execution of the task at an assured quality
level
in a cost effective manner turns out to be a mechanism design problem
when the individual qualities are private information of the experts.
Also, the task execution problem involves interdependent values, where
truthfulness and efficiency cannot be achieved in an unrestricted
setting
due to an impossibility result. We propose a novel mechanism that
exploits
the special structure of the problem and guarantees allocative
efficiency,
expost incentive compatibility and strict budget balance for the
mechanism, and expost individual rationality for the experts.
(2) Design of an optimal bidimensional crowdsourcing auction
We study the problem faced by an auctioneer who gains stochastic rewards
by procuring multiple units of a service from a pool of heterogeneous
strategic workers. The reward obtained depends on the inherent quality
of
the worker; the worker's quality is fixed but unknown. The costs and
capacities are private information of the workers. The auctioneer is
required to elicit costs and capacities (making the mechanism design
bidimensional) and further, has to learn the qualities of the workers
as
well, to enable utility maximization. To solve this problem, we design a
bidimensional multiarmed bandit auction that maximizes the expected
utility of the auctioneer subject to incentive compatibility and
individual rationality while simultaneously learning the unknown
qualities
of the agents.
(3) Design of a multiparameter learning mechanism for crowdsourcing
We investigate the problem of allocating divisible jobs, arriving
online,
to workers in a crowdsourcing platform. Each job is split into a certain
number of tasks that are then allocated to workers. These tasks have to
meet several constraints that depend on the worker performance. The
performance of each worker in turn is characterized by several intrinsic
stochastic parameters. In particular, we study a problem where each
arriving job has to be completed within a deadline and each task has to
be completed, honouring a lower bound on quality. The job completion
time
and quality of each worker are stochastic with fixed but unknown means.
We
propose a learning mechanism to elicit the costs truthfully while
simultaneously learning the stochastic parameters. Our proposed
mechanism
is dominant strategy incentive compatible and expost individually
rational with asymptotically optimal regret performance.
 Series: Ph.D. Thesis Defense Title: Incentive Design for Crowdfunding and Crowdsourcing Markets  Speaker: Mr. Praphul Chandra,
Hewlett Packard Enterprise, Bangalore
(ERP Candidate)  Faculty Advisor: Prof. Y. Narahari (IISc) and Dr. Sitaram Ramachandrula (Hewlett Packard Enterprise)
 Date and Time: Tuesday, March 21, 2017, 11:30 AM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract With the everincreasing trend in the number of social interactions
getting intermediated by technology (the world wide web) as the
backdrop, this thesis focuses on the design of mechanisms for
online communities (crowds)which strive to come together, albeit
in adhoc fashion, to achieve a social objective. Two examples of
such webbased social communities which are of central concern
in this thesis are crowdsourcing markets and crowdfunding platforms.
For these settings which involve strategic human agents, we design
mechanisms that incentivize contributions (effort, funds, or information)
from the crowd and aggregate these contributions to achieve the specified
objective. Our work is thus concerned with the challenge of designing mechanisms
which encourage social coordination and cooperation among large groups of
(often) anonymous users to achieve group efficiency (social welfare).
We address the following related challenges:
• Can we design mechanisms to solve the freerider problem in public
goods settings? Can large anonymous groups of individuals be incentivized
to contribute to create public goods?
• Can we design mechanisms that harness social networks to achieve
coordination of contributions towards a public good to ensure that
publicly desirable goods are successfully funded via private contributions?
How do we make such mechanisms fair?
• Can we design mechanisms that improve the efficiency of markets by
expanding the set of individuals who participate in the market? Can these
individuals be incentivized to increase the group efficiency and, if so, at what cost?
• Can we design mechanisms that make crowdsourcing markets more
equitable by offering participation opportunities to novices and higher
incentives to agents with high reliability? What is the price of reliability?
Using mechanism design as the principal design tool, the thesis attempts
to offer rigorous solutions to the above challenges.
 Series: Ph.D. Thesis Defense Title: NonParametric Clustering of Multivariate Count Data  Speaker: Ms. Lavanya Sita Tekumalla
 Faculty Advisor: Prof. Chiranjib Bhattacharyya
 Date and Time: Monday, March 20, 2017, 9:30 AM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract While there has been significant work in Bayesian nonparametric modeling
in the last decade there has been much less work on nonparametric
clustering of Multivariate Count Data. The first contribution of this
thesis addresses extensions to clustering with Poisson based models for
multivariate counts. While the Poisson is the most popular distribution for
count modeling, the Multivariate Poisson, often leads to intractable
inference and a suboptimal fit of the data. To address this, we introduce
a family of models based on the SparseMultivariate Poisson, that exploit
the inherent sparsity in multivariate data, leading to a better fit and
more efficient inference. We explore Dirichlet process mixture model
extensions and temporal nonparametric extensions to models based on the
Sparse Multivariate Poisson for practical use of Poisson based models for
nonparametric clustering of multivariate counts. A second contribution of
this thesis addresses moving beyond the limitations of Poisson based models
for nonparametric clustering, for instance in handling over dispersed data
or data with negative correlations. We explore, for the first time,
marginal independent inference techniques based on the Gaussian Copula for
multivariate count data in the Dirichlet Process mixture model setting.
This enables nonparametric clustering of multivariate counts without
limiting assumptions that usually restrict the marginals to belong to a
particular family. This inference technique can also work for mixed data (a
combination of counts, binary and continuous data) enabling Bayesian
nonparametric modeling to be used for a wide variety of data types. The
third contribution of this thesis addresses modeling a wide range of more
complex dependencies such as asymmetric and tail dependencies with Vine
Copula based Dirichlet process mixtures. While vine copula inference has
been well explored for continuous data, it is still a topic of active
research for multivariate counts and mixed multivariate data. An efficient
marginal independent inference approach based on extended rank likelihood,
is proposed in this thesis, extending the use vines for multivariate counts
and mixed data in practical clustering scenarios.
This thesis also explores the novel systems application of Bulk Cache
Preloading by analyzing I/O traces though predictive models for temporal
nonparametric clustering of multivariate count data. State of the art
techniques in the caching domain are limited to exploiting shortrange
correlations in memory accesses at the millisecond granularity or smaller.
We explore for the first time, Bulk Cache Preloading, the process of
proactively predicting data to load into cache, minutes or hours before
the actual request from the application, by leveraging longer range
correlation at the granularity of minutes or hours. Our approach involves
data aggregation, converting I/O traces into a temporal sequence of
multivariate counts, that we analyze with the temporal nonparametric
clustering models.
As an additional contribution, this thesis addresses multilevel
nonparametric admixture modeling for discrete data in the form of grouped
categorical data. Nonparametric topic models for document collections, is
well explored with admixture models such as the Hierarchical Dirichlet
Process. However, there exist scenarios, where a document requires being
associated with themes at multiple levels, where each theme is itself an
admixture over themes at the previous level, motivating the need for
multilevel admixtures. We propose the nested Hierarchical Dirichlet
Process to address this problem and apply a two level version of our model
for nonparametric entitytopic modeling to automatically learn entities in
the form of authors and topics from research corpora.
 Series: Department Seminar Title: Designing Efficient BudgetConstrained Incentive Schemes: To
Compete or To Cooperate?  Speaker: Dr. Ragavendran Gopalakrishnan
Conduent Labs  Date and Time: Thursday, March 16, 2017, 4:00 PM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract This paper studies designing realworld performancebased incentive
schemes, which are necessarily budgetconstrained, must be simple to
understand and implement, and satisfy some fairness criteria.
Participants incur a cost of effort for their performance, which they
trade off with the incentives received. The resulting competition for
incentives is modeled as a noncooperative game. First, we characterize
budgetconstrained incentive schemes that maximize performance at Nash
equilibrium, and observe, interestingly, that the maximum performance is
a concave function of the available budget. Second, we study competitive
vs. cooperative fairness among high and low performing players to
quantify the efficiency loss in moving from Nash to dominant strategy
equilibrium; perhaps surprisingly, under a certain mixed fairness model,
this loss is zero, and the equilibrium is unique. Third, we perform
numerical experiments (when players are homogeneous/heterogeneous in
their type, and when the employer has only partial type information),
and observe interesting phenomena, e.g., an employer’s inability to
typediscriminate is not a significant disadvantage, and the expected
maximum average performance is quite robust to onesided typeestimation
error. Finally, using real employee data from a large BPO organization,
we show that our incentive scheme can result in up to a 29.8% increase
in average performance of its employees.
Host Faculty: Dr. Siddharth Barman
 Series: M.Sc. (Engg) Colloquium Title: Generalizations of Hitting, Covering and Packing problems on Intervals  Speaker: Mr. Datta Krupa
M.Sc student at CSA department  Faculty Advisor: Prof. Sathish Govindarajan
 Date and Time: Tuesday, March 14, 2017, 10:30 AM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract Interval graphs are well studied structures. Intervals can represent
resources like jobs to be scheduled. Finding maximum independent set in
interval graphs would correspond to scheduling maximum number of
nonconflicting jobs on the computer. Most optimization problems on
interval graphs like independent set, vertex cover, dominating set, maximum
clique, etc can be solved efficiently using combinatorial algorithms in
polynomial time. Hitting, Covering and Packing problems have been
extensively studied in the last few decades and have applications in
diverse areas. While they are NPhard for most settings, they are
polynomial solvable for intervals. In this thesis, we consider the
generalizations of hitting, covering and packing problems for intervals.
For the general setting, we model the problems as mincost flow problems
using nontrivial reduction and solve it using standard flow algorithms.
Demandhitting problem which is a generalization of hitting problem is
defined as follows: Given n intervals, a positive integer demand for every
interval, m points, a real weight for every point, select a subset of
points H, such that every interval contains at least its demand many points
from H and sum of the weight of points in H minimized. When demands of
intervals are one we give dynamic programming based O(m+n) time algorithm
assuming that intervals and points are sorted. When demands are same, (say
k), we give O(m^2*n) time flow based algorithm. For the general case, we
make an assumption that no interval is contained in another interval.
Under this assumption, we give O(m^2*n) time flow based algorithm.
Demandcovering problem which is a generalization of covering problem is
defined as follows: Given n intervals, a real weight for every interval, m
points, a positive integer demand for every point, select a subset of
intervals C, such that every point is contained in at least its demand many
intervals from C and sum of the weight of intervals in C minimized. When
demand of points are one, we give dynamic programming based O(m+n log n)
time algorithm. When demands are same, (say k) we give O(m*n^2) time flow
based algorithm. For the general case, we give O(m*n^2) time flow based
algorithm.
kpack points problem which is a generalization of packing problem is
defined as follows: Given n intervals, an integer k, m points, a real
weight for every point, select a subset of points Y, such that every
interval contains at most k points from Y and sum of the weight of points
in Y is maximized. For kpack points problem, we give O(m^2 log m) time flow
based algorithm to solve it assuming intervals and points are sorted.
 Series: Ph.D. Colloquium Title: Optimization Algorithms for Deterministic,
Stochastic and Reinforcement Learning Settings  Speaker: Mr. Ajin George Joseph
PhD student at CSA department  Faculty Advisor: Prof. Shalabh Bhatnagar
 Date and Time: Monday, March 13, 2017, 2:30 PM
 Venue: CSA Seminar Hall (Room No. 254, First Floor)
Abstract Optimization is a very important field with diverse applications in
physical, social and biological sciences and in various areas of
engineering. It appears widely in machine learning, information retrieval,
regression, estimation, operations research and a wide variety of
computing domains. The subject is being deeply studied and a lot
of algorithms are available in the literature. These algorithms which can
be executed (sequentially or concurrently) on a computing machine explore
the space of input parameters to seek high quality solutions to the
optimization problem with the search mostly guided by certain structural
properties of the objective function. In this thesis, we propose
an optimization algorithm which is gradientfree, i.e., does not employ
any knowledge of the gradient or higher order derivatives of the objective
function, rather utilizes objective function values
themselves to steer the search. The proposed algorithm is particularly
effective in a blackbox setting, where a closedform expression of the
objective function is unavailable and gradient or higherorder derivatives
are hard to compute or estimate. Our algorithm is inspired by the well
known cross entropy (CE) method. The CE method is a model based
search method to solve continuous/discrete multiextremal optimization
problems, where the objective function has minimal structure. The
proposed method seeks, in the statistical manifold of the parameters
which identify the probability distribution/model defined over the input
space to find the degenerate distribution concentrated on the global
optima (assumed to be finite in quantity). In the early part of the
thesis, we propose a novel stochastic approximation version of the CE
method to the unconstrained optimization problem, where the objective
function is realvalued and deterministic. The basis of the algorithm is a
stochastic process of model parameters which is probabilistically
dependent on the past history, where we reuse all the previous samples
obtained in the process till the current instant based on discounted
averaging. This approach can save the overall computational and storage
cost. Our algorithm is incremental in nature and possesses attractive
features such as stability, computational and storage efficiency and
better accuracy. We further investigate, both theoretically and
empirically, the asymptotic behaviour of the algorithm and find that the
proposed algorithm exhibits global optimum convergence for a particular
class of objective
functions.
Further, we extend the algorithm to solve the
simulation/stochastic optimization problem. In stochastic
optimization, the objective function possesses a stochastic characteristic, where the
underlying probability distribution in most cases is hard to comprehend
and quantify. This begets a more challenging optimization problem, where
the pretentious nature is due to the hardness in computing the objective
function values for various input parameters with absolute certainty. In
this case, one can only hope to obtain noise corrupted objective function
values for various input parameters. Settings of this kind can be found in
scenarios where the objective function is evaluated using a continuously
evolving dynamical system or through a simulation. We propose a
multitimescale stochastic approximation algorithm, where we integrate an
additional timescale to accommodate the noisy measurements and decimate
the effects of the gratuitous noise asymptotically. We found that if the
objective function and the noise involved in the measurements are well
behaved and additionally the timescales are compatible then our algorithm
can generate high quality solutions. In the later part of the thesis, we
propose algorithms for reinforcement learning/Markov decision processes
using the optimization techniques we developed in the
early stage. MDP can be considered as a generalized framework for
modeling planning under uncertainty. We provide a novel algorithm for the
problem of prediction in reinforcement learning, i.e., estimating the
value function of a given stationary policy of a model free MDP (with
large state and action spaces) using the linear function
approximation architecture. Here, the value function is defined as the
longrun average of the discounted transition costs. The resource
requirement of the proposed method in terms of computational and storage
cost scales quadratically in the size of the feature set. The algorithm is
an adaptation of the multitimescale variant of the CE method
proposed in the earlier part of the thesis for simulation optimization. We
also provide both theoretical and empirical evidence to corroborate the
credibility and effectiveness of the approach.
In the final part of the thesis, we consider a modified version
of the control problem in a model free MDP with large state and action
spaces. The control problem most commonly addressed in the literature is
to find an optimal policy which maximizes the value function, i.e., the
longrun average of the discounted transition payoffs. The contemporary
methods also presume access to a generative model/simulator of the MDP
with the hidden premise that observations of the system behaviour in the
form of sample trajectories can be obtained with ease from the model. In
this thesis, we consider a modified version, where the cost function to
be optimized is a realvalued performance function (possibly nonconvex) of
the value function. Additionally, one has to seek the optimal policy without
presuming access to the generative model. In this thesis, we propose a
stochastic approximation algorithm for this peculiar control problem. The
only information, we presuppose, available to the algorithm is the
sample trajectory generated using a priori chosen behaviour policy. The
algorithm is data (sample trajectory) efficient, stable, robust as well as
computationally and storage efficient. We also provide a proof of
convergence of our algorithm to a high performing policy relative to the
behaviour policy.

