[silpa-discuss] Fwd: Re: Regarding GSoC in Indic

---------- Forwarded message ----------
From: Anivar Aravind <address@hidden>
Date: 25-Feb-2017 6:00 PM
Subject: Fwd: Re: Regarding GSoC in Indic
To: Santhosh Thottingal <address@hidden>, Santhosh Thottingal <address@hidden>, Vasudev Kamath <address@hidden>, Jishnu Mohan <address@hidden>
Cc: Anivar Aravind <address@hidden>

Dear Santhosh vasudev & Jishnu

What you think about the idea proposed .
Please respond to the student
---------- Forwarded message ----------
From: "Rohan Saxena" <address@hidden>
Date: 25 Feb 2017 5:47 p.m.
Subject: Re: Regarding GSoC in Indic
To: "Akshay S Dinesh" <address@hidden>
Cc: <address@hidden>

Hello Sir,

I did not hear from you regarding my previous mail. I just wanted to know what do you think about the idea of using RNNs in sandhi splitting? Would the community be interested in such a project being implemented?

If not, I am also open to the idea of porting the LibIndic code to Python 3.

Thanks,
Rohan Saxena

On Thu, Feb 23, 2017 at 9:28 PM, Rohan Saxena <address@hidden> wrote:
Hello Sir,

I went through the approach used in the present version of Sandhi-splitter (the documentation mentions that this approach is employed).

While it is an interesting algorithm, I think it relies a lot on human judgement (see below on how and why this can be improved). The entire model has been constructed by making certain decisions at crucial stages of the architecture design based on the designers' understanding of the structure and semantics of the dataset. For example, the technique of skipping an initial part of the word to a particular position, and moving equal characters on either side. This is the designers' choice, and does not reflect whether this is one of the best strategies for this task. The model has not arrived on this approach on its own.
(On a side note, we calculate the probability of a substring being at the start or end of a sandhi by coursing through the entire dataset and counting all occurrences of the substring in such a position. This relies strongly on a uniform distribution of our dataset (with respect to the substrings occurring before and after the sandhi) ).

From above, "How and why can this be improved?" While this (strong human architectural decisions) need not necessarily be bad, recent advancements in deep learning have shown that allowing the model to learn the features on its own, in increasing levels of abstraction gives stellar results (see 'evidence paper'). Such techniques of automatic learning have reached amazing accuracies in computer vision and pattern recognition. This is one of the important factors which has made deep learning so effective and popular.

From above, 'evidence paper': See this paper by Y. LeCun et al. "The main message of this paper is that better pattern recognition systems can be built by relying more on automatic learning, and less on hand-designed heuristics... We show that hand-crafted feature extraction can be advantageously replaced by carefully designed learning machines that operate directly on pixel images."

I think it would be interesting to see if we could apply this philosophy of deep learning to try to achieve better results on sandhi splitting.

RNNs are popular models that are showing good performance on NLP tasks. If you are unsure whether feature learning can be incorporated into RNNs, see this.

I have my mid-semester exams coming up soon so unfortunately I have little time on my hands. However, if you wish me to go deeper on a particular aspect of this argument, let me know!

Thank you,
Rohan Saxena

On Wed, Feb 22, 2017 at 1:22 PM, Rohan Saxena <address@hidden> wrote:
Sure. Let me get back to you on this in a couple of days.

Thanks,
Rohan

On Wed, Feb 22, 2017 at 1:59 AM, Akshay S Dinesh <address@hidden> wrote:
Can you go through http://jerinphilip.github.io/posts/2016-08-22-gsoc-final-report.html which is the final report by jerin from last year and think of how that approach differs from RNN and compare them?

On Tue, 21 Feb 2017 at 21:17, Rohan Saxena <address@hidden> wrote:
Hello sir,

Thank you for sharing the wonderful links. The essay on free software was backed by some pretty deep psychology, and really changed how I view on open source software. I have also shared it with some of my friends and fellow programmers :)

I went through the ideas list you mentioned and am interested to work on improving the Sandhi Splitter to include RNN strategies. This project is in line with my interests (deep learning and neural networks) and my skills (machine learning and python).

I apologise for giving a very brief introduction about myself in my first mail. Here is some more information about me:
I am a second year computer science student from BITS Pilani. Here I have been a member of the Embedded Systems and Robotics lab since my freshman year itself. I work on robotics and artificial intelligence (specifically computer vision and deep learning).
I am also a student at the Udacity self-driving car nanodegree, and as part of the programme I have implemented various deep learning architectures which have achieved for example, accuracies over 99% in the MNIST dataset (classification of handwritten digits) and 95% in GTS dataset (classification of German Traffic Signs). I want to work on a similar project for GSoC 2017.

Kindly advice me on how to proceed.

Thank you,
Rohan Saxena.

On Tue, Feb 21, 2017 at 4:34 PM, Akshay S Dinesh <address@hidden> wrote:
Hey Rohan,
If this is your first time contributing to free software, read http://asd.learnlearn.in/gsoc-handbook/
Then, go through the gsoc repo at https://gitlab.com/indicproject/gsoc-2017/

See how and where you can contribute and try to come up with an idea.

Let us know.

Akshay

On Tue, 21 Feb 2017 at 16:31, Rohan Saxena <address@hidden> wrote:
Hello,

I am a second year Computer Science student at BITS Pilani, Pilani campus interested in machine learning (especially deep learning).

I wish to contribute to the Indic project as part of GSoC 2017. How can I go about doing this? It would be helpful if you could point me in the right direction.

Thank you,
Rohan Saxena.

Sent with Mailtrack

From:	vasudev
Subject:	[silpa-discuss] Fwd: Re: Regarding GSoC in Indic
Date:	Sat, 25 Feb 2017 20:03:14 +0530