SE Asian Language Course in Development

Discussion in 'Learning Techniques and Advice' started by Big_Dog, May 2, 2014.

  1. Big_Dog

    Big_Dog Administrator Staff Member

    Joined:
    Jan 11, 2014
    Messages:
    1,039
    Native Language:
    English
    Advanced Languages:
    Spanish
    Intermediate Languages:
    French, Japanese, Mandarin, Russian, Swahili, Thai
    Basic Languages:
    Korean
    There is a thread elsewhere that posed some interesting questions. I wanted to comment without worrying about not being able to quote my own post, so I'm commenting here. I'm also interested in kikenyoy's opinion, since he is a skilled Thai learner, and Cainntear's opinion, since he is also developing something (top secret?). If you get a chance, please read the old thread. Here is the recent question:

    This sounds similar to a mouse-over dictionary that uses statistics. LingQ does something similar, and I find it extremely useful. Something LingQ does that yours doesn't (I think) is color code (by shading) unknown and previously looked-up words. LingQ does some other things too, but that's the only useful thing they have that you don't, imo. The thing that your tool has that LingQ doesn't is the connection to an SRS, rather than just a bare-bones statistic program. This will appeal to lots of people. Your flashcards will be SRS, whereas LingQ flashcards are very basic.

    But I have to admit, there are things about an SRS based learning program that don't appeal to me. It's some sort of bias that I have against doing to many flashcards. I normally advise people not to let flashcards get to be over 25% of their total study time. But if the whole program is in an SRS - yikes! And part of this bias comes from years of dealing with Supermemo and Anki. SRS's are so unforgiving. Have a couple bad sessions, and the whole collection could spin out of control, and the user could be facing a nightmarish quantity of repetitions. So have you thought of this? Are you going to give the user a way out? If he's sick, can he skip a day without increasing his workload? I had several arguments with Damien (Anki creator) about this. He refused to allow Anki to do a time shift; said the user should make up the repetitions. This isn't realistic, imo. Anyway, he ended up adding the ability to put an upper limit on reps per day for a given deck. At first, I refused to use that, because it obviously violates the analog when you hit the limit. But since it was the only solution I had, I finally gave in and started using it. It seems to work pretty well.
  2. Shaun Noonan

    Shaun Noonan New Member

    Joined:
    May 2, 2014
    Messages:
    3
    Thanks for carrying this thread over. Thanks for the feedback regarding LinQ. I think we can get to similar or better functionality after some iterations with real users. It's not a hard software problem to annotate the text.

    I can answer all of this in that SRS software is kind of dumb in my opinion. Somewhere along the way, spaced repetition came to equal boring flashcard implementations. We're not robots, right? My theory is that implementation can be MUCH MUCH more subtle than a flashcard app while being equally effective. So, imagine it's tracking what you reading and understanding something in L2. It knows what you know, kind of know and don't know. Why should it be concerned with drilling on individual terms as if it's programming you like a robot?

    The big revelation I had as a language course developer, a coder and a language learner myself is that SRS can be more subtle. It can passively extract what you know and don't know from how you scored how well you understood whole chunks of text, right? A human could do it. It's not hard. SRS can be a LOT smarter than Anki. It can act on whole chunks of material and infer the SRS entries if you're using it a lot. It can then let you know what you need to drill on to understand a specific text (or audio if you have the matching transcript). Then you can only drill for a minute or two so you can understand the text you don't know. Or not drill at all, and keep it on the sidebar or in hover-overs (it doesn't matter, same result). Understanding it in context will be the same as an SRS hit. You don't have to look at flashcards anymore. Just read and/or listen.

    It bothers me to see Anki == spaced repetition because that's just the most boring way to implement computer-assisted spaced repetition. Time to get a little smarter in 2014. :)

    It's SRS without you having to actually directly drill with it. Read, understand, SRS tracks and catches you if you're missing things. You can drill if you want for focused results, but it will fill in the blanks on-demand as needed for the material you're reading. It's there, in the background, providing you with term instruction when you need it.

    And in defense of Damien, he's done great work in refining his implementation. He's put much more time into developing Anki than we have with our system and he really knows what he's doing. But I now disagree with Anki as the be-all end-all of vocab learning with SRS for L2 acquisition. It's too much of a grind and people quit. It's not aware of context and doesn't know what you're doing with the words. If it knew whether your GOAL was being satisfied (vs you scoring your cards), it could infer much more and be smarter about how it models what it thinks you know. It could recommend stuff for you to read. It could suggest drilling if you're missing some key terms. It could know about 1, 2 & 3 word co-location frequencies and make sure you're reading stuff with important words and phrases that you're weak on.

    I am advocating stepping outside of that specifically for language learning and analyzing L2 text you are reading. The SRS interface becomes the material you're reading, not the flashcard. This is something like what a good language tutor is doing for you... knowing what you know and providing N+1 input to drive you to understand more. It mostly does away with the flashcard interface altogether and puts the interface at the reading/listening(where there's a transcript) level.

    So, that's where I'm coming from here. It's a new way to implement it. It might not be as absolutely time efficient as flashcard drilling for vocab acquisition, BUT I expect it leads to useful understanding much faster while actually being interesting and keeping the learner immersed. I can't show any data yet until we are running on live users, but we do have a lot of Indonesian users to test with and will happily discuss results.
  3. Big_Dog

    Big_Dog Administrator Staff Member

    Joined:
    Jan 11, 2014
    Messages:
    1,039
    Native Language:
    English
    Advanced Languages:
    Spanish
    Intermediate Languages:
    French, Japanese, Mandarin, Russian, Swahili, Thai
    Basic Languages:
    Korean
    I think I'm understanding you better now, especially regarding the SRS. I kept on thinking of it as a flashcard program, which is very specific, rather than general software that drives spaced repetition. It sounds really good, and quite advanced. I'm going to remain skeptical, of course, but I really look forward to implementation.

    Looking back to the other thread, is the main obstacle right now the creation of the 100 hr audio program? And are you working on Malaysian and Tagalog concurrently?
  4. Nobody

    Nobody Member

    Joined:
    Apr 20, 2014
    Messages:
    31
    Native Language:
    English
    Advanced Languages:
    Korean
    Intermediate Languages:
    Chinese_classical, Mandarin
    Let me try this again, this time not on my phone.

    I feel like you're being far too hard on Anki here, for a reason you yourself say: "It might not be as absolutely time efficient as flashcard drilling for vocab acquisition." Anki's advantages are that it's quick (just a second or two per card) and convenient (it requires just a second or two per card, and one can easily stop at any moment and resume later, making it ideal for studying while walking, waiting, having a coffee, or so forth). You're obviously correct that it's not the be all and end all of vocabulary acquisition, but at the same time, it fills an excellent niche: it's what you study when studying anything else wouldn't work for whatever reason (especially since it's on a phone). By contrast, it seems to me -- emphasis on that word because I'm not sure I perfectly understand what you're proposing here -- that what you're describing here is a process even slower and more focused than simply reading. Surely for any sort of SRS system to be subtly integrated into a text reader, the user must give feedback regarding the text, yes? So if I'm reading something and I don't know a word, not only do I have to pause to look it up, but I also must pause to tell the system, "By the way, I didn't know this word?" And then it provides you with texts which have words it decides you need to practice based upon that feedback? This means it's going to be studied not in the time I'd do Anki, but in the time I'd do something like reading, writing, or listening practice. Assuming I am understanding you correctly (and maybe it isn't), then what concerns me is that if there's one thing that's more boring that doing flashcards, it's reading texts in which you have no interest. I tried some system called Bliu Bliu a while back, and I found the texts it provided me so horribly boring I just gave up. By contrast, if I just read what I want, the "SRS" element is gone; whatever words are in the text are in the text, with no regard to what I actually "need" to study, right? So instead of replacing boring Anki, it seems like I'd be replacing texts which genuinely interest me?

    I'm not saying this sounds like a bad system, by the way. It sounds interesting. It just doesn't sound like it could replace Anki for me.
  5. Shaun Noonan

    Shaun Noonan New Member

    Joined:
    May 2, 2014
    Messages:
    3
    @BigDog, yes, I appreciate the skepticism. I'm skeptical too. I almost feel like it was too early to start the thread on it, but I wanted to get feedback early on the direction of it. I ended up defending it a bit, but I really shouldn't. It's just a theory that I'm implementing and will test. It may not be better in the end.

    We're working through Tagalog, but Indonesian and Malay will also come about probably about same time since we already have more finished resources and reliable people for those. Tagalog, Indonesian and Malaysian are collectively Batch 1.

    I can probably list a lot of huge obstacles! If we were doing a 4-6 CD set kind of thing, we could bang it out in 1 day of recording studio time. It would be something we could develop in just a few weeks. This is why you see these smaller things on the market a lot. We're doing the equivalent of like 85-90 CDs. So that means we need a lot of script material, a lot of editing work and a lot of recording work.

    The largest obstacles are retaining skilled language editors that will stick it through. It's about as hard as writing a book. We'd like to think partnering up for a large work like this is as easy as finding people with the right skills and paying them for their work. The reality is a bit rougher. People just stop working on things. They give up because it's hard. They cut corners when no one is looking. For example, we lost 2 months of work because we found one of our language editors was wholesale copying material from websites and claiming it was original work for lesson scripts. She was mixing it in with good material we had written. It tainted hundreds of pages of work before we noticed some of it was oddly similar to another site we happened across. The wrangling of these issues and getting the right people to show up for recording sessions on a consistent basis without having to switch them out. We try to have 2 or 3 people working on the same thing. After that, any software development project is a major challenge (We have 4 in this!). Lastly, not about building the thing, but getting a return so we can not have spent 10's of thousands out of pocket! We have to fight to get eyes on it with tons of junk like Rosetta Stone and crap mobile apps polluting the channels where people could find it. It's not enough to just make a good course, but people have to actually find it. Anyway, that got a little ranty. The truth is that the hardest part is staffing and making sure what comes out is clean, correct and consistent across 100's of source script pages.

    @Nobody, I'm actually going to step back from advocating what I'm working on as it's kind of becoming a defense of something untested. What we'll do is publish courses with the method and people can try them out, see how it works and provide feedback. The impetus for posting early was to get feedback, not customers! :) The discussion has gone a little too far to explaining and defending something that hasn't been tested against with new learners yet.

    I think I mentioned it before, but the story part of what we're doing is actually being written by fiction writers. We have been posting sample pilot stories and surveying against random samples of 5000 Indonesian learners, getting direct feedback on how interesting each character, story line, how likely they are to want to read each pilot story line, etc. We don't go with anything less than 4's or 5's out of 5. So I can say with some confidence that our reading/audio material will be enjoyable for most people. Our current writer bangs out fun stuff and is absolutely onboard with having such a user-driven approach to tweaking the work. The reason for doing all this writing work was because the of the lack of interesting material suitable for language learning.
    Nobody likes this.
  6. Cainntear

    Cainntear Active Member VIP member

    Joined:
    Apr 29, 2014
    Messages:
    343
    Native Language:
    English
    Advanced Languages:
    Catalan, French, Italian, Scottish_Gaelic, Spanish
    Intermediate Languages:
    Corsican
    Basic Languages:
    Dutch, German, Irish, Polish, Russian, Welsh, Sicilian
    I'm not wanting to say too much, as Shaun's idea are starting to approach mine in certain ways. He seems to be directly addressing some of the biggest gaps between SRS and things like Learning With Texts and LingQ.

    One of the problems with SRS is that it fails to recognise the "outside" world -- the longer you've been engaging with a language, the less the SRS system knows about what you do and can do, and the less relevant its scheduling becomes.

    Meanwhile, LWT and LingQ just jump from "never seen it" to "know it" without paying any attention to repetition.

    What Shaun's proposing is certainly not technically difficult and there's really no good reason that similar software doesn't already exist. The use of n-grams is an excellent way to get round the difficulties of identifying specific contexts and senses, so the software can make a good guess at whether a text is genuinely comprehensible. The fact that it is just an approximation is not a bad thing. It's a weakness, but if implement like he says, it will be stronger than anything that we've got at the moment, so that's not a problem.
  7. Shaun Noonan

    Shaun Noonan New Member

    Joined:
    May 2, 2014
    Messages:
    3
    @Cainntear, thanks. This expresses it well. I would agree on the strengths and weaknesses.

    Going off topic and from a programming perspective, I'd love to see ways around using n-grams for novel text and where we don't have lovely things like WordNet and good NLTK support. Since we're working with very underserved languages in both available quality course material and computational linguistics work, there's not a lot to base anything useful on. If you're really going to nail it, drop me a line if licensing is something you'd be interested in. We don't have not-invented-here syndrome and we'd rather be more focused on the content side.
  8. Big_Dog

    Big_Dog Administrator Staff Member

    Joined:
    Jan 11, 2014
    Messages:
    1,039
    Native Language:
    English
    Advanced Languages:
    Spanish
    Intermediate Languages:
    French, Japanese, Mandarin, Russian, Swahili, Thai
    Basic Languages:
    Korean
    I can sympathize with this, being a partner in several businesses myself. The Devil's in the details, I guess. Sounds like things are going pretty well though; you keep persevering. I might be asking you some business questions myself in the near future :)
    Last edited: May 4, 2014
  9. Cainntear

    Cainntear Active Member VIP member

    Joined:
    Apr 29, 2014
    Messages:
    343
    Native Language:
    English
    Advanced Languages:
    Catalan, French, Italian, Scottish_Gaelic, Spanish
    Intermediate Languages:
    Corsican
    Basic Languages:
    Dutch, German, Irish, Polish, Russian, Welsh, Sicilian
    I'd certainly be interested in licensing if I had a product ready to license! At the moment, though, I've just got a collection of scrappy prototypes in Python and (shudder) JavaScript. Various factors (including myself) have conspired to make progress much slower than it ought to be. Once my current teaching contract finishes (mid-July) I'll be getting my head down and trying to get a working system.

    My views are kind of divergent from yours, though, in that I'm trying to merge procedurally generated content with authentic materials.
  10. Big_Dog

    Big_Dog Administrator Staff Member

    Joined:
    Jan 11, 2014
    Messages:
    1,039
    Native Language:
    English
    Advanced Languages:
    Spanish
    Intermediate Languages:
    French, Japanese, Mandarin, Russian, Swahili, Thai
    Basic Languages:
    Korean
    I remember some discussion at LingQ about automatically toggling words from "previously looked up" (yellow) to "known" (unhighlighted) after a certain number of views. What they have now is a rating system that allows you to grade how well you know the word, and/or the ability to toggle the word manually. I don't think LingQ lets you do this automatically.

    Personally, I don't think it's worth the time to grade the words. I just leave them yellow, and never get credit for knowing them. I don't think this makes reading any less effective, and I'm sure it saves me from interrupting my reading.
  11. Peregrinus

    Peregrinus Active Member

    Joined:
    May 27, 2014
    Messages:
    613
    Native Language:
    English
    Intermediate Languages:
    German
    Basic Languages:
    Spanish
    I just noticed this thread although the convo died off a while back. Obviously from the other threads I have participated in, Anki interests me a great deal and I attribute to it my rapid growth of passive skills in German. And there are also some other highly interesting points in this conversation, not the least of which is concentrating on languages without a lot of good resources.

    There are two metrics important to me with Anki or other methods, time efficiency and effectiveness. The efficiency of SRS vs extensive reading was the topic of the thread I started, If You Don't Like Anki or Wordlists. As of yet I don't see anyone asserting that extensive reading (which I believe in and use!), is nearly as efficient timewise as SRS. If all one does is briefly look up a word each time one encounters an unknown or partially unknown, then one still needs huge exposure to encounter even mid-frequency vocabulary enough times for it to stick. Anki wins hands down here IMO. As for effectiveness, Anki is not an end in itself, and is most effectively used in conjunction with courses or sources where the drilled words are reinforced by reading. After a certain point though one runs out of courses and easier sources, and must rely on encountering lower frequency "in the wild" as it were, i.e. in normal reading where one is not guaranteed to encounter such low frequency vocabulary any time soon.

    I think the prime selling point of the envisioned course here is simply the HUGE size of audio support in the form of dialogues. It seems like Assimil multiplied by a factor of 25 or so. There is an IF though. That is IF this course also teaches a vocabulary size that is multiples of Assimil. Otherwise it is Assimil plus a large number of graded readers to support the same vocabulary level. So my question would be, how many lemma forms is this course going to teach? Obviously a little more difficult question to answer re Tagalog given its highly agglutinative nature.

    The SRS aspect of this proposed course is actually the least interesting facet to me, because I would simply stick the vocab in Anki to reinforce the course if it did not have the capability itself. Rather the size of the course, again assuming it teaches a very large vocabulary, is what interests me, along with it being based on dialogues. In other languages like German once you get to upper level courses as taught in universities, the audio support is in the form of longish monologues, news-like reports and short snippets of conversation. But shorter dialogues are more easily remembered and more easily broken into bite-sized chunks of learning. Which is what accounts for the success of Assimil

    Another interesting aspect of this discussion is ngrams, though not in the way you are using it, which is to analyze what the learner still needs to work on. Rather I am interested in frequency based ngrams as sources of lexcial chunks to be used for learning. Apart from English there is little analysis of this by language, merely the raw data as one gets from the google sources. Identifying and teaching high frequency lexical chunks is the frontier of language learning to my mind.

    I wish the course designers the best of luck.
  12. Cainntear

    Cainntear Active Member VIP member

    Joined:
    Apr 29, 2014
    Messages:
    343
    Native Language:
    English
    Advanced Languages:
    Catalan, French, Italian, Scottish_Gaelic, Spanish
    Intermediate Languages:
    Corsican
    Basic Languages:
    Dutch, German, Irish, Polish, Russian, Welsh, Sicilian
    I meant to respond to this ages ago.

    I actually think Shaun's approach to n-grams is a very clever one. In computing, and particularly in artificial intelligence, a lot can be done with "dumb" approximations. Shaun's software won't exactly identify idiomatic bundles -- that's a hard task, because it needs some kind of verification step to avoid making untrue assumptions based on limited training data.

    Instead, the software makes the observation that whether or not this is a genuine lexical bundle, you are more likely to understand the word in this situation than in a situation you have never encountered before.

    But actually, it doesn't even assume this. It may be that you didn't understand it first time, but by exposing you to a similar usage, it will either help you learn by exposure, or push you look it up in a reference book.

    Either way, you win.

    OK, for each individual language feature, it's less efficient than well-programmed instruction, but no programmed course is going to cover that much material, so you're getting more bang for your buck in the long-term.
    Big_Dog likes this.

Share This Page