Sunspotter Talk

Passes, and all the mystical backroom workings.

  • Quia by Quia

    As I'm writing this, we're currently at pass 15. The number of rankings/images line up nicely for the conclusion that each image has been viewed 15 times, and some of them have been viewed 16 times.

    So what happens at the end of a pass/start of a new one? Do we get new, random pairings of two images, or is each new pass constraining the pairs to compare more similar complexities? Or is there some other method being used to make the pairs?

    I've working on ranking sets of 5k items before, I know the completely random approach can get reasonable results, but ranks converge on their 'real' values a lot faster if you do non-random matches. But here there's a human factor in the rankings! If you do non-random matchups, each pass will get harder to classify, with the complexities getting closer and closer together. I wonder if you couldn't retire images once they stopped changing appreciably in rank in a given number of passes, as people couldn't decide whether or not it was more or less complex than the ones around it...

    Sorry, rambling a bit, I'm just curious about the approach taken with this project!

    Posted

  • Quia by Quia

    There's a tiny bit on the Sunspotter poster linked on the blog that mentions sorting the sunspots rather than ranking them, all the logical problems I was thinking of get replaced by a series of nice binary comparisons. The idea of being the binary predicate in a merge sort amuses me!

    So, to answer my own question... Yes, the images we see are getting sorted into more intelligent pairings. It may not seem like some of time time, as it largely depends on how the data is being served up to us, and the approach used to deal with getting multiple potentially differing classifications on the same pair. If the spots are sorted as soon as they have one classification, we'd be done the list almost three times over now and every classification would be between very similar images. Clearly that's not happening, so I imagine each pair/image is required to have a greater than one number of classifications before it's sorted.

    NOW I'm curious about the logistics of repeated sorts vs combined classifications into one sort. Or if I'm even close to what's actually going on behind the scenes.

    My mind is clearly wandering a little while I'm clicking away. Clickclickclickclick go the sunspots.

    Posted

  • pahiggins by pahiggins

    Sorry it took a while to respond, we scientists were having a discussion with the development team to make sure we understood how the sort was working. So, the ELO scoring algorithm (used for chess ranking as well as 'hot or not') is used to give the images a complexity score. So each new comparison affects the score according to ELO. As for which images get served up to volunteers following each click- that is random. Also, the Pass # only ticks up once all images have been compared at least once. So currently, lots of images get compared multiple times before a new Pass is declared- and some images only get compared once. This is how the ranking is being done at this first stage of the project. Once the next set of data is input (~200k images), the comparing and sorting may be altered, depending on how the results of this test come out.
    If that didn't make sense, let us know!

    Posted

  • Quia by Quia

    That makes perfect sense, thank you!

    Is there any significance to the current data being classified, or is it just a random subset of the larger dataset?

    I also wonder about the potential efficiency gains in terms of classifications vs added confidence in the rating, for non-random pairs. ELO and other ranking systems provide the most information per match when the outcome of the matchup is not already clear, otherwise you're just confirming what you already know.

    Posted

  • pahiggins by pahiggins

    For the first question, this data set has been used for a large collaborative solar flare forecasting investigation. Thus many physical properties of these sunspot group detections have already been determined. So, it will be interesting to compare complexity with polarity separation line length, and other measures. I can't wait to see how well complexity aids us in forecasting flares.

    Secondly, we are calculating a standard deviation for the ELO score that tells us how much the sunspot group ranking is still changing with each successive click. Once it gets to a low enough value, we can be fairly certain of its ordering in the overall ranking

    Posted

  • Quia by Quia

    Gotcha, it makes sense to try and add more information to a well studied subset than it does to try to classify the whole data set without knowing exactly how the data will help advance understanding.

    My second point was that the data you get out of each comparison is not a static value, it inversely relates to the difference in ELO scores. eg, after a small number of passes, we probably know that ASZ000046m is more complex than ASZ00009dt, and comparing the two of them doesn't give much data about either of them(and using ELO, neither of of their rankings will change much) If we do get to classifying the whole dataset, reducing the number of classification to get the same certainty seems like a very good idea!

    That said, I haven't run the numbers to see how much of an effect this has, maybe it's not a big a gain as I think it is. Theory vs practice. I should throw together a test suite and see how well random vs targeted matchups perform rather than just talking about what-ifs.

    Thanks for the discussion, I often find the data analysis as fascinating as the data itself!

    Posted

  • Quia by Quia

    Alright, tests have been run, data is here!

    You said you used ELO with a std deviation to tell when an image has a good enough classification, TrueSkill has the uncertainty built into the rating(and I may have borrowed most of the code I used from an old project that used TrueSkill...), so that's what I used as the ranking algorithm.

    Objects are generated with random values from 0-100, and a Trueskill rating. We loop through the objects, and pick a random object to pair it with. If we're using targeted pairs, keep selecting random objects until outcome of the match according to the rating is less than 95% certain. (semi arbitrary high number, too low discards too much data, too high and ranking converges slower. Somewhere between .9 and .99 works nicely...) The object is then presented to the classifier, which infallibly picks the object with the greater value(more on that later)

    And the results... This plots mean sigma scores of a sample of 500 randomly generated objects, calculated every 5 passes. Excel doesn't know what error bars are, so the standard deviations are in the data but not the graph.
    Sunspotter Classifications

    Conclusions, it takes roughly a third less classifications to reach the same certainty level when the matchmaking excludes pairs that are known within 95%. I would say that's pretty significant! Another thing worth noting is that the standard deviation of the standard deviations(now there's a mouthful) of the objects is a lot smaller, allowing you to say that the whole dataset is done at a certain point rather than removing images as they reach a threshold. This is because we guarantee that each classification gives us useful data.

    I did try to model the effects of inaccurate classifications by making my algorithm completely wrong for a certain percent of the classifications, but it increased the uncertainty of both methods about equally so I didn't go any further down that route.

    I hope this sparks some consideration on how to serve up images if we start classifying the whole 200k image set!

    Posted

  • pahiggins by pahiggins

    Very interesting post! Thanks for running this experiment. I think that doing this sort of analysis on our current dataset (once it is 'finished') would allow us to get a sense of how much time/work we can save by doing the kind of targeting you discuss. A couple questions:

    Can you explain a bit more about how you are doing the targeting?

    Specifically, in the targeted case, how are you deciding what pair to serve up next?

    Also like you said, this analysis assumes that everyone picks the 'true' more complex one- so I suppose the fact that multiple people may choose A rather than B as more complex will slow the convergence. But even still, serving them with a target pair would be more efficient.

    I think this is an extremely useful discussion. I'll have to rope some of the developers into it!

    BTW: I had to look up TrueSkill. According to wikipedia: 'TrueSkill is patented and the name is trademarked, so therefore it is limited to Microsoft projects and commercial projects that obtain a license to use the algorithm.' I can't believe you can patent an equation (or rather the implementation, thereof). That's almost as bad as patenting a gene sequence (imo)! But apparently, it is not legal to patent such things in Europe, so long as we don't publish anything using TrueSkill in USA, Microsoft probably can't sue us. (would want to ask a lawyer to be sure though)

    Posted

  • parrish by parrish admin

    Glad to see I'm not the only one fascinated by this stuff 😃

    Currently, our pairing algorithm is pretty well random.

    Our implementation of ELO is a bit different from a standard chess system. In chess, the system is geared towards keeping highly ranked players highly ranked. As scores reach higher brackets, the k-value, which determines the magnitude of a score change, decreases. In our implementation, a standard deviation calculated from a moving average of the past N scores modifies the k-value in pairings. Basically, as an image's score starts to stabilize, it is less likely (though able) to change.

    In terms of optimizing for the least number of required classifications, that's something we're still working on.

    This system keeps track of scores in real time, but the parameters were only optimized against the smaller subset of data from the project beta.

    After we have a pretty good volume of classifications on this dataset, I'll go back and look for ways to gain the most information per click.

    Some of the optimizations we're testing:

    • Initial score seeds. We have a fair amount of information about each image. Starting the images with a "best-guess" score can cause them to require fewer classifications to reach a stable score.

    • Early "retirement." We may be able to stop comparing images that have stable scores. That's a bit tricky though as removing images changes the population which in turn will cause the relative score distribution to change.

    • User weighting. Not something we often talk about, but some volunteers are better at classifying certain types of data. For instance, if we can spot that a person is really accurate when comparing two images close to the limb, then we can add a weight to their classification. Likewise, if your cat starts playing with your mouse, we can fix that too 😃

    • Nonrandom selection. I tested a few methods in simulations without much luck, but it's worth revisiting with real data from the project. Potentially, focusing more classifications on images with high standard deviations will be worthwhile.

    As Paul mentioned, the next dataset we're going to be ranking will be much larger, so we'll need to optimize wherever we can.

    The more classifications we can get for this dataset, the more informed we'll be!

    Posted

  • Quia by Quia

    Can you explain a bit more about how you are doing the targeting?
    Specifically, in the targeted case, how are you deciding what pair to serve up next?

    Trueskill and ELO both have quality functions to determine the likelihood of one side winning, based on their previous performance. In the targeted case, I continue picking a random second object and check the quality until the match certainty is below the threshold. The number of retries scales linearly with passes, with the exact scaling depending on the threshold. At .95 certainty the number of retries are 0.1*pass.

    BTW: I had to look up TrueSkill. According to wikipedia: 'TrueSkill is patented and the name is trademarked, so therefore it is limited to Microsoft projects and commercial projects that obtain a license to use the algorithm.' I can't believe you can patent an equation (or rather the implementation, thereof). That's almost as bad as patenting a gene sequence (imo)! But apparently, it is not legal to patent such things in Europe, so long as we don't publish anything using TrueSkill in USA, Microsoft probably can't sue us. (would want to ask a lawyer to be sure though)

    I actually forgot about this. They don't seem to care too much, given how freely available libraries are that implement trueskill, very odd. I used trueskill because your description of using a standard deviation to track the certainty of an ELO rating is almost exactly what it does, so it saved me a bit of time in implementing a best guess of your system.

    Our implementation of ELO is a bit different from a standard chess system. In chess, the system is geared towards keeping highly ranked players highly ranked. As scores reach higher brackets, the k-value, which determines the magnitude of a score change, decreases. In our implementation, a standard deviation calculated from a moving average of the past N scores modifies the k-value in pairings. Basically, as an image's score starts to stabilize, it is less likely (though able) to change.

    I will have to experiment with this. Most ranking systems are not designed for a set of objects with static values that you want to reach a consensus on, as you said, ELO is biased towards making high rankings harder to change, which makes sense when you're talking about Chess, not so much when you're talking about sunspots!

    Next step for me is to make a fitness function to see how good my actual ranking is, and compare some of these different systems. Right now all I have analysis of is that the certainty of each rank is increased when using targeted pairs, which you'd think would translate directly to more accurate ranks, but I'd like to get a better picture of what's happening.

    Some of the optimizations we're testing:

    Hmm.. I can fiddle with all these except for user weighting. User weighting also seems like something you do after the data has been finished to improve the quality of what you've collected, rather than realtime.

    The Cat Detector sounds like a good addition to any project! Sometimes my mouse double clicks on a single button press and classifies the next pair immediately, classifications made within a few milliseconds should be easy compared to cats!

    Posted