linusbohman

Interesting findings after indexing all Eurobricks reviews

Recommended Posts

Hey guys! As part of an experiment I've indexed all reviews from Eurobricks. I wanted to share some fun statistics! Here's what I've done:

  • Gather all reviews
  • For each review: *
  • - Extract set id and map to Bricksets database
  • - Extract a short summary (if possible)
  • - Extract author
  • - If there's an overall score, extract that and normalize it to 1-100

This lets us figure out some fun stuff:

  • I've found a total of 3275 reviews from Eurobricks (whoah!)
  • 74% of these are scored in one way or another (that's 2422 scored and 853 unscored)
  • The average score is 81.83. Calculating standard deviation means a good score for a Eurobricks review is better than 93.11, and a bad score is below 70.55. Everything between those numbers is average, statistically speaking.
  • 627 individual authors have produced these reviews.

I've been doing the same to a lot of other reviews as well, and the global average score is 79.67. Eurobricks is pretty darn spot on - not a mean feat!

eurobricks-review-statistics.thumb.png.540935097e6880d2c7b6f58b180a6023.png

This has been a fun experiment. If you're interested in seeing more you can see the stats for Eurobricks here. That page also lets you find all reviews written by a specific author. The entire experiment can be found on brickinsights.com - it's a work in progress, but any feedback you have is appreciated.

I'd be happy to try to answer any questions you have about this, or provide the data for you to do your own analysis. Hope you enjoyed it as much as I did!

* Some caveats:

  • I found the reviews by making a site wide search for titles containing the word "review". If a review hasn't followed this convention it is not included.
  • Even then, not all reviews were included. There were a few that mostly consisted of broken images or just links to other sites. These were left out. I thought of it this way: If a random person looking at the review would find it useful, it should be included in the dataset. If not, then it's out. Not always an easy call to make.

Cheers!
Linus

Edited by linusbohman

Share this post


Link to post
Share on other sites

Sure, we could do that pretty easily! At the moment I have a json export, but I could build an exporter for Eurobricks as well. A few questions, then:

- Do you want an exporter where you select categories and get all sets in that category?

- Or would it be better for you to select individual set ids manually?

I’ll research this more in the coming days, but I’d love to put something together that makes it as easy as possible for you.

Share this post


Link to post
Share on other sites

I think by theme would be most useful. That is how the reviews are (or were) indexed in the forum.

This is quite the impressive feat I have to say.

Share this post


Link to post
Share on other sites

Wha....? :oh:

This is quite remarkable. I will echo Pep's request for reviews by theme (and I don't know if you'd be able to do it by host subforum) but I also wonder if you've done any data crunching with regards to the Reviewers Academy and RA Approved Reviews? 

@WhiteFang and @Rufus I think you should stop by here.

Thank you for this work Linus. Astonishing.

EDIT: I also notice that a few of my reviews are not in the database, and I know I do (I really do, being an RA Teacher!) put "Review" in the title, and there shouldn't be any broken images, only proper reviews. :look: 

Share this post


Link to post
Share on other sites

@Pandora no, I haven’t done any differentiation between RA reviews and others. I wanted to create as generalized a structure as possible so that it could work for multiple review sources (for that fun inter-site comparison). Would be a really interesting thing to explore, though! I’ll add it to my feature wishlist. Reading all of these reviews made me appreciate the work you guys in the RA do even more - thanks for everything you and your peers do :)

Could you point me to the reviews you’re missing so I could try to figure out where the error lies? I built a small script that scrapes the search page, so the error could be either on the search page’s end, my script’s end or my own manual curation’s end. I really want the dataset to be as complete as possible - thanks for letting me know!

Share this post


Link to post
Share on other sites

My mind is totally blown away by the sheer catalogue of the global reviews and putting Eurobricks reviews into the stats and indexing. Truly amazing piece of analytical insight piece of project which I simply adore. I am on my mobile phone and I am surely viewing the tip of this ice berg. 

Thanks Pandy for the shout out. 

Linus, I echo Pandy’s sentiments. If you could somehow segregate with EB RA reviews as it will be helpful for us to understand how much reviews were done through this programme over the years. 

I seen my recent reviews also updated into the index. So pleased to see that. :sweet:

Not sure if this helps, below is the Flickr Group for RA reviews done. Not all of us upload there, but is a place for us to showcase some of our works. 

https://www.flickr.com/groups/reviewers_academy/

edit: if it helps, I had done about 196 reviews and I hope it will helps :blush: it does include some of my customization reviews though.

https://www.eurobricks.com/forum/index.php?/profile/2351-whitefang/&tab=field_core_pfield_23

Share this post


Link to post
Share on other sites

Hi @linusbohman thank you providing the statistics. I have been discussing with @WhiteFang how good the statistics are. There are some quirks, and I'll get into that in a little while. I just want to congratulate you for the effort as it clearly show the love of your work.

The presentation looks awesome. The progress circle / donut chart + the breakdown of each rating makes the aggregator site helpful for the average joe.

I do think you can improve on the how the score from S&H and Amazon are represented because it seems to round up the value to the nearest integer. 

You seem to not distinguish the difference of RA reviews and like what was already pointed out, it would be great to distinguish the difference like how you segregated Brickset news review from Brickset user reviews.

As part of Reviewers Academy, I really like how easily I can show all my completed reviews in the link below:

https://brickinsights.com/statistics/reviewers/15?author=makoy&page=1

Question: if we add a score to the unscored reviews retroactively, will it get updated as scored reviews too? 

Share this post


Link to post
Share on other sites

Really interesting stuff. I find it strange how a 79 is average - not that Lego releases garbage sets on the regular, just that around 70 tends to be most people's "average set" score, instead of a 50.

Share this post


Link to post
Share on other sites
On 6/22/2018 at 11:11 PM, linusbohman said:

Could you point me to the reviews you’re missing so I could try to figure out where the error lies? I built a small script that scrapes the search page, so the error could be either on the search page’s end, my script’s end or my own manual curation’s end. I really want the dataset to be as complete as possible - thanks for letting me know!

So I had a look through and I've identified a few reviews of mine that are missing (sorry):

41041 - Turtle's Little Paradise, 41042 Tiger's Beautiful Temple, 41043 Penguin's Playground (Triple Review)

41005 - Heartlake High

41007 - Heartlake Pet Salon

70706 - Crater Creeper

3629 - Barney Bear

329 - Bernard Bear And His Delivery Lorry

and 21013 - Big Ben which was a joint review with Rufus (and it's not on his profile either).

As suspected, they all do have "review" in their title, so I don't know why the search didn't pick them up.

I note that there's another 21013 review from EB in your dataset (produced after ours), but I presume you would include all reviews, regardless of duplication?

Share this post


Link to post
Share on other sites

Thanks so much for these stats! I appreciate the time and effort that you've put into this.

 

8 hours ago, JackJonespaw said:

Really interesting stuff. I find it strange how a 79 is average - not that Lego releases garbage sets on the regular, just that around 70 tends to be most people's "average set" score, instead of a 50.

I would imagine that this is because most people have a tendency to rate sets a little higher than their gut is telling them in order to be less controversial :shrug_confused:

Share this post


Link to post
Share on other sites

@makoy thanks for the input! I really want the statistics to be as accurate as possible, so I'm especially interested in figuring out where my math is incorrect. However, I'm not sure I can see where the numbers from LEGO Shop and Amazon are wrong. I collect the average from each of those sites (not every individual review), and it seems to be working, unless I'm missing something. Here's a set with an Amazon review:

https://brickinsights.com/sets/7094-1

The score the system identified is 4.1 (and we normalised that to 82). That's the average from 29 reviewers, which seems to be correct when I go to Amazon:

https://www.amazon.com/LEGO-Castle-Kings-Siege/dp/B000NOB9Z8#customerReviews

The same is true for S&H:

https://brickinsights.com/sets/60204-1 and
https://shop.lego.com/en-US/LEGO-City-Hospital-60204#product-reviews

Both of those numbers are floats, as they should be. There are a lot of integers, especially for S&H, when there are just one review from that site. Is that what you're referring to, or am I misunderstanding something? I really appreciate the input, so if you or @WhiteFang can help me see where I'm wrong I'd love to fix it. (Thought: I experiment with displaying a snippet from one of the reviews, even if the score is an aggregate of several. I write this in the footnote at the bottom of the review, but perhaps this is just confusing?)

When it comes to retroactively editing the score, that's not something the system picks up automatically. Since all reviews from Eurobricks are formatted differently I have to manually enter the score. If you do notice I have the wrong score, however, just let me know and I'll fix it. I plan to build a "report errors in this review" feature to make it easier, but that comes when/if people actually use the site. (To geek out though, some sources are updated automatically. They are the ones with programmatically readable structures, like Amazon, LEGO Shop, Brickpicker, Brickset and a few others.)

Action plan on all the great feedback you've given so far:

You've all given me excellent feedback - more than I could ever dream of. Besides the potential data error you've commented on missing reviews, segregating Reviewer's Academy reviews and having an exporter for Eurobricks. I'm following it all up in this Trello card, so you can see current status there - I intend to see how I can solve it all as best as possible. Here's the current progress:

  • Missing reviews: Thank you so much for the list, @Pandora! I'll look into it further. I haven't manually excluded them, so they have either been omitted from Eurobricks search page or been misidentified by my script. I'll explore further, but this helps a lot. To make it slightly easier to troubleshoot I added the number of sets a particular author has reviewed on the Eurobricks stats page. It's important to note that this is not the same as number of reviews written, since each review can handle multiple sets. It's probably most noticeable for WhiteFang, since each collectible minfig review (which are awesome, by the way) contains 10-20 "sets" as Brickset consider each minfig a different set.
  • Exporting reviews: I've started building a tool where you can select categories and do a few other options, and have the site generate a list of sets and reviews. I can't do by theme since that data doesn't exist in Bricksets database, but hopefully this will be good enough - you could generate your own theme list this way. My goal is to generate output that looks like this: https://www.eurobricks.com/forum/index.php?/forums/topic/30357-lego-action-themes-pictoral-reviews-index/ that you can copy/paste, @Peppermint_M. I'll get back to you when I have something you can test. It'll be a few days (or if it's trickier than it seems a week or two) due to family, work and such. (By the way, if it's useful you can already embed the score for a particular set.)
  • Segregating RA reviews: I can, like you say, figure out a way to consider RA it's own entity. The tricky part is that technically there's no difference between an RA post and an "ordinary" Eurobricks post - but perhaps you could help me a bit. Does all RA reviews have the RA review logo? I seem to recall that there are multiple images symbolising Reviewer's Academy. Could you give me examples of those? This is by far the most time consuming task for me to fix since it will require a lot of manual work, so while I really want to make it happen, I'll prioritize the above tasks before this. I hope you understand :)

Thanks again for all of the feedback - I really appreciate it! I hope this site could become a useful tool to find fantastic reviews and help people figure out if a set is good or not. Ways to go still, but this is exactly the kind of input I need.

Share this post


Link to post
Share on other sites
4 hours ago, linusbohman said:

I'll get back to you when I have something you can test. It'll be a few days (or if it's trickier than it seems a week or two) due to family, work and such.

Most excellent, and of course, life does come first so I'm not concerned. Thanks for this, it is very helpful.

Edit: as for the categories, EB made up the title Action Theme. LEGO theme name is perfect for indexing.

Share this post


Link to post
Share on other sites
2 hours ago, linusbohman said:

The tricky part is that technically there's no difference between an RA post and an "ordinary" Eurobricks post

Technically there is -- but not from the post body. You can find the difference from the aside element with class cAuthorPane_info. Inside the post author info you parse the element with this image or a variation of it (silver/bronze/gold):

rsgold.gif

By right, those with these badges are reviewers who are part of Reviewers Academy team.

2 hours ago, linusbohman said:

However, I'm not sure I can see where the numbers from LEGO Shop and Amazon are wrong. I collect the average from each of those sites (not every individual review)

As for the integer problem, I cannot find the one that I saw yesterday and I was not able to capture a screenshot of the set. If I see something unusual on the score tally, I will inform you.

3 hours ago, linusbohman said:

When it comes to retroactively editing the score, that's not something the system picks up automatically. Since all reviews from Eurobricks are formatted differently I have to manually enter the score. If you do notice I have the wrong score, however, just let me know and I'll fix it. I plan to build a "report errors in this review" feature to make it easier, but that comes when/if people actually use the site. (To geek out though, some sources are updated automatically. They are the ones with programmatically readable structures, like Amazon, LEGO Shop, Brickpicker, Brickset and a few others.)

One unfamiliar feature in the forum is the existence of RSS up until now. Each sub-theme has its own RSS feed.

https://www.eurobricks.com/forum/index.php?/forums/forum/86-lego-star-wars.xml/

So in a way, you can trigger a workflow that will fetch the latest version of the topic if that helps. However, I checked the RSS feed structure and it's only the publish date that is available there. I know there are more higher priority items in your list but having the "report errors in this review" will be a great way to crowdsource the data error correction.

 

Cheers!

Share this post


Link to post
Share on other sites
18 hours ago, makoy said:

Technically there is -- but not from the post body. You can find the difference from the aside element with class cAuthorPane_info. Inside the post author info you parse the element with this image or a variation of it (silver/bronze/gold):

rsgold.gif

By right, those with these badges are reviewers who are part of Reviewers Academy team.

That's interesting, but if this forum is built the same way most forums are, it wouldn't be 100% correct to use the sidebar to identify whether a post is part of the RA or not. I've been checking all posts the search gave me (since 2005, if I recall correct) and it is my understanding that the sidebar is static for the user. It's not a snapshot for when the post was initially made, but rather a representation of what the user is right now. This could lead to me categorizing reviews from reviewers before they were in the RA as RA. I've been looking at the post content, and many differentiate RA reviews with an image, but not all. Or am I incorrect? Reading this review by JackJonesPaw makes me think it's part of the RA. Is it? If so, I might need to check the post content for either RA images or the words "Reviewer's Academy", methinks. See any caveats or problems with that?

Share this post


Link to post
Share on other sites
15 hours ago, linusbohman said:

it is my understanding that the sidebar is static for the user. It's not a snapshot for when the post was initially made, but rather a representation of what the user is right now. This could lead to me categorizing reviews from reviewers before they were in the RA as RA.

 

Reading this review by JackJonesPaw makes me think it's part of the RA. Is it? If so, I might need to check the post content for either RA images or the words "Reviewer's Academy", methinks. See any caveats or problems with that?

The sidebar is indeed a representation of the user right now, and you are correct in your logic regarding the reviews - for instance a few of my reviews were made before I joined the RA, thus they are not RA reviews. We try to put the RA logo at the top of RA reviews, but sometimes we forget. So yes, that review by JackJonespaw you linked to is indeed an RA review, but there isn't a logo.

I fear we might be asking you the impossible to segregate RA reviews, or if not impossible, then very difficult without more input from us. Even going by author will lead to errors without knowing which reviews by that author are RA. I am willing to help, time permitting, but I suspect it will require a lot of manual checking one way or another. :sceptic: 

Share this post


Link to post
Share on other sites
On 6/22/2018 at 3:05 AM, linusbohman said:

 

  • The average score is 81.83. Calculating standard deviation means a good score for a Eurobricks review is better than 93.11, and a bad score is below 70.55. Everything between those numbers is average, statistically speaking.

First of all, this is an awesome amount of work. Congratulations it’s very impressive.

My only minor quibble is with the approach you’ve used for the standard deviation.  We have here a dataset that is on a percentage basis (0% to 100%). These are rarely normally distributed - typically there’s a negative skew and so in order to calculate  a +1/-1 Standard deviation you’ld have to transform the source data using some form of power function (like a y=a + b^c or a log-based version). Then you calculate the STD, then you reverse transform it back into values in the source data. You’ll end up with a -1 STD that is further away from the mean than the +1 STD but that’s okay, that’s what it should be for negative skew.

As i said ‘just a minor quibble’ and I don’t want it to detract from the serious amount of fine work you’ve put into this.

Excellent job!

Share this post


Link to post
Share on other sites
1 hour ago, eurotrash said:

First of all, this is an awesome amount of work. Congratulations it’s very impressive.

My only minor quibble is with the approach you’ve used for the standard deviation.  We have here a dataset that is on a percentage basis (0% to 100%). These are rarely normally distributed - typically there’s a negative skew and so in order to calculate  a +1/-1 Standard deviation you’ld have to transform the source data using some form of power function (like a y=a + b^c or a log-based version). Then you calculate the STD, then you reverse transform it back into values in the source data. You’ll end up with a -1 STD that is further away from the mean than the +1 STD but that’s okay, that’s what it should be for negative skew.

As i said ‘just a minor quibble’ and I don’t want it to detract from the serious amount of fine work you’ve put into this.

Excellent job!

I'm not a statistician (or even a mathematician!), so it's really interesting to hear this. I have to admit I don't fully understand what you mean. Instead of using the raw values (0-100), I should convert them beforehand? Why is that important? Do you have any links for me to read up on? I'm really interested in improving.

Here's a link describing the formula I implemented (unless I did something wrong :) ): https://www.mathsisfun.com/data/standard-deviation-formulas.html

 

Share this post


Link to post
Share on other sites

No worries mate! Luckily I get paid to be an applied statistician and mathematician. AND I build things in Lego. So, here’s a MOC to help with the problem.

Lego Standard Deviation

 

Both charts contain the same amount of data (studs) and I’ve shown the +1/-1 Standard deviation in red and the mean with the black tile. The top chart shows data that is normally distributed and the lower chart shows skewed data. The same amount of tan and red plates straddle the mean but the shape is very different.

Top Chart; Mean = 6.5, Standard deviation = 1.5, AVG +1/STDEV = 8, AVG -1/STDEV = 5

Bottom Chart; Mean is 8.5, Standard Deviation = broken and meaningless because the data asymmetric, but AVG +1/STDEV = 9, AVG -1/STDEV = 7.

 

So the usual methods for calculating the STDEV just do not work for skewed data. However if we can mathematically manipulate the shape of that lower chart (using a formula) into something that resembles the top chart then we can calculate the STDEV thresholds and then using the same formula backwards we can calculate the thresholds for the skewed data. 

Luckily Im a lazy applied statistician so I use Excel to calculate the =skew()* of the data then create a simple power function like y=x^N and ‘Goal seek’ to find me a legitimate value of N that gives me a zero skew data set. Then I calculate STDEV as for ‘normally distributed data’ work out the thresholds (in this case the first 9 plates and the last 9 plates would be outside of +1/-1 STDEV . Then I can determine where those thresholds are on the actual skewed data by that running the inverse of that original power function.....  good grief I’m boring myself to sleep here. 😀

TL:DR Open up the tool you’re using to crunch the numbers and google ‘How to calculate STD in skewed data sets in Tool X’ 

And as for useful links Wikipedia’s section on Stats is awesome! I’d start with Skewness and Data Transformation (Statistics) and see where that leads you.

 

As I said before Congratulations on an awesome job!

 

* I worry about =KURT() as well and also adjust for that in the power function. Kurt is kurtosis; the pointiness of the data set.

Edited by eurotrash

Share this post


Link to post
Share on other sites

This is fantastic stuff, Chris! Thanks for the explanation (and very cool MOC, haha). I use a web programming language (PHP) to do the math, and there seems to be a few extensions that deal with a skewed dataset. I just didn't know it was something you ought to compensate for. I'll be researching this as soon as I can, and implement it as soon as I feel I understand it. Thank you so much for the input! Mind if I run the finished result by you when I'm done?

Share this post


Link to post
Share on other sites
4 hours ago, linusbohman said:

This is fantastic stuff, Chris! Thanks for the explanation (and very cool MOC, haha). I use a web programming language (PHP) to do the math, and there seems to be a few extensions that deal with a skewed dataset. I just didn't know it was something you ought to compensate for. I'll be researching this as soon as I can, and implement it as soon as I feel I understand it. Thank you so much for the input! Mind if I run the finished result by you when I'm done?

You're very welcome and, of course, if there's anything I can help with then please don't hesitate to reach out to me.

Share this post


Link to post
Share on other sites

Alright, I've had time to do some work! @Peppermint_M, this is about your request to export reviews.

I've built the base for an exporter now. It can be used in two ways. First, to export a single set just navigate to a desred set and click on the export button. Before, you could only select JSON, but now you can also choose BBCode. The other way is the big one. By heading to the export page you can select a bunch of options. They're a bit sparse right now, but you can choose to export by category and what reviewers you want to include. Here's what the export for an individual set currently looks like:

6941-1.jpg 
6941-1 Battrax 94 / 100  
Brickset 100 / 100  
Lugnet 91 / 100  
Eurobricks by RangerBob 90 / 100  
Eurobricks by ZeeK  

If you use the export page to export multiple sets they'll all look like this, with spacing between them. No matter what you use to export, you probably need to ensure you only paste clean text to not confuse Eurobricks' CKEditor. I do it by copy/pasting into notepad before copy/pasting the code here.

Thoughts? Something you want me to change or improve? Options you need? Meanwhile, I'll focus on researching @Pandoras missing reviews.

Edited by linusbohman
words

Share this post


Link to post
Share on other sites
4 minutes ago, linusbohman said:

Thoughts? Something you want me to change or improve? Options you need?

This looks great so far, I will have a play around with it on my afternoon off tomorrow and see how I do. I have a little (very basic) HTML knowledge and some good friends who know plenty so we'll see how I do!

This is excellent and helpful, seriously amazing work. 

Share this post


Link to post
Share on other sites
5 minutes ago, Peppermint_M said:

This looks great so far, I will have a play around with it on my afternoon off tomorrow and see how I do. I have a little (very basic) HTML knowledge and some good friends who know plenty so we'll see how I do!

This is excellent and helpful, seriously amazing work. 

Glad you like it! If you find that you don't like the look of the export, that you're missing something from them or that you need greater control over the selected sets just let me know and I'll tweak it. The end goal is for you to not having to mess around with the generated output at all :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.