Deciding when to filter out large scale refactorings from code analysis

I want to be able to see the level of change between OpenStack releases. However, there are a relatively small number of changes with simply huge amounts of delta in them — they’re generally large refactors or the delete which happens when part of a repository is spun out into its own project.

I therefore wanted to explore what was a reasonable size for a change in OpenStack so that I could decide what maximum size to filter away as likely to be a refactor. After playing with a couple of approaches, including just randomly picking a number, it seems the logical way to decide is to simply plot a histogram of the various sizes, and then pick a reasonable place on the curve as the cutoff. Due to the large range of values (from zero lines of change to over a million!), I ended up deciding a logarithmic axis was the way to go.

For the projects listed in the OpenStack compute starter kit reference set, that produces the following histogram:A histogram of the sizes of various OpenStack commitsI feel that filtering out commits over 10,000 lines of delta feels justified based on that graph. For reference, the raw histogram buckets are:

Commit sizeCount
< 225747
< 11237436
< 101326314
< 1001148865
< 1000116928
< 1000013277
< 1000001522
< 1000000113

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.