TimSort Really Is O(nlogn)
Written by Mike James   
Wednesday, 05 September 2018

Most of us suppose that sorting algorithms are all done-and-dusted. Nothing new or exciting is left to find or discover, but TimSort, used in both Python and Java, was born in 2002 and is still being investigated.

Ineffective Sorts

Obligatory xkcd cartoon.

More cartoon fun at xkcd a webcomic of romance,sarcasm, math, and language

If you have done almost any sort of computer science course, or just read a book, then you will know that the problem of sorting things into order is so important that we have been looking at it for ages. You probably also know that what matters is how fast the sort is in terms of the number of things to be sorted. The most popular "theoretical" sort is Quicksort because it is elegant, sophisticated and sorts n items in time proportional to nlogn which is generally regarded as good.

End of story?

No, because there are many more considerations than theoretical performance. For instance, you can ask what the worst case performance of a sorting method is. In other words, what is the slowest a sorting algorithm goes if you are evil and throw just the right, or more accurately wrong, data at it? In the case of Quicksort the answer is O(n2), which is not so good after all. The problems often arise when the list presented is almost sorted and the algorithm has to avoid doing extra work in trying to sort what is already sorted.

In practice, sorting algorithms have had to become clever at avoiding the sort of data that chokes their basic algorithm and switch to another algorithm if the data isn't right for them.

This is where TimSort enters the picture. In 2002 Tim Peters designed a sorting algorithm for Python. It was convincingly good enough for Java to use it with some differences a bit later. The basis of the algorithm is that the input sequence is first scanned and decomposed into monotonic runs, i.e. increasing or decreasing sub-sequences. These sequences are then merged to give a sorted list. The originator of the basic idea of merging natural sequences was, like so much in computer science, due to Donald Knuth.

This makes TimSort sound easy, but it also includes heuristics about what to do and when, and this makes it difficult to analyse. In short, this is not the elegant logic of a pure algorithm like Quicksort, but a practically optimized sorting procedure complete with magic numbers and choices. So much so that it was assumed that its average complexity was O(nlogn) and more that its worst case complexity was also O(nlogn). It took till 2015 for Nicolas Auger, Cyril Nicaud and Carine Pivoteau to prove that indeed it was. Now they have a new paper which substantially simplifies the proof and in the process throws new light on how the algorithm works.

"Based on this description, we were also able to answer positively a natural question, which was left open so far: does TimSort runs in O(n + n log ρ), where ρ is the number of runs? We hope our theoretical work highlights that TimSort is actually a very good sorting algorithm. Even if all its fine-tuned heuristics are removed, the dynamics of its merges, induced by a small number of local rules, results in a very efficient global behavior, particularly well suited for almost sorted inputs."

What is slightly more worrying is that:

"Besides, we want to stress the need for a thorough algorithm analysis, in order to prevent errors and misunderstandings. As obvious as it may sound, the three consecutive mistakes on the stack height in Java illustrate perfectly how the best ideas can be spoiled by the lack of a proper complexity analysis."

Yes, they found bugs in the Java implementation of TimSort.

Sorting is still important and there is still much to be done. We need to move beyond the theoretical algorithms implemented on theoretical machines and look at what happens in real machines with complexities such as cache and out of order execution.

Now is still a good time to be a computer scientist - if not the best.

sortsicon

More Information

On the Worst-Case Complexity of TimSort

Related Articles

Sorting Algorithms as Dances

At last - Quicksort the dance

Silverlight Sorting Lab

SilverSort

Magic of Merging

QuickSort exposed

Sequential storage

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

 

Banner


52nd Mersenne Prime Found
27/10/2024

It has been nearly six years since the last Mersenne prime was discovered. Now, at last, we have Mersenne prime number 52 and it has 41,024,320 digits!



Google Opensources Privacy Library
08/11/2024

Google is making a new differential privacy library available as open source. PipelineDP4J is a Java-based library that can be used to analyse data sets while preserving privacy.


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Saturday, 08 September 2018 )