The PaSh Project - Advancing the Unix Philosophy One Step Further

Written by Nikos Vaggalis

Thursday, 04 November 2021

The PaSh Project gives your POSIX script superpowers by utilizing parallelization in order to speed up execution times. This leads to faster results for data scientists, engineers, biologists, economists, administrators, and programmers.

I remember the time when the saying was "Learn Perl so you don't have to learn the Shell and its hundreds of utilities".
Fast forward some decades and the use of shell scripts still has not been eradicated. On the contrary, their use has increased due to the rise of containers, VM's, administering the cloud, and Linux itself.

This also serves as a lesson to those who are quick to denounce technologies as 'dead'. There's a time where a new use case revitalizes an old technology.

So what is meant by "Unix philosophy"? It's taking simple, high quality, components and combining them together in smart ways to obtain a complex result. An example which encapsulates this notion comes straight from the PaSh documentation and shows how you can use many utilities and pipes and redirections to combine and filter them, to get at the desired outcome:

Consider the following spell-checking script, applied to two large markdown files f1.md and f2.md

The speed of an operation like this would depend on the size of the two files. It could take seconds to minutes. What if you could speed it up by breaking it up into pieces that would run in parallel, and afterwards combine their results? You can.

PaSh is such a system for parallelizing POSIX shell scripts, shown to achieve order-of-magnitude performance improvements. Given a shell script, PaSh converts it to a dataflow graph, performs a series of semantics - preserving program transformations that expose parallelism, and then converts the dataflow graph back into a POSIX script. The new parallel script has POSIX constructs added to explicitly guide parallelism, coupled with PaSh-provided Unix-aware runtime primitives for addressing performance- and correctness-related issues.

For instance the script above run from Pash with -w 2, that is 2x-parallelism, would create 2 pipes which it would then run in parallel. Therefore, the dataflow graph would look like:

You could say that, there's GNU Parallel for that too. The problem with Parallel is that it doesn't know the semantics of commands like grep, so it is hard to use. The user has to write a carefully parameterized command for these tools to parallelize a task while also some commands have ad-hoc custom parallel flags like -j, --jobs, --parallel. These are all different, hard to use, and hard to compose.

PaSh instead has a compiler which works in the following way:

Inputs a shell script and command-annotations
Constructs a dataflow graph
Does graph transformations
Outputs a new shell script with low-level & and wait parallelism
Outputs a new shell script with parallelism

Since PaSh is a source-to-source compiler, it allows the optimized shell script to be inspected and executed using the same tools, in the same environment, and with the same data as the original script.

The other two main components of PaSh are annotations, a lightweight annotation language which allows command developers to express key parallelizability properties about their commands and a small runtime library providing the PaSh compiler with high-performance primitives and supporting its key functions.

Various benchmarks on common Unix one-liners show a magnitude of 60 in performance enhancement.

PaSh can be run on Ubuntu, Fedora, Debian, and Arch. Use one of the following ways to set it up:

Run curl up.binpa.sh | sh from your terminal,
Clone the repo and run ./scripts/distro-deps.sh; ./scripts/setup-pash.sh,
Fetch a Docker container by running docker pull binpash/pash-18.04, or
Build a Docker container from scratch.

And on Windows WSL too.

More Information

PaSh: Light-touch Data-Parallel Shell Processing

Pash on GitHub

The Linux Upskill Challenge

Three Tips for the Linux Shell Addict

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Interact With DuckDB Using Local UI
01/04/2025

MotherDuck, DuckDB's makers, having listened to its users, has
released a local GUI for easier interaction with the database.

+ Full Story

Google Adds Open-Source Development Kit To Vertex AI
15/04/2025

Google has added an Agent Development Kit (ADK) to Vertex AI, along with an agent engine and an Agent2Agent protocol that provides agents with a common, open language for collaboration. The anno [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Thursday, 04 November 2021 )

More Information

Related Articles

Comments