Skip to main content

When it comes to wrangling data at scale, R, Python, Scala, and Java have you covered -- mostly

When it comes to wrangling data at scale, R, Python, Scala, and Java have you covered -- mostly

You have a big data project. You understand the problem domain, you know what infrastructure to use, and maybe you've even decided on the framework you will use to process all that data, but one decision looms large: What language should I choose? (Or perhaps more pointed: What language should I force all my developers and data scientists to suffer?) It's a question that can be put off for only so long.

[ Download the InfoWorld quick guide: Learn to crunch big data with R. | Sign up for InfoWorld's Big Data Report to stay atop all the latest news and developments in the field. ]

Sure, there's nothing stopping you from doing big data work with, say, XSLT transformations (a good April Fools' suggestion for tomorrow, simply to see the looks on everybody's faces). But in general, there are three languages of choice for big data these days -- R, Python, and Scala -- plus the perennial stalwart enterprise tortoise of Java. What language should you choose and why ... or when?

Here's a rundown of each to help guide your decision.

R

R is often called "a language for statisticians built by statisticians." If you need an esoteric statistical model for your calculations, you'll likely find it on CRAN -- it's not called the Comprehensive R Archive Network for nothing, you know. For analysis and plotting, you can't beat ggplot2. And if you need to harness more power than your machine can offer, you can use the SparkR bindings to run Spark on R.

However, if you are not a data scientist and haven't used Matlab, SAS, or OCTAVE before, it can take a bit of adjustment to be productive in R. While it's great for data analysis, it's less good at more general purposes. You'd construct a model in R, but you would consider translating the model into Scala or Python for production, and you'd be unlikely to write a clustering control system using the language (good luck debugging it if you do).

Python

If your data scientists don't do R, they'll likely know Python inside and out. Python has been very popular in academia for more than a decade, especially in areas like Natural Language Processing (NLP). As a result, if you have a project that requires NLP work, you'll face an embarrassing number of choices, including the classicNTLK, topic modeling with GenSim, or the blazing-fast and accurate spaCy. Similarly, Python punches well above its weight when it comes to neural networking, with Theano and Tensorflow; then there's scikit-learn for machine learning, as well as NumPy and Pandas for data analysis.

There's Juypter/iPython too -- the Web-based notebook server that allows you to mix code, plots, and, well, almost anything, in a shareable logbook format. This had been one of Python's killer features, although these days, the concept has proved so useful that it has spread across almost all languages that have a concept of Read-Evaluate-Print-Loop (REPL), including both Scala and R.

Python tends to be supported in big data processing frameworks, but at the same time, it tends not to be a first-class citizen. For example, new features in Spark will almost always appear at the top in the Scala/Java bindings, and it may take a few minor versions for those updates to be made available in PySpark (especially true for the Spark Streaming/MLLib side of development).

As opposed to R, Python is a traditional object-oriented language, so most developers will be fairly comfortable working with it, whereas first exposure to R or Scala can be quite intimidating. A slight issue is the requirement of correct white-spacing in your code. This splits people between "this is great for enforcing readability" and those of us who believe that in 2016 we shouldn't need to fight an interpreter to get a program running because a line has one character out of place (you might guess where I fall on this issue).

Scala

Ah, Scala -- of the four languages in this article, Scala is the one that leans back effortlessly against the wall with everybody admiring its type system. Running on the JVM, Scala is a mostly successful marriage of the functional and object-oriented paradigms, and it's currently making huge strides in the financial world and companies that need to operate on very large amounts of data, often in a massively distributed fashion (such as Twitter and LinkedIn). It's also the language that drives both Spark and Kafka.

As it runs in the JVM, it immediately gets access to the Java ecosystem for free, but it also has a wide variety of "native" libraries for handling data at scale (in particular Twitter's Algebird and Summingbird). It also includes a very handy REPL for interactive development and analysis as in Python and R.

I'm very fond of Scala, if you can’t tell, as it includes lots of useful programming features like pattern matching and is considerably less verbose than standard Java. However, there's often more than one way to do something in Scala, and the language advertises this as a feature. And that's good! But given that it has a Turing-complete type system and all sorts of squiggly operators ('/:' for foldLeft and ':\' forfoldRight), it is quite easy to open a Scala file and think you're looking at a particularly nasty bit of Perl. A set of good practices and guidelines to follow when writing Scala is needed (Databricks' are reasonable).

The other downside: Scala compiler is a touch slow, to the extent that it brings back the days of the classic "compiling!" XKCD strip. Still, it has the REPL, big data support, and Web-based notebooks in the form of Jupyter and Zeppelin, so I forgive a lot of its quirks.

Java

Finally, there's always Java -- unloved, forlorn, owned by a company that only seems to care about it when there's money to be made by suing Google, and completely unfashionable. Only drones in the enterprise use Java! Yet Java could be a great fit for your big data project. Consider Hadoop MapReduce -- Java. HDFS? Written in Java. Even Storm, Kafka, and Spark run on the JVM (in Clojure and Scala), meaning that Java is a first-class citizen of these projects. Then there are new technologies like Google Cloud Dataflow (now Apache Beam), which until very recently supported Java only.

Java may not be the ninja rock star language of choice. But while they're straining to sort out their nest of callbacks in their Node.js application, using Java gives you access to a large ecosystem of profilers, debuggers, monitoring tools, libraries for enterprise security and interoperability, and much more besides, most of which have been battle-tested over the past two decades. (I'm sorry, everybody; Java turns 21 this year and we are all old.)

The main complaints against Java are the heavy verbosity and the lack of a REPL (present in R, Python, and Scala) for iterative developing. I've seen 10 lines of Scala-based Spark code balloon into a 200-line monstrosity in Java, complete with huge type statements that take up most of the screen. However, the new lambda support in Java 8 does a lot to rectify this situation. Java is never going to be as compact as Scala, but Java 8 really does make developing in Java less painful

Popular posts from this blog

The Best Web Hosting Services

Are you looking for the best web  hosting  services for your needs? Whether you need a place to host your small personal blog or a major corporate website, the following list will help you identify the best hosts to use. Finding the best web hosting service isn’t quite as straightforward as searching Google and choosing the one with the lowest price. There are a lot of issues to consider, including the reasons for  why  you need hosting and  how  you intend to use it. Once you have a handle on that, finding the right host becomes much easier. Choose one that’s undersized and you’ll end up with website outages and slow page loads, but choose one that’s oversized and you’ll be throwing money away. Defining Your Web Hosting Needs Before choosing your web host, you’ll need to think about your requirements. Consider the following concerns and decide the importance of each item on a scale of 0 to 10 (with 0 being not at all...

Now You Can Use Reliance Jio 4G Services On 2G And 3G Smartphones

Indians will always be at the top in availing any free internet facility. As, Reliance Jio aims to offer free 4G internet to the 90% of Indians, hence, with one of its services now you can use Jio 4G services on 2G and 3G smartphones. Now You Can Use Reliance Jio 4G Services On 2G And 3G Smartphones Who doesn’t want the free internet? Of course, we all want, Indians will always be at the top in availing any free internet facility. Jio, which is also known as Reliance Jio and officially as Reliance Jio Infocomm Limited has already given its users free unlimited 4G data for 90 days. As the Reliance Jio aims to offer free 4G internet to the 90% of Indians along with the free voice calls and messaging services. So, we all must agree that Indians are always at the peak when it comes about available any internet facility. We all know Jio, which is also known as Reliance Jio Infocomm Limited has previously given its users free unlimited 4G data for 90 days. Not only that but even th...

10 Essential Tips To Keep In Mind While Surfing The Internet

BENGALURU: Internet is an amazing resource which is stapled in many people’s day-to-day lives. It is very much informative but at the same time encircled with many dangers. Listed below are the strategies you can follow to stay safe on the internet as stated by Tech Radar India. Use your discretion on social media Online scams crops up almost everywhere in webpage such as an e-mail, tweet, Facebook post, or many other places. Never click on links that do not look like a real address or pop-ups that claim you have won millions of dollars—all these are scams in which one can easily get trapped. Also don’t fall prey to e-mails which ask you to help someone transfer a large amount of money out of their country delineating their long sad story. Hackers can access data through various innovative ways like sharing links of content that they feel the targeted person is likely to click. Be careful of what you are sharing Limit the content that you share on social media. Facebook...