August 2018 - wiky apps

Software Development, Games Development, Mobile Development, iOS Development, Android Development, Window Phone Development. Dot Net, Window Services,WCF Services, Web Services, MVC, MySQL, SQL Server and Oracle Tutorials, Articles and their Resources

Wednesday, August 22, 2018

How-to: Run a Simple Apache Spark App in CDH 5

Getting started with Apache Spark in CDH 5.x is easy using this simple example. Apache Spark is a general-purpose, cluster computing f...
Read More

Apache Hive on Apache Spark: Motivations and Design Principles

Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that...
Read More

How-to: Build Advanced Time-Series Pipelines in Apache Crunch

Learn how creating dataflow pipelines for time-series analysis is a lot easier with Apache Crunch. In a previous blog post, I describe...
Read More

Bayesian Machine Learning on Apache Spark

Markov Chain Monte Carlo methods are another example of useful statistical computation for Big Data that is capably enabled by Apache ...
Read More

Building Lambda Architecture with Spark Streaming

The versatility of Apache Spark's API for both batch/ETL and streaming workloads brings the promise of lambda architecture to the ...
Read More

Getting Started with Big Data Architecture

What does a "Big Data engineer" do, and what does "Big Data architecture" look like? In this post, you'll get ...
Read More

Apache Kafka for Beginners

When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data in...
Read More

Calculating CVA with Apache Spark

Thanks to Matthew Dixon, principal consultant at Quiota LLC and Professor of Analytics at the University of San Francisco, and Mohamma...
Read More

How-to: Translate from MapReduce to Apache Spark (Part 2)

The conclusion to this series covers Combiner-like aggregation functionality, counters, partitioning, and serialization. Apache Spark ...
Read More

Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle

Our thanks to Ilya Ganelin, Senior Data Engineer at Capital One Labs, for the guest post below about his hard-earned lessons from usin...
Read More

Deploying Apache Kafka: A Practical FAQ

This post contains answers to common questions about deploying and configuring Apache Kafka as part of a Cloudera-powered enterprise d...
Read More

Ibis on Impala: Python at Scale for Data Science

This new Cloudera Labs project promises to deliver the great Python user experience and ecosystem at Hadoop scale. Across the user com...
Read More

How Apache Spark, Scala, and Functional Programming Made Hard Problems Easy at Barclays

Thanks to Barclays employees Sam Savage, VP Data Science, and Harry Powell, Head of Advanced Analytics, for the guest post below about...
Read More

Interactive Analytics on Dynamic Big Data in Python using Kudu, Impala, and Ibis

The following post was originally published in the Ibis project blog. (Ibis is a data analysis framework incubating in Cloudera Labs t...
Read More

Time Series for Spark: 0.2.0 Released

The 0.2.0 release of the spark-ts package includes includes a fleshed-out Java API, among other things. The spark-ts library, which wa...
Read More

Progress Report: Bringing Erasure Coding to Apache Hadoop

Get an update on the progress of the effort to bring erasure coding to HDFS, including a report about fresh performance benchmark test...
Read More

How-to: Build a Real-Time Search System using StreamSets, Apache Kafka, and Cloudera Search

Thanks to Jonathan Natkins, a field engineer from StreamSets, for the guest post below about using StreamSets Data Collector—open sour...
Read More

Making Python on Apache Hadoop Easier with Anaconda and CDH

Enabling Python development on CDH clusters (for PySpark, for example) is now much easier thanks to new integration with Continuum Ana...
Read More

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory p...
Read More

Time Series for Spark Joins Cloudera Labs

Bringing Time Series for Spark into Cloudera Labs is a reflection of its potentially future usefulness in more use cases. Time is more...
Read More

Building a Data Science Portfolio: Storytelling with Data (Part 2: Data Exploration)

The following post (Part 2 of two parts) by Vik Paruchuri, founder of data science learning platform Dataquest, offers some detailed a...
Read More

Securing Apache Spark Shuffle using Apache Commons Crypto

Learn how the performance advantages of the Crypto cryptographic library will provide an upgrade for Spark shuffle encryption over the...
Read More

Apache Kudu and Apache Impala (Incubating): The Integration Roadmap

Impala users can expect new performance and usability benefits via improved integration with Kudu. It's been nearly one year since...
Read More

Introducing sparklyr, an R Interface for Apache Spark

Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStud...
Read More

Resource Management for Apache Impala (incubating)

Apache Impala (incubating) includes several features that allow you to restrict or allocate resources so as to maximize stability and ...
Read More

How-to: Deploy a Secure Enterprise Data Hub on Microsoft Azure – Part 1

Learn how to use Cloudera Director, Microsoft Active Directory (AD DS, AD CS, AD DNS), SAMBA, and SSSD to deploy a secure EDH cluster ...
Read More

How-to: Fuzzy Name Indexing in Apache Hadoop with Rosette and Cloudera Search

In this guide, learn how to use Cloudera Search with Basis Technology's Rosette® to perform fuzzy name searches in multiple langua...
Read More

How-to: Deploy a Secure Enterprise Data Hub on Microsoft Azure – Part 2

In Part 1 of the blog, we covered all the prerequisites needed to deploy a CDH cluster on the Microsoft Azure cloud platform. In Part ...
Read More

How to secure ‘Internet exposed’ Apache Hadoop

You may have heard of the recent (and ongoing) hacks targeting open source database solutions like MongoDB and Apache Hadoop. From wha...
Read More

Hardening Apache ZooKeeper Security: SASL Quorum Peer Mutual Authentication and Authorization

Background Apache ZooKeeper is a core infrastructure component in Apache Hadoop stack and is also widely used by many companies for se...
Read More

Up and running with Apache Spark on Apache Kudu

After the GA of Apache Kudu in Cloudera CDH 5.10, we take a look at the Apache Spark on Kudu integration, share code snippets, and exp...
Read More

Working with UDFs in Apache Spark

User-defined functions (UDFs) are a key feature of most SQL environments to extend the system's built-in functionality. UDFs allow...
Read More

Analyzing US flight data on Amazon S3 with sparklyr and Apache Spark 2.0

We posted several blog posts about sparklyr (introduction, automation), which enables you to analyze big data leveraging Apache Spark ...
Read More

How-to: Log Analytics with Solr, Spark, OpenTSDB and Grafana

Organizations analyze logs for a variety of reasons. Some typical use cases include predicting server failures, analyzing customer beh...
Read More

Blacklisting in Apache Spark

At Cloudera, we're always working to provide our customers and the Apache Spark community with the most robust, most reliable soft...
Read More

Deep Learning Frameworks on CDH and Cloudera Data Science Workbench

The emergence of "Big Data" has made machine learning much easier because the key burden of statistical estimation—generaliz...
Read More

Apache Impala Leads Traditional Analytic Database

Unmodified TPC-DS-based performance benchmark show Impala's leadership compared to a traditional analytic database (Greenplum), es...
Read More