Spring 프레임웍 Hadoop-Hive 통합

프로그래밍/Spring & Maven

Spring 프레임웍 Hadoop-Hive 통합

Terry Cho 2013. 3. 19. 00:46

Spring for Apache Hadoop Project #2

(Hive Integration)

Hive는 Apache 오픈 소스 프로젝트의 하나로, Hadoop 관련 프로젝트이다.

HDFS에 저장된 데이타를 마치 RDMS의 SQL처럼 쿼리하기 위한 솔루션으로, 복잡한 데이타 쿼리 연산에 있어서, Hadoop과 함께 사용하면 매우 유용하게 이용할 수 있다.

SHDP에서도 이 Hive를 지원한다. 크게 Hive의 기동과, Hive Script의 실행 그리고, Hive에서 제공하는 API를 수행할 수 있도록 지원하며, Hadoop 지원과 마찬가지로, Tasklet을 제공하여 Spring Batch와의 통합을 지원한다.

Hive Server의 기동

hive-server 엘리먼트로 정의하며, configuration file을 읽어서 기동할 수 있으며, 추가되는 configuration은 hive-server엘리먼트 안에 value로써 지정이 가능하다.

<hdp:hive-server host="some-other-host" port="10001" properties-location="classpath:hive-dev.properties" configuration-ref="hadoopConfiguration">

  someproperty=somevalue

  hive.exec.scratchdir=/tmp/mydir

</hdp:hive-server>

Thrift Client 를 이용한 Hive Script의 수행

Hive를 사용하기 위해서는 Hive Server에 접속하는 클라이언트를 생성해야 하는데, 첫번째 방법이 Thrift Client를 이용하는 방법이 있다. Thrift Client의 경우에는 Thread Safe 하지 않기 때문에, client factory를 리턴한다.

아래 설정을 보면 hive-client-factory에 hive서버의 ip,port를 지정하여 client를 생성하였다.

그리고, script 실행을 위해서 runner 를 지정한후에, 앞서 생성한 clientfactory를 reference하였다. 그리고 hive-runner에서 script location을 지정하여,password-analysis.hal 파일에 정의된 script가 실행되도록 정의하였다.

<hdp:hive-client-factory host="some-other-host" port="10001" />

<hdp:hive-runner id=”hiveRunner”hive-client-ref=”hiveClientFactory” run-at-startup=”false” pre-action=”hdfsScript”>

  <script location=”password-analysis.hal”/>

</hdp:/hiverunner>

실제 위의 Configuration을 가지고 수행하는 자바 코드를 보면 다음과 같다.

public class HiveAppWithApacheLogs {

         private static final Log log = LogFactory.getLog(HiveAppWithApacheLogs.class);

         public static void main(String[] args) throws Exception {

                 AbstractApplicationContext context = new ClassPathXmlApplicationContext(

                                   "/META-INF/spring/hive-apache-log-context.xml"

, HiveAppWithApacheLogs.class);

                 log.info("Hive Application Running");

                 context.registerShutdownHook();

                 HiveRunner runner = context.getBean(HiveRunner.class);

                 runner.call();

Hive client를 만들때는 각 client가 생성될때마다 자동으로 initialize script를 실행할 수 있다.

<hive-client-factory host="some-host" port="some-port" xmlns="http://www.springframework.org/schema/hadoop">

   <hdp:script>

     DROP TABLE IF EXITS testHiveBatchTable;

     CREATE TABLE testHiveBatchTable (key int, value string);

   </hdp:script>

   <hdp:script location="classpath:org/company/hive/script.q">

       <arguments>ignore-case=true</arguments>

   </hdp:script>

</hive-client-factory>

위의 설정은 client가 생성될때 마다 DROP TABLE xx 스크립트와, script.q에 지정된 스크립트 두개를 자동으로 수행하도록 한다.

마찬가지로, runner에서도 순차적으로 여러개의 쿼리가 수행되도록 설정할 수 있다.

JDBC 를 이용한 스크립트 수행

Hive는 Thrift 이외에도, RDBMS에 사용하는 JDBC 드라이버를 사용할 수 있다. Spring에서도 이 JDBC를 통한 Hive 통합을 지원한다.

사용 방법은 일반적인 JDBC Template을 사용하는 방법과 동일하다.

먼저 hive-driver로 Hive JDBC 드라이버를 지정한후, 이를 이용하여, hive data source를 정의한후, Jdbc template을 이 data source와 연결하여 사용한다. (아래 예제 참고)

<beans xmlns="http://www.springframework.org/schema/beans"

         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

         xmlns:c="http://www.springframework.org/schema/c"

         xmlns:context="http://www.springframework.org/schema/context"

         xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd

         http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd">

    <!-- basic Hive driver bean -->

    <bean id="hive-driver" class="org.apache.hadoop.hive.jdbc.HiveDriver"/>

    <!-- wrapping a basic datasource around the driver -->

    <!-- notice the 'c:' namespace (available in Spring 3.1+) for inlining constructor arguments,

         in this case the url (default is 'jdbc:hive://localhost:10000/default') -->

    <bean id="hive-ds" class="org.springframework.jdbc.datasource.SimpleDriverDataSource"

       c:driver-ref="hive-driver" c:url="${hive.url}"/>

    <!-- standard JdbcTemplate declaration -->

    <bean id=" jdbcTemplate" class="org.springframework.jdbc.core.JdbcTemplate" c:data-source-ref="hive-ds"/>

    <context:property-placeholder location="hive.properties"/>

</beans>

위의 Configuration을 수행하는 자바 코드는 다음과 같다.

Hive template을 이용한 Hive API 실행

JDBC Template과 유사하게 Hive 실행도 Template을 제공한다.

다음과 같이 context 파일에서, hive-template을 만든후에, 해당 template을 SomeClass라는 클래스에 someBean이란 이름으로 생성해서 weaving하였다.

<hdp:hive-client-factory ... />

<!-- Hive template wires automatically to 'hiveClientFactory'-->

<hdp:hive-template />

<!-- wire hive template into a bean -->

<bean id="someBean" class="org.SomeClass" p:hive-template-ref="hiveTemplate"/>

SomeClass에서는 template을 받아서, hivetemplate.execute() 메서드를 수행한다.

public class SomeClass {

private HiveTemplate template;

public void setHiveTemplate(HiveTemplate template) { this.template = template; }

public List<String> getDbs() {

    return hiveTemplate.execute(new HiveClientCallback<List<String>>() {

       @Override

       public List<String> doInHive(HiveClient hiveClient) throws Exception {

          return hiveClient.get_all_databases();

    }));

}}

Spring Batch Integration

마지막으로 Hadoop integration등과 마찬가지로 Spring Batch 통합을 위하여, tasklet을 제공한다.

<hdp:hive-tasklet id="hive-script">

   <hdp:script>

     DROP TABLE IF EXITS testHiveBatchTable;

     CREATE TABLE testHiveBatchTable (key int, value string);

   </hdp:script>

   <hdp:script location="classpath:org/company/hive/script.q" />

</hdp:hive-tasklet>

저작자표시

'프로그래밍 > Spring & Maven' 카테고리의 다른 글

Eclipse에서 Spring Maven 개발환경 설정하기 (0)	2013.03.19
Maven 스터디 (0)	2013.03.19
Spring 프레임웍 Hadoop 지원 기능 소개 (Spring Data Apache Hadoop Project) (0)	2013.03.18
Maven 스터디 노트 (2)	2013.02.01
Maven (0)	2013.01.30

현재글Spring 프레임웍 Hadoop-Hive 통합

실리콘밸리에서 살고 있는 평범한 엔지니어 입니다 이메일-bwcho75골뱅이지메일 닷컴. 아키텍처 디자인, 머신러닝 시스템, 빅데이터 설계, DEVOPS/SRE, 애자일 방법론,쿠버네티스,마이크로서비스, ChatGPT 생성형 AI , CTO 등에 대한 기술 멘토링과 강의 진행합니다. 쓰레드 : https://www.threads.net/@byungwookcho

Machine Learning, 강좌, cloud, 구글, 클라우드, 소개, tensorflow, 클라우드 컴퓨팅, google, 튜토리얼, 초보, Tutorial, 조대협, 쿠버네티스, Kubernetes, 텐서플로우, 딥러닝, node.js, 머신러닝, 빅데이타,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

조대협의 블로그