from + size の合計で 1万が上限らしい
パラメータをいじると上限変えられるらしいけど、根本的な解決にならないのでメッセージ通りscrollを使ってみる
elasticsearch.exceptions.TransportError: TransportError(500, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. Se e the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')
ググりながら scroll 使ってみた
# elasticsearch から1万件以上取得するサンプル from elasticsearch import Elasticsearch counter = 1 size = 10000 # ドキュメント毎の処理 def get_doc(hits): global counter for hit in hits: text = hit['_source']['text'] print(counter, text) counter += 1 # Elasticsearch es = Elasticsearch( hosts = [{'host':'hoge_host', 'port':9200} ] ) response = es.index(index="hoge_index", doc_type="hoge_type", body={"key": "value"}) # 検索条件 response = es.search( scroll='2m', index="hoge_index", size=size, body={ "query":{ "range":{ "timestamp_jpn":{ "from":"2018-10-01 00:00:00", "to":"2018-10-02 00:00:00" } } }, "sort" : [ {"timestamp_jpn" : {"order" : "asc"}} ]} ) sid = response['_scroll_id'] print('sid', sid) print( 'total', response['hits']['total'] ) scroll_size = len( response['hits']['hits'] ) print('scroll_size', scroll_size) while True: # スクロールサイズ 0 だったら終了 if scroll_size <= 0: break # 検索結果を処理 get_doc(response['hits']['hits']) # スクロールから次の検索結果取得 response = es.scroll(scroll_id=sid, scroll='10m') scroll_size = len(response['hits']['hits']) print( 'scroll_size', scroll_size)
参考ページ https://gist.github.com/hmldd/44d12d3a61a8d8077a3091c4ff7b9307
データ取り出すだけだったら elasticdump 使ったほうがはやそう...